Building Supervised Fine-Tuning Data from NVIDIA Open-SWE-Traces: Trajectory Parsing, Patch Analysis, Token Budgets, and Tool-Use Metrics

The NVIDIA Open-SWE-Traces dataset is now available as a practical resource for building supervised fine-tuning data for agentic software engineering. This tutorial demonstrates how to stream the data directly from Hugging Face to Google Colab, avoiding the need to download the full dataset locally. The process involves inspecting individual records, normalising multi-turn agent conversations, parsing final code patches, and extracting metadata to build an analysis DataFrame. This allows for a clear view of trajectory length, tool usage, patch size, language distribution, and resolution outcomes. A curated subset is then created, keeping only high-quality trajectories based on success labels, token limits, language filters, and patch availability.

Installing Dependencies and Configuration

The first step involves installing and importing the core libraries required for streaming, parsing, analysis, and visualisation. Pandas and Matplotlib are configured to ensure tables and plots remain readable within the Google Colab environment. The code defines the dataset name, agent and model combinations, sampling size, and SFT filtering settings that control the rest of the tutorial.

import subprocess, sys
def _pip(*pkgs):
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=False)
_pip("-U", "datasets", "huggingface_hub")
_pip("tiktoken", "pandas", "matplotlib")
import json
import re
import textwrap
from itertools import islice
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 160)
plt.rcParams.update({
   "figure.figsize": (9, 4.6),
   "figure.dpi": 110,
   "axes.grid": True,
   "grid.alpha": 0.25,
   "axes.spines.top": False,
   "axes.spines.right": False,
   "font.size": 11,
   "axes.titlesize": 13,
   "axes.titleweight": "bold",
})
BLUE, ORANGE, GREEN, RED = "#4C72B0", "#DD8452", "#55A868", "#C44E52"
def banner(title):
   line = "=" * 78
   print(f"\n{line}\n  {title}\n{line}")
DATASET = "nvidia/Open-SWE-Traces"
AGENTS = ["openhands", "sweagent"]
MODELS = ["minimax_m25", "qwen35_122b"]
SAMPLE_ALL = True
PER_COMBO  = 400
N_SINGLE   = 1500
MAX_SFT_TOKENS = 32000
SFT_REQUIRE_RESOLVED = True
SFT_LANGUAGES = None

Defining Trajectory Parsing Helpers

The following helper functions make the dataset easier to process, even when fields appear in different formats. They normalise trajectories, extract message text, count roles, detect tool usage, parse code patches, and estimate token lengths. These utilities are built defensively so that analysis remains stable across schema variations found in large streamed datasets.

def message_text(msg):
   if not isinstance(msg, dict):
       return ""
   content = msg.get("content", "")
   if content is None:
       return ""
   if isinstance(content, str):
       return content
   if isinstance(content, list):
       parts = []
       for block in content:
           if isinstance(block, dict):
               parts.append(block.get("text") or block.get("content") or "")
           elif isinstance(block, str):
               parts.append(block)
       return "\n".join(p for p in parts if p)
   return str(content)
def normalize_trajectory(traj):
   if traj is None:
       return []
   if isinstance(traj, str):
       try:
           traj = json.loads(traj)
       except Exception:
           return []
   norm = []
   for msg in traj:
       if isinstance(msg, str):
           try:
               msg = json.loads(msg)
           except Exception:
               msg = {"role": "unknown", "content": msg}
       if isinstance(msg, dict):
           norm.append(msg)
   return norm
def normalize_metadata(meta):
   if isinstance(meta, str):
       try:
           return json.loads(meta)
       except Exception:
           return {}
   return meta if isinstance(meta, dict) else {}
def role_counts(trajectory):
   c = Counter()
   for msg in trajectory or []:
       if isinstance(msg, dict):
           c[msg.get("role", "unknown")] += 1
   return c
_FUNC_XML   = re.compile(r"<function\s*=\s*([a-zA-Z0-9_\-]+)", re.IGNORECASE)
_EXEC_TAG   = re.compile(r"<(execute_[a-z]+)>", re.IGNORECASE)
_BASH_FENCE = re.compile(r"```(?:bash|sh|shell)\b", re.IGNORECASE)
def extract_tool_names(trajectory):
   names = Counter()
   for msg in trajectory or []:
       if not isinstance(msg, dict):
           continue
       for call in msg.get("tool_calls") or []:
           fn = (call or {}).get("function", {}) if isinstance(call, dict) else {}
           if fn.get("name"):
               names[fn["name"]] += 1
       if msg.get("role") == "tool" and msg.get("name"):
           names[msg["name"]] += 1
       if msg.get("role") == "assistant":
           text = message_text(msg)
           for m in _FUNC_XML.findall(text):
               names[m.lower()] += 1
           for m in _EXEC_TAG.findall(text):
               names[m.lower()] += 1
           if _BASH_FENCE.search(text):
               names["bash_block"] += 1
   return names
def parse_patch(diff_text):
   if not diff_text or not isinstance(diff_text, str):
       return 0, 0, 0, [], Counter()
   files, exts = [], Counter()
   additions = deletions = 0
   for line in diff_text.splitlines():
       if line.startswith("diff --git"):
           parts = line.split()
           if len(parts) >= 3:
               path = parts[2][2:] if parts[2].startswith("a/") else parts[2]
               files.append(path)
               base = path.split("/")[-1]
               if "." in base:
                   exts[base.rsplit(".", 1)[-1].lower()] += 1
       elif line.startswith("+") and not line.startswith("+++"):
           additions += 1
       elif line.startswith("-") and not line.startswith("---"):
           deletions += 1
   return len(files), additions, deletions, files, exts
def make_token_counter():
   try:
       import tiktoken
       enc = tiktoken.get_encoding("cl100k_base")
       return lambda s: len(enc.encode(s, disallowed_special=()))
   except Exception:
       return lambda s: max(1, len(s) // 4)
count_tokens = make_token_counter()

Streaming and Inspecting Trajectories

The code streams a small sample of Open-SWE-Traces directly from Hugging Face instead of downloading the full dataset. It collects examples across agent and model combinations, then inspects the structure of a single record in detail. The walkthrough covers the first few trajectory messages and previews the final patch to understand what each training example contains.

def stream_take(agent, model, n):
   ds = load_dataset(DATASET, agent, split=model, streaming=True)
   rows = []
   for ex in islice(ds, n):
       ex = dict(ex)
       ex["_agent"], ex["_model"] = agent, model
       rows.append(ex)
   return rows
banner("STEP 1 — Streaming trajectories from the Hub")
raw_rows = []
if SAMPLE_ALL:
   combos = [(a, m) for a in AGENTS for m in MODELS]
   for agent, model in combos:
       try:
           part = stream_take(agent, model, PER_COMBO)
           raw_rows.extend(part)
           print(f"  ✓ {agent:<10} / {model:<12}  ->  {len(part):>4} rows")
       except Exception as e:
           print(f"  ✗ {agent}/{model} failed: {type(e).__name__}: {e}")
else:
   raw_rows = stream_take(AGENTS[0], MODELS[0], N_SINGLE)
   print(f"  ✓ {AGENTS[0]} / {MODELS[0]}  ->  {len(raw_rows)} rows")
print(f"\n  Total rows pulled into memory: {len(raw_rows)}")
assert raw_rows, "No rows streamed — check your internet connection and retry."
banner("STEP 2 — Anatomy of a single record")
sample = raw_rows[0]
print("Top-level fields :", list(sample.keys()))
print("instance_id      :", sample.get("instance_id"))
print("repo / language  :", sample.get("repo"), "/", sample.get("language"))
print("license          :", sample.get("license"))
print("resolved (1/0/-1):", sample.get("resolved"))
print("metadata         :", normalize_metadata(sample.get("metadata")))
traj0 = normalize_trajectory(sample.get("trajectory"))
print(f"\nTrajectory has {len(traj0)} messages. Role histogram: {dict(role_counts(traj0))}")
print("\n--- Trajectory walkthrough (each message truncated to 240 chars) ---")
for i, msg in enumerate(traj0[:8]):
   role = msg.get("role", "unknown").upper()
   body = " ".join(message_text(msg).split())
   print(f"\n[{i}] {role}")
   print(textwrap.fill(body[:240] + ("…" if len(body) > 240 else ""),
                       width=92, subsequent_indent="    "))
if len(traj0) > 8:
   print(f"\n… (+{len(traj0) - 8} more messages)")
print("\n--- Final patch (model_patch), first 25 lines ---")
print("\n".join((sample.get("model_patch") or "").splitlines()[:25]) or "(empty)")

Building the Analysis DataFrame

Once the data is loaded and parsed, the next step is to aggregate the information into a structured format. This involves creating a DataFrame that captures key metrics for every record. The code iterates through the raw rows, applying the previously defined normalisation and extraction functions. It calculates the total token count for each trajectory, counts the number of tool calls, and measures the size of the final code patch. This structured data makes

Source Read original →