Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken

For developers and data engineers, the sheer volume of NVIDIA’s Nemotron-Pretraining-Code-v3 dataset makes direct ingestion impractical. This workflow demonstrates how to bypass the multi-gigabyte download barrier by streaming the metadata directly. By inspecting the schema and constructing a filtered, shuffled sample, we gain immediate insight into the index’s composition without consuming excessive resources. The process involves analysing language distribution, file extensions, and repository frequency to understand the data’s structure. Finally, we reconstruct raw GitHub URLs to fetch actual source files, estimating the token scale of the retrieved code and saving the results for future experimentation.

Streaming the NVIDIA Nemotron-Pretraining-Code-v3 Dataset and Inspecting Its Schema

Copy Code

!pip -q install -U "datasets>=2.19" huggingface_hub tiktoken pyarrow 2>/dev/null
import os, io, time, itertools, collections, textwrap, math
import pandas as pd
import requests
import matplotlib.pyplot as plt
from datasets import load_dataset, get_dataset_config_names
REPO_ID = "nvidia/Nemotron-Pretraining-Code-v3"
pd.set_option("display.max_colwidth", 80)
configs = get_dataset_config_names(REPO_ID)
CONFIG = configs[0]
print(f"Configs available : {configs}")
print(f"Using config      : {CONFIG}")
stream = load_dataset(REPO_ID, CONFIG, split="train", streaming=True)
print("\nFeatures / schema:")
print(stream.features)
print("\nFirst raw record:")
print(next(iter(stream)))

We configure the Colab environment by installing necessary libraries and importing tools for streaming, analysis, and visualisation. We define the dataset identifier, retrieve the available configuration, and load the training split in streaming mode. Inspecting the schema and printing the initial record provides a clear view of the data structure before deeper analysis begins.

Building a Shuffled Sample and Analyzing Code Metadata Features

Copy Code

N_SAMPLE = 30_000
shuffled = stream.shuffle(seed=42, buffer_size=20_000)
t0 = time.time()
rows = list(itertools.islice(shuffled, N_SAMPLE))
df = pd.DataFrame(rows)
print(f"\nPulled {len(df):,} rows in {time.time()-t0:,.1f}s")
print(df.head(10))
print("\nColumns:", list(df.columns), "| memory:",
     f"{df.memory_usage(deep=True).sum()/1e6:,.1f} MB")
df["ext"]   = df["rel_path"].str.extract(r"\.([A-Za-z0-9_]+)$")[0].str.lower()
df["depth"] = df["rel_path"].str.count("/")
df["fname"] = df["rel_path"].str.rsplit("/", n=1).str[-1]
print("\n--- Top 15 languages (sample) ---")
lang_counts = df["language"].value_counts()
print(lang_counts.head(15))
print("\n--- Top 15 file extensions (sample) ---")
print(df["ext"].value_counts().head(15))
print("\n--- Most frequent repositories (sample) ---")
print(df["repo"].value_counts().head(10))
print("\n--- Path-depth summary ---")
print(df["depth"].describe())
print(f"\nUnique repos in sample : {df['repo'].nunique():,}")
print(f"Unique languages       : {df['language'].nunique():,}")

We generate a shuffled sample to avoid bias from clustered initial rows. Converting these records into a Pandas DataFrame allows us to derive key features like file extension, directory depth, and filename. We then examine the prevalence of languages, file types, repositories, and path-depth statistics to better understand the sampled metadata.

Visualizing Languages, File Extensions, Directory Depth, and Repository Frequency

Copy Code

fig, ax = plt.subplots(2, 2, figsize=(14, 9))
lang_counts.head(12).iloc[::-1].plot.barh(ax=ax[0, 0], color="#76b900")
ax[0, 0].set_title("Top 12 languages (sample)"); ax[0, 0].set_xlabel("files")
df["ext"].value_counts().head(12).iloc[::-1].plot.barh(ax=ax[0, 1], color="#5b8def")
ax[0, 1].set_title("Top 12 file extensions (sample)"); ax[0, 1].set_xlabel("files")
df["depth"].clip(upper=12).plot.hist(bins=range(0, 14), ax=ax[1, 0],
                                    color="#f4a261", edgecolor="white")
ax[1, 0].set_title("Directory nesting depth"); ax[1, 0].set_xlabel("'/' count in path")
(df["repo"].value_counts().head(10).iloc[::-1]
  .plot.barh(ax=ax[1, 1], color="#9b5de5"))
ax[1, 1].set_title("Most common repos (sample)"); ax[1, 1].set_xlabel("files")
plt.tight_layout(); plt.show()

We visualise the primary patterns within the sampled metadata using a multi-panel plot. Comparisons include the top languages, file extensions, directory nesting depth, and repository frequency. These charts simplify interpretation and quickly highlight dominant structures within the metadata index.

Reconstructing Raw GitHub URLs and Fetching Real Source Files

Copy Code

def raw_url(repo: str, commit_id: str, rel_path: str) -> str:
   from urllib.parse import quote
   return (f"https://raw.githubusercontent.com/{repo}/{commit_id}/"
           f"{quote(rel_path)}")
df["raw_url"] = df.apply(lambda r: raw_url(r.repo, r.commit_id, r.rel_path), axis=1)
print("\nExample reconstructed URLs:")
for u in df["raw_url"].head(5):
   print(" ", u)
def fetch_code(url: str, max_bytes: int = 200_000, timeout: int = 10):
   try:
       resp = requests.get(url, timeout=timeout)
       if resp.status_code == 200 and len(resp.content) <= max_bytes:
           return resp.text
       return None
   except requests.RequestException:
       return None
print("\n--- Attempting to fetch a few real files ---")
fetched, attempts = [], 0
for _, r in df.sample(frac=1, random_state=1).iterrows():
   if len(fetched) >= 5:
       break
   attempts += 1
   code = fetch_code(r["raw_url"])
   status = "OK " if code else "MISS"
   print(f"[{status}] {r['language']:<12} {r['repo']}/{r['rel_path']}")
   if code:
       fetched.append({**r.to_dict(), "code": code, "n_chars": len(code)})
print(f"\nFetched {len(fetched)} files in {attempts} attempts "
     f"(misses are normal — repos get deleted/renamed).")
if fetched:
   ex = fetched[0]
   print(f"\n----- PREVIEW: {ex['repo']}/{ex['rel_path']} ({ex['language']}) -----")
   print(textwrap.shorten(ex["code"].replace("\n", "  "), width=600,
                          placeholder=" ...[truncated]"))

We reconstruct raw GitHub URLs using the repository name, commit ID, and relative file path found in the metadata. We then attempt to fetch a selection of real source files from GitHub, handling errors gracefully for

Source Read original →

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Building a Code Dataset Pipeline from NVIDIA Nemotron-Pretraining-Code-v3 Metadata with Streaming, Pandas, and tiktoken

Streaming the NVIDIA Nemotron-Pretraining-Code-v3 Dataset and Inspecting Its Schema

Building a Shuffled Sample and Analyzing Code Metadata Features

Visualizing Languages, File Extensions, Directory Depth, and Repository Frequency

Reconstructing Raw GitHub URLs and Fetching Real Source Files

Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

Google will save your…

Microsoft restricts Claude Fable…

‘AI-pilled’ firms spend $7,500…