A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics

For developers and data scientists building large language models, the FineWeb dataset remains a critical resource, but its sheer scale—multi-terabyte in size—demands efficient handling. Rather than downloading the entire corpus, we can leverage streaming techniques to inspect the data, verify its integrity, and understand its composition without the storage overhead. This workflow demonstrates how to stream a sample, apply quality filters, detect near-duplicates using MinHash, and validate token counts against the official GPT-2 tokenizer.

Streaming and Inspecting the Corpus

The first step involves setting up the necessary environment. We install dependencies such as datasets, datasketch, tiktoken, and pandas to handle data ingestion, hashing, tokenization, and analysis. Crucially, we set random seeds for reproducibility and configure pandas to display wide columns.

Copy Code

import subprocess, sys
def pip(*pkgs):
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
pip("datasets>=2.19", "datasketch", "tiktoken", "pandas", "matplotlib", "tqdm")
import re, math, random, collections
from urllib.parse import urlparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from datasets import load_dataset
random.seed(0); np.random.seed(0)
pd.set_option("display.max_colwidth", 90)

We then initiate a stream from the sample-10BT subset of FineWeb. By setting streaming=True, we process documents one by one in memory, avoiding the need to store thousands of files on disk. We limit the stream to 3,000 documents for this demonstration. Once loaded, we convert the stream into a pandas DataFrame to inspect the schema. Key fields include url, language, language_score, and token_count. Printing the first five rows and a detailed example record reveals the structure of the data, showing text snippets alongside their associated metadata.

Recreating Quality Filters

While FineWeb is a high-quality dataset, understanding the logic behind its curation is valuable for building custom pipelines. We implement simplified versions of the filtering heuristics originally used for Gopher and C4 datasets. These functions check for:

Word count anomalies: Rejecting texts with fewer than 50 or more than 100,000 words.
Word length statistics: Flagging documents where the average word length is outside the 3–10 character range.
Boilerplate detection: Identifying excessive symbols or list-like structures (bullet points).
Stopword frequency: Ensuring a minimum density of common words like “the” or “and”.

Copy Code

WORD = re.compile(r"\b\w+\b")
def gopher_quality(text):
   words = WORD.findall(text)
   n = len(words)
   if n < 50 or n > 100_000:
       return False, "word_count_out_of_range"
   mean_len = sum(len(w) for w in words) / n
   if mean_len < 3 or mean_len > 10:
       return False, "bad_mean_word_length"
   if (text.count("#") + text.count("...")) / n > 0.1:
       return False, "too_many_symbols"
   lines = text.split("\n")
   if lines and sum(l.lstrip().startswith(("•", "-", "*")) for l in lines) / len(lines) > 0.9:
       return False, "mostly_bullets"
   stops = {"the", "be", "to", "of", "and", "that", "have", "with"}
   if len(stops & {w.lower() for w in words}) < 2:
       return False, "too_few_stopwords"
   return True, "ok"
def c4_quality(text):
   lines = [l for l in text.split("\n") if l.strip()]
   if not lines:
       return False, "empty"
   low = text.lower()
   for bad in ("lorem ipsum", "javascript is disabled"):
       if bad in low:
           return False, f"boilerplate:{bad}"
   if text.count("{") > 0 and text.count("{") / max(len(lines), 1) > 0.5:
       return False, "too_many_braces"
   return True, "ok"
def fineweb_custom(text):
   lines = [l.strip() for l in text.split("\n") if l.strip()]
   if not lines:
       return False, "empty"
   dup_frac = 1 - len(set(lines)) / len(lines)
   if dup_frac > 0.3:
       return False, "duplicated_lines"
   short_frac = sum(len(l) < 30 for l in lines) / len(lines)
   if short_frac > 0.67 and len(lines) > 5:
       return False, "list_like"
   return True, "ok"
results = []
for d in docs:
   t = d["text"]
   g_ok, g_r = gopher_quality(t)
   c_ok, c_r = c4_quality(t)
   f_ok, f_r = fineweb_custom(t)
   reason = "kept" if (g_ok and c_ok and f_ok) else (g_r if not g_ok else c_r if not c_ok else f_r)
   results.append(reason)
filter_summary = pd.Series(results).value_counts()
print("\n--- Quality-filter outcomes on already-clean FineWeb data ---")
print("(Most pass: FineWeb is pre-filtered. Rejections show what the rules catch.)")
print(filter_summary)

Running these checks on the already-cleaned FineWeb sample confirms that the vast majority of documents pass all filters. The few rejections highlight specific edge cases, such as boilerplate text or documents with excessive duplicate lines, validating the robustness of the original curation process.

MinHash-Based Deduplication

To simulate how large-scale web corpora handle repeated content, we employ MinHash with Locality Sensitive Hashing (LSH). This probabilistic technique allows us to estimate the Jaccard similarity between documents without comparing them character-by-character.

We break the text into shingles (contiguous sequences of words), generate MinHash signatures for each document, and insert them into an LSH index. We then query the index to find pairs of documents with a similarity threshold of 0.7. In this specific slice of FineWeb, which is deduplicated per crawl, we expect no near-duplicates, but the method scales efficiently for much larger datasets.

Copy Code

from datasketch import MinHash, MinHashLSH

def shingles(text, k=5):

toks = WORD.findall(text.lower())

return {" ".join(toks[i:i+k]) for i in range(max(len(toks) - k + 1, 1))}

NUM_PERM = 128

THRESHOLD = 0.7

lsh = MinHashLSH(threshold=THRESHOLD, num_perm=NUM_PERM)

minhashes = {}

for idx, d in enumerate(tqdm(docs, desc="MinHashing")):

m = MinHash(num_perm=NUM_PERM)

for s in shingles(d["text"]):

m.update(s.encode("utf8"))

minhashes[idx] = m

lsh.insert(str(idx), m)

dup_pairs = set()

for idx, m in minhashes.items():

for cand in lsh.query(m):

c = int(cand)

if c != idx:

dup_pairs.add(tuple(sorted((idx, c))))

print(f"\nFound {len(dup_pairs)} near-duplicate pairs (Jaccard ≥ {THRESHOLD}).")

if dup_pairs:

a, b = next(iter(dup_pairs))

j = minhashes[a].jaccard(minhashes[b])

print(f"Example pair (estimated Jaccard ≈ {j:.2f}):")

print("  DOC A:", docs[a]["text"][:160].replace("\n", " "), "…")

print("  DOC B:",
Source Read original →
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.
Please enable JavaScript in your browser to complete this form.

Email Name
Name *First
Last
Email *

AI Maestro is an independent British AI publication.
We test what we recommend. More about us →

A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics

Streaming and Inspecting the Corpus

Recreating Quality Filters

MinHash-Based Deduplication

`Empowering Businesses with AI — Smart Tools, Smarter Business Decisions.`

`follow us`

`Popular Tag`

`Popular Post`

`ChatGPT’s market share slips…`

`Android 17 launches with…`

`‘Dangerous’ AI Models Are…`