Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b

For developers and data scientists building retrieval systems, the current approach is often a bottleneck. Most search agents act as policies over expanding transcripts, forced to juggle both high-level search strategy and low-level bookkeeping. This dual burden often degrades performance. Researchers from the University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argue that the model shouldn’t have to do both. They have released Harness-1, a 20B retrieval subagent built on gpt-oss-20b that offloads state management to an external environment.

The system separates cognitive load. The policy handles semantic choices-what to search, what to verify, and when to stop. The harness manages the recoverable state: candidate pools, evidence graphs, and verification records. Both the weights and the harness code are open source, allowing immediate deployment in production pipelines.

What is Harness-1 Actually

This tool does not generate answers; it curates a ranked list of documents for a downstream model to consume. It operates within a state-machine harness centred on a per-episode WORKINGMEMORY.

Execution follows a strict loop. The harness displays the current search state and recent actions. The model outputs a single structured command. The harness executes the command, updates the internal state, and renders the next observation. This separation ensures the model focuses on intent rather than data management.

The Stateful Harness: What Moves Out of the Policy

The team describes this as stateful cognitive offloading. The policy retains the decision-making logic, while the harness preserves the context required to execute those decisions.

The state is structured into specific components. A candidate pool stores compressed, deduplicated documents. A curated set acts as the final output, limited to 30 documents and tagged with importance levels: very_high, high, fair, or low. A full-text store archives every retrieved chunk outside the immediate prompt context.

An evidence graph imposes structure on the data. A regex extractor scans chunks for proper nouns, years, and dates. The harness then visualises frequent entities, bridge documents (containing two or more frequent entities), and singletons (appearing in one document, suggesting follow-up leads).

The policy interacts via eight tools: fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search results are compressed using sentence-BM25, retaining only the top four sentences. Two-level deduplication removes repeats based on chunk ID and content fingerprint.

To address cold starts, the first successful search automatically seeds the curated set with eight reranked results marked as fair importance. The policy then promotes strong documents and discards weak ones, transforming the task from building from scratch to refinement.

The researchers identify three requirements for a trainable harness: warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 satisfies all three.

How It is Trained

Training mirrors the architecture. Supervised fine-tuning teaches the model to operate the interface. Reinforcement learning then optimises search decisions relative to the maintained state.

A single teacher, GPT-5.4, runs live inside the full harness. After filtering, 899 trajectories remain for supervised fine-tuning. The model uses LoRA at rank 32 for three epochs. The step-550 checkpoint serves as the starting point for reinforcement learning.

RL employs on-policy CISPO with a 40-turn cap and terminal-only reward. Training focuses exclusively on SEC queries. Groups with identical rewards are excluded from the gradient. The process ran on Tinker.

The reward function separates discovery from selection and includes a tool-diversity bonus. Without this bonus, the agent collapses into repeated searches, causing curated recall to plateau near 0.53. With the bonus, diversity stabilises and recall rises to approximately 0.60.

The Benchmark Case

Harness-1 was tested on eight benchmarks covering web, finance, patents, and multi-hop QA. The primary metric is curated recall: the proportion of relevant documents in the final set. Trajectory recall measures evidence encountered anywhere during the episode.

Model	Type	Avg Curated Recall	Avg Trajectory Recall
Harness-1 (20B)	Open small	0.730	0.807
Tongyi DeepResearch 30B	Open small	0.616	0.673
Context-1 (20B)	Open small	0.603	0.756
Search-R1 (32B)	Open small	0.289	0.289
GPT-OSS-20B	Open small	0.262	0.590
Qwen3 (32B)	Open small	0.216	0.446
Opus-4.6	Frontier	0.764	0.794
GPT-5.4	Frontier	0.709	0.752
Sonnet-4.6	Frontier	0.688	0.725
Kimi-K2.5	Frontier	0.647	0.794
GPT-OSS-120B	Frontier	0.496	0.769

Averages across eight benchmarks, from Figure 1 of the paper. Frontier models run as zero-shot retrievers under the Context-1 harness.

Harness-1 achieves an average curated recall of 0.730. This surpasses the next best open subagent, Tongyi DeepResearch 30B, by 11.4 points. Among the frontier searchers tested, only Opus-4.6 scores higher on average.

The transfer pattern highlights the mechanism’s effectiveness. Supervised fine-tuning used four benchmark families; reinforcement learning used only SEC. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks, it gained 17.0 points. This represents a 2.2x larger gain on tasks furthest from the training data.

Ablations support the harness claim. Disabling all harness mechanisms drops relative recall by 12.2 percent on BrowseComp+. The trained policy continues searching but fails to rank the retrieved content effectively.

Use Cases

The method targets evidence-seeking retrieval where documents support an answer. Several workflows fit this pattern.

One application is literature and patent review. The evidence graph and curated set help organise numerous sources. Another is financial-filing analysis. The SEC case study recovers an exact executive-transition date across multiple 8-Ks.

A third use case is multi-hop fact-checking. The fan_out_search and verify tools resolve ambiguous entities before committing. A fourth is modular RAG. The curated set feeds a frozen generator, and improved sets yield higher answer accuracy.

Strengths and Weaknesses

Strengths

Highest average curated recall among the open models tested, and second only to Opus-4.6 overall.
Gains hold on held-out benchmarks, suggesting domain-general search operations.
Trained on 4,352 unique items, far fewer than several baselines.
Open checkpoint and harness code, servable with common runtimes.

Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b

What is Harness-1 Actually

The Stateful Harness: What Moves Out of the Policy

How It is Trained

The Benchmark Case

Use Cases

Strengths and Weaknesses

Strengths

Source Read original →

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

datasette-apps 0.2a0

Ten advances in mathematics…

Judge denies xAI’s request…

What is Harness-1 Actually

The Stateful Harness: What Moves Out of the Policy

How It is Trained

The Benchmark Case

Use Cases

Strengths and Weaknesses

Strengths

Source Read original →

Related articles

Empowering Businesses with AI: Smart Tools, Smarter Business Decisions.

follow us

Popular Tag

Popular Post

datasette-apps 0.2a0

Ten advances in mathematics…

Judge denies xAI’s request…