For developers and data scientists building retrieval systems, the current approach is often a bottleneck. Most search agents act as policies over expanding transcripts, forced to juggle both high-level search strategy and low-level bookkeeping. This dual burden often degrades performance. Researchers from the University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argue that the model shouldn’t have to do both. They have released Harness-1, a 20B retrieval subagent built on gpt-oss-20b that offloads state management to an external environment.
The system separates cognitive load. The policy handles semantic choices—what to search, what to verify, and when to stop. The harness manages the recoverable state: candidate pools, evidence graphs, and verification records. Both the weights and the harness code are open source, allowing immediate deployment in production pipelines.

What is Harness-1 Actually
This tool does not generate answers; it curates a ranked list of documents for a downstream model to consume. It operates within a state-machine harness centred on a per-episode WORKINGMEMORY.
Execution follows a strict loop. The harness displays the current search state and recent actions. The model outputs a single structured command. The harness executes the command, updates the internal state, and renders the next observation. This separation ensures the model focuses on intent rather than data management.
The Stateful Harness: What Moves Out of the Policy
The team describes this as stateful cognitive offloading. The policy retains the decision-making logic, while the harness preserves the context required to execute those decisions.
The state is structured into specific components. A candidate pool stores compressed, deduplicated documents. A curated set acts as the final output, limited to 30 documents and tagged with importance levels: very_high, high, fair, or low. A full-text store archives every retrieved chunk outside the immediate prompt context.
An evidence graph imposes structure on the data. A regex extractor scans chunks for proper nouns, years, and dates. The harness then visualises frequent entities, bridge documents (containing two or more frequent entities), and singletons (appearing in one document, suggesting follow-up leads).
The policy interacts via eight tools: fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search results are compressed using sentence-BM25, retaining only the top four sentences. Two-level deduplication removes repeats based on chunk ID and content fingerprint.
To address cold starts, the first successful search automatically seeds the curated set with eight reranked results marked as fair importance. The policy then promotes strong documents and discards weak ones, transforming the task from building from scratch to refinement.
The researchers identify three requirements for a trainable harness: warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 satisfies all three.
How It is Trained
Training mirrors the architecture. Supervised fine-tuning teaches the model to operate the interface. Reinforcement learning then optimises search decisions relative to the maintained state.
A single teacher, GPT-5.4, runs live inside the full harness. After filtering, 899 trajectories remain for supervised fine-tuning. The model uses LoRA at rank 32 for three epochs. The step-550 checkpoint serves as the starting point for reinforcement learning.
RL employs on-policy CISPO with a 40-turn cap and terminal-only reward. Training focuses exclusively on SEC queries. Groups with identical rewards are excluded from the gradient. The process ran on Tinker.
The reward function separates discovery from selection and includes a tool-diversity bonus. Without this bonus, the agent collapses into repeated searches, causing curated recall to plateau near 0.53. With the bonus, diversity stabilises and recall rises to approximately 0.60.
The Benchmark Case
Harness-1 was tested on eight benchmarks covering web, finance, patents, and multi-hop QA. The primary metric is curated recall: the proportion of relevant documents in the final set. Trajectory recall measures evidence encountered anywhere during the episode.
| Model | Type | Avg Curated Recall | Avg Trajectory Recall |
|---|---|---|---|
| Harness-1 (20B) | Open small | 0.730 | 0.807 |
| Tongyi DeepResearch 30B | Open small | 0.616 | 0.673 |
| Context-1 (20B) | Open small | 0.603 | 0.756 |
| Search-R1 (32B) | Open small | 0.289 | 0.289 |
| GPT-OSS-20B | Open small | 0.262 | 0.590 |
| Qwen3 (32B) | Open small | 0.216 | 0.446 |
| Opus-4.6 | Frontier | 0.764 | 0.794 |
| GPT-5.4 | Frontier | 0.709 | 0.752 |
| Sonnet-4.6 | Frontier | 0.688 | 0.725 |
| Kimi-K2.5 | Frontier | 0.647 | 0.794 |
| GPT-OSS-120B | Frontier | 0.496 | 0.769 |
Harness-1 achieves an average curated recall of 0.730. This surpasses the next best open subagent, Tongyi DeepResearch 30B, by 11.4 points. Among the frontier searchers tested, only Opus-4.6 scores higher on average.
The transfer pattern highlights the mechanism’s effectiveness. Supervised fine-tuning used four benchmark families; reinforcement learning used only SEC. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks, it gained 17.0 points. This represents a 2.2x larger gain on tasks furthest from the training data.
Ablations support the harness claim. Disabling all harness mechanisms drops relative recall by 12.2 percent on BrowseComp+. The trained policy continues searching but fails to rank the retrieved content effectively.

Use Cases
The method targets evidence-seeking retrieval where documents support an answer. Several workflows fit this pattern.
One application is literature and patent review. The evidence graph and curated set help organise numerous sources. Another is financial-filing analysis. The SEC case study recovers an exact executive-transition date across multiple 8-Ks.
A third use case is multi-hop fact-checking. The fan_out_search and verify tools resolve ambiguous entities before committing. A fourth is modular RAG. The curated set feeds a frozen generator, and improved sets yield higher answer accuracy.
Strengths and Weaknesses
Strengths
- Highest average curated recall among the open models tested, and second only to Opus-4.6 overall.
- Gains hold on held-out benchmarks, suggesting domain-general search operations.
- Trained on 4,352 unique items, far fewer than several baselines.
- Open checkpoint and harness code, servable with common runtimes.




