AI coding agents find the right file but miss the exact lines that matter, study shows

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 14, 2026 3 min read
AI coding agents find the right file but miss the exact lines that matter, study shows

For creators and developers using AI to build software, the latest research reveals a frustrating reality: the tools often locate the correct project files but fail to pinpoint the specific lines of code required to fix a bug. While these agents appear to navigate the codebase effectively, they frequently miss the critical context needed to execute a repair, resulting in failed patches that waste time and computational resources.

Separating search from the fix

A new evaluation framework, SWE-Explore, isolates the search phase of AI coding agents from the actual bug-fixing process. Previous assessments typically judged success solely on whether a bug was resolved, masking the underlying failure modes. This new approach requires an agent to receive a bug description and a software repository, then return a ranked list of code sections it deems relevant before attempting a solution.

Establishing a gold standard

Identifying which code lines are truly essential is difficult for humans, so the researchers established a reference point using successful runs from powerful models. The dataset, comprising 848 problems from 203 open-source projects across ten programming languages, draws heavily from Python, which accounts for 547 tasks. For each issue, the team identified which files and lines models such as GPT-5.4, Gemini 3 Pro, Claude Sonnet 4.6, and Kimi K2.6 examined before fixing the bug. Passages where multiple independent solution paths converge are treated as strong indicators of useful context.

Search is better than chance, but precision is low

The study pitted traditional keyword search against five general-purpose coding agents and four specialised research systems. Traditional keyword matching performs barely better than random guessing. For instance, a bug description like “RuntimeWarning on Overflow” often yields results dominated by documentation and templates rather than the actual source code. General agents outperform this method because they navigate the project step-by-step, yet their precision at the line level remains poor.

The line-level accuracy gap

At the file level, agents perform competently, ranking the correct source files early and keeping selections tight. However, zooming in to individual lines exposes a significant weakness. On average, these agents capture only 14 to 19 percent of the relevant code lines. Increasing the power of the language model does not resolve this; whether using GPT, Anthropic, Google, Moonshot, or Zhipu models, the pattern of high file hit rates but low line coverage persists.

Architecture variations show little impact on this specific metric. Systems like Claude Code, Codex, OpenHands, Mini-SWE-Agent, and AweAgent achieve nearly identical scores. The CoSIL research system stands out as an exception, scanning code as a network of interconnected blocks to achieve higher line coverage. Among specialised localisers, AutoCodeRover is precise but conservative, while OrcaLoca generates less noise but misses many relevant areas.

Context thresholds dictate success

Controlled experiments varying the amount of context provided to the repair model reveal a sharp threshold effect. When the model sees less than 50 percent of the necessary core regions, repairs mostly fail. Success rates only jump significantly between 50 and 75 percent coverage, indicating that fixes do not improve gradually but require a minimum amount of contextual clues to work.

For harder tasks that exceed the model’s inherent capability, even additional context offers little help. However, once the critical spots are visible, extra irrelevant code does not significantly hinder performance. An agent that reads too little performs worse than one that reads too much. The clear implication for future system design is to filter less and read more.

This work follows the creation of SWE-bench two years ago, which tested agents against real GitHub issues. While variants of that benchmark have expanded to cover more languages and harder professional tasks, the underlying success metric faces scrutiny. Recent findings by the research organisation METR suggest that project managers would reject approximately half of the solutions accepted by automated reviewers, often due to basic functional errors.

Key takeaways

  • AI coding agents frequently locate the correct files but capture only 14 to 19 percent of the necessary lines of code, indicating a critical gap in precision.
  • Repair success depends on a minimum context threshold; models require at least half of the core relevant regions to be visible to execute fixes effectively.
  • Future improvements should prioritise reading broader contexts over aggressively filtering irrelevant information, as excess noise has less impact than missing context.
  • The dataset and code are available on GitHub and Hugging Face for further analysis and development.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top