Researchers let Claude Code discover AI scaling algorithms that humans probably wouldn’t have designed

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro May 24, 2026 4 min read
Researchers let Claude Code discover AI scaling algorithms that humans probably wouldn’t have designed


Researchers let Claude Code discover AI scaling algorithms that humans probably wouldn’t have designed

Instead of writing rules for more efficient AI reasoning themselves, researchers let a coding agent hunt for better control algorithms in a simulated environment. The result beats established methods while burning far less compute.

Test-time scaling (TTS) is meant to make large language models perform better by letting them spend more compute on a response, say, by running several solution paths in parallel or extending chains of thought. Until now, human-written rules almost always dictated when a model kicks off a new solution path, doubles down on a promising one, or kills it.

A research team from UMD, UVA, WUSTL, UNC, Google, and Meta flips that with AutoTTS. Humans don’t write the algorithm. Instead, they build the playground where an AI agent figures out algorithms on its own.

The paper argues that many known methods are really just special cases in a shared control space defined by width (how many solution paths run at once) and depth (how far each one goes). So why, the authors ask, do researchers keep plotting paths through this space by hand instead of letting a machine search it?

Simulating the search keeps costs down

At the core of AutoTTS sits an offline environment. For each task, the team pre-generates several solution paths from the language model and stores them. A new control algorithm decides how to spend compute based on data that’s already there. That way, thousands of variants can run without firing up the actual language model each time.

Claude Code does the searching. Over several rounds, the agent reviews what came before, spots weaknesses in earlier proposals, and writes a new control algorithm directly in code. To stop the search from getting lost in thousands of tiny knobs, each proposal can only expose one high-level controller to the outside. That controller sets all the other thresholds on its own. Full logs from each run also show the agent where earlier attempts blew compute for nothing.

Agent-written algorithm outperforms human-designed ones

On math benchmarks like AIME and HMMT, the algorithm the agent came up with gets better accuracy per unit of compute than established methods. The lean setting slashes token usage by about 70 percent compared to standard self-consistency, which just generates 64 answers in parallel and picks the winner by majority vote. Accuracy holds steady.

The algorithm also carries over to a different model (DeepSeek-R1-Distill-Llama-8B) and a non-math benchmark (GPQA-Diamond). The whole discovery run cost about $40 and took 160 minutes.

A logic humans probably wouldn’t have come up with

More interesting than the raw numbers is how the discovered program actually works. It tracks how the model’s confidence shifts over several rounds. Other methods bail out the moment a majority among answers tips over.

If confidence barely budges, the algorithm opens more solution paths. If it climbs quickly, it skips new ones. Solution paths whose interim result lines up with the current majority get extra compute. The algorithm only drops paths that diverge if they keep heading the wrong way over multiple rounds.

The authors call this kind of coordination something that would’ve been nearly impossible to design by hand. An ablation study shows how much depends on two design choices: drop the single high-level controller, and the agent falls back on extreme shortcuts that save tons of compute in testing but tank accuracy on new tasks. Without detailed logs, the discovered algorithm eats more compute at worse accuracy, so a bare final result just isn’t enough to figure out what went wrong.

From writing algorithms to building search spaces

The authors put AutoTTS in a line with work like FunSearch, AlphaEvolve, and ADAS, all of which use language models as program searchers. What’s new here is applying that idea to test-time scaling, which was mostly done by hand before.

The current version only covers the trade-off between width and depth. It can’t handle more complex structures like tree searches. How good the discovery turns out also depends on the coding agent. The authors don’t say whether open-source alternatives would work just as well.

The bigger takeaway is that the work shifts where humans come in: instead of inventing the rules themselves, researchers set up the search environment those rules live in. The actual strategy then emerges as code that a language model writes and refines.

As early as 2024, researchers from Hugging Face showed that small language models can match much larger ones through smart test-time compute scaling, though with search strategies designed by hand. Meta and partners recently introduced hyperagents, AI systems that optimize their own improvement process.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive “AI Radar” frontier report six times a year, full archive access, and access to our comment section.

Subscribe now

Read on for the full picture.
Subscribe for hype-free coverage.

  • Access to all THE DECODER articles.
  • Read without distractions – no Google ads.
  • Access to comments and community discussions.
  • Weekly AI newsletter.
  • 6 times a year: “AI Radar” – deep dives on key AI topics.
  • Up to 25 % off on KI Pro online events.
  • Access to our full ten-year archive.
  • Get the latest AI news from The Decoder.



Originally published at the-decoder.com. Curated by AI Maestro.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top