Tokyo-based startup Sakana AI has launched Fugu, a system that coordinates multiple language models to compete with Anthropic‘s Fable 5 and Mythos Preview benchmarks.
In this article
The approach aims to reduce dependence on any single AI provider.
How the system works
Sakana AI previously achieved strong results with orchestrator setups for coding. Its ALE-Agent placed 21st out of 1,000 human experts in a coding competition.
Fugu acts as a language model trained to call other LLMs from an agent pool, which includes copies of itself. Depending on the request, it either handles a task on its own or pulls together a team of specialised models. Selection, delegation, checks, and synthesis all run internally. Users access everything through a single OpenAI-compatible API.
Two variants for different needs
The base Fugu model targets low latency and solid everyday performance across coding, code review, and chatbot use cases. Teams with privacy or compliance needs can exclude specific agents from the pool.
Fugu Ultra is built for maximum answer quality on complex, multi-step problems. Early users have put it to work on AI research, reproducing scientific papers, cybersecurity analysis, and patent and literature searches.
According to benchmark results Sakana AI published, Fugu Ultra performs on par with Anthropic’s Fable 5 and Mythos Preview across a range of coding, reasoning, science, and agent benchmarks.
Neither Anthropic model is in Fugu’s agent pool, though, since they aren’t publicly available. With those models included, Fugu would likely score even higher. Sakana AI says the baseline comparison numbers come from the model providers themselves. The table below shows how Fugu stacks up against the underlying base models.
| Benchmark | Fugu | Fugu Ultra | Opus 4.8 | Gemini 3.1 Pro | GPT 5.5 |
|---|---|---|---|---|---|
| SWE Bench Pro | 59.0 | 73.7 | 69.2 | 54.2 | 58.6 |
| TerminalBench 2.1 | 80.2 | 82.1 | 74.6 | 70.3 | 78.2 |
| LiveCodeBench | 92.9 | 93.2 | 87.8 | 88.5 | 85.3 |
| LiveCodeBench Pro | 87.8 | 90.8 | 84.8 | 82.9 | 88.4 |
| Humanity’s Last Exam | 47.2 | 50.0 | 49.8 | 44.4 | 41.4 |
| CharXiv Reasoning | 85.1 | 86.6 | 84.2 | 83.3 | 84.1 |
| GPQA-D | 95.5 | 95.5 | 92.0 | 94.3 | 93.6 |
| SciCode | 60.1 | 58.7 | 53.5 | 58.9 | 56.1 |
| τ³ Banking | 21.7 | 20.6 | 20.6 | 8.4 | 20.6 |
| Long-Context Reasoning | 74.7 | 73.3 | 67.7 | 72.7 | 74.3 |
| MRCRv2 | 86.6 | 93.6 | 87.9 | 84.9 | 94.8 |
A hedge against vendor lock-in
Sakana AI is pitching Fugu as a safeguard against single-provider dependence. The company points to the recent export controls on Anthropic’s Fable and Mythos models as a concrete example. Access to top AI systems can vanish overnight due to regulatory shifts or foreign policy decisions.
“For an organization or a nation, relying on a single company’s APIs for critical infrastructure, finance, or governance is a material vulnerability. This risk is no longer a hypothetical possibility, but a reality,” Sakana AI writes in its announcement. Fugu’s model pool is fully swappable, so the system can reroute to other models if one provider goes dark.
The system’s real-world performance depends entirely on which models are in the pool, though. If several top providers restrict access at the same time, Fugu’s options shrink too. An orchestrator like Fugu may boost resilience, but it’s not the same as true sovereignty. Still, Fugu could be worth watching on performance alone.
Early testers report gains on complex workflows
About 500 beta users have already tested the system in real-world settings, according to Sakana AI. Fugu proved strongest on long, multi-step workflows like automated data research, security analysis, and code reviews.
One software developer says Fugu Ultra catches far more bugs during code review than GPT-5.5. “Where other tools flag about three issues, Fugu surfaced more than twenty.” Sakana AI also claims Fugu beat Gemini 3.1 Pro, Opus 4.8, and GPT 5.5 in its own tests on automated research, mechanical design, and financial forecasting.
Video: According to Sakana, Fugu solves and visualises a Rubik’s Cube faster than the individual models.
“The beta made clear that multi-agent orchestration matters most when the task is messy, long-running, and difficult to solve with a single model call,” writes Sakana AI.
Both variants are live now through a single API on the product page and console. Sakana offers subscription plans for daily use and usage-based billing for bigger workloads.
Sakana’s bet is an AI ecosystem rather than a single model
Fugu’s technical approach builds on Sakana AI’s own research into learned model orchestration, specifically two papers presented at ICLR 2026 called Trinity and Conductor.
The idea fits Sakana AI’s broader vision of applying natural principles like swarm behaviour, evolution, and collective intelligence to AI systems. The company sees powerful AI not as a single-model problem but as a collaborative ecosystem that goes beyond what any one model can do alone.
Sakana AI was founded by former Google AI researchers Llion Jones and David Ha. Jones co-authored the 2017 “Attention Is All You Need” paper that gave us the Transformer.




