StartSakana Fugu Beats GPT-5.5 With Model Orchestration
By Addy · June 25, 2026
The entire premise of the AI race is wrong.
That is not a contrarian take. It is an architectural argument. For three years, every major AI development has been framed as a competition between monolithic models: which single system has the most parameters, the best benchmark score, the strongest reasoning capability. OpenAI trains a bigger model. Anthropic trains a more aligned model. Google DeepMind trains a faster model. Chinese labs train cheaper models. The frame is consistent: one model, one frontier, one winner.
Sakana AI launched Fugu on June 22, 2026, and the launch is a direct challenge to that frame.
Fugu is not another monolithic foundation model. It is a learned orchestrator. Sakana's technical report describes Fugu as a family of language models trained to dynamically create agentic scaffolds over a pool of frontier LLM workers. You send one request to one API. Fugu decides whether to answer directly, route to one worker, or assemble a multi-model workflow that delegates, verifies, and synthesizes the result.
On SWE-bench Pro, Fugu Ultra scores 73.7%, outperforming GPT-5.5 at 58.6% and Claude Opus 4.8 at 69.2%. On Terminal-Bench 2.1, it scores 82.1%, ahead of every standalone model in Sakana's comparison set. The regular Fugu model scores 80.2% on Terminal-Bench while balancing latency and performance.
The benchmark scores are impressive. The architectural claim underneath them is the story that will matter in 2026 and beyond.
The Orchestration Theory
The question Sakana is asking is deceptively simple: what if the bottleneck in AI capability is not model quality but coordination?
Every frontier model has a specific failure mode. Claude is stronger at debugging and careful code review. GPT-5.5 is strong at planning and terminal-heavy execution. Gemini 3.1 Pro has its own strengths in scientific and multimodal reasoning. No single model is the best choice for every component of a complex task.
A human expert team works the same way. A senior engineer who needs to ship a product does not do the architecture, the frontend, the backend, the security audit, and the code review with equal capability. They route work to specialists. The quality of the output depends on the quality of the routing as much as the quality of any individual specialist.
Sakana's TRINITY and Conductor papers, both accepted at ICLR 2026, formalize this intuition. TRINITY uses a lightweight evolved coordinator that assigns Thinker, Worker, and Verifier roles to different LLMs across multiple turns. Instead of asking one model to think through a problem, generate a solution, and verify its own work, three tasks that compete for the same model's attention and capability, TRINITY routes each role to the model best suited for it. The Verifier that checks the Worker's output is not the same model that generated it. Disagreement between models is a signal rather than noise.
The Conductor extends this through reinforcement learning. Instead of hand-coding the workflow, the coordinator learns to write natural-language coordination strategies: which worker should handle which subtask, which outputs should be shared, which communication topology should be used, and how the final answer should be synthesized.
Fugu turns that research into a product. The regular Fugu model uses a fast selection head to pick a worker model for an input, keeping latency close to a direct frontier-model call. Fugu Ultra uses deeper multi-agent workflows when answer quality matters more than speed. One is the everyday router. The other is the accuracy-first orchestrator.
This matters because Fugu improves as the underlying models improve. When a new frontier model ships publicly, Sakana says it expects to spend roughly two weeks training and evaluating updated Fugu models before rollout. The orchestrator does not need to become the best model at every task. It needs to know which model is best for each task.
The Infrastructure That Was Already Being Built
Fugu did not arrive into a vacuum. The infrastructure for model orchestration has been assembling itself throughout 2026.
The Sakana technical report names Claude Code, Codex, and OpenCode as coding-assistant environments used in its end-to-end training trajectories. That detail matters. The report is not treating model capability as a raw chat score. It is treating capability as something that emerges inside an agent harness: repository context, tool calls, code execution, editing loops, environment feedback, and task completion across multiple turns.
That is the same direction the rest of the agent ecosystem has been moving. LangGraph, CrewAI, the OpenAI Agents SDK, Claude Code, Codex, Cursor agents, and MCP-based workflows all share the same broad premise: the model is no longer the whole product. The surrounding system decides what context the model sees, which tools it can call, how failures are detected, and when another model or agent should take over.
Fugu's contribution is that the orchestration layer is itself learned. A rule-based router is only as good as the rules the developer wrote. A learned router is only as good as its training data, but it can generalize to task types that the rules never anticipated, and it can improve as the pool of underlying models changes.
That is the distinction. Most agent frameworks coordinate tools and agents because the developer told them how. Fugu coordinates because the routing behavior is part of the trained system.
Why Japan Built This
To understand why Sakana built an orchestrator rather than another giant model, you have to understand Japan's position in the AI race.
Sakana AI is a Tokyo-based AI lab built around evolution and collective intelligence. Axios reported in November 2025 that Sakana raised about USD 135 million in Series B funding at a USD 2.65 billion post-money valuation. That makes it one of Japan's most important AI startups, but it is still not operating with the same compute base as OpenAI, Anthropic, Google, Meta, or the largest Chinese labs.
The decision to build an orchestrator rather than a frontier monolith is not a limitation. It is a strategy that converts the constraint into a structural advantage.
A foundation model lab must train, maintain, and iterate on the model itself. The compute cost is continuous. The data requirement is enormous. The capability ceiling is determined by training budget. Sakana's orchestrator has a different ceiling because its capability is determined by the frontier models it can route through, which improve independently.
This is the asymmetric strategy: achieve frontier performance without frontier training compute by routing through the frontier rather than replicating it.
The pufferfish name carries the right analogy. Fugu the fish is poisonous if handled wrong and extraordinary if handled correctly. The dish requires a licensed chef who knows exactly which parts to use and which to discard. Sakana's system is the chef: the coordinator that knows which model to use for which part of the task and keeps the parts that would be dangerous or wasteful if combined carelessly separate from each other.
The Honest Limits
The orchestration approach has specific failure modes that the benchmark scores do not reveal.
The most significant is dependency. Fugu's pool does not include Claude Fable 5 or Claude Mythos Preview because neither is publicly accessible, according to Sakana's own benchmark notes. The system's ceiling is determined by the publicly accessible frontier, which is real and capable, but is not the same as the actual frontier.
The access limits matter too. Sakana's product page says Fugu is not yet available in the EU or EEA while the company works toward GDPR and EU-specific compliance. That is not a footnote for European developers. It is the difference between an architecture being interesting and a product being usable.
The latency cost is real. Routing a query through a deeper workflow adds network round-trips, token generation time, and coordination overhead. Fugu is described as the low-latency default. Fugu Ultra is described as answer-quality-first. That is the accuracy-latency tradeoff every multi-agent system makes, and it becomes more visible under the time constraints of interactive work.
The cost structure is unusual. Fugu Ultra is priced at USD 5 per million input tokens and USD 30 per million output tokens, with higher rates for contexts above 272,000 tokens. Sakana says it does not stack model fees: when multiple agents are active, the user pays one rate based on the top-tier model involved. That makes the product simpler to buy than a hand-built multi-model system, but the economics still depend on the orchestrator avoiding unnecessary work.
The benchmark provenance matters too. Sakana published its own benchmark scores. The baseline scores are provider-reported where available. SWE-bench Pro is also the benchmark TheQuery has already covered as flawed, with verifier reliability issues and a git-history loophole. Fugu Ultra's 73.7% is an important signal. It is not the final word.
What the Industry Is Actually Converging On
Set the benchmark caveats aside and look at what several separate teams arrived at independently.
Sakana built a learned coordinator that routes tasks to frontier model specialists. Agent frameworks like LangGraph and CrewAI built developer-facing ways to coordinate tools, memory, agents, and subtasks. Claude Code, Codex, Cursor, and other coding environments wrap models in scaffolds that manage repositories, execution, retries, and feedback.
These are not different trends. They are the same architectural conclusion arrived at from different starting points. The industry is converging on the position that a capable coordinator routing between specialist models can outperform a single generalist model on complex multi-step tasks because the real world's tasks are multi-step, heterogeneous, and benefit from specialization in the same way human expert teams do.
The specific contribution Sakana makes to this convergence is the learned coordinator. Every other orchestration system in common use today routes mostly through rules, prompts, or developer-defined workflows. Fugu routes through a trained model that has learned when to delegate, how to delegate, and when to synthesize.
That is a small but significant distinction. A rule-based router follows your map. A learned router builds a map while walking.
The Next AI Architecture
The Transformer architecture that Llion Jones co-authored in 2017 defined the AI era that followed it. Every model in every benchmark table in every TheQuery article, Claude, GPT-5.5, Gemini, DeepSeek, Qwen, GLM, MiniMax, is a transformer. The architecture from one paper became the substrate for the entire industry.
Sakana's TRINITY and Conductor papers are not claiming to replace the transformer. They are claiming to change what sits above it. The new layer is not a bigger model. It is a smarter coordinator, a learned routing system that treats frontier transformers as specialists and assembles them dynamically into teams that outperform any individual member.
The frame that says AI progress means bigger models and higher benchmark scores is not wrong. It is describing one layer of the stack. The layer above it, the coordination layer, the routing layer, the learned orchestration layer, is what Fugu is building, and what the agent framework ecosystem is converging toward without yet having a consensus name for it.
The fish is not the dangerous part. The chef is.
Sources:
- Sakana Fugu: One Model to Command Them All - Sakana AI
- Sakana Fugu - Sakana AI
- Sakana Fugu Technical Report - arXiv
- Axios Pro Rata, November 18, 2025 - Axios
Previously on TheQuery: You Want to Build an AI Agent. Here Is Where to Actually Start. and The Real AI Race Nobody Is Covering: MiniMax M3, GLM 5.2, and DeepSeek V4 Are Fighting Each Other