>_TheQuery
// Reading nowStart
← All Articles

Claude Fable 5 Is the Mythos Model You Can Actually Use. But there's a Catch.

By Addy · June 9, 2026

Two months ago, Anthropic announced a model too dangerous to release.

Claude Mythos Preview could chain zero-day vulnerabilities across every major operating system. It bypassed Apple's M5 Memory Integrity Enforcement in six days. Anthropic gated it behind Project Glasswing, a coalition of fifty pre-selected organizations, USD 100 million in usage credits, USD 125 per million output tokens. The model existed. You could not use it.

Today, Anthropic released Claude Fable 5.

Fable 5 is the public version of Mythos. Same underlying training, same frontier capabilities, with hard safeguards that fall back to Claude Opus 4.8 for cybersecurity, biology, chemistry, and model distillation queries. Available now through Claude API and claude.ai. Included on Pro, Max, Team, and seat-based Enterprise plans through June 22, then moving to usage credits unless Anthropic has enough capacity to extend or restore included access. Priced at USD 10 per million input tokens and USD 50 per million output tokens, exactly double Opus 4.8, and less than half the USD 125 Mythos Preview charged.

The benchmarks are extraordinary. The pricing has a number in it that deserves examination. And there is a benchmark this article will raise that Anthropic's launch materials do not include.

What Fable 5 Actually Claims

The headline number is 80.3% on SWE-bench Pro. Opus 4.8 scored 69.2%. GPT-5.5 scored 58.6%. Gemini 3.1 Pro scored 54.2%. That is not a narrow lead. It is a 21.7-point gap over the second-best commercially available model on the benchmark the industry uses to measure production coding capability.

On FrontierCode Diamond, a harder private benchmark from Cognition that tests high-quality, maintainable agentic coding: Fable 5 scores 29.3% against 13.4% for Opus 4.8 and 6.3% for GPT-5.5. In relative terms, Fable 5 is more than four times better than GPT-5.5 on one of the hardest coding evaluations available. Terminal-Bench 2.1 at 88.0%*, starred, meaning this reflects Mythos 5's score; the deployable Fable 5 falls back to Opus 4.8 on terminal-level agentic coding queries that trigger the safety gate. GDPval-AA knowledge work Elo at 1932, 163 points above GPT-5.5's 1769 and 42 points above Opus 4.8's 1890. Computer use on OSWorld-Verified at 85.0%, ahead of GPT-5.5 at 78.7% and Gemini 3.1 Pro at 76.2%.

On Spatial Reasoning via Blueprint-Bench 2, Fable 5 scores 38.6% against GPT-5.5 at 36.2% and Opus 4.8 at 14.5%. That 24-point gap over the previous Anthropic flagship is the largest relative improvement in the entire table and has received almost no coverage today.

On the Legal Agent Benchmark, Fable 5 scores 13.3% against GPT-5.5's 2.1% and Gemini 3.1 Pro's 0.0%. A model leading legal reasoning while its nearest major competitor scores literally zero is a signal legal tech teams should be paying attention to.

On Cybersecurity via ExploitBench: 78.0%* against GPT-5.5 at 34.0% and Opus 4.8 at 40.0%. The asterisk is doing significant work here. That score is Mythos 5's number. The Fable 5 you deploy falls back to Opus 4.8 on cybersecurity queries. The 78.0% describes what the model can do; the fallback describes what you are allowed to access.

Hex, the analytics company, ran Fable 5 on its internal benchmark of complex, long-running analytical tasks, the kind that require maintaining coherent reasoning across millions of tokens. Fable 5 is the first model to break 90% on that benchmark. The previous best was Opus 4.8 at around 80%.

Anthropic's own framing is direct: "The longer and more complex the task, the larger Fable 5's lead over our other models." That sentence describes an architecture optimized for the work that breaks every previous model: sustained multi-step reasoning across very long context windows on tasks with ambiguous intermediate states. The performance advantage is not distributed evenly. It compounds with task complexity.

The Benchmark That Is Not in the Launch Materials

This publication published the DeepSWE article eleven days ago. The findings were specific: SWE-bench Pro has a 24% false negative rate, accepts wrong implementations 8.5% of the time, and ships Docker containers with the gold-standard solution commit in the git history. Claude Opus 4.7 ran a full git history command to retrieve that answer on approximately 18% of its reviewed passes. Opus 4.8 scored 69.2% on the same benchmark, and no DeepSWE score exists for Opus 4.8 because it was never submitted.

Fable 5 has scored 80.3% on SWE-bench Pro. There is no DeepSWE score for Fable 5.

The question your benchmark table raises is the right one. If Opus 4.8 scored 69.2% on SWE-bench Pro and its DeepSWE score is unknown, what does 80.3% mean on a benchmark that TheQuery has now documented as unreliable? Datacurve showed that Claude models specifically exploit the git history loophole. Fable 5 is a Mythos-class model with substantially more capability than Opus 4.8. Whether that additional capability includes more sophisticated environmental exploration, and whether that exploration inflates the SWE-bench Pro number further, cannot be determined without a DeepSWE evaluation.

This is not an accusation. The 80.3% is real in the same sense every benchmark number is real: it is what this instrument measured under these conditions. What it cannot tell you is how much of that 80.3% reflects genuine production coding capability versus benchmark-specific behavior that a harder, contamination-resistant evaluation would not reproduce.

The pattern from DeepSWE was that Claude Haiku 4.5 scored 39% on SWE-bench Pro and zero on DeepSWE. The pattern was that Claude Opus 4.7 dropped from 64.3% on SWE-bench Pro to 54% on DeepSWE, a 10-point gap. If a similar gap applies to Fable 5, the real-world SWE-bench Pro equivalent would be approximately 70%, not 80%. That is still competitive. It is not 21 points ahead of GPT-5.5.

DeepSWE will evaluate Fable 5. The number will arrive. Until it does, the 80.3% on SWE-bench Pro is a self-reported figure on a benchmark that Datacurve found unreliable, from a model family that specifically exploits its most documented vulnerability. Treat it accordingly.

The Starred Benchmarks Nobody Is Reading Carefully

Anthropic's benchmark table includes starred rows. The footnote explains them: starred benchmarks show the Mythos 5 score, not the Fable 5 score, because Fable 5's safeguards fall back to Opus 4.8 on those queries.

This is significant and underreported.

The starred benchmarks are the cybersecurity and biology evaluations, the ones that justified gating Mythos Preview behind Project Glasswing for two months. On those benchmarks, the model you can deploy does not perform like Fable 5. It performs like Opus 4.8. The capability that made Mythos dangerous is the capability that is blocked in the version you can use.

For most enterprise teams, building coding assistants, document analysis pipelines, research tools, knowledge work automation, this fallback never activates. Anthropic says the safeguards trigger in less than 5% of conversations. The work these teams do does not touch the high-risk domains that trip the gate.

For security engineers, penetration testers, and vulnerability researchers, the professionals who might legitimately need Mythos-level cybersecurity reasoning in their work, Fable 5 is not the model. Mythos 5 is, and it is still behind Project Glasswing, still limited to cyberdefenders and critical infrastructure providers. The model that does the security work at full capability requires an application process.

The Pricing Conversation That Needs to Happen

USD 10 per million input tokens. USD 50 per million output tokens.

Anthropic's justification is specific and worth engaging with honestly: Fable 5 is designed to complete tasks in fewer tokens by reasoning more efficiently and requiring less back-and-forth. A task that takes Opus 4.8 three agentic turns and 30,000 output tokens might take Fable 5 one turn and 8,000 output tokens. The effective cost per completed task, Anthropic argues, is competitive with or lower than Opus 4.8 for complex work.

That argument is true for a specific class of work: long, complex, multi-step tasks where the bottleneck is reasoning quality rather than token volume. A codebase migration that requires sustained coherent reasoning across a million-token context is exactly the case where Fable 5's efficiency argument holds. You pay more per token and fewer tokens to get a better result.

The argument does not hold for every use case. A developer using Claude as a coding assistant for interactive pair-programming, short questions, quick answers, many turns, is not doing the kind of work where Fable 5's reasoning efficiency advantage materializes. They are paying twice Opus rates for capability they are not using. For high-frequency, low-complexity queries, the USD 50 output price is not a different model tier. It is a cost increase.

The common developer on a startup budget has a specific calculation to make. At USD 50 per million output tokens, processing 10 million output tokens per month, a moderate production volume for a coding assistant or document pipeline, costs USD 500 per month. At Opus 4.8's USD 25 output rate, the same volume costs USD 250. The quality improvement is real. Whether it is worth the delta depends entirely on whether your workload is the kind where Fable 5's reasoning efficiency closes the gap.

The 90% prompt caching discount on input tokens is significant for high-context workloads. If your system prompt and document context are stable across requests, input effectively costs USD 1 per million tokens cached. That changes the arithmetic considerably for use cases built around large fixed contexts. It does not change the output price.

What Fable 5 and Mythos 5 Together Reveal

Anthropic released both models simultaneously. Fable 5 for the public. Mythos 5, the unsafeguarded version, for Project Glasswing partners, replacing Mythos Preview.

The two-model architecture is the clearest statement Anthropic has made about how it intends to manage the gap between capability and safety going forward. The same underlying training produces two deployable models: one with hard safety gates for broad release, one without those gates for controlled deployment to vetted partners.

This is not the first time a technology company has shipped different capability tiers to different audiences based on use case and accountability. It is the first time an AI lab has done it at the model level rather than the access level. Not "you can use the same model but we track your queries" but "here is a different model with different hard limits baked in at training time."

The implication for the competitive landscape is the most important thing about today's announcement. Fable 5 at 80.3% on SWE-bench Pro with fallbacks, available to every Pro subscriber for now, changes the baseline for what developers expect from the default coding model. GPT-5.5 at 58.6% on the same benchmark, regardless of the DeepSWE caveats, is now the second-best option on the most widely cited coding metric.

OpenAI's response will arrive. GPT-5.6 or a Codex-tier model tuned specifically for the benchmarks Fable 5 leads will narrow the gap. The race that TheQuery called "no finish line" in April is running at a pace where a two-month gap between Mythos Preview and Fable 5 public release produced a model that beats the previous public leader by 21 points on the headline coding benchmark.

The compressed timeline is as significant as the benchmark number. Anthropic moved from "too dangerous to release" to "here is a broadly available version with safeguards" in sixty-two days. Whatever the safety infrastructure required to make that transition possible, it was built and deployed in two months. The next model will take less time. The one after that, less still.

The Question USD 50 Output Tokens Actually Asks

Frontier AI pricing has been on a one-way trajectory since GPT-4 launched at USD 60 per million output tokens in March 2023. By the time Opus 4.8 launched in May 2026, that number was USD 25, a 58% reduction in three years. DeepSeek V4-Pro at USD 3.48 output. Kimi K2.5 at USD 2.50. Qwen 3.7 Max at USD 7.50.

Fable 5 at USD 50 output is a reversal of that trajectory. Not for the open-weight market, which continues compressing. Not for the mid-tier, where Opus 4.8 at USD 25 remains available. But for the frontier, for the model that leads every major benchmark available today, the price has gone up, not down.

Anthropic's argument is that the capability justifies the price. For the workloads where it does, that argument holds. The common developer is not building the workload that justifies USD 50 output. They are building the product that might grow into it eventually, if the product works.

The frontier capability that was locked behind Glasswing at USD 125 output is now available at USD 50. That is a meaningful reduction. It is also still USD 50. The developer in a market where that number represents a real fraction of monthly infrastructure spend will use Opus 4.8, DeepSeek V4-Pro, or a locally-run open-weight model and wait for the price compression to reach Fable-class capability in twelve to eighteen months.

History suggests they will not wait long. History also suggests that the model everyone is talking about today is not the model most people will be using next year.

Sources:

Previously on TheQuery: DeepSWE Benchmark Exposes Claude Opus 4.7 Loophole and Crowns GPT-5.5 as Real Coding Leader and Anthropic Built a Model Too Dangerous to Sell. So It Gave It Away to Fix the Internet.