>_TheQuery
← Glossary

Claude Opus 4.8

Models

Anthropic's May 2026 Opus upgrade for long-horizon agentic coding, dynamic workflows, effort control, and faster premium inference.

A senior engineer who already knew the codebase came back with a bigger team, better checklists, and a faster emergency lane: not a new profession, but a much stronger way to do the hard jobs.

Claude Opus 4.8 is Anthropic's May 2026 upgrade to the Opus 4.x family. It builds on Claude Opus 4.7 as Anthropic's most capable generally available model for complex reasoning, long-horizon agentic coding, high-autonomy work, and professional knowledge tasks.

Core profile

Claude Opus 4.8 keeps the premium Opus pricing tier: USD 5 per million input tokens and USD 25 per million output tokens for regular usage. Fast mode is available as a research preview on the Claude API at USD 10 per million input tokens and USD 50 per million output tokens, with up to 2.5x higher output speed.

The model supports a 1 million token context window by default on the Claude API, Amazon Bedrock, and Google Cloud Vertex AI, with 200k context on Microsoft Foundry. It also supports 128k max output tokens, adaptive thinking, and the same broad tool surface as Claude Opus 4.7.

What changed from Opus 4.7

Opus 4.8 is not a brand-new generation. It is a point release that improves the parts of Opus 4.7 that matter most in production agent systems: long-context handling, compaction recovery, tool triggering, effort calibration, and honesty around flawed results.

The launch also introduced several product and platform changes around the model:

  • Dynamic Workflows in Claude Code: Claude can break a large task into subtasks, run tens to hundreds of parallel subagents, verify their outputs, and merge the results into one coordinated answer.
  • Effort control: Users can choose how much effort Claude spends on a response, trading latency and token usage against quality.
  • Fast mode: Developers can pay a premium for lower latency when the same Opus 4.8 capability needs to run faster.
  • Mid-conversation system messages: Developers can update instructions during a long-running task without restating the entire system prompt or breaking prompt cache behavior.
  • Lower prompt-cache minimum: Opus 4.8 lowers the minimum cacheable prompt length to 1,024 tokens, making more agent loops cache-friendly.

Official benchmark signals

Anthropic reported measurable gains over Opus 4.7 on several launch benchmarks:

BenchmarkOpus 4.7Opus 4.8
SWE-bench Verified87.6%88.6%
SWE-bench Pro64.3%69.2%
Terminal-Bench 2.166.1%74.6%
MCP-Atlas77.3%82.2%
GDPval-AA Elo17531890
USAMO 202669.3%96.7%

The most practical improvement may be behavioral rather than headline benchmark movement. Anthropic reports that Opus 4.8 is much less likely than Opus 4.7 to miss important code flaws or report flawed results uncritically. For teams running coding agents overnight or across multi-day workflows, that kind of honesty matters because silent false success is one of the hardest agent failures to detect.

The DeepSWE caveat

Opus 4.8 launched the same day DeepSWE entered the broader AI news cycle. DeepSWE questioned SWE-bench Pro's verifier reliability and documented a git-history loophole that affected some Claude Opus 4.7 SWE-bench Pro passes. As of the Opus 4.8 launch, there was no public DeepSWE score for Opus 4.8.

That does not make the Opus 4.8 benchmark gains meaningless. It means the SWE-bench Pro number should be read carefully until Opus 4.8 is evaluated on a benchmark that removes the same loophole and uses stricter verifiers.

What it is good at

Claude Opus 4.8 is best suited for expensive, high-value work where weaker models tend to fail quietly: large codebase migrations, multi-file refactors, deep debugging, code review, complex enterprise agents, long-document analysis, and workflows that require repeated tool calls over hours or days.

Its strongest product story is not just that it is a better model. It is that it fits into a stronger agent harness: dynamic workflows, effort control, mid-task instruction updates, better caching, and faster premium inference all make Opus 4.8 easier to run inside long-lived systems.

Tradeoffs

The tradeoff is still cost. Opus 4.8 is rarely the model a team should use for every request. It is the escalation model: the one routed to hard tasks, risky code changes, high-value reviews, or agent workflows where the cost of failure is higher than the cost of tokens.

The other tradeoff is benchmark uncertainty. Anthropic's published numbers are strong, but the coding benchmark landscape is moving quickly. Opus 4.8's real standing depends on independent evaluation on newer benchmarks such as DeepSWE and on how well it behaves inside real developer harnesses rather than isolated benchmark runs.

Last updated: May 29, 2026