DeepSeek V4 Is Almost at the Frontier. The Price Is Not.
By Addy · April 25, 2026
DeepSeek ships the way a hedge fund executes a trade. No press cycle. No staged rollout. No Jensen Huang on a stage. On the evening of April 24, 2026, the weights appeared on Hugging Face, the API went live, and the technical report dropped at essentially the same time. By the time most of the Western AI industry opened their laptops the next morning, DeepSeek V4 had already been running in production for hours.
That is not an accident. It is a strategy.
The model that arrived is V4-Pro and V4-Flash: two variants of a Mixture of Experts architecture built around a simple thesis. A one million token context window should be the default, not a premium feature.
The report is worth taking seriously because the architecture backing that claim is not hand-wavy. The benchmark table is worth taking seriously for the same reason. DeepSeek published enough detail that you can see both where V4 wins and where it still falls short.
What V4 Actually Is
DeepSeek-V4-Pro carries 1.6 trillion total parameters with 49 billion active per token. V4-Flash is 284 billion total with 13 billion active. Both models support a one million token context window natively. Both ship under the MIT license. Both are on Hugging Face right now.
The architectural story is the reason the context window matters.
DeepSeek combines Compressed Sparse Attention with Heavily Compressed Attention in what it describes as a hybrid attention design for long-context efficiency. In the one million token setting, the report says V4-Pro needs only 27% of the single-token inference FLOPs and 10% of the KV cache required by DeepSeek V3.2. DeepSeek says Flash pushes efficiency further still.
Two other design choices matter. Manifold-Constrained Hyper-Connections are there to stabilize signal flow across very deep stacks without flattening model expressivity. The Muon optimizer is there for convergence and training stability. And the model was trained with mixed FP4 and FP8 precision in the instruct checkpoints, which helps explain why the memory story is friendlier than the raw parameter counts would suggest.
This is not a small model pretending to be large. It is an enormous model built to make million-token context less theoretical than it used to be.
The Pricing, Because It Is the Story
| Model | Input / 1M tokens | Output / 1M tokens |
|---|---|---|
| DeepSeek V4-Flash | $0.14 | $0.28 |
| DeepSeek V4-Pro | $1.74 | $3.48 |
| Claude Opus 4.7 | $5 | $25 |
| GPT-5.5 | $5 | $30 |
That makes V4-Pro roughly 7x cheaper than Claude Opus 4.7 on output. V4-Flash is dramatically cheaper still.
These are not rounding errors. They describe a different economic model for AI access.
For a startup running 100 million output tokens per month, the gap between $25 per million and $3.48 per million is the difference between an expensive model and a different budget category altogether. Even before self-hosting enters the picture, DeepSeek is pricing like a company trying to turn switching into a default decision.
The API design reinforces that strategy. DeepSeek's docs say V4 is available through both the OpenAI ChatCompletions format and the Anthropic API format. The company is making the migration path as unromantic as possible: keep the base URL pattern, change the model name, move on.
Where V4 Wins
Competitive programming is the headline result.
DeepSeek V4-Pro-Max scores 3,206 on Codeforces. In DeepSeek's own comparison table, that puts it ahead of GPT-5.4 at 3,168 and Gemini 3.1 Pro at 3,052. For an open-weight model, that is a real frontier-level result.
On LiveCodeBench, V4-Pro-Max scores 93.5. That is higher than every model DeepSeek includes in its own table, including Gemini 3.1 Pro at 91.7 and Claude Opus 4.6 at 88.8.
On IMOAnswerBench, V4-Pro-Max scores 89.8. In DeepSeek's own table, Claude Opus 4.6 is listed at 75.3 and Gemini 3.1 Pro at 81.0. DeepSeek wins cleanly there.
On Toolathlon, which measures tool use across external systems, V4-Pro-Max scores 51.8. In DeepSeek's own table, Claude Opus 4.6 is at 47.2 and Gemini at 48.8.
And on SWE-bench Verified, V4-Pro-Max lands at 80.6 against the Claude Opus 4.6 score DeepSeek includes in its table: 80.8. Anthropic's official April 16, 2026 Opus 4.7 launch also reported CursorBench at 70% versus 58% for Opus 4.6, a 14% lift on complex multi-step workflows with a third of the tool errors, and 3x more production tasks resolved on Rakuten-SWE-Bench. That is enough to make clear that V4 is close to Claude's earlier published benchmark level, but not to Anthropic's latest generally available release across the board.
This is the part that matters most.
DeepSeek is no longer posting interesting open-model numbers that sit in a separate category from the frontier. On several coding and reasoning benchmarks, it is in the same conversation already.
Note: DeepSeek's published comparison table uses Claude Opus 4.6 for LiveCodeBench, IMOAnswerBench, Toolathlon, and SWE-bench Verified. Anthropic's April 16, 2026 Opus 4.7 launch published updated results for some other benchmarks, but not for all of these specific ones.
Where V4 Still Falls Short
The same benchmark table makes the limits visible too.
On SWE-bench Pro, V4-Pro-Max scores 55.4. In DeepSeek's own comparison table, Claude Opus 4.6 scores 57.3 and GPT-5.4 scores 57.7. That is not a collapse, but it is a real gap on the messier benchmark that is closer to production software engineering than SWE-bench Verified.
On Terminal-Bench 2.0, V4-Pro-Max scores 67.9. That beats Claude Opus 4.6 at 65.4, but it trails GPT-5.4 at 75.1 and Gemini 3.1 Pro at 68.5. So the model is strong at coding, but not yet dominant at the harder class of agentic terminal execution.
Factual recall is a clearer weakness. On SimpleQA-Verified, V4-Pro-Max scores 57.9. Gemini 3.1 Pro is at 75.6. That is not a rounding-error deficit. It is a meaningful gap.
Long-context retrieval is also less impressive than the raw context window alone might suggest. On MRCR 1M, V4-Pro-Max scores 83.5. Claude Opus 4.6 is listed at 92.9. You can absolutely give DeepSeek a million-token prompt. You should not assume it will retrieve the right thing inside that prompt as reliably as Claude will.
And then there is what is missing.
DeepSeek does not publish an ARC-AGI result for V4 in the report. For a benchmark specifically designed to resist shortcut pattern-matching, that absence is its own piece of information.
Flash May Be the More Important Model
Most of the attention will go to Pro. Flash may be the more disruptive release.
V4-Flash is much smaller in active compute terms at 13 billion active parameters, but the drop from Pro is narrower than the price collapse suggests. Flash-Max scores 79.0 on SWE-bench Verified against Pro-Max at 80.6. On LiveCodeBench, Flash-Max scores 91.6 against Pro-Max at 93.5.
Those are not identical numbers. But they are close enough that the economic question becomes unavoidable.
If your workload is standard coding help, code generation, review, and lighter tool use, Flash looks much closer to Pro than the price would lead you to expect. Where the gap opens up is on harder multi-step work and knowledge-heavy tasks. Terminal-Bench drops to 56.9 for Flash-Max. SimpleQA-Verified drops to 34.1.
That split is the whole product segmentation.
Pro is the stronger frontier challenger. Flash is the model that makes pricing from the proprietary labs look the least comfortable.
The Strategy Is Visible in the Surface Area
The most interesting thing about V4 is not any single benchmark. It is how coherent the release strategy has become.
MIT license. Open weights. One million token context. OpenAI-compatible calls. Anthropic-compatible calls. Prices low enough to reset expectations before a product team even finishes procurement paperwork.
This is not just an engineering strategy. It is a distribution strategy.
DeepSeek is trying to make the switch away from proprietary APIs feel operationally boring. That is how software markets usually move: not when the new thing is perfect, but when it is good enough, cheaper, and easier to adopt than the incumbent expected.
The American labs still hold real advantages at the very top. Claude Opus 4.7 remains more expensive because it still carries real capability where the hardest production tasks live. But the category where that premium is obviously justified keeps getting smaller.
What the Pattern Means
DeepSeek R1 was the moment that made people look up. V4 feels more like the continuation of a pattern.
The pattern is that open-weight Chinese labs are no longer playing in a secondary tier. They are shipping models that are close enough to the frontier to force pricing questions every time a benchmark table lands.
V4-Pro is not the best model in the world at everything. The report itself makes that clear. It is weaker on factual retrieval than Gemini. It is weaker on million-token retrieval than Claude. It still trails on some of the hardest agentic coding evaluations.
But it is close enough, often enough, and cheap enough, that the pricing of the frontier proprietary models starts looking less like a law of nature and more like a temporary market condition.
That is the real story.
Not that DeepSeek has already won.
That it keeps making the frontier justify its markup in narrower and narrower slices of the market.
Sources:
- DeepSeek-V4 Technical Report - DeepSeek AI
- Models & Pricing - DeepSeek API Docs
- Introducing Claude Opus 4.7 - Anthropic
Previously on TheQuery: The Cheap Model Is Winning. The Expensive One Is Locked In a Room. and A 55GB Open Model Just Beat Claude Opus on Three Benchmarks