StartThe Real AI Race Nobody Is Covering: MiniMax M3, GLM 5.2, and DeepSeek V4 Are Fighting Each Other
By Addy · June 18, 2026
Every article about a Chinese AI model release follows the same structure. New model drops. Benchmark table appears. Western frontier model in the comparison column. Headline reads: "Chinese lab challenges OpenAI" or "Open-weight model closes gap with Claude."
The frame is not wrong. The gap has closed. The open-weight frontier is real. But the frame misses the story that is actually happening underneath it.
In the past eight weeks, three Chinese labs shipped models that are genuinely competitive with the American frontier. MiniMax shipped M3 on June 1. Z.ai shipped GLM 5.2 on June 13. DeepSeek shipped V4 in April and has been iterating since. All three are MIT-licensed or Apache 2.0. All three have 1 million token context windows. All three beat or tie GPT-5.5 on at least one major benchmark.
They are also, unmistakably, fighting each other.
The developer who is choosing between these three models is not choosing between China and America. They are choosing between three distinct bets on what the open-weight frontier actually is -- a multimodal Swiss Army knife, a long-horizon agentic powerhouse, or an algorithmic reasoning specialist. Those are different products. The labs building them know this. The coverage pretending this is primarily a US vs China story is missing the competition that is actually shaping the ecosystem.
The Intelligence Index That Tells the Real Story
The top open-weight models by Artificial Analysis Intelligence Index are: GLM-5.2 at 51, MiniMax M3 at 44, and DeepSeek V4 Pro at 44. All three Chinese labs. All three ahead of every other open-weight model in existence. One through three on the composite intelligence ranking for open-weight models are a Tsinghua-affiliated lab, a Shanghai startup, and a Hangzhou hedge fund's AI division.
The American open-weight story -- Meta's Llama family, Mistral, the various fine-tunes and derivatives -- does not appear in the top three. This is not because American labs stopped doing open-weight research. It is because these three Chinese labs are specifically optimizing for the benchmarks that matter for production agentic coding work, shipping on aggressive timelines, and pricing at levels that make the American closed-model incumbents look expensive by default.
The race at the open-weight frontier is between these three. Everything else is context.
Three Models, Three Philosophies
DeepSeek V4 Pro: The Algorithmic Reasoning Specialist
DeepSeek dominates competitive coding: LiveCodeBench 93.5% -- number one globally -- Codeforces rating 3206, GPQA Diamond 90.1%. These are not production software engineering benchmarks. They are algorithmic reasoning benchmarks -- the kind of tasks that require holding a complex problem in memory, generating an elegant solution, and executing it without errors. The Codeforces rating of 3206 puts DeepSeek V4 Pro in the top 23 competitive programmers globally among humans.
What DeepSeek is not: the best model for long-horizon agentic coding on real production codebases. On SWE-bench Pro, DeepSeek V4 Pro scores 55.4% -- 6.7 points behind GLM 5.2 and 3.6 points behind MiniMax M3. The model that leads the world on competitive programming trails the other two Chinese labs on the benchmark that measures real software engineering. These are different skills, and DeepSeek is optimized for the former.
The pricing is the most aggressive of the three. DeepSeek V4 Pro runs at 3.48 output per million tokens at standard rates. MIT license. Weights on Hugging Face. The model that led every benchmark in April is now the third-ranked open-weight model on the composite intelligence index -- not because it got worse, but because the other two caught up.
MiniMax M3: The Multimodal Swiss Army Knife
MiniMax M3 scores 59.0% on SWE-bench Pro, 66.0% on Terminal-Bench 2.1, and leads BrowseComp at 83.5%. The BrowseComp number is the one that distinguishes M3 from the other two: web navigation, information retrieval, and tool use across browser interfaces are capabilities neither GLM 5.2 nor DeepSeek V4 specifically optimizes for.
M3's structural advantage is native multimodality. The model processes video, images, desktop computer operation, and text in a single unified model -- not routed through specialist modules. M3 is 3.7 times cheaper than GLM 5.2 on output tokens, priced at $1.20 per million output tokens. For production workflows that involve mixed modalities -- analyzing screenshots, navigating interfaces, processing document images alongside code -- M3 is the only open-weight model that handles all of it natively and cheaply.
The honest weakness: GLM 5.2 leads every shared benchmark by 3 to 15 points. M3 trades benchmark points for modality breadth and cost efficiency. The developer who only needs text and code gets a better model with GLM 5.2. The developer who needs a model that can also see, browse, and operate a desktop gets M3, at lower cost.
GLM 5.2: The Long-Horizon Agentic Powerhouse
GLM 5.2 scores 62.1% on SWE-bench Pro -- outperforming GPT-5.5's 58.6% -- and 74.4% on FrontierSWE, ahead of GPT-5.5 at 72.6% and close to Claude Opus 4.8 at 75.1%. The FrontierSWE number is the one worth focusing on: it measures long-horizon task completion, not short code generation. Sustained multi-step reasoning across a full agentic session is specifically where GLM 5.2 is designed to lead.
The architecture supports this positioning. IndexShare reuses one lightweight indexer across every four sparse-attention layers to cut per-token FLOPs by 2.9x at 1M context. An improved MTP layer for speculative decoding raises accepted token length by up to 20%. These are not general-purpose optimizations. They are specific architectural choices for making long-context, long-session agentic work more efficient and more stable than the alternatives.
GLM 5.2 sits at number one on the open-weight Intelligence Index at 51 points -- ahead of both M3 and DeepSeek on the composite benchmark that aggregates across the widest range of evaluations. It is the most expensive of the three at $4.40 per million output tokens -- still 5.7 times cheaper than Claude Opus 4.8.
The Benchmark Table
| Model | SWE-bench Pro | Terminal-Bench 2.1 | Price (output/1M) | License | Multimodal |
|---|---|---|---|---|---|
| GLM 5.2 | 62.1% | 81.0% | $4.40 | MIT | No |
| MiniMax M3 | 59.0% | 66.0% | $1.20 | Open | Yes |
| DeepSeek V4 Pro | 55.4% | 67.9% | $3.48 | MIT | No |
| GPT-5.5 (reference) | 58.6% | 78.2% | ~$30 | Closed | No |
GLM 5.2 leads on the benchmarks that measure production coding. MiniMax M3 leads on multimodal and cost. DeepSeek V4 Pro leads on algorithmic reasoning and math. GPT-5.5 is in the table as a reference point -- it is behind GLM 5.2 on SWE-bench Pro, behind M3 on BrowseComp, and costs roughly 7-25 times more than any of the three. The table that was supposed to have a Western model in the top position does not.
What Each Lab Is Actually Competing For
The three labs are not trying to build the same thing. The competition between them is not primarily on benchmark points -- it is on developer ecosystem capture.
DeepSeek's bet is on the competitive programmer, the research scientist, the math-heavy use case. The lab that runs the highest Codeforces rating and leads LiveCodeBench is building tooling for the developer who solves hard algorithmic problems and needs a model that can match their reasoning speed. The DeepSeek user is not primarily a product developer running long agentic coding sessions. They are a researcher or competitive programmer who needs the fastest, most capable reasoner available in open weights.
MiniMax's bet is on the developer building products that touch the real world -- browsing agents, computer-use pipelines, multimodal document processors. M3's BrowseComp lead is not an accident. It reflects a specific product decision: build the model that can operate across modalities natively, price it cheaply enough that the marginal cost of adding vision or video to a pipeline is negligible, and capture the developer building the next generation of agentic products that go beyond the text terminal.
Z.ai's bet is on the enterprise developer running long, complex, multi-day coding sessions on large codebases. The IndexShare architecture, the 131,072 output token ceiling, the FrontierSWE optimization -- these are choices made for the team running an overnight agentic refactor on a 500,000-line codebase. The model that leads FrontierSWE by a margin is the model you reach for when the task runs for hours, not minutes.
These are not competing for the same user. They are competing for adjacent users in the same ecosystem -- and the adjacency is close enough that the labs are watching each other's benchmark releases and shipping responses within weeks.
The Speed of This Competition
Four Chinese AI labs released open-weight coding models inside a 12-day window in late April 2026 -- Z.ai's GLM-5.1, MiniMax M2.7, Moonshot's Kimi K2.6, and DeepSeek V4 -- all landing at roughly the same capability ceiling on agentic engineering at meaningfully lower cost than Western frontier models.
That was April. GLM 5.1 shipped in March. GLM 5.2 shipped in June -- 11 weeks later, with a 5x context expansion and a new architecture. MiniMax M2.7 shipped in April. M3 shipped in June -- 6 weeks later, with native multimodality added. The iteration speed is the competitive pressure that the Western labs are responding to, not any single model's benchmark.
The pattern from the open-source software world applies here: when multiple capable contributors are working on adjacent problems in an open ecosystem, the pace of improvement exceeds what any single closed team can match. Linux improved faster than Unix because more engineers were working on it. PyTorch improved faster than Theano because more researchers were using it. The Chinese open-weight cluster is iterating faster than any individual American lab because there are three of them, and each release from one forces a response from the others.
The Question the Frame Misses
The Western coverage of Chinese AI has a structural limitation: it frames every Chinese model release as a response to an American one. DeepSeek V4 is interesting because it challenges GPT-5.5. GLM 5.2 is interesting because it beats GPT-5.5 on SWE-bench Pro. MiniMax M3 is interesting because it undercuts Claude Opus pricing.
All of those framings are accurate. None of them explain why GLM 5.2 and MiniMax M3 shipped within two weeks of each other with deliberately different positioning, or why DeepSeek's V4 release in April was followed by two competing Chinese labs releasing models that specifically targeted the benchmarks where DeepSeek was weakest.
The competition that is shaping the open-weight ecosystem is not primarily East versus West. It is three Chinese labs with different architectural bets, different user positioning, and different pricing strategies, racing to establish the default open-weight model for different segments of the developer market before the market consolidates.
The American labs are the reference point in every benchmark table. They are not the primary competitive pressure in the ecosystem where these three models actually compete.
That is the story the frame misses. The race at the open-weight frontier has already moved on from "can Chinese labs match American frontier models." The answer to that question is confirmed. The new question is which Chinese lab's architectural bet wins.
Sources:
- GLM-5.2 vs MiniMax M3: Open-Weight Coding Showdown -- CodingFleet
- MiniMax M3 debuts, eclipsing GPT-5.5 and Gemini 3.1 Pro -- VentureBeat
- MiniMax-M3 vs DeepSeek V4 Pro Intelligence Index -- Artificial Analysis
Previously on TheQuery: DeepSeek V4 Is Almost at the Frontier. The Price Is Not. and MiniMax M3 Is the First Open-Weight Model With Frontier Coding, 1M Context, and Multimodality. The Weights Aren't Out Yet.