StartMiniMax M3 Is the First Open-Weight Model With Frontier Coding, 1M Context, and Multimodality. The Weights Aren't Out Yet.
By Addy · June 2, 2026
Something worth paying attention to happened on June 1, 2026.
A Shanghai-based AI lab called MiniMax released a model that beats Gemini 3.1 Pro on SWE-bench Pro, edges GPT-5.5 by four-tenths of a point on the same benchmark, delivers a native 1 million token context window, processes text, images, and video natively, can operate a desktop computer, and costs one-tenth of Claude Opus 4.7's input price. Then MiniMax called it open-weight and said the weights would arrive in ten days.
That last sentence is where the story gets complicated.
Every claim in the first paragraph is worth examining. The benchmark scores are real but vendor-run. The open-weight designation is a commitment, not a verifiable fact. The weights are not on Hugging Face today. The architecture that produced these results will not be independently reviewable until approximately June 11. Until that date, MiniMax M3 is simultaneously the most interesting open-weight model announcement of 2026 and a set of claims that cannot yet be verified.
Both of those things are true. The industry will spend the next ten days watching to see which one matters more.
What M3 Actually Claims
MiniMax positions M3 as the first open-weight model to simultaneously deliver three capabilities that have until now been the exclusive preserve of closed-source giants.
The first is frontier-level coding. On SWE-bench Pro, M3 scores 59.0%, ahead of GPT-5.5 at 58.6% and Gemini 3.1 Pro at 54.2%. On Terminal Bench 2.1, which tests agentic terminal coding in live environments, M3 scores 66.0%, trailing GPT-5.5's 78.2% and Gemini 3.1 Pro's 70.0% by meaningful margins. On MCP Atlas, which evaluates tool use across external services, it scores 74.2%. On SVG-Bench, which tests structured code generation for visual outputs, M3 leads at 63.7% ahead of Claude Opus 4.7 at 62.3%.
The second is a native 1 million token context window. The context window is not a retrofit. It is built into the architecture through MiniMax Sparse Attention, a new sparse attention mechanism that MiniMax says makes M3 more than 9x faster in prefilling and more than 15x faster in decoding at 1 million tokens versus its previous-generation model. The technical report will arrive alongside the weights in ten days. Until then, the speed claim is a benchmark number without an auditable implementation behind it.
The third is native multimodality. M3 processes text, images, and video in a single model rather than routing through separate specialist models. MiniMax also says it can operate a desktop computer, which puts it in the same broad computer-use category as the closed models that have been pushing agentic work beyond chat.
The pricing is the number that changes the competitive calculation. M3's API launched through OpenRouter at USD 0.30 per million input tokens and USD 1.20 per million output tokens during the launch promotion. MiniMax's own token plans start at USD 20 per month for roughly 1.7 billion M3 tokens. For production agentic workflows running millions of tokens per day, the cost differential is not marginal. It is the difference between a viable product and an unviable one.
The Benchmark Caveat That Applies Here
This publication ran the DeepSWE article four days ago. The timing is relevant.
MiniMax discloses that several M3 results were run on its own infrastructure, often using agent scaffolding including Claude Code, Mini-SWE-Agent, Terminus, and Codex. That disclosure is honest and appreciated. Most labs do not flag the scaffolding choices that influence benchmark scores. But it means the 59.0% SWE-bench Pro number carries the same caveat that every vendor-run benchmark carries: the setup conditions were chosen by the team that built the model.
M3 is not yet on the DeepSWE official board. The benchmark that TheQuery identified as the most rigorous coding evaluation available, the one that found Claude was reading git history and that Claude Haiku 4.5 collapses to zero on harder tasks, has not evaluated M3. That evaluation will not exist until the weights ship and Datacurve or an independent team can run M3 through the same standardized harness they used on every other model.
The DeepSWE board as of June 2 shows GPT-5.5 at 70%, Claude Opus 4.8 at 58%, GPT-5.4 at 56%, and Claude Opus 4.7 at 54%. Where M3 lands on that board, once the weights are available and an independent team submits it, is the number that will determine whether MiniMax's frontier coding claim holds under the most rigorous available scrutiny.
That number does not exist today.
Why the Open-Weight Claim Is the Real Story
Set aside the benchmark caveats for a moment and focus on what MiniMax is promising.
A model that scores 59% on SWE-bench Pro while delivering a 1 million token context window and native multimodality, as open weights, downloadable from Hugging Face, is a different category of development from anything the open-weight ecosystem has previously produced.
The closest comparison is DeepSeek V4-Flash, which this publication covered in April: MIT-licensed, 160GB on Hugging Face, competitive with frontier models on coding benchmarks at USD 0.28 per million output tokens. DeepSeek V4-Flash has native long context but not native multimodality at that tier. M3 claims all three simultaneously.
For enterprises that have been structurally excluded from AI-assisted coding because their code cannot leave the building, the use case TheQuery described in the RTX Spark article yesterday, an open-weight model with frontier coding capability and a 1 million token context window changes the calculation fundamentally. The model can run locally on RTX Spark hardware shipping this fall. The code stays on-premises. The model reasons across the entire codebase. The capability that required a cloud API two months ago requires only hardware in six months.
That is the compounding that makes M3's open-weight commitment significant regardless of the benchmark caveats. The benchmarks tell you how good the model is relative to other models. The open-weight license tells you who gets to use it and under what conditions. A model that is near GPT-5.5's vendor-reported coding score at a fraction of the cost, downloadable for self-hosting, is not a second-tier alternative. For the use cases that cannot use cloud APIs, it may be the only viable option.
The National Intelligence Law Footnote
Every coverage of a Chinese AI lab release requires this paragraph, and MiniMax M3 is no exception.
China's 2017 National Intelligence Law requires organizations and citizens operating under Chinese jurisdiction to support, assist, and cooperate with state intelligence work when required by law. MiniMax is a Shanghai-based company. The law applies. What cooperation has been requested, if any, is not public information. It cannot be by definition.
This is not a reason to avoid M3 entirely. DeepSeek V4, Kimi K2.6, Qwen 3.7, and Ling 2.6 all carry the same footnote and all have significant production deployments in Western enterprises. The footnote is a risk factor to name in a procurement decision, not a disqualifying condition for all use cases. Self-hosted open-weight deployment on on-premises hardware in an air-gapped environment reduces the data exposure surface considerably compared to API usage that routes queries through MiniMax's servers.
Enterprises in regulated industries or with data sovereignty requirements should be making this distinction for every vendor they use, not only Chinese ones.
Ten Days
The next ten days are the most important ten days in MiniMax M3's story.
Three things will determine whether today's announcement becomes a lasting shift in the open-weight landscape or a carefully staged launch that fails to survive contact with independent evaluation.
First: do the weights ship on schedule? MiniMax committed to approximately June 11. A delay past that date would follow a pattern that has damaged other Chinese lab credibility cycles, Qwen 3.7 Max's closed-weight pivot being the most recent example. The weights shipping on time, under the promised license, with the technical report, is the minimum required for the announcement to be taken seriously.
Second: does the architecture paper hold up? The sparse attention mechanism claiming more than 15x faster decoding at long contexts is either a genuine architectural innovation or a benchmark artifact from favorable test conditions. Independent engineers will read the paper the day it ships and the AI research community will know within 48 hours which one it is.
Third: does M3 appear on DeepSWE, and what score does it get? The benchmark that TheQuery called the most rigorous coding evaluation available has not seen M3. If MiniMax submits M3 to Datacurve's standardized harness, or if the community runs it independently once the weights are available, the DeepSWE score will tell you more about M3's real coding capability than the vendor-run SWE-bench Pro number that is leading every other article today.
If the weights ship, the architecture holds, and the DeepSWE score is within range of the SWE-bench Pro number, MiniMax M3 will have done something genuinely significant: delivered the first open-weight model with frontier coding, million-token context, and native multimodality, at a fraction of the price of the closed alternatives. That outcome would validate every claim in today's announcement.
If any of those three things fail, the gap between the announcement and the reality will be the story instead.
TheQuery will cover whichever one happens. The ten-day clock starts today.
Sources:
- MiniMax M3: Frontier Coding, 1M Context, Native Multimodality - MiniMax
- MiniMax M3 model page - MiniMax
- DeepSWE leaderboard - Datacurve
- MiniMax M3 Open-Weight Coding Model: Frontier Claims, Unverified Benchmarks - TechTimes
- National Intelligence Law of the People's Republic of China - LawInfoChina
Previously on TheQuery: DeepSWE Benchmark Exposes Claude Opus 4.7 Loophole and Crowns GPT-5.5 as Real Coding Leader and Qwen 3.7 Max Is Closed Source: Alibaba Just Copied the Playbook It Was Beating