MiniMax M2.7 Built Itself. The Race Shifted.

By Addy · March 18, 2026

MiniMax just released M2.7: a model that participated in its own training loop, matches GPT-5.3-Codex on software engineering benchmarks, and is live on the API today. It is built by a Chinese lab. Most developers will not touch it anyway. That tension is the whole story.

The Self-Evolution Loop

During M2.7's development, MiniMax used earlier versions of the model to run its own reinforcement learning cycle. The model analyzed failure trajectories, planned changes, modified scaffold code, ran evaluations, compared results, and decided what to keep or revert, autonomously, for over 100 consecutive rounds.

The result was a 30 percent performance improvement on internal evaluation sets.

To be precise: M2.7 did not write its own weights. Human researchers directed every critical decision. But the model handled 30 to 50 percent of the research workflow - data pipelines, experiment monitoring, log reading, debugging, and merge requests - that previously required multiple researchers across teams.

MiniMax calls this self-evolution. Whether that framing is accurate or aspirational is a fair debate. What is not debatable is where the benchmarks landed.

The Numbers

On SWE-Pro, which tests real-world software engineering across multiple languages, M2.7 scored 56.22% - matching GPT-5.3-Codex.

On MLE Bench Lite: 22 machine learning competitions used by OpenAI to evaluate AI research capability - M2.7 achieved a 66.6% medal rate across three runs. For context: Opus 4.6 scored 75.7%, GPT-5.4 scored 71.2%, Gemini-3.1 tied M2.7 at 66.6%.

On VIBE-Pro, which measures end-to-end full project delivery across Web, Android, iOS, and simulation tasks, M2.7 scored 55.6% - nearly on par with Opus 4.6.

On GDPval-AA, which measures professional domain expertise across 45 models, M2.7 achieved an ELO of 1495 - highest among open-source models, just behind Opus 4.6, Sonnet 4.6, and GPT-5.4.

These are not catch-up numbers. These are competitive numbers.

Chinese Labs Are No Longer Following. They Are Competing.

A year ago the AI narrative was simple: American labs set the frontier, everyone else chased it. That story no longer holds.

DeepSeek arrived in early 2025 and matched GPT-4-class performance at a fraction of the compute cost. Qwen from Alibaba became a serious open-source contender. MiniMax released M2, M2.1, M2-Her, M2.5, and now M2.7 in rapid succession - each iteration meaningfully closing gaps on benchmarks that matter for real engineering work.

The gap between American and Chinese frontier models is now a matter of percentage points on specific benchmarks, not a fundamental capability divide.

For developers building API-powered products, this matters practically. M2.7 is available today on the MiniMax API Platform at competitive pricing. For cost-sensitive workloads, or for developers regularly hitting rate limits on frontier models, M2.7 is worth a serious evaluation. The performance is real. The access is live.

The hesitation is also real.

The Trust Problem Nobody Wants to Talk About

Chinese AI models carry a trust problem that benchmark numbers cannot fix.

DeepSeek was banned on government devices in New York. Italy's data protection authority blocked it. South Korea removed it from app stores. These were institutional responses to a documented concern: data processed by Chinese AI systems may be subject to Chinese government access under national security law.

The concern goes beyond privacy. Independent testing and user reports indicate that some Chinese AI models suppress or deflect discussion of politically sensitive topics - events the Chinese government considers unfavorable. Developers have documented instances where these models decline to engage with certain geopolitical topics that Western models handle without restriction.

MiniMax is not DeepSeek. The engineering culture, model design, and benchmark transparency suggest a serious technical organization. But MiniMax is still a Chinese company operating under Chinese law. That legal reality does not change because the benchmarks are impressive.

For personal projects, open-source work, and non-sensitive workloads: the trust concern is manageable.

For enterprise use, proprietary codebases, or regulated industries: the concern is legitimate and due diligence is non-negotiable.

This is not an argument to avoid M2.7. It is an argument to use it with eyes open.

What M2.7 Actually Signals

The self-evolution framing matters not because M2.7 is fully autonomous, but because it shows where the trajectory is heading. If AI models can meaningfully accelerate their own training loops today at 30 to 50 percent of the research workflow, the pace of iteration is going to keep compressing.

MiniMax shipped five model versions in roughly a year. Each one moved the benchmarks. The next one is already being built, and this time, the model itself is running part of the process.

The American labs are still ahead on the absolute frontier. But the margin is narrowing faster than most expected. And the labs doing the catching up are not slowing down.

Sources:

MiniMax M2.7 Official Launch - MiniMax
MLE-bench: Evaluating Machine Learning Agents on Kaggle - OpenAI ArXiv