>_TheQuery
← All Articles

The Model That Thinks With 12B Parameters but Knows Everything a 120B Model Knows.

By Addy · March 11, 2026

There is a quiet architectural shift happening across AI labs that the parameter count leaderboard does not reflect.

The assumption underneath every benchmark comparison for the last three years was simple: more parameters, more capability. A 70B model beats a 35B model. A 120B model beats a 60B model. Scale was the variable that mattered. Everything else was noise.

That assumption is being dismantled, not by a single model, not by a single lab, but by a pattern emerging across multiple independent releases from different teams working on different problems.

The pattern is this: separate what a model knows from what it activates. Store knowledge at scale. Think at efficiency. The models that do this well are outperforming models that simply store and activate everything.


What the Shift Actually Looks Like

A model with 120 billion parameters used to mean a model doing 120 billion parameters worth of work on every token. All 120 billion engaged on every forward pass. Inference cost scaled directly with parameter count. The number on the press release reflected the compute bill.

The new architecture decouples those two things.

Store 120 billion parameters of knowledge. Activate 12 billion of them per token. Route each token to the subset of parameters most relevant to the task at hand. The other 108 billion sit available, engaged when needed, idle when not.

This is not a compression trick. The knowledge is real. The routing is learned. The model is not approximating a 120B model from a smaller one. It is a 120B model that has learned to think efficiently.

NVIDIA's Nemotron 3 Super, released today, is the clearest example of this architecture at scale. 120 billion total parameters. 12 billion active at inference. Three architectural decisions make the gap possible.

LatentMoE projects tokens into a compressed latent dimension before expert routing, making the routing decision itself cheaper, allowing four times as many experts to be activated for the same inference cost as a standard MoE design. Hybrid Mamba-Attention replaces quadratically expensive transformer attention for most sequence processing with Mamba-2, a state space architecture that handles long sequences efficiently, this is what makes a one million token context window practical rather than theoretical. Multi-Token Prediction generates multiple future tokens per forward pass, enabling native speculative decoding at up to three times faster wall-clock inference without a separate draft model.

The result: 5x higher throughput than its predecessor. One million token context at 91.75% accuracy on RULER. A model running at 12B active-parameter cost that outperforms models activating three times as many parameters per token.


Qwen3-Coder-Next Did This First

Before Nemotron, Alibaba's Qwen3-Coder-Next made the same architectural argument in sharper terms.

80 billion total parameters. 3 billion active at inference. On SWE-Bench Pro it matched Claude Sonnet 4.5-level coding performance. It outperformed DeepSeek V3, which activates 37 billion parameters per token, more than twelve times Qwen3-Coder-Next's active count.

A model thinking with 3 billion parameters outperforming one thinking with 37 billion is not a benchmark anomaly. It is an architectural argument. The routing is doing work that used to require raw parameter scale. The stored knowledge is deep enough that activating the right 3 billion matters more than activating a generic 37 billion.

The similarity between Qwen3-Coder-Next and Nemotron 3 Super is not superficial. Both use Mixture of Experts routing to separate stored capacity from active compute. Both achieve results that their active parameter counts should not make possible by dense model logic. Both are from labs that independently arrived at the same conclusion: the path to capability is not more activation. It is better routing.


Is This Shift Grounded or Gradual?

That is the honest question. Two models from two labs is a pattern. But is it a fundamental shift or an incremental one that will plateau?

The evidence suggests it is grounded, not gradual, for a specific reason: the efficiency gains compound rather than trade off.

DeepSeek V3 demonstrated this earlier, 671 billion total parameters, 37 billion active, outperforming Llama 3 at 405 billion dense parameters. That result forced a public reconsideration of dense scaling as a strategy. At the time it looked like a one-off. Now it looks like the first data point in a trend that Qwen and NVIDIA have independently confirmed.

What makes this grounded rather than a temporary advantage is that the architectural decisions are additive. LatentMoE makes routing cheaper. Mamba-Attention makes long contexts viable. Multi-Token Prediction accelerates inference. Each decision compounds with the others. The efficiency gap between this architecture and dense transformers does not shrink as models scale - it grows, because each component of the design benefits more from scale than dense attention does.

We are also seeing this pattern emerge not in research papers but in production releases. When efficiency architecture moves from arxiv to production deployment, it is not gradual. It is grounded.

The parameter count leaderboard will keep publishing numbers. But the number that matters at inference, active parameters per token, is quietly becoming the more honest metric. The labs that understood this earliest are already three models deep into proving it.


Sources:


Previously on TheQuery: Alibaba's 9B Model Just Beat Its Own 30B. The Scaling Era Is Ending., where this story started.