>_TheQuery
← Glossary

Mamba-3

Deep Learning

A state space model architecture published at ICLR 2026 that advances Mamba-2 through trapezoidal discretization, a MIMO formulation for hardware efficiency, and complex-valued dynamics via data-dependent RoPE - delivering stronger state tracking at half the state size of its predecessor.

Like upgrading a hand-cranked adding machine to a calculator - the underlying math is the same, but trapezoidal discretization and MIMO give the machine enough computational intensity to actually keep up with the hardware it runs on.

Mamba-3 is a sequence modeling architecture published at ICLR 2026 by Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. It extends the Mamba family of state space models (SSMs), which are designed as an inference-efficient alternative to the Transformer's attention mechanism. While Transformers deliver strong quality at the cost of quadratic compute scaling with sequence length, SSMs operate with linear-time recurrence - making them faster and cheaper at inference, particularly for long sequences.

Mamba-3 is motivated by an inference-first perspective: how do you build a model that is fast and memory-efficient to run, while closing the capability gap that has historically made SSMs weaker than attention on tasks requiring precise retrieval and state tracking?

Three Core Innovations

The first innovation is trapezoidal discretization. Mamba-1 and Mamba-2 used Euler's first-order method to discretize the continuous SSM equations into a recurrent form. Mamba-3 replaces this with a second-order trapezoidal approximation that considers both the current and previous input timesteps, effectively introducing a local convolution of size 2. The result is a more expressive recurrence that captures richer input dependencies without increasing the state size.

The second innovation is a MIMO (multi-input multi-output) formulation. The fundamental hardware problem with standard SSM decoding is arithmetic intensity: single-input single-output recurrences perform roughly 2.5 FLOPs per byte of memory accessed - far below the compute throughput of modern GPUs like the H100, which operate at around 300 FLOPs per byte. This leaves the GPU sitting mostly idle during inference. Mamba-3's MIMO formulation expands the inputs from vectors to matrices, transforming inner operations from outer products into matrix multiplications. This increases compute per memory byte by a significant factor, pushing inference into compute-bound rather than memory-bound territory - and enabling better hardware utilization without enlarging the recurrent state.

The third innovation is complex-valued dynamics via data-dependent RoPE. Mamba-2 restricted its state transition matrix to real-valued scalars, which eliminated the oscillatory dynamics needed to solve state-tracking tasks like parity detection. Mamba-3 proves theoretically that a discretized complex-valued SSM is mathematically equivalent to a real-valued SSM equipped with data-dependent Rotary Positional Embeddings (RoPE). This recovers the rotational mechanics required for state tracking without the 4x overhead of explicit complex arithmetic. On the parity detection benchmark, Mamba-2 achieves near-random performance (0.90% accuracy). Mamba-3 achieves 100%.

Mamba-3 vs Mamba-2

At 1.5B parameters on FineWeb-Edu language modeling, Mamba-3 outperforms Gated DeltaNet by 0.6 percentage points, with the MIMO variant adding a further 1.2 points for 1.8 points total improvement. It matches Mamba-2's perplexity using half the state size - meaning the same modeling quality at lower memory cost. Across all evaluated scales from 180M to 1.5B parameters, Mamba-3 consistently outperforms Mamba-2 and Gated DeltaNet.

The limitation that remains: like all fixed-state recurrent models, Mamba-3 cannot perfectly retrieve arbitrary content from a long context in the way attention can by revisiting the full KV cache. For tasks requiring precise, structured retrieval from long contexts, hybrid architectures that combine Mamba layers with attention remain the practical solution.

Mamba Architecture in Production Models

The Mamba family has moved well beyond research into deployed models, particularly in hybrid architectures that interleave Mamba layers with attention.

Jamba (AI21 Labs) was one of the first major production deployments of a hybrid Transformer-Mamba architecture. Jamba 1.5 interleaves Mamba and attention layers across 72 total layers, combined with a Mixture of Experts routing layer, and supports a 256K-token context window. It achieves state-of-the-art results on long-context benchmarks while maintaining inference efficiency comparable to smaller pure-attention models.

Nemotron-H (NVIDIA) is a family of hybrid Mamba-Transformer models at 8B, 47B, and 56B parameters. By replacing 92% of attention layers with Mamba-2 blocks, Nemotron-H delivers up to 3x faster inference throughput compared to similarly sized Transformers like LLaMA-3.1 and Qwen-2.5, while matching or exceeding accuracy on MMLU, GSM8K, HumanEval, and MATH. This is the most direct production validation that Mamba-architecture hybrid models can displace pure Transformers at scale without a capability penalty.

Nemotron 3 Super (NVIDIA), which uses a hybrid MoE and Mamba-Transformer design, is the architecture TheQuery covered in March 2026 - 120 billion total parameters, 12 billion active at inference, with Mamba layers enabling the long-context efficiency that makes the active parameter count sustainable.

Bamba-9B (IBM) is an open-source hybrid model combining Transformer expressivity with SSM efficiency, released to support community research into hybrid architectures.

FalconMamba (Technology Innovation Institute) is a 7B parameter model built on a pure Mamba architecture without attention layers - one of the first large-scale demonstrations that pure SSM models can reach competitive performance on standard language modeling benchmarks without any attention.

Last updated: March 21, 2026