Mixture of Experts Routing
ArchitectureThe mechanism within a Mixture of Experts model that determines which expert sub-networks process each input token, directly controlling how model capacity is utilized.
Like a bouncer deciding where each person goes in a club - poor routing sends wrong people to wrong places and clashes can occur between people.
Mixture of Experts routing is the process by which a gating mechanism assigns input tokens to specific expert sub-networks within a Mixture of Experts architecture. The router is typically a small neural network, often a single linear layer followed by a softmax, that produces a score for each expert given an input token.
When a token arrives at a MoE layer, its hidden representation flows through the router network, which multiplies it by learned weights to produce a score for each expert. These scores become probabilities through softmax normalization - all scores sum to 1, with each representing the likelihood that this expert should process the token.
In token-choice routing (used by Switch Transformers), each token selects its top expert based on the highest probability. The system then bins all tokens by their chosen expert and processes them sequentially until that expert reaches its capacity limit - a fixed maximum number of tokens it can handle. Tokens assigned to a full expert simply don't get processed by that expert in that forward pass.
The key design decision is simplicity over perfect load balancing. Rather than complex negotiations about who processes what, the system uses a hard capacity cutoff. During training, auxiliary losses push experts toward receiving equal numbers of tokens, and router logits include noise to prevent rigid routing patterns. At inference time, the noise is removed for deterministic routing.
After experts process their assigned tokens, outputs are combined using a weighted average where the weights come from the router probabilities. This weighted combination allows downstream layers to integrate contributions from multiple experts even though each token was routed to a single expert for computation.
The quality of routing directly determines whether a MoE model achieves its theoretical capacity advantage. Poor routing causes expert collapse, where a few experts handle most tokens while others sit idle. Modern architectures monitor router statistics during training to detect and correct imbalanced distributions. The routing decision happens at every layer, so a single token may visit different experts at different depths through the network.
References & Resources
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity - Lepikhin et al. (2021)
- Mixture of Experts with Expert Choice Routing - Lewis et al. (2021)
- Sparse MoE Implementation - Hugging Face Transformers Documentation
- Stabilizing Transformers with Mixture-of-Experts through Base Layer Freezing - Shazeer et al. (2022)
Last updated: March 5, 2026