Mixture of Experts

Architecture

A model architecture that splits computation across multiple specialized sub-networks (experts), activating only a subset for each input to achieve large model capacity at a fraction of the compute cost.

Imagine a hospital with specialist doctors - each patient gets routed to the right expert instead of seeing one general practitioner for everything.

Mixture of Experts (MoE) is a neural network architecture where the model is divided into multiple parallel sub-networks called experts, each potentially specializing in different types of inputs or tasks. A routing network (also called a gating network) examines each input and decides which experts to activate. Only a small fraction of the total experts process any given input, so the model can have billions or even trillions of parameters while only using a fraction of them per forward pass.

The key advantage is decoupling model capacity from inference cost. A 744B parameter MoE model with 44B active parameters per token has the knowledge capacity of a dense 744B model but the inference speed closer to a 44B model. This is how models like Mixtral, GLM-5, and Qwen 3.5's medium series achieve frontier-level performance at significantly lower serving costs than equivalently capable dense models.

The main engineering challenges are load balancing (ensuring all experts get trained and used, not just a few popular ones), routing quality (the gating network must learn to send inputs to the right experts), and infrastructure complexity (distributing experts across hardware efficiently). Despite these challenges, MoE has become the dominant architecture for large-scale models where serving cost matters as much as capability.

Last updated: March 5, 2026

Mixture of Experts

Related Terms