Routing Network

Architecture

A small neural network within a Mixture of Experts model that decides which expert sub-networks should process each input token.

Like a switchboard operator connecting calls to the right department - the quality of the conversation depends on getting the routing right.

A routing network (also called a gating network or router) is the decision-making component in a Mixture of Experts architecture. For each input token, the router produces a probability distribution over all available experts and selects the top-k experts to handle that token. The router is trained jointly with the experts, learning over time which experts are best suited for which types of inputs.

The simplest routing approach is a linear layer followed by a softmax that produces expert selection probabilities. More sophisticated approaches use auxiliary losses to encourage balanced expert utilization, noisy top-k gating to add exploration during training, or learned token-to-expert affinity scores that improve over training.

Router quality directly determines model quality. A poorly trained router creates two problems: expert collapse, where most tokens get sent to the same few experts while others go unused, and misrouting, where tokens reach experts that are not well-suited to process them. Both waste model capacity. Modern architectures address this through load balancing losses that penalize uneven expert usage and capacity factors that cap how many tokens any single expert can receive per batch.

Last updated: March 5, 2026

Routing Network

Related Terms