>_TheQuery
← Glossary

Layer Normalization

Deep Learning

A technique that normalizes the inputs across the features of a single training example during a forward pass, stabilizing training and reducing sensitivity to learning rate choice.

Before passing notes down the row, you standardize the handwriting size so nobody has to squint -- the message stays the same but everyone can read it equally well.

Layer normalization is a technique that normalizes the activations within a single layer for each individual training example, rather than across the batch. For each input, it computes the mean and variance across all features in that layer, then shifts and scales the values to have zero mean and unit variance. Two learned parameters -- gamma (scale) and beta (shift) -- allow the network to restore any representational power lost through normalization.

The key distinction from batch normalization is what gets averaged over. Batch normalization computes statistics across the batch dimension, meaning its behavior depends on batch size and it performs differently during training and inference. Layer normalization computes statistics across the feature dimension for each individual example, making it independent of batch size and consistent between training and inference.

This property makes layer normalization the standard choice for transformers and language models. Sequences in NLP have variable lengths and are often processed one at a time during autoregressive generation, making batch normalization impractical. Layer normalization works correctly regardless of sequence length or batch size, which is why every major transformer architecture -- GPT, BERT, Claude, Llama -- uses it.

In the original transformer paper, layer normalization was applied after the attention and feed-forward sublayers (post-norm). Most modern architectures have switched to pre-norm, where normalization is applied before each sublayer. Pre-norm training is more stable at large scale and has largely become the default in frontier model architectures.

Layer normalization also accelerates training by reducing the internal covariate shift problem -- the tendency for the distribution of layer inputs to change as weights update -- allowing higher learning rates and reducing the number of steps needed to converge.

References & Resources

Last updated: March 13, 2026