>_TheQuery
← Glossary

Multi-Head Attention

Deep Learning

An extension of attention that runs multiple attention operations in parallel with different learned projections, allowing the model to capture different types of relationships simultaneously.

Multi-head attention divides the attention computation into h parallel 'heads', each with its own learned projection matrices. Each head computes attention independently in a lower-dimensional subspace (dimension d_model/h), then the results are concatenated and projected back to the full dimension: MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O.

The key benefit is that different heads can learn to attend to different types of relationships in the data. In language models, researchers have observed that individual heads specialize: some track syntactic dependencies (subject-verb agreement), others capture semantic similarity, and others focus on local context (adjacent tokens). This specialization emerges naturally during training without explicit programming.

Multationally, multi-head attention has the same computational cost as single-head attention with the same total dimension: O(n^2 * d). The parallelism across heads is efficiently handled by GPUs. Typical configurations use 8-16 heads (e.g., 8 heads with d_k = 64 for d_model = 512). Multi-head attention is a core component of every transformer layer, applied in self-attention (within a sequence), cross-attention (between encoder and decoder), and masked self-attention (for autoregressive generation).

Last updated: February 22, 2026