Multi-Head Attention

Deep Learning

An extension of attention that runs multiple attention operations in parallel with different learned projections, allowing the model to capture different types of relationships simultaneously.

Picture having multiple reviewers read the same document simultaneously, each looking for a different thing.

Multi-head attention divides the attention computation into h parallel 'heads', each with its own learned projection matrices. Each head computes attention independently in a lower-dimensional subspace (dimension d_model/h), then the results are concatenated and projected back to the full dimension: MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O.

The key benefit is that different heads can learn to attend to different types of relationships in the data. In language models, researchers have observed that individual heads specialize: some track syntactic dependencies (subject-verb agreement), others capture semantic similarity, and others focus on local context (adjacent tokens). This specialization emerges naturally during training without explicit programming.

Multationally, multi-head attention has the same computational cost as single-head attention with the same total dimension: O(n^2 * d). The parallelism across heads is efficiently handled by GPUs. Typical configurations use 8-16 heads (e.g., 8 heads with d_k = 64 for d_model = 512). Multi-head attention is a core component of every transformer layer, applied in self-attention (within a sequence), cross-attention (between encoder and decoder), and masked self-attention (for autoregressive generation).

Last updated: February 22, 2026

Multi-Head Attention

Related Terms