Positional Encoding

NLP

A technique that injects information about token position into transformer inputs, since the attention mechanism itself is permutation-invariant and has no inherent notion of sequence order.

Like numbering the pages of a shuffled manuscript so the reader knows what order they go in.

Self-attention treats its input as a set, not a sequence , swapping the order of tokens produces permuted but structurally equivalent attention weights. This means 'dog bites man' and 'man bites dog' would be indistinguishable without positional information. Positional encodings solve this by adding position-dependent vectors to token embeddings before they enter the transformer.

The original transformer uses sinusoidal positional encodings: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)). This design is elegant: each position gets a unique encoding, relative positions can be computed as linear transformations (rotation matrices), different dimensions encode position at different frequency scales (like a clock's second/minute/hour hands), and it requires zero learnable parameters. Alternatively, many modern models (BERT, GPT) use learned positional embeddings.

Modern architectures increasingly use relative positional encodings. RoPE (Rotary Positional Embedding), used in LLaMA and PaLM, applies rotation matrices to queries and keys so that attention scores depend only on relative position. ALiBi adds a linear distance penalty to attention scores. These approaches generalize better to sequence lengths not seen during training, which is critical for deploying LLMs on variable-length inputs.

Related Terms

Transformer Self-Attention Embedding

Last updated: February 22, 2026