Self-Attention

Deep Learning

An attention mechanism where queries, keys, and values all come from the same input sequence, allowing each token to attend to every other token in the sequence including itself.

Like everyone in a meeting being able to instantly check what everyone else said before forming their own opinion.

Self-attention is the core operation in transformer architectures where each position in a sequence computes attention weights over all other positions in the same sequence. Given an input sequence X, self-attention projects it into queries Q = XW_Q, keys K = XW_K, and values V = XW_V, then computes: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V.

Each output token is a weighted sum of all value vectors, where the weights are determined by the similarity between that token's query and every other token's key. This allows the model to route information from relevant context positions to each output position , for example, when processing the ambiguous word 'bank', self-attention can look at surrounding words like 'river' or 'money' to determine the correct meaning.

Self-attention contrasts with cross-attention (where queries come from one sequence and keys/values from another, as in encoder-decoder models) and with masked self-attention (used in decoder-only models like GPT, where tokens can only attend to previous positions to prevent looking at future tokens during generation). Self-attention's ability to directly connect any two positions in a sequence, regardless of distance, is what gives transformers their advantage over RNNs for capturing long-range dependencies.

Last updated: February 22, 2026

Self-Attention

Related Terms