Transformer

NLP

A neural network architecture based on self-attention mechanisms that processes input data in parallel, forming the basis of modern large language models.

Imagine reading a book where every word can instantly look at every other word on the page to understand context, instead of reading left to right.

The transformer is a neural network architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. Unlike recurrent neural networks that process sequences step by step, transformers use self-attention mechanisms to process all positions of the input simultaneously, enabling massive parallelization during training.

A transformer consists of an encoder and a decoder, each built from stacks of layers containing multi-head self-attention and feed-forward sub-layers. The self-attention mechanism allows each token in a sequence to attend to every other token, capturing long-range dependencies far more effectively than RNNs. Positional encodings are added to the input embeddings to retain information about token order.

Transformers have become the dominant architecture in natural language processing and are increasingly used in computer vision, audio processing, and multimodal AI. Models like GPT, BERT, and their successors are all built on the transformer architecture, and they power today's most capable large language models.

Last updated: February 20, 2026

Transformer

Related Terms