Language Modeling

NLP

The task of learning a probability distribution over sequences of tokens, enabling a model to predict or generate text.

Consider a game of predict-the-next-word - the model learns language by getting better at guessing what comes next.

A language model assigns a probability to a sequence of tokens. The most common formulation is the autoregressive language model, which factorizes the joint probability of a sequence as a product of conditional probabilities: the probability of each token given all preceding tokens. Training objective: maximize the likelihood of the training corpus, equivalently minimizing cross-entropy loss.

Classical n-gram language models estimated these probabilities from token co-occurrence counts with smoothing techniques to handle unseen sequences. Neural language models replaced lookup tables with learned representations, first using feedforward networks and then recurrent architectures that could theoretically capture arbitrarily long context. The transformer architecture, with its global self-attention, dramatically improved the ability to exploit long-range dependencies and scaled efficiently on modern hardware.

Language modeling is the pretraining objective behind GPT-style models (causal language modeling) and contributes to BERT-style models through masked language modeling, where randomly masked tokens are predicted from bidirectional context. A model that has learned a good language model has implicitly acquired broad world knowledge, grammatical structure, and reasoning patterns - making it a powerful foundation for fine-tuning on specific downstream tasks.

Last updated: March 6, 2026

Language Modeling

Related Terms