Language Modeling
NLPThe task of learning a probability distribution over sequences of tokens, enabling a model to predict or generate text.
A language model assigns a probability to a sequence of tokens. The most common formulation is the autoregressive language model, which factorizes the joint probability of a sequence as a product of conditional probabilities: the probability of each token given all preceding tokens. Training objective: maximize the likelihood of the training corpus, equivalently minimizing cross-entropy loss.
Classical n-gram language models estimated these probabilities from token co-occurrence counts with smoothing techniques to handle unseen sequences. Neural language models replaced lookup tables with learned representations, first using feedforward networks and then recurrent architectures that could theoretically capture arbitrarily long context. The transformer architecture, with its global self-attention, dramatically improved the ability to exploit long-range dependencies and scaled efficiently on modern hardware.
Language modeling is the pretraining objective behind GPT-style models (causal language modeling) and contributes to BERT-style models through masked language modeling, where randomly masked tokens are predicted from bidirectional context. A model that has learned a good language model has implicitly acquired broad world knowledge, grammatical structure, and reasoning patterns — making it a powerful foundation for fine-tuning on specific downstream tasks.
Last updated: March 6, 2026