Beam Search

NLP

A decoding algorithm that generates text by keeping track of the top K most probable partial sequences at each step, balancing output quality against the cost of exploring all possible completions.

Instead of always taking the first exit off the highway, you drive a few miles on K routes simultaneously before deciding which one actually leads to the best destination.

Beam search is a heuristic search algorithm used during text generation to find high-probability output sequences without exhaustively evaluating every possible completion. At each generation step, the algorithm maintains a fixed number of candidate sequences, called the beam width or K. For each candidate, it computes the probability of every possible next token, then keeps only the top K sequences by cumulative log-probability. This continues until all beams produce an end-of-sequence token or a maximum length is reached.

Greedy decoding , always picking the single most probable next token , is the special case where beam width equals 1. The problem with greedy decoding is that a locally optimal choice can lead to a globally poor sequence. Beam search explores K paths simultaneously, recovering from early suboptimal choices by keeping alternatives alive. Wider beams produce better-quality outputs at the cost of more computation.

Beam search was the dominant decoding strategy for sequence-to-sequence tasks like machine translation and summarization throughout the 2010s, and it remains the default for tasks where deterministic, high-quality outputs are required. Translation systems, code generation pipelines, and structured output tasks often use beam search because it consistently produces fluent, coherent results.

For open-ended generation tasks like conversation and creative writing, beam search has largely been replaced by sampling-based methods such as top-k sampling, top-p (nucleus) sampling, and temperature scaling. Beam search tends to produce repetitive, generic text in open-ended settings because the highest-probability sequences are often common phrases. Sampling introduces randomness that produces more varied and natural-sounding outputs.

Length penalties and coverage penalties are common additions to standard beam search. Without a length penalty, beam search tends to favor shorter sequences because they accumulate less probability mass. Length normalization divides the log-probability score by sequence length, encouraging the algorithm to consider longer completions.

References & Resources

Sequence to Sequence Learning with Neural Networks

Last updated: March 13, 2026

Beam Search

References & Resources

Related Terms