>_TheQuery
← Glossary

BLEU Score

Evaluation

A metric that measures how closely a machine-generated text matches one or more human reference translations, calculated by comparing overlapping word sequences (n-grams) between the output and reference.

Like grading an essay by counting how many exact phrases from the answer key appear in the student's response -- useful as a rough signal, but it misses a student who said the same thing in different words.

BLEU (Bilingual Evaluation Understudy) is an automatic evaluation metric originally designed for machine translation, introduced by Papineni et al. at IBM in 2002. It works by counting how many n-grams (contiguous sequences of n words) in the model's output appear in one or more human reference translations, then applying a brevity penalty to discourage overly short outputs.

The score ranges from 0 to 1 (often expressed as 0 to 100). A score of 1 means the output is identical to a reference; a score of 0 means no overlap. In practice, state-of-the-art machine translation systems score in the 0.4--0.6 range on standard benchmarks, and human translators score around 0.6--0.7 when evaluated against other humans.

BLEU is fast to compute and has historically correlated well with human judgements at the corpus level, which made it the dominant MT evaluation metric for two decades. However, it has significant limitations: it is sensitive to tokenization choices, does not account for synonyms or paraphrasing, ignores fluency and meaning beyond surface overlap, and correlates poorly with human judgement at the sentence level.

For large language model evaluation, BLEU is now largely supplemented or replaced by neural metrics like BERTScore, COMET, and BLEURT, which use contextual embeddings to capture semantic similarity rather than exact string matches. Despite its limitations, BLEU remains widely reported because of its long track record, reproducibility, and interpretability as a baseline comparison.

Last updated: March 14, 2026