Subword Tokenization

NLP

A tokenization strategy that splits words into smaller units to balance vocabulary size with the ability to handle rare and unknown words.

Like breaking 'unhappiness' into 'un' + 'happiness' so the model can understand new words by recognizing familiar pieces.

Subword tokenization sits between character-level and word-level tokenization. Rather than treating each word as an indivisible token, subword methods break infrequent words into smaller pieces while keeping common words intact. 'unhappiness' might become ['un', 'happiness'] or ['un', 'happy', 'ness'], depending on the algorithm and vocabulary.

The dominant subword algorithms are Byte-Pair Encoding (BPE), used by GPT and RoBERTa; WordPiece, used by BERT; and SentencePiece, used by T5 and multilingual models. BPE starts with individual characters and iteratively merges the most frequent adjacent pair until the desired vocabulary size is reached. WordPiece follows a similar approach but uses likelihood rather than raw frequency to guide merges.

Subword tokenization solves the out-of-vocabulary problem that plagued word-level models: any string can be encoded because all characters exist in the vocabulary as fallback tokens. It also allows morphologically related words to share subword pieces, helping models generalize across inflections. The trade-off is that a single word may require several tokens, increasing sequence length and computational cost.

Last updated: March 6, 2026

Subword Tokenization

Related Terms