Token
FundamentalsThe basic unit of text that a language model reads and produces. A token is typically a word, part of a word, or a punctuation character, depending on how the model's vocabulary was constructed.
Like how a sentence is made of words but a typesetter works in individual letter blocks - the model thinks in tokens, not words, and the granularity of those blocks shapes everything from what it can remember to what you pay.
A token is the smallest unit of text that a language model processes. Before any text reaches a model, it is split into tokens by a tokenizer - a preprocessing step that converts raw strings into sequences of integers, each corresponding to a chunk of text in the model's fixed vocabulary.
Tokens do not map neatly to words. In English, common words like "the" or "is" are usually a single token. Longer or less frequent words are split into subword pieces: "unhappiness" might become ["un", "happiness"] or ["unhappy", "ness"] depending on the tokenizer. Punctuation, whitespace, and numbers each consume tokens too. As a rough rule of thumb, one token corresponds to about four characters of English text, or roughly three-quarters of a word - meaning 100 tokens is approximately 75 words.
Tokens matter practically for three reasons. First, models have a fixed context window measured in tokens: the total number of tokens in the input plus output that the model can attend to at once. A model with a 200,000-token context window can process roughly 150,000 words of combined input and output. Second, API pricing for commercial models is almost universally denominated in tokens - typically per million input tokens and per million output tokens, with output tokens costing more because generating them requires a forward pass per token. Third, the efficiency of a model depends partly on how many tokens it needs to represent a given idea: a tokenizer that requires more tokens for the same content increases both cost and latency.
Different models use different tokenizers with different vocabularies. GPT-4 uses the cl100k_base tokenizer with a 100,000-token vocabulary. Claude's tokenizer has a similar structure but its vocabulary is not publicly released. Code, non-English languages, and structured formats like JSON often tokenize less efficiently than plain English prose, meaning the same number of characters costs more tokens.
Understanding tokens is foundational to reasoning about model context limits, API costs, latency, and why certain inputs behave unexpectedly at the edges of a model's capability.
References & Resources
Last updated: March 16, 2026