KV Cache
Deep LearningA memory optimization technique in LLM inference that stores previously computed key-value pairs from attention layers, avoiding redundant recalculation when generating each new token.
KV cache (Key-Value cache) is a critical optimization for large language model inference that stores the intermediate key and value tensors computed by the attention mechanism during token generation. Without it, the model would need to recompute attention over the entire sequence for every new token, resulting in quadratic time complexity that makes generation prohibitively slow.
LLM inference has two phases. During the prefill phase, the model processes the input prompt and computes key-value pairs for every token across all attention layers. During the decode phase, new tokens are generated one at a time in an autoregressive fashion. The KV cache stores the key-value pairs from all previously processed tokens so that each new generation step only needs to compute attention for the latest token against the cached history, reducing the per-token cost from linear to constant with respect to sequence length.
The KV cache is one of the largest consumers of GPU memory during inference, often accounting for up to 70% of total memory usage for long sequences. This has driven significant research into optimization techniques. PagedAttention, introduced by vLLM, borrows concepts from operating system virtual memory to allocate KV cache in fixed-size pages that grow dynamically. Other approaches include KV cache quantization to reduce precision, selective eviction to keep only the most relevant tokens, merging similar key-value pairs, and chunk-level compression that preserves semantic coherence. These optimizations are essential for serving models with long context windows efficiently.
Last updated: February 26, 2026