Quantization
Deep LearningA technique that reduces the numerical precision of a model's weights and activations, shrinking memory usage and speeding up inference with minimal loss in accuracy.
Quantization is the process of converting a neural network's parameters from high-precision floating point numbers (typically 32-bit or 16-bit) to lower-precision representations such as 8-bit integers (INT8) or even 4-bit integers (INT4). This reduces the model's memory footprint and increases inference speed, since lower-precision arithmetic requires less memory bandwidth and can be computed faster on modern hardware.
There are two main approaches. Post-training quantization (PTQ) takes an already-trained model and converts its weights to lower precision, sometimes using a small calibration dataset to minimize accuracy loss. Quantization-aware training (QAT) simulates low-precision arithmetic during the training process itself, allowing the model to learn to compensate for the reduced precision and typically producing better results than PTQ.
Quantization has become essential for deploying large language models on consumer hardware. A 7-billion parameter model stored in 16-bit precision requires roughly 14 GB of memory, but the same model quantized to 4-bit precision fits in approximately 3.5 GB - making it runnable on a laptop GPU or even a phone. Popular quantization formats include GPTQ, GGUF (used by llama.cpp and Ollama), and AWQ. The tradeoff is always between model size, speed, and accuracy - aggressive quantization saves memory but can degrade output quality, particularly on tasks requiring precise reasoning.
Related Terms
Last updated: February 26, 2026