Quantization

Deep Learning

A technique that reduces the numerical precision of a model's weights and activations, shrinking memory usage and speeding up inference with minimal loss in accuracy.

Think of it as compressing a high-resolution photo to save storage - you lose some detail, but the image is still recognizable and much smaller.

Quantization is the process of converting a neural network's parameters from high-precision floating point numbers (typically 32-bit or 16-bit) to lower-precision representations such as 8-bit integers (INT8) or even 4-bit integers (INT4). This reduces the model's memory footprint and increases inference speed, since lower-precision arithmetic requires less memory bandwidth and can be computed faster on modern hardware.

There are two main approaches. Post-training quantization (PTQ) takes an already-trained model and converts its weights to lower precision, sometimes using a small calibration dataset to minimize accuracy loss. Quantization-aware training (QAT) simulates low-precision arithmetic during the training process itself, allowing the model to learn to compensate for the reduced precision and typically producing better results than PTQ.

Quantization has become essential for deploying large language models on consumer hardware. A 7-billion parameter model stored in 16-bit precision requires roughly 14 GB of memory, but the same model quantized to 4-bit precision fits in approximately 3.5 GB - making it runnable on a laptop GPU or even a phone. Popular quantization formats include GPTQ, GGUF (used by llama.cpp and Ollama), and AWQ. The tradeoff is always between model size, speed, and accuracy - aggressive quantization saves memory but can degrade output quality, particularly on tasks requiring precise reasoning.

Last updated: February 26, 2026

Quantization

Related Terms