Inference

MLOps

The process of using a trained model to make predictions on new, unseen data, as opposed to the training phase where the model learns from labeled examples.

The moment the model stops studying and starts answering questions - training is the exam prep, inference is the actual exam.

Inference is the deployment-time operation where a trained model processes new inputs to produce predictions. During training, the model learns parameters from labeled data through forward and backward passes; during inference, only the forward pass runs , input goes in, prediction comes out, and no weight updates occur. This distinction matters for computational requirements, latency, and cost.

Inference has different performance characteristics than training. It typically requires less memory (no need to store intermediate activations for backpropagation), can run on less powerful hardware, and must meet latency constraints that training does not face. In production, inference latency is often the bottleneck: ad auctions require predictions in milliseconds, while training the same model may take hours or days.

Optimizing inference is a major concern in production AI. Techniques include model quantization (reducing numerical precision from 32-bit to 8-bit), pruning (removing unimportant weights), knowledge distillation (training a smaller model to mimic a larger one), batching requests for GPU efficiency, and caching results for repeated queries. For LLMs specifically, KV caching avoids recomputing attention for previously processed tokens, dramatically reducing generation latency.

Related Terms

Neural Network Large Language Model

Last updated: February 22, 2026