Stochastic Gradient Descent

Optimization

An optimization algorithm that updates model parameters using the gradient computed on a small random subset (mini-batch) of the training data rather than the entire dataset.

Consider estimating the average height of a city by measuring random people on the street instead of everyone - faster, noisier, surprisingly effective.

Stochastic Gradient Descent (SGD) is a variant of gradient descent that computes parameter updates using randomly sampled mini-batches instead of the full training set. While standard gradient descent computes the exact gradient over all training examples (expensive for large datasets), SGD uses noisy gradient estimates from small batches (typically 32-256 examples), making each update much faster.

The noise in SGD's gradient estimates is actually beneficial. It helps escape saddle points and poor local minima in the high-dimensional loss landscape of neural networks. Research has shown that in high-dimensional spaces, most critical points are saddle points rather than local minima, and SGD's inherent noise provides the perturbation needed to escape them. SGD also has an implicit regularization effect , it biases the optimization toward flatter minima that tend to generalize better.

SGD with momentum is the standard variant, where a running average of past gradients accelerates convergence and dampens oscillations. While adaptive methods like Adam have become popular, SGD with momentum and careful learning rate tuning often achieves the best final generalization performance, particularly for computer vision tasks. The choice between SGD and adaptive methods remains an active area of practice.

Last updated: February 22, 2026

Stochastic Gradient Descent

Related Terms