Regularization

Optimization

A set of techniques that constrain model complexity during training to prevent overfitting and improve generalization to unseen data.

Like putting a speed limit on the model - it could go faster on the training track, but the limit keeps it safe on real roads.

Regularization adds a penalty for model complexity to the training objective, encoding the principle that simpler models generalize better (Occam's Razor). Instead of just minimizing prediction error, the model minimizes error plus a complexity penalty: Loss = Error + lambda * Complexity.

The most common forms are L2 regularization (Ridge), which penalizes the sum of squared weights and shrinks all weights toward zero, and L1 regularization (Lasso), which penalizes the sum of absolute weights and drives some weights to exactly zero, performing automatic feature selection. Elastic Net combines both approaches. The regularization strength lambda controls the bias-variance tradeoff: higher lambda means simpler models with more bias but less variance.

Regularization has deep mathematical foundations. L2 regularization is equivalent to maximum a posteriori estimation with a Gaussian prior on weights (encoding the belief that weights should be small). L1 regularization corresponds to a Laplace prior (encoding sparsity). Other regularization techniques include dropout (randomly zeroing neurons during training), early stopping (halting training before the model overfits), and data augmentation (artificially expanding the training set). All regularization methods share the same philosophy: constrain the hypothesis space to prevent the model from fitting noise.

Last updated: February 22, 2026

Regularization

Related Terms