Learning Rate

Optimization

A hyperparameter that controls how much model weights are adjusted in response to the estimated error during each step of gradient descent optimization.

The size of steps the model takes while learning - too big and it stumbles past the answer, too small and it never arrives.

The learning rate determines the step size when updating model parameters during training. At each iteration, weights are updated by: weight = weight - learning_rate * gradient. This single number has an outsized impact on whether training succeeds or fails.

If the learning rate is too high, updates overshoot the minimum and training diverges , loss oscillates wildly or goes to infinity (NaN). If the learning rate is too low, training converges extremely slowly, requiring thousands of extra iterations and potentially getting stuck in poor local minima. The ideal learning rate depends on the loss landscape, model architecture, batch size, and training stage.

Modern practice often uses learning rate schedules that change the rate during training: warmup (start small and increase), cosine annealing (gradually decrease), or step decay (reduce at fixed intervals). Adaptive optimizers like Adam automatically adjust effective learning rates per-parameter. Finding the right learning rate is often the single most important hyperparameter tuning decision, and techniques like learning rate range tests help identify good starting values.

Last updated: February 22, 2026

Learning Rate

Related Terms