Vanishing Gradients
Deep LearningA training problem where gradients become exponentially smaller as they propagate backward through many layers, effectively preventing early layers from learning.
Vanishing gradients occur when the gradient signal diminishes as it is backpropagated through many layers of a deep network. Since backpropagation multiplies gradients at each layer via the chain rule, if each layer's gradient contribution is less than 1, the overall gradient decays exponentially: after L layers with gradient factor 0.9, the signal is 0.9^L. For 50 layers, this gives 0.9^50 = 0.005 -- the first layers receive essentially zero learning signal.
This problem was historically most severe with sigmoid and tanh activations, which saturate (gradient approaches zero) for large inputs. In deep recurrent networks processing long sequences, the same weight matrix is applied at each time step, making vanishing gradients particularly acute -- information from early tokens cannot influence later computations. This is why RNNs struggle with long-range dependencies.
Multiple solutions have been developed: ReLU activation (gradient is 1 for positive inputs, never saturates), proper weight initialization (Xavier, He), residual connections (providing a gradient highway that bypasses learned layers), LSTM/GRU architectures (with gating mechanisms that control information flow), and normalization techniques (batch norm, layer norm) that keep activations in well-behaved ranges. These innovations collectively enabled training of networks with hundreds of layers.
Last updated: February 22, 2026