Exploding Gradients

Deep Learning

A training problem where gradients grow exponentially large as they propagate backward through many layers, causing weight updates to be enormous and training to diverge.

Like a game of telephone where each person shouts louder than the last - by the end, the message is deafening and incomprehensible.

Exploding gradients are the opposite of vanishing gradients: when gradient factors at each layer are greater than 1, the backpropagated signal grows exponentially. After L layers with gradient factor 1.1, the gradient magnitude becomes 1.1^L. For 50 layers, this is 1.1^50 = 117 , gradients become enormous, causing massive weight updates that destabilize training. Loss values spike to infinity or become NaN.

This problem commonly manifests in deep recurrent networks and very deep feedforward networks without proper initialization or normalization. A team training a 50-layer RNN experienced loss going to NaN within 10 iterations due to gradient explosion. The symptoms are unmistakable: loss suddenly jumps to very large values or NaN, weights become extremely large, and training completely breaks down.

The primary solution is gradient clipping: capping the maximum gradient magnitude to a threshold (e.g., clip gradients to norm 1.0). Other solutions include proper weight initialization (Xavier, He), normalization layers (batch norm, layer norm), residual connections, and using LSTM/GRU architectures for sequential data. Gradient clipping is a simple but essential technique: if ||gradient|| > threshold: gradient = gradient * threshold / ||gradient||. Modern deep learning frameworks apply gradient clipping as a standard training practice.

Last updated: February 22, 2026

Exploding Gradients

Related Terms