Residual Connection

Deep Learning

A shortcut that adds a layer's input directly to its output (y = F(x) + x), enabling training of very deep networks by providing a gradient highway that prevents vanishing gradients.

Consider a highway bypass that lets traffic skip past a congested town - the information can take the shortcut if the detour is not useful.

Residual connections, introduced in ResNets by He et al. in 2015, add the identity of the input to the output of a layer: y = F(x) + x, where F(x) represents the learned transformation. Instead of learning the full desired mapping H(x), the network only needs to learn the residual F(x) = H(x) - x. If the optimal transformation is close to identity, the network just needs to learn F(x) close to zero, which is much easier.

Residual connections solve the gradient vanishing problem through a gradient highway. During backpropagation, the gradient through a residual block is d(y)/d(x) = d(F)/d(x) + I, where I is the identity matrix. The +I term ensures gradients can always flow directly backward regardless of what the learned transformation does. This enables training networks with 100+ layers, where plain networks fail even to fit training data beyond ~20 layers.

Residual networks can also be viewed as implicit ensembles of exponentially many paths of varying depth (2^L paths for L blocks), where most gradient flow uses short paths of length O(log L). They are now a fundamental component of virtually all deep architectures: CNNs (ResNet, DenseNet), transformers (every layer uses residual connections around attention and feed-forward sub-layers), and generative models (U-Net skip connections).

Last updated: February 22, 2026

Residual Connection

Related Terms