Residual Connection
Deep LearningA shortcut that adds a layer's input directly to its output (y = F(x) + x), enabling training of very deep networks by providing a gradient highway that prevents vanishing gradients.
Residual connections, introduced in ResNets by He et al. in 2015, add the identity of the input to the output of a layer: y = F(x) + x, where F(x) represents the learned transformation. Instead of learning the full desired mapping H(x), the network only needs to learn the residual F(x) = H(x) - x. If the optimal transformation is close to identity, the network just needs to learn F(x) close to zero, which is much easier.
Residual connections solve the gradient vanishing problem through a gradient highway. During backpropagation, the gradient through a residual block is d(y)/d(x) = d(F)/d(x) + I, where I is the identity matrix. The +I term ensures gradients can always flow directly backward regardless of what the learned transformation does. This enables training networks with 100+ layers, where plain networks fail even to fit training data beyond ~20 layers.
Residual networks can also be viewed as implicit ensembles of exponentially many paths of varying depth (2^L paths for L blocks), where most gradient flow uses short paths of length O(log L). They are now a fundamental component of virtually all deep architectures: CNNs (ResNet, DenseNet), transformers (every layer uses residual connections around attention and feed-forward sub-layers), and generative models (U-Net skip connections).
Related Terms
Last updated: February 22, 2026