Skip to content
Convolutional Networks
Lesson 8 ⏱ 14 min

Skip connections and residual learning

Video coming soon

Skip Connections - Why ResNet Unlocked 100+ Layer Networks

The degradation problem in deep networks, the residual reformulation F(x) = H(x) - x, how the skip connection provides a gradient highway, the identity initialization property, and a comparison of 20-layer vs. ResNet-152 training curves.

⏱ ~8 min

🧮

Quick refresher

Vanishing gradients

During backpropagation, gradients are computed by the chain rule: each layer multiplies the gradient by the local Jacobian. In deep networks, if those Jacobians have values less than 1, the gradient shrinks exponentially as it travels backward through layers. By the time it reaches early layers, it may be near zero — those layers stop learning.

Example

If each layer's gradient factor is 0.8, after 20 layers the gradient is 0.8²⁰ ≈ 0.012.

After 50 layers: 0.8⁵⁰ ≈ 0.00001.

The early layers receive essentially no gradient signal and cannot update their weights.

The Degradation Problem

In 2014, the conventional wisdom was that deeper networks should always be more expressive and therefore more accurate. But something strange was observed: networks with 56 layers performed worse than networks with 20 layers — not just on the test set (which might be overfitting), but on the training set itself.

This is the : deep networks are harder to optimize, not just harder to generalize. Adding layers was actually hurting training.

The problem can be stated clearly: at any depth, we could always construct a solution by having the extra layers learn the identity function (just pass input through unchanged). But gradient descent doesn't find this solution reliably — it's hard to learn the identity from random initializations.

He et al. (2015) proposed a deceptively simple fix.

The Residual Reformulation

Instead of asking layers to learn the desired mapping directly, reformulate: ask them to learn the residual:

H(x)=F(x)+xF(x)=H(x)xH(x) = F(x) + x \quad \Longleftrightarrow \quad F(x) = H(x) - x
F(x)F(x)
the residual function learned by the layers: F(x) = H(x) - x
xx
the input to the block (also called the identity shortcut)
H(x)H(x)
the desired block output

The is implemented by the actual layers (two or three convolutions + normalization). bypasses the layers via a skip connection (also called a shortcut connection) and is added directly to the output.

The Gradient Highway

The gradient flow through a residual block is what makes deep networks trainable. By chain rule:

Lx=LHHx=LH(1+Fx)\frac{\partial L}{\partial x} = \frac{\partial L}{\partial H} \cdot \frac{\partial H}{\partial x} = \frac{\partial L}{\partial H} \cdot \left(1 + \frac{\partial F}{\partial x}\right)
LL
the loss function
HH
block output
FF
residual function learned by layers
xx
input / skip connection

Two terms:

  1. Direct path: LH\frac{\partial L}{\partial H} — gradient flowing directly back through the skip connection (the "highway")
  2. Learned path: LHFx\frac{\partial L}{\partial H} \cdot \frac{\partial F}{\partial x} — gradient flowing through the layers

Even if the layers saturate and Fx1\frac{\partial F}{\partial x} \approx -1 (canceling case) or 0\approx 0 (vanishing case), there's still a direct gradient path. The gradient cannot be completely blocked by the skip connection — it always has at least one clear route back to earlier layers.

In a network with many residual blocks, the gradient path has many branches:

Lx0=LxLi=1L(1+Fixi)\frac{\partial L}{\partial x_0} = \frac{\partial L}{\partial x_L} \cdot \prod_{i=1}^{L} \left(1 + \frac{\partial F_i}{\partial x_i}\right)
LL
loss
x0x_0
early layer input
xLx_L
output after L blocks
fif_i
layer transformation in block i

The product of terms (1+Fi/xi)(1 + \partial F_i/\partial x_i) avoids the vanishing gradient problem: even if all Fi/xi0\partial F_i/\partial x_i \to 0, the product is 1, not 0.

Identity Initialization Property

With small random weight initialization, F(x)0F(x) \approx 0 initially. Therefore:

H(x)=F(x)+xxH(x) = F(x) + x \approx x

Every residual block starts as approximately the identity function. The entire deep network starts as an identity mapping. This is a remarkably stable initialization: the network can be made arbitrarily deep without immediately producing garbage outputs.

Training then proceeds by having each block learn small perturbations from identity, gradually building up the complex transformation.

When Skip Dimensions Don't Match

If the input xx has different dimensions than the output F(x)F(x) (different number of channels or different spatial size), the skip connection must project x to match:

H(x)=F(x)+WsxH(x) = F(x) + W_s x
WsW_s
a learnable projection matrix (1×1 conv for spatial, or linear for FC)

In ResNet, projection shortcuts are used at the start of each stage when the number of channels doubles and spatial resolution halves. They use a 1×1 convolution with stride 2.

Code: Residual Block in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        residual = x                         # save the skip connection
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))      # don't apply ReLU yet
        out = out + residual                  # add skip connection
        return F.relu(out)                   # ReLU after addition

class ProjectionResidualBlock(nn.Module):
    """Used when channel count changes (e.g., at stage transitions)."""
    def __init__(self, in_channels, out_channels, stride=2):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        # Projection: 1×1 conv to match dimensions
        self.shortcut = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
            nn.BatchNorm2d(out_channels)
        )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return F.relu(out + self.shortcut(x))

The identity shortcut (no parameters) is preferred when dimensions match — it adds no computation. The projection shortcut (1×1 conv) is only used at stage boundaries where dimensions change.

Quiz

1 / 3

In a residual block with output H(x) = F(x) + x, what does F(x) represent?