The Degradation Problem
In 2014, the conventional wisdom was that deeper networks should always be more expressive and therefore more accurate. But something strange was observed: networks with 56 layers performed worse than networks with 20 layers — not just on the test set (which might be overfitting), but on the training set itself.
This is the : deep networks are harder to optimize, not just harder to generalize. Adding layers was actually hurting training.
The problem can be stated clearly: at any depth, we could always construct a solution by having the extra layers learn the identity function (just pass input through unchanged). But gradient descent doesn't find this solution reliably — it's hard to learn the identity from random initializations.
He et al. (2015) proposed a deceptively simple fix.
The Residual Reformulation
Instead of asking layers to learn the desired mapping directly, reformulate: ask them to learn the residual:
- the residual function learned by the layers: F(x) = H(x) - x
- the input to the block (also called the identity shortcut)
- the desired block output
The is implemented by the actual layers (two or three convolutions + normalization). bypasses the layers via a skip connection (also called a shortcut connection) and is added directly to the output.
The Gradient Highway
The gradient flow through a residual block is what makes deep networks trainable. By chain rule:
- the loss function
- block output
- residual function learned by layers
- input / skip connection
Two terms:
- Direct path: — gradient flowing directly back through the skip connection (the "highway")
- Learned path: — gradient flowing through the layers
Even if the layers saturate and (canceling case) or (vanishing case), there's still a direct gradient path. The gradient cannot be completely blocked by the skip connection — it always has at least one clear route back to earlier layers.
In a network with many residual blocks, the gradient path has many branches:
- loss
- early layer input
- output after L blocks
- layer transformation in block i
The product of terms avoids the vanishing gradient problem: even if all , the product is 1, not 0.
Identity Initialization Property
With small random weight initialization, initially. Therefore:
Every residual block starts as approximately the identity function. The entire deep network starts as an identity mapping. This is a remarkably stable initialization: the network can be made arbitrarily deep without immediately producing garbage outputs.
Training then proceeds by having each block learn small perturbations from identity, gradually building up the complex transformation.
When Skip Dimensions Don't Match
If the input has different dimensions than the output (different number of channels or different spatial size), the skip connection must project x to match:
- a learnable projection matrix (1×1 conv for spatial, or linear for FC)
In ResNet, projection shortcuts are used at the start of each stage when the number of channels doubles and spatial resolution halves. They use a 1×1 convolution with stride 2.
Code: Residual Block in PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
class ResidualBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(channels)
self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(channels)
def forward(self, x):
residual = x # save the skip connection
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out)) # don't apply ReLU yet
out = out + residual # add skip connection
return F.relu(out) # ReLU after addition
class ProjectionResidualBlock(nn.Module):
"""Used when channel count changes (e.g., at stage transitions)."""
def __init__(self, in_channels, out_channels, stride=2):
super().__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
# Projection: 1×1 conv to match dimensions
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
return F.relu(out + self.shortcut(x))
The identity shortcut (no parameters) is preferred when dimensions match — it adds no computation. The projection shortcut (1×1 conv) is only used at stage boundaries where dimensions change.