Skip connections and residual learning — Convolutional Networks

The Degradation Problem

The telephone game analogy

In the children's game of telephone, a message whispered from person to person degrades with each pass — subtle details get lost, words are misheard, and the final message can be completely garbled. Training a very deep network without skip connections has a similar problem, but for gradients flowing backward during training.

During backpropagation, a gradient signal starts at the loss and travels backward through every layer to update early weights. Each layer can slightly distort or attenuate the signal. After 50 layers, the gradient arriving at the first layer may be nearly zero — that layer receives no useful training signal and its weights barely update.

Skip connections are like adding a direct phone line from the end of the chain back to the beginning. The gradient can travel this direct path unchanged, bypassing all the intermediate layers that might distort it. Every layer now receives a clear training signal regardless of network depth.

In 2014, the conventional wisdom was that deeper networks should always be more expressive and therefore more accurate. But something strange was observed: networks with 56 layers performed worse than networks with 20 layers — not just on the test set (which might be overfitting), but on the training set itself.

This is the : deep networks are harder to optimize, not just harder to generalize. Adding layers was actually hurting training.

The problem can be stated clearly: at any depth, we could always construct a solution by having the extra layers learn the identity function (just pass input through unchanged). But gradient descent doesn't find this solution reliably — it's hard to learn the identity from random initializations.

He et al. (2015) proposed a deceptively simple fix.

The Residual Reformulation

Instead of asking layers to learn the desired mapping directly, reformulate: ask them to learn the residual:

H(x) = F(x) + x \quad \Longleftrightarrow \quad F(x) = H(x) - x

$F(x)$: the residual function learned by the layers: F(x) = H(x) - x
$x$: the input to the block (also called the identity shortcut)
$H(x)$: the desired block output

The is implemented by the actual layers (two or three convolutions + normalization). bypasses the layers via a skip connection (also called a shortcut connection) and is added directly to the output.

The Gradient Highway

The gradient flow through a residual block is what makes deep networks trainable. By chain rule:

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial H} \cdot \frac{\partial H}{\partial x} = \frac{\partial L}{\partial H} \cdot \left(1 + \frac{\partial F}{\partial x}\right)

$L$: the loss function
$H$: block output
$F$: residual function learned by layers
$x$: input / skip connection

Two terms:

Direct path: $\frac{\partial L}{\partial H}$ — gradient flowing directly back through the skip connection (the "highway")
Learned path: $\frac{\partial L}{\partial H} \cdot \frac{\partial F}{\partial x}$ — gradient flowing through the layers

Even if the layers saturate and $\frac{\partial F}{\partial x} \approx -1$ (canceling case) or $\approx 0$ (vanishing case), there's still a direct gradient path. The gradient cannot be completely blocked by the skip connection — it always has at least one clear route back to earlier layers.

In a network with many residual blocks, the gradient path has many branches:

\frac{\partial L}{\partial x_0} = \frac{\partial L}{\partial x_L} \cdot \prod_{i=1}^{L} \left(1 + \frac{\partial F_i}{\partial x_i}\right)

$L$: loss
$x_0$: early layer input
$x_L$: output after L blocks
$f_i$: layer transformation in block i

The product of terms $(1 + \partial F_i/\partial x_i)$ avoids the vanishing gradient problem: even if all $\partial F_i/\partial x_i \to 0$ , the product is 1, not 0.

Identity Initialization Property

With small random weight initialization, $F(x) \approx 0$ initially. Therefore:

H(x) = F(x) + x \approx x

Every residual block starts as approximately the identity function. The entire deep network starts as an identity mapping. This is a remarkably stable initialization: the network can be made arbitrarily deep without immediately producing garbage outputs.

Training then proceeds by having each block learn small perturbations from identity, gradually building up the complex transformation.

When Skip Dimensions Don't Match

If the input $x$ has different dimensions than the output $F(x)$ (different number of channels or different spatial size), the skip connection must project x to match:

H(x) = F(x) + W_s x

$W_s$: a learnable projection matrix (1×1 conv for spatial, or linear for FC)

In ResNet, projection shortcuts are used at the start of each stage when the number of channels doubles and spatial resolution halves. They use a 1×1 convolution with stride 2.

Code: Residual Block in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        residual = x                         # save the skip connection
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))      # don't apply ReLU yet
        out = out + residual                  # add skip connection
        return F.relu(out)                   # ReLU after addition

class ProjectionResidualBlock(nn.Module):
    """Used when channel count changes (e.g., at stage transitions)."""
    def __init__(self, in_channels, out_channels, stride=2):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        # Projection: 1×1 conv to match dimensions
        self.shortcut = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 1, stride=stride, bias=False),
            nn.BatchNorm2d(out_channels)
        )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return F.relu(out + self.shortcut(x))

The identity shortcut (no parameters) is preferred when dimensions match — it adds no computation. The projection shortcut (1×1 conv) is only used at stage boundaries where dimensions change.

Skip connections and signal transmission (engineering perspective)

From a control-systems perspective, the degradation problem can be framed as a signal attenuation problem: the gradient is a signal that must traverse a long transmission line (the deep network), and each layer introduces some attenuation. Without skip connections, the signal power at the input can drop to essentially zero — similar to driving a long unmatched transmission line.

The skip connection is the engineering equivalent of an impedance-matched bypass: it ensures there is always a low-loss path for the gradient signal regardless of what the intermediate layers do. The mathematical guarantee is in the gradient formula: $\partial L / \partial x = \partial L / \partial H \cdot (1 + \partial F / \partial x)$ — the "+1" term is the bypass path, always preserving at least $\partial L / \partial H$ at each stage.