Skip to content
Backpropagation
Lesson 5 ⏱ 12 min

Vanishing and exploding gradients

Video coming soon

Vanishing and Exploding Gradients: The Deep Network Training Crisis

Why gradients shrink exponentially in sigmoid networks, how ReLU solves vanishing gradients, and how residual connections create gradient highways for very deep architectures.

⏱ ~7 min

🧮

Quick refresher

Geometric sequences and exponential decay

Multiplying a number by a factor less than 1 repeatedly gives exponential decay. After n multiplications by r: result = r^n. For r=0.25 and n=10: 0.25^10 = 0.0000001.

Example

A sigmoid gradient at each layer is at most 0.25.

After 10 layers: 0.25^10 ≈ 10^-7.

The gradient at the first layer is one ten-millionth of the gradient at the last layer.

The Core Problem

Recall the backward pass propagation rule:

\delta^{(l)} = \left(W^{(l+1)\top} \cdot \delta^{(l+1)}\right) \odot \sigma'(z^{(l)})
WW
weight matrix of next layer
δ\delta
error signal at next layer
\odot
elementwise product
σ\sigma'
activation derivative

At each layer, the error signal gets multiplied by the activation derivative \sigma'(z^{(l)}). For a network with layers, the gradient at the first layer involves approximately L1L-1 of these multiplicative factors:

\delta^{(1)} \approx \delta^{(L)} \cdot \prod_{l=2}^{L} \left[W^{(l)} \cdot \sigma'(z^{(l)})\right]
δ\delta
gradient at first layer
δ\delta
gradient at output layer
\prod
product over all layers

If each factor is less than 1: the product shrinks exponentially with depth. If each factor is greater than 1: it grows exponentially. Neither is good.

Imagine whispering a number to a friend, who multiplies it by a small fraction and passes it on — who multiplies it again, and so on, through 20 people. By the end, the number has shrunk to essentially zero. Now imagine that number is the "how to improve" signal for the first person in the chain. They receive an instruction so faint it might as well be silence. They don't update. The entire beginning of the network stops learning, while only the last few people (layers) get a clear signal.

Gradients flowing back through a deep network face exactly this problem. Each layer multiplies the signal by a small number, and by the time it reaches the early layers, there is nothing left to learn from. Those layers simply stop updating.

Vanishing Gradients: The Sigmoid Case

The at its maximum is 0.25 (at z=0z = 0). In practice it is often much smaller, especially when the sigmoid is near saturation.

Let's track what happens in a 10-layer sigmoid network, assuming an optimistic \sigma'(z) \approx 0.25 at every layer:

Layers passedGradient factor
10.250.25
50.2550.0010.25^5 \approx 0.001
100.25101070.25^{10} \approx 10^{-7}

That is one ten-millionth of the original gradient after just 10 layers. The first few layers receive a signal so small they barely update. The network cannot effectively use its depth.

The ReLU Solution

The for positive inputs is exactly 1:

\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \ 0 & \text{if } z \leq 0 \end{cases}
zz
pre-activation value

For a neuron where z > 0: the gradient passes through completely unchanged in magnitude. For a 10-layer ReLU network with all positive pre-activations: 110=11^{10} = 1. No decay.

The "dying ReLU" (z < 0, derivative = 0) does block gradients completely — but that affects individual neurons rather than all neurons systematically. Gradients flow through active neurons without attenuation.

This is the single biggest practical reason why ReLU replaced sigmoid in hidden layers.

Interactive example

Compare gradient magnitude at each layer for sigmoid vs ReLU networks - watch vanishing in action

Coming soon

Residual Connections: A Structural Fix

Even ReLU is not perfect for very deep networks (50+ layers). Enter (He et al., ResNet, 2015).

A residual block computes:

a=F(x)+xa = F(x) + x
F(x)F(x)
learned transformation at this block
xx
input to the block (skip connection)

Instead of learning a=F(x)a = F(x), the layer learns the residual F(x)=axF(x) = a - x — just the correction on top of the identity. The gradient now has two paths backward:

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial a} \cdot \left(F'(x) + 1\right)
F(x)F'(x)
gradient through the learned transformation
11
gradient through the identity (skip connection)

Even if F'(x) \approx 0 (the learned transformation has near-zero gradient), the gradient still flows at full strength via the +1+1 term. Residual connections create a "gradient highway" through the network.

ResNet-152 (152 layers) became competitive with shallow architectures — not because depth alone helps, but because residual connections make training 152 layers tractable. Today, essentially every deep architecture uses residual or skip connections.

Exploding Gradients: The Opposite Problem

If weight matrices have large singular values (effectively eigenvalues > 1), gradients grow exponentially instead of shrinking:

\text{After } L \text{ layers: gradient} \approx r^L \text{ for } r > 1

For r=1.1r = 1.1 and L=100L = 100 layers: 1.110013,7801.1^{100} \approx 13{,}780. Gradients become astronomically large, weights update by enormous amounts, and training diverges. Loss becomes NaN.

Gradient clipping is the standard fix. Compute the gradient norm and rescale if it exceeds a threshold:

g \leftarrow g \cdot \frac{\text{max_norm}}{|g|} \quad \text{if } |g| > \text{max_norm}
gg
full gradient vector
g\|g\|
L2 norm of gradient vector
\text{max_norm}
threshold - typical values 1.0 or 5.0

This caps the step size while preserving the gradient direction. The network still moves in the right direction — just not so far that it overshoots into instability.

Batch Normalization: Preventing Saturation

A deeper structural fix is (Ioffe and Szegedy, 2015). After each layer, normalize the activations to have mean 0 and variance 1, then apply learned scaling and shifting :

x^i=xiμBσB2+ε,yi=γx^i+β\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \qquad y_i = \gamma \hat{x}_i + \beta
μB\mu_B
batch mean
σB2\sigma_B^2
batch variance
γ\gamma
learned scale
β\beta
learned shift
ε\varepsilon
small constant for numerical stability

By keeping activations in a well-behaved range, batch normalization prevents them from saturating. It also reduces sensitivity to initialization and learning rate, making training faster and more stable. Batch normalization is now standard in most convolutional networks. Transformers typically use layer normalization instead (the same idea, normalized across features rather than across the batch).

Quiz

1 / 3

Sigmoid activations cause vanishing gradients because...