Vanishing and exploding gradients — Backpropagation

The Core Problem

Recall the backward pass propagation rule:

\delta^{(l)} = \left(W^{(l+1)\top} \cdot \delta^{(l+1)}\right) \odot \sigma&#39;(z^{(l)})

$W$: weight matrix of next layer
$\delta$: error signal at next layer
$\odot$: elementwise product
$\sigma'$: activation derivative

At each layer, the error signal gets multiplied by the activation derivative $\sigma'(z^{(l)})$ . For a network with layers, the gradient at the first layer involves approximately $L-1$ of these multiplicative factors:

\delta^{(1)} \approx \delta^{(L)} \cdot \prod_{l=2}^{L} \left[W^{(l)} \cdot \sigma&#39;(z^{(l)})\right]

$\delta$: gradient at first layer
$\delta$: gradient at output layer
$\prod$: product over all layers

If each factor is less than 1: the product shrinks exponentially with depth. If each factor is greater than 1: it grows exponentially. Neither is good.

Imagine whispering a number to a friend, who multiplies it by a small fraction and passes it on — who multiplies it again, and so on, through 20 people. By the end, the number has shrunk to essentially zero. Now imagine that number is the "how to improve" signal for the first person in the chain. They receive an instruction so faint it might as well be silence. They don't update. The entire beginning of the network stops learning, while only the last few people (layers) get a clear signal.

Gradients flowing back through a deep network face exactly this problem. Each layer multiplies the signal by a small number, and by the time it reaches the early layers, there is nothing left to learn from. Those layers simply stop updating.

Vanishing Gradients: The Sigmoid Case

The at its maximum is 0.25 (at $z = 0$ ). In practice it is often much smaller, especially when the sigmoid is near saturation.

Let's track what happens in a 10-layer sigmoid network, assuming an optimistic $\sigma'(z) \approx 0.25$ at every layer:

Layers passed	Gradient factor
1	$0.25$
5	$0.25^5 \approx 0.001$
10	$0.25^{10} \approx 10^{-7}$

That is one ten-millionth of the original gradient after just 10 layers. The first few layers receive a signal so small they barely update. The network cannot effectively use its depth.

The ReLU Solution

The for positive inputs is exactly 1:

\text{ReLU}&#39;(z) = \begin{cases} 1 &amp; \text{if } z &gt; 0 \ 0 &amp; \text{if } z \leq 0 \end{cases}

$z$: pre-activation value

For a neuron where $z > 0$ : the gradient passes through completely unchanged in magnitude. For a 10-layer ReLU network with all positive pre-activations: $1^{10} = 1$ . No decay.

The "dying ReLU" ( $z < 0$ , derivative = 0) does block gradients completely — but that affects individual neurons rather than all neurons systematically. Gradients flow through active neurons without attenuation.

This is the single biggest practical reason why ReLU replaced sigmoid in hidden layers.

Interactive example

Compare gradient magnitude at each layer for sigmoid vs ReLU networks - watch vanishing in action

Coming soon

Residual Connections: A Structural Fix

Even ReLU is not perfect for very deep networks (50+ layers). Enter (He et al., ResNet, 2015).

A residual block computes:

a = F(x) + x

$F(x)$: learned transformation at this block
$x$: input to the block (skip connection)

Instead of learning $a = F(x)$ , the layer learns the residual $F(x) = a - x$ — just the correction on top of the identity. The gradient now has two paths backward:

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial a} \cdot \left(F&#39;(x) + 1\right)

$F'(x)$: gradient through the learned transformation
$1$: gradient through the identity (skip connection)

Even if $F'(x) \approx 0$ (the learned transformation has near-zero gradient), the gradient still flows at full strength via the $+1$ term. Residual connections create a "gradient highway" through the network.

ResNet-152 (152 layers) became competitive with shallow architectures — not because depth alone helps, but because residual connections make training 152 layers tractable. Today, essentially every deep architecture uses residual or skip connections.

Exploding Gradients: The Opposite Problem

If weight matrices have large singular values (effectively eigenvalues > 1), gradients grow exponentially instead of shrinking:

\text{After } L \text{ layers: gradient} \approx r^L \text{ for } r &gt; 1

For $r = 1.1$ and $L = 100$ layers: $1.1^{100} \approx 13{,}780$ . Gradients become astronomically large, weights update by enormous amounts, and training diverges. Loss becomes NaN.

Gradient clipping is the standard fix. Compute the gradient norm and rescale if it exceeds a threshold:

g \leftarrow g \cdot \frac{\text{max_norm}}{|g|} \quad \text{if } |g| &gt; \text{max_norm}

$g$: full gradient vector
$\|g\|$: L2 norm of gradient vector
$\text{max_norm}$: threshold - typical values 1.0 or 5.0

This caps the step size while preserving the gradient direction. The network still moves in the right direction — just not so far that it overshoots into instability.

Batch Normalization: Preventing Saturation

A deeper structural fix is (Ioffe and Szegedy, 2015). After each layer, normalize the activations to have mean 0 and variance 1, then apply learned scaling and shifting :

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \qquad y_i = \gamma \hat{x}_i + \beta

$\mu_B$: batch mean
$\sigma_B^2$: batch variance
$\gamma$: learned scale
$\beta$: learned shift
$\varepsilon$: small constant for numerical stability

By keeping activations in a well-behaved range, batch normalization prevents them from saturating. It also reduces sensitivity to initialization and learning rate, making training faster and more stable. Batch normalization is now standard in most convolutional networks. Transformers typically use layer normalization instead (the same idea, normalized across features rather than across the batch).