Vanishing gradients in sequences — Recurrent Networks

You've seen that BPTT multiplies Jacobians together for every step back in time. The previous lesson showed this numerically for a small example. Now let's be precise about why this destroys learning for long sequences, and what kind of problems it breaks.

The vanishing gradient problem is why early NLP systems could only model short-range dependencies — they literally could not learn that a subject at the beginning of a paragraph constrains the verb at the end. It is the key motivation for LSTMs, GRUs, and ultimately the attention mechanism. Understanding it precisely means understanding why every subsequent architecture was designed the way it was.

The Core Problem: Multiplicative Decay

In a vanilla RNN, the Jacobian at each step is:

J_t = W_h^T \cdot \text{diag}(\tanh&#39;(z_t))

$J_t$: Jacobian of h_t with respect to h_{t-1}
$diag(tanh'(z_t))$: diagonal matrix of tanh' evaluated at pre-activations z_t = W_h h_{t-1} + W_x x_t + b

The gradient from step back to step 1 is a product of T−1 such Jacobians:

\frac{\partial h_T}{\partial h_1} = J_T \cdot J_{T-1} \cdots J_2

Consider the magnitude of this product. Each $J_t$ has two sources of shrinkage:

tanh saturation: $\tanh'(z) = 1 - \tanh^2(z) \in (0, 1]$ , with maximum 1 only at $z=0$ . For any nonzero pre-activation, the derivative is strictly less than 1.
Weight matrix spectrum: the spectral radius of $W_h^T$ depends on how the weights were initialized and trained.

If the product of these two effects gives an effective spectral radius $\rho < 1$ , the gradient decays exponentially:

\left|\frac{\partial h_T}{\partial h_1}\right| \approx \rho^{T-1}

$ρ$: effective spectral radius of the Jacobian

T	ρ = 0.9	ρ = 0.8	ρ = 0.7
10	0.387	0.107	0.028
30	0.042	0.001	7×10⁻⁵
100	2.7×10⁻⁵	≈0	≈0

For any typical weight initialization, gradients from more than 10-30 steps back are effectively zero.

Vanishing vs. Feedforward Networks: The Scale Difference

In a feedforward network, gradients travel through at most layers. Modern nets are 12-100 layers deep. With skip connections (ResNets, transformers), gradients often skip most layers entirely.

In an RNN, gradients must travel through at most steps. And $T$ can be:

100 words in a paragraph
1,000 characters in a document
10,000 timesteps in an audio clip

No one builds a 10,000-layer feedforward network. But processing a 10,000-token sequence requires the RNN to propagate gradients through 10,000 steps. The vanishing gradient problem is not just worse — it's qualitatively different.

A Concrete Failure: Long-Range Dependency

Consider language modeling on the sentence:

"The cat that sat on the mat by the window in the old kitchen was hungry."

To predict "hungry" correctly, the model needs to remember that the subject is "cat" — a word that appeared 12 tokens ago. The gradient from learning about "hungry" must travel backward through "was," "kitchen," "old," "the," "in," "window," "the," "by," "mat," "the," "on," "sat," "that" — 13 steps — before it can update the hidden state representation built when "cat" was processed.

With a gradient that decays as $\rho^{13}$ , even for $\rho = 0.8$ , the signal reaching the "cat" step is $0.8^{13} \approx 0.055$ of its original strength. The weight update based on this gradient is 20× smaller than it would be for a one-step dependency. The network will mostly learn short-range patterns.

The Exploding Gradient Case

The mirror problem: if $\rho(W_h) > 1$ , gradients don't vanish — they explode. The product of Jacobians grows without bound, producing gradient values in the millions or billions within a few steps.

Exploding gradients cause weight updates of enormous magnitude, usually resulting in loss = NaN within a few training steps. The fix is gradient clipping (clip the gradient norm to a maximum value). Vanishing gradients have no comparably simple fix — you can't "amplify" gradients that have already been zeroed by multiplicative decay.

What's Actually Needed

The vanishing gradient problem isn't about the learning algorithm — it's about the architecture. No optimizer trick (momentum, Adam, better learning rates) can fix gradients that are numerically zero before they reach the parameters.

What's needed is an architecture where information can flow over long time spans without being multiplied through many small Jacobians. The LSTM's solution: introduce an additive update path — a "cell state" — that allows gradients to flow backward without shrinking. This is the subject of the next lesson.