Skip to content
Recurrent Networks
Lesson 4 ⏱ 12 min

Vanishing gradients in sequences

Video coming soon

Vanishing Gradients in RNNs: Why Long-Range Dependence Is Hard

Gives a precise analysis of why vanishing gradients are worse in RNNs than feedforward nets, shows a concrete long-range dependency example that vanilla RNNs fail on, and explains why this motivates the LSTM.

⏱ ~6 min

🧮

Quick refresher

Eigenvalues of a matrix

The eigenvalue of a matrix A tells you how much A scales a vector in its principal directions. The spectral radius is the largest absolute eigenvalue. When you multiply a matrix by itself repeatedly (A^k), the result is dominated by the spectral radius raised to the power k.

Example

If A has eigenvalues 0.7 and 1.2, then A^10 is dominated by 1.2^10 ≈ 6.2 in the 1.2-eigenvector direction, while the 0.7-eigenvector direction has shrunk to 0.7^10 ≈ 0.028.

You've seen that BPTT multiplies Jacobians together for every step back in time. The previous lesson showed this numerically for a small example. Now let's be precise about why this destroys learning for long sequences, and what kind of problems it breaks.

The vanishing gradient problem is why early NLP systems could only model short-range dependencies — they literally could not learn that a subject at the beginning of a paragraph constrains the verb at the end. It is the key motivation for LSTMs, GRUs, and ultimately the attention mechanism. Understanding it precisely means understanding why every subsequent architecture was designed the way it was.

The Core Problem: Multiplicative Decay

In a vanilla RNN, the Jacobian at each step is:

J_t = W_h^T \cdot \text{diag}(\tanh'(z_t))
JtJ_t
Jacobian of h_t with respect to h_{t-1}
diag(tanh(zt))diag(tanh'(z_t))
diagonal matrix of tanh' evaluated at pre-activations z_t = W_h h_{t-1} + W_x x_t + b

The gradient from step back to step 1 is a product of T−1 such Jacobians:

hTh1=JTJT1J2\frac{\partial h_T}{\partial h_1} = J_T \cdot J_{T-1} \cdots J_2

Consider the magnitude of this product. Each JtJ_t has two sources of shrinkage:

  1. tanh saturation: \tanh'(z) = 1 - \tanh^2(z) \in (0, 1], with maximum 1 only at z=0z=0. For any nonzero pre-activation, the derivative is strictly less than 1.

  2. Weight matrix spectrum: the spectral radius of WhTW_h^T depends on how the weights were initialized and trained.

If the product of these two effects gives an effective spectral radius \rho < 1, the gradient decays exponentially:

hTh1ρT1\left|\frac{\partial h_T}{\partial h_1}\right| \approx \rho^{T-1}
ρρ
effective spectral radius of the Jacobian
Tρ = 0.9ρ = 0.8ρ = 0.7
100.3870.1070.028
300.0420.0017×10⁻⁵
1002.7×10⁻⁵≈0≈0

For any typical weight initialization, gradients from more than 10-30 steps back are effectively zero.

Vanishing vs. Feedforward Networks: The Scale Difference

In a feedforward network, gradients travel through at most layers. Modern nets are 12-100 layers deep. With skip connections (ResNets, transformers), gradients often skip most layers entirely.

In an RNN, gradients must travel through at most steps. And TT can be:

  • 100 words in a paragraph
  • 1,000 characters in a document
  • 10,000 timesteps in an audio clip

No one builds a 10,000-layer feedforward network. But processing a 10,000-token sequence requires the RNN to propagate gradients through 10,000 steps. The vanishing gradient problem is not just worse — it's qualitatively different.

A Concrete Failure: Long-Range Dependency

Consider language modeling on the sentence:

"The cat that sat on the mat by the window in the old kitchen was hungry."

To predict "hungry" correctly, the model needs to remember that the subject is "cat" — a word that appeared 12 tokens ago. The gradient from learning about "hungry" must travel backward through "was," "kitchen," "old," "the," "in," "window," "the," "by," "mat," "the," "on," "sat," "that" — 13 steps — before it can update the hidden state representation built when "cat" was processed.

With a gradient that decays as ρ13\rho^{13}, even for ρ=0.8\rho = 0.8, the signal reaching the "cat" step is 0.8130.0550.8^{13} \approx 0.055 of its original strength. The weight update based on this gradient is 20× smaller than it would be for a one-step dependency. The network will mostly learn short-range patterns.

The Exploding Gradient Case

The mirror problem: if \rho(W_h) > 1, gradients don't vanish — they explode. The product of Jacobians grows without bound, producing gradient values in the millions or billions within a few steps.

Exploding gradients cause weight updates of enormous magnitude, usually resulting in loss = NaN within a few training steps. The fix is gradient clipping (clip the gradient norm to a maximum value). Vanishing gradients have no comparably simple fix — you can't "amplify" gradients that have already been zeroed by multiplicative decay.

What's Actually Needed

The vanishing gradient problem isn't about the learning algorithm — it's about the architecture. No optimizer trick (momentum, Adam, better learning rates) can fix gradients that are numerically zero before they reach the parameters.

What's needed is an architecture where information can flow over long time spans without being multiplied through many small Jacobians. The LSTM's solution: introduce an additive update path — a "cell state" — that allows gradients to flow backward without shrinking. This is the subject of the next lesson.

Quiz

1 / 3

In a vanilla RNN processing a sequence of length 100, the gradient of the loss at step 100 w.r.t. the hidden state at step 1 involves...