You know how backpropagation works in feedforward networks: apply the chain rule backward through the computation graph, layer by layer. For RNNs, the same algorithm applies — but the computation graph has a special structure that creates new challenges.
Backpropagation through time is what makes RNN training possible — and its failure mode (vanishing/exploding gradients over long sequences) is what motivated every LSTM, GRU, and eventually transformer that followed. Understanding BPTT means understanding the root cause of the most important problem in sequence modeling.
The Unrolled Computation Graph
An RNN processing a sequence of length can be viewed as a feedforward network with T layers, where:
- Layer takes and as inputs
- All layers share the same weights , ,
This "unrolled" view is exactly how backpropagation is implemented. You unroll the RNN into T copies, treat it as a deep feedforward network, and apply standard backprop. This algorithm is called (BPTT).
Accumulating Gradients for Shared Weights
Here's the key consequence of weight sharing. Suppose we have a sequence of length 3 with loss contributions at each step. The total loss is:
- total loss over the sequence
- loss at time step t
The gradient of with respect to :
- total gradient for W_h
Because appears in every time step's computation, it receives a gradient contribution from every time step. These are summed (by the chain rule for parameters that appear multiple times).
The Chain Through Time
Now consider a specific path: how does the loss at step T affect the weights used back at step 1? We need:
- gradient from step T flowing back to the W_h used at step 1
Each factor is a matrix. From the RNN equation :
- diagonal matrix of tanh derivative values at each neuron
The full gradient from step T to step 1 involves multiplying T-1 such Jacobians:
This product of T-1 matrices is where the trouble starts.
Worked Example: Gradient Through 4 Steps
Let's trace a simple case. Suppose (scalar hidden state) and at every step, \tanh'(\cdot) = 0.8 and . The gradient from step 4 to step 1:
After 4 steps, the gradient is 37% of its starting value. Now stretch to 20 steps:
Less than 0.2%. The signal from step 20 is almost invisible to the weights that processed step 1. For 100 steps: . Completely vanished.
The Total Gradient
For each weight matrix, the total gradient in BPTT is:
- total gradient — sum of contributions from all time steps
The inner sum runs over all paths from each time step t back to each earlier step k. Long paths have their contributions vanished; only short paths contribute meaningfully. This is the formal statement of why vanilla RNNs struggle with long-range dependencies.
Truncated BPTT
For sequences of length T=1000, full BPTT requires storing all T hidden states and computing gradients through T steps — both expensive. The practical solution is :
- Process the sequence forward, keeping track of hidden states
- Every K steps, backpropagate through the last K steps and update weights
- The hidden state at the "boundary" is treated as constant (no gradient flows through it)
- Carry the most recent hidden state forward as the starting state for the next chunk
T=100, K=20: Process steps 1-20 → backprop through 20 steps → update weights Process steps 21-40 (starting from h₂₀) → backprop through 20 steps → update weights ...
This is an approximation: the gradient for events more than K steps back is discarded. If K=20, the network can only learn dependencies within a 20-step window. For most tasks, K=20-50 is sufficient — vanilla RNNs struggle with longer dependencies anyway.
The BPTT analysis reveals a deep limitation of vanilla RNNs: they can't reliably learn from events more than 10-20 steps in the past. The next lesson makes this precise, and the lesson after that introduces the LSTM's solution.