Skip to content
Recurrent Networks
Lesson 3 ⏱ 16 min

Backpropagation through time

Video coming soon

Backpropagation Through Time: Unrolling and Accumulating Gradients

Shows how BPTT treats the unrolled RNN as a feedforward network, explains why shared weights require gradient accumulation across all time steps, and introduces truncated BPTT as a practical approximation.

⏱ ~7 min

🧮

Quick refresher

Chain rule for derivatives

The chain rule says d/dx[f(g(x))] = f'(g(x))·g'(x). For a chain of functions, you multiply the local derivatives. In a computation graph, backpropagation applies the chain rule at every node, multiplying local Jacobians as you move backward.

Example

If L = (h₃)² and h₃ = tanh(h₂), then dL/dh₂ = dL/dh₃ · dh₃/dh₂ = 2h₃ · tanh'(h₂) = 2h₃ · (1 − tanh²(h₂)).

You know how backpropagation works in feedforward networks: apply the chain rule backward through the computation graph, layer by layer. For RNNs, the same algorithm applies — but the computation graph has a special structure that creates new challenges.

Backpropagation through time is what makes RNN training possible — and its failure mode (vanishing/exploding gradients over long sequences) is what motivated every LSTM, GRU, and eventually transformer that followed. Understanding BPTT means understanding the root cause of the most important problem in sequence modeling.

The Unrolled Computation Graph

An RNN processing a sequence of length can be viewed as a feedforward network with T layers, where:

  • Layer tt takes ht1h_{t-1} and xtx_t as inputs
  • All layers share the same weights WhW_h, WxW_x, bb

This "unrolled" view is exactly how backpropagation is implemented. You unroll the RNN into T copies, treat it as a deep feedforward network, and apply standard backprop. This algorithm is called (BPTT).

Accumulating Gradients for Shared Weights

Here's the key consequence of weight sharing. Suppose we have a sequence of length 3 with loss contributions L1,L2,L3L_1, L_2, L_3 at each step. The total loss is:

L=L1+L2+L3L = L_1 + L_2 + L_3
LL
total loss over the sequence
LtL_t
loss at time step t

The gradient of LL with respect to WhW_h:

LWh=L1Wh+L2Wh+L3Wh\frac{\partial L}{\partial W_h} = \frac{\partial L_1}{\partial W_h} + \frac{\partial L_2}{\partial W_h} + \frac{\partial L_3}{\partial W_h}
L/Wh∂L/∂W_h
total gradient for W_h

Because WhW_h appears in every time step's computation, it receives a gradient contribution from every time step. These are summed (by the chain rule for parameters that appear multiple times).

The Chain Through Time

Now consider a specific path: how does the loss at step T affect the weights used back at step 1? We need:

LTh1=LThThThT1hT1hT2h2h1\frac{\partial L_T}{\partial h_1} = \frac{\partial L_T}{\partial h_T} \cdot \frac{\partial h_T}{\partial h_{T-1}} \cdot \frac{\partial h_{T-1}}{\partial h_{T-2}} \cdots \frac{\partial h_2}{\partial h_1}
LT/Wh(1)∂L_T/∂W_h^{(1)}
gradient from step T flowing back to the W_h used at step 1

Each factor ht/ht1\partial h_t / \partial h_{t-1} is a matrix. From the RNN equation ht=tanh(Whht1+Wxxt+b)h_t = \tanh(W_h h_{t-1} + W_x x_t + b):

\frac{\partial h_t}{\partial h_{t-1}} = W_h^T \cdot \text{diag}(\tanh'(W_h h_{t-1} + W_x x_t + b))
diag(tanh())diag(tanh'(·))
diagonal matrix of tanh derivative values at each neuron

The full gradient from step T to step 1 involves multiplying T-1 such Jacobians:

\frac{\partial h_T}{\partial h_1} = \prod_{k=2}^{T} W_h^T \cdot \text{diag}(\tanh'(\cdot))

This product of T-1 matrices is where the trouble starts.

Worked Example: Gradient Through 4 Steps

Let's trace a simple case. Suppose n=1n = 1 (scalar hidden state) and at every step, \tanh'(\cdot) = 0.8 and Wh=0.9W_h = 0.9. The gradient from step 4 to step 1:

\frac{\partial h_4}{\partial h_1} = (W_h \cdot \tanh')^3 = (0.9 \times 0.8)^3 = 0.72^3 \approx 0.373

After 4 steps, the gradient is 37% of its starting value. Now stretch to 20 steps:

0.72190.00190.72^{19} \approx 0.0019

Less than 0.2%. The signal from step 20 is almost invisible to the weights that processed step 1. For 100 steps: 0.729910130.72^{99} \approx 10^{-13}. Completely vanished.

The Total Gradient

For each weight matrix, the total gradient in BPTT is:

LWh=t=1Tk=1tLthththkhkWh\frac{\partial L}{\partial W_h} = \sum_{t=1}^{T} \sum_{k=1}^{t} \frac{\partial L_t}{\partial h_t} \cdot \frac{\partial h_t}{\partial h_k} \cdot \frac{\partial h_k}{\partial W_h}
L/Wh∂L/∂W_h
total gradient — sum of contributions from all time steps

The inner sum runs over all paths from each time step t back to each earlier step k. Long paths have their contributions vanished; only short paths contribute meaningfully. This is the formal statement of why vanilla RNNs struggle with long-range dependencies.

Truncated BPTT

For sequences of length T=1000, full BPTT requires storing all T hidden states and computing gradients through T steps — both expensive. The practical solution is :

  1. Process the sequence forward, keeping track of hidden states
  2. Every K steps, backpropagate through the last K steps and update weights
  3. The hidden state at the "boundary" is treated as constant (no gradient flows through it)
  4. Carry the most recent hidden state hKh_K forward as the starting state for the next chunk
T=100, K=20:
Process steps 1-20 → backprop through 20 steps → update weights
Process steps 21-40 (starting from h₂₀) → backprop through 20 steps → update weights
...

This is an approximation: the gradient for events more than K steps back is discarded. If K=20, the network can only learn dependencies within a 20-step window. For most tasks, K=20-50 is sufficient — vanilla RNNs struggle with longer dependencies anyway.

The BPTT analysis reveals a deep limitation of vanilla RNNs: they can't reliably learn from events more than 10-20 steps in the past. The next lesson makes this precise, and the lesson after that introduces the LSTM's solution.

Quiz

1 / 3

In BPTT, why is the gradient of the loss w.r.t. W_h the sum of gradients across all time steps?