Backpropagation through time — Recurrent Networks

You know how backpropagation works in feedforward networks: apply the chain rule backward through the computation graph, layer by layer. For RNNs, the same algorithm applies — but the computation graph has a special structure that creates new challenges.

Backpropagation through time is what makes RNN training possible — and its failure mode (vanishing/exploding gradients over long sequences) is what motivated every LSTM, GRU, and eventually transformer that followed. Understanding BPTT means understanding the root cause of the most important problem in sequence modeling.

The Unrolled Computation Graph

An RNN processing a sequence of length can be viewed as a feedforward network with T layers, where:

Layer $t$ takes $h_{t-1}$ and $x_t$ as inputs
All layers share the same weights $W_h$ , $W_x$ , $b$

This "unrolled" view is exactly how backpropagation is implemented. You unroll the RNN into T copies, treat it as a deep feedforward network, and apply standard backprop. This algorithm is called (BPTT).

Accumulating Gradients for Shared Weights

Here's the key consequence of weight sharing. Suppose we have a sequence of length 3 with loss contributions $L_1, L_2, L_3$ at each step. The total loss is:

L = L_1 + L_2 + L_3

$L$: total loss over the sequence
$L_t$: loss at time step t

The gradient of $L$ with respect to $W_h$ :

\frac{\partial L}{\partial W_h} = \frac{\partial L_1}{\partial W_h} + \frac{\partial L_2}{\partial W_h} + \frac{\partial L_3}{\partial W_h}

$∂L/∂W_h$: total gradient for W_h

Because $W_h$ appears in every time step's computation, it receives a gradient contribution from every time step. These are summed (by the chain rule for parameters that appear multiple times).

The Chain Through Time

Now consider a specific path: how does the loss at step T affect the weights used back at step 1? We need:

\frac{\partial L_T}{\partial h_1} = \frac{\partial L_T}{\partial h_T} \cdot \frac{\partial h_T}{\partial h_{T-1}} \cdot \frac{\partial h_{T-1}}{\partial h_{T-2}} \cdots \frac{\partial h_2}{\partial h_1}

$∂L_T/∂W_h^{(1)}$: gradient from step T flowing back to the W_h used at step 1

Each factor $\partial h_t / \partial h_{t-1}$ is a matrix. From the RNN equation $h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$ :

\frac{\partial h_t}{\partial h_{t-1}} = W_h^T \cdot \text{diag}(\tanh&#39;(W_h h_{t-1} + W_x x_t + b))

$diag(tanh'(·))$: diagonal matrix of tanh derivative values at each neuron

The full gradient from step T to step 1 involves multiplying T-1 such Jacobians:

\frac{\partial h_T}{\partial h_1} = \prod_{k=2}^{T} W_h^T \cdot \text{diag}(\tanh&#39;(\cdot))

This product of T-1 matrices is where the trouble starts.

Worked Example: Gradient Through 4 Steps

Let's trace a simple case. Suppose $n = 1$ (scalar hidden state) and at every step, $\tanh'(\cdot) = 0.8$ and $W_h = 0.9$ . The gradient from step 4 to step 1:

\frac{\partial h_4}{\partial h_1} = (W_h \cdot \tanh&#39;)^3 = (0.9 \times 0.8)^3 = 0.72^3 \approx 0.373

After 4 steps, the gradient is 37% of its starting value. Now stretch to 20 steps:

0.72^{19} \approx 0.0019

Less than 0.2%. The signal from step 20 is almost invisible to the weights that processed step 1. For 100 steps: $0.72^{99} \approx 10^{-13}$ . Completely vanished.

The Total Gradient

For each weight matrix, the total gradient in BPTT is:

\frac{\partial L}{\partial W_h} = \sum_{t=1}^{T} \sum_{k=1}^{t} \frac{\partial L_t}{\partial h_t} \cdot \frac{\partial h_t}{\partial h_k} \cdot \frac{\partial h_k}{\partial W_h}

$∂L/∂W_h$: total gradient — sum of contributions from all time steps

The inner sum runs over all paths from each time step t back to each earlier step k. Long paths have their contributions vanished; only short paths contribute meaningfully. This is the formal statement of why vanilla RNNs struggle with long-range dependencies.

Truncated BPTT

For sequences of length T=1000, full BPTT requires storing all T hidden states and computing gradients through T steps — both expensive. The practical solution is :

Process the sequence forward, keeping track of hidden states
Every K steps, backpropagate through the last K steps and update weights
The hidden state at the "boundary" is treated as constant (no gradient flows through it)
Carry the most recent hidden state $h_K$ forward as the starting state for the next chunk

T=100, K=20:
Process steps 1-20 → backprop through 20 steps → update weights
Process steps 21-40 (starting from h₂₀) → backprop through 20 steps → update weights
...

This is an approximation: the gradient for events more than K steps back is discarded. If K=20, the network can only learn dependencies within a 20-step window. For most tasks, K=20-50 is sufficient — vanilla RNNs struggle with longer dependencies anyway.

The BPTT analysis reveals a deep limitation of vanilla RNNs: they can't reliably learn from events more than 10-20 steps in the past. The next lesson makes this precise, and the lesson after that introduces the LSTM's solution.