Skip to content
Backpropagation
Lesson 3 ⏱ 16 min

Computing gradients layer by layer

Video coming soon

Backpropagation Algorithm: Step-by-Step Formulas

The four concrete steps of backpropagation - forward pass with caching, output layer error signal, backward propagation with transposed weights, and weight gradient computation.

⏱ ~9 min

🧮

Quick refresher

Matrix transpose

Transposing a matrix swaps rows and columns. (Aᵀ)ᵢⱼ = Aⱼᵢ. Key: (AB)ᵀ = BᵀAᵀ (order reverses).

Example

If W is 4×3 (4 neurons, 3 inputs), then Wᵀ is 3×4.

The Goal: Concrete, Implementable Formulas

The chain rule tells us that we can compute gradients layer by layer. Now let's pin down exactly what to compute at each layer, in an order that avoids redundant work.

This is where backpropagation becomes concrete. The formulas here are what frameworks like PyTorch implement — knowing them means understanding what actually happens during every training step.

Plain-language preview of the four steps:

  1. Forward pass — run the input through the network layer by layer, saving (caching) every intermediate result. This is just making a prediction, but carefully writing down all intermediate work.
  2. Output error signal — compare the prediction to the true answer. Compute a vector that says "here is how wrong the output layer was, and in which direction."
  3. Propagate backward — carry that error signal backward through the network, layer by layer. Each layer converts "how wrong was my output?" into "how wrong was my input?" using the chain rule.
  4. Compute weight gradients — once each layer knows its error signal, the weight gradients are simple outer products: (how wrong was I) × (what did I receive as input).

The key concept is the :

δ(l)=Lz(l)\delta^{(l)} = \frac{\partial L}{\partial z^{(l)}}
δ\delta
error signal at layer l
LL
loss
zz
pre-activation at layer l

Once you have δ(l)\delta^{(l)} for every layer, the weight gradients follow immediately. The backward pass is really just computing all the δ\delta values from output to input.

Step 1: Forward Pass (With Caching)

For l=1l = 1 to LL:

z(l)=W(l)a(l1)+b(l)(cache this)z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)} \qquad \text{(cache this)}
zz
pre-activation vector - MUST BE CACHED
aa
post-activation vector - MUST BE CACHED
WW
weight matrix at layer l
bb
bias vector at layer l
a(l)=σ(z(l))(cache this)a^{(l)} = \sigma(z^{(l)}) \qquad \text{(cache this)}
σ\sigma
activation function

After the forward pass, compute the loss: L=loss(a(L),y)L = \text{loss}(a^{(L)}, y).

Step 2: Output Layer Error Signal

The error signal for the output layer depends on both the loss function and the output activation. For the standard pairings:

  • Cross-entropy + softmax (multi-class): δ(L)=y^y\delta^{(L)} = \hat{y} - y
  • MSE + linear (regression): δ(L)=y^y\delta^{(L)} = \hat{y} - y
  • Cross-entropy + sigmoid (binary): δ(L)=y^y\delta^{(L)} = \hat{y} - y

Step 3: Propagate Backward

Given δ(l+1)\delta^{(l+1)} (the error signal at layer l+1l+1), compute δ(l)\delta^{(l)}:

\delta^{(l)} = \left(W^{(l+1)\top} \cdot \delta^{(l+1)}\right) \odot \sigma'(z^{(l)})
WW
weight matrix of the NEXT layer
\odot
elementwise (Hadamard) product
σ\sigma'
derivative of the activation function

Two parts:

W(l+1)δ(l+1)W^{(l+1)\top} \cdot \delta^{(l+1)} — route the error signal backward through the weight matrix. The forward pass mapped a(l)a^{(l)} (size nln_l) to z(l+1)z^{(l+1)} (size nl+1n_{l+1}) using W(l+1)W^{(l+1)} (shape nl+1×nln_{l+1} \times n_l). To map the error backward, multiply by W(l+1)W^{(l+1)\top} (shape nl×nl+1n_l \times n_{l+1}).

\odot \sigma'(z^{(l)}) — apply the . This is why we cached z(l)z^{(l)}: we need the pre-activation values to compute \sigma'(z^{(l)}).

For ReLU: \sigma'(z) = 1 if z > 0, else 00. For sigmoid: \sigma'(z) = a(1-a) where a=σ(z)a = \sigma(z).

Step 4: Compute Weight Gradients

Once you have all the error signals δ(l)\delta^{(l)}, the weight gradients are simple outer products:

LW(l)=δ(l)(a(l1))\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \cdot \left(a^{(l-1)}\right)^\top
δ\delta
error signal at layer l - shape n_l x 1
aa
previous activation - shape n_{l-1} x 1
Lb(l)=δ(l)\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}
δ\delta
error signal also equals the bias gradient directly

Shape check for layer ll with nln_l neurons and nl1n_{l-1} inputs:

  • δ⁽ˡ⁾: shape (nₗ × 1)
  • (a⁽ˡ⁻¹⁾)ᵀ: shape (1 × nₗ₋₁)
  • Outer product: (nl×nl1)(n_l \times n_{l-1}) which equals the shape of W(l)W^{(l)}

Step 5: Update Parameters

After computing all gradients, apply gradient descent to every layer:

W(l)W(l)αLW(l)W^{(l)} \leftarrow W^{(l)} - \alpha \cdot \frac{\partial L}{\partial W^{(l)}}
α\alpha
learning rate
ll
layer index
b(l)b(l)αLb(l)b^{(l)} \leftarrow b^{(l)} - \alpha \cdot \frac{\partial L}{\partial b^{(l)}}

Interactive example

Run a complete forward and backward pass - see error signals delta flow backward through each layer

Coming soon

Memory: The Cost of Caching

The forward pass caches all z(l)z^{(l)} and a(l)a^{(l)} values. For a batch of BB examples and a network of depth LL with average width dd:

MemoryB×d×L\text{Memory} \propto B \times d \times L
BB
batch size
dd
average layer width
LL
number of layers

For a deep network with large batches, this can be tens of gigabytes. is a technique that caches activations only at certain layers and recomputes the rest during the backward pass, trading computation for memory. It is commonly used when training very large models.

Quiz

1 / 3

In the backward pass formula δ⁽ˡ⁾ = (W⁽ˡ⁺¹⁾ᵀ · δ⁽ˡ⁺¹⁾) ⊙ σ'(z⁽ˡ⁾), what does the transpose W⁽ˡ⁺¹⁾ᵀ do?