Computing gradients layer by layer — Backpropagation

The Goal: Concrete, Implementable Formulas

The chain rule tells us that we can compute gradients layer by layer. Now let's pin down exactly what to compute at each layer, in an order that avoids redundant work.

This is where backpropagation becomes concrete. The formulas here are what frameworks like PyTorch implement — knowing them means understanding what actually happens during every training step.

Plain-language preview of the four steps:

Forward pass — run the input through the network layer by layer, saving (caching) every intermediate result. This is just making a prediction, but carefully writing down all intermediate work.
Output error signal — compare the prediction to the true answer. Compute a vector that says "here is how wrong the output layer was, and in which direction."
Propagate backward — carry that error signal backward through the network, layer by layer. Each layer converts "how wrong was my output?" into "how wrong was my input?" using the chain rule.
Compute weight gradients — once each layer knows its error signal, the weight gradients are simple outer products: (how wrong was I) × (what did I receive as input).

The key concept is the :

\delta^{(l)} = \frac{\partial L}{\partial z^{(l)}}

$\delta$: error signal at layer l
$L$: loss
$z$: pre-activation at layer l

Once you have $\delta^{(l)}$ for every layer, the weight gradients follow immediately. The backward pass is really just computing all the $\delta$ values from output to input.

Step 1: Forward Pass (With Caching)

For $l = 1$ to $L$ :

z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)} \qquad \text{(cache this)}

$z$: pre-activation vector - MUST BE CACHED
$a$: post-activation vector - MUST BE CACHED
$W$: weight matrix at layer l
$b$: bias vector at layer l

a^{(l)} = \sigma(z^{(l)}) \qquad \text{(cache this)}

$\sigma$: activation function

After the forward pass, compute the loss: $L = \text{loss}(a^{(L)}, y)$ .

Step 2: Output Layer Error Signal

The error signal for the output layer depends on both the loss function and the output activation. For the standard pairings:

Cross-entropy + softmax (multi-class): $\delta^{(L)} = \hat{y} - y$
MSE + linear (regression): $\delta^{(L)} = \hat{y} - y$
Cross-entropy + sigmoid (binary): $\delta^{(L)} = \hat{y} - y$

Step 3: Propagate Backward

Given $\delta^{(l+1)}$ (the error signal at layer $l+1$ ), compute $\delta^{(l)}$ :

\delta^{(l)} = \left(W^{(l+1)\top} \cdot \delta^{(l+1)}\right) \odot \sigma&#39;(z^{(l)})

$W$: weight matrix of the NEXT layer
$\odot$: elementwise (Hadamard) product
$\sigma'$: derivative of the activation function

Two parts:

$W^{(l+1)\top} \cdot \delta^{(l+1)}$ — route the error signal backward through the weight matrix. The forward pass mapped $a^{(l)}$ (size $n_l$ ) to $z^{(l+1)}$ (size $n_{l+1}$ ) using $W^{(l+1)}$ (shape $n_{l+1} \times n_l$ ). To map the error backward, multiply by $W^{(l+1)\top}$ (shape $n_l \times n_{l+1}$ ).

$\odot \sigma'(z^{(l)})$ — apply the . This is why we cached $z^{(l)}$ : we need the pre-activation values to compute $\sigma'(z^{(l)})$ .

For ReLU: $\sigma'(z) = 1$ if $z > 0$ , else $0$ . For sigmoid: $\sigma'(z) = a(1-a)$ where $a = \sigma(z)$ .

Step 4: Compute Weight Gradients

Once you have all the error signals $\delta^{(l)}$ , the weight gradients are simple outer products:

\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \cdot \left(a^{(l-1)}\right)^\top

$\delta$: error signal at layer l - shape n_l x 1
$a$: previous activation - shape n_{l-1} x 1

\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}

$\delta$: error signal also equals the bias gradient directly

Shape check for layer $l$ with $n_l$ neurons and $n_{l-1}$ inputs:

δ⁽ˡ⁾: shape (nₗ × 1)
(a⁽ˡ⁻¹⁾)ᵀ: shape (1 × nₗ₋₁)
Outer product: $(n_l \times n_{l-1})$ which equals the shape of $W^{(l)}$ ✓

What the weight gradient formula means intuitively

Here, $\partial L / \partial W^{(l)}_{\alpha\beta}$ (gradient for the weight from neuron $\beta$ in layer $l-1$ to neuron $\alpha$ in layer $l$ ) equals $\delta^{(l)}\alpha \cdot a^{(l-1)}\beta$ . It is proportional to how wrong neuron $\alpha$ was ( $\delta_\alpha$ ) and how active its input was ( $a_\beta$ ). A highly active input that caused a big error gets a large gradient — exactly what you want.

The outer product $\delta^{(l)} \cdot (a^{(l-1)})^\top$ produces a matrix where entry (α, β) = δ_α × a_β. In NumPy/Python: np.outer(delta, a_prev). The result has exactly the same shape as W — one gradient value per weight.

A full backprop pass in Python (no framework)

import math, numpy as np

def sigmoid(z): return 1 / (1 + np.exp(-z))
def sigmoid_deriv(z): a = sigmoid(z); return a * (1 - a)

# Tiny network: 2 inputs → 2 hidden (sigmoid) → 1 output (sigmoid)
# True label y=1, learning rate alpha=0.5
np.random.seed(42)
W1 = np.array([[0.5, -0.3], [0.2, 0.8]])  # 2×2
b1 = np.array([0.1, -0.1])
W2 = np.array([[0.6, -0.4]])              # 1×2
b2 = np.array([0.0])
x  = np.array([1.0, 0.5])
y  = 1.0

# Forward pass (Step 1)
z1 = W1 @ x + b1
a1 = sigmoid(z1)
z2 = W2 @ a1 + b2
a2 = sigmoid(z2)
y_hat = a2[0]

# Step 2: output error signal
delta2 = a2 - y           # shape (1,)

# Step 3: propagate backward
dz2_da1 = W2.T            # shape (2,1)
delta1 = (dz2_da1 @ delta2) * sigmoid_deriv(z1)  # shape (2,)

# Step 4: weight gradients
dW2 = np.outer(delta2, a1)   # (1,2)
db2 = delta2
dW1 = np.outer(delta1, x)    # (2,2)
db1 = delta1

print(f"Loss before: {-math.log(y_hat):.4f}")
W1 -= 0.5 * dW1;  b1 -= 0.5 * db1
W2 -= 0.5 * dW2;  b2 -= 0.5 * db2
y_hat_after = sigmoid((W2 @ sigmoid(W1 @ x + b1) + b2)[0])
print(f"Loss after one update: {-math.log(y_hat_after):.4f}")  # lower ✓

Step 5: Update Parameters

After computing all gradients, apply gradient descent to every layer:

W^{(l)} \leftarrow W^{(l)} - \alpha \cdot \frac{\partial L}{\partial W^{(l)}}

$\alpha$: learning rate
$l$: layer index

b^{(l)} \leftarrow b^{(l)} - \alpha \cdot \frac{\partial L}{\partial b^{(l)}}

Interactive example

Run a complete forward and backward pass - see error signals delta flow backward through each layer

Coming soon

Memory: The Cost of Caching

The forward pass caches all $z^{(l)}$ and $a^{(l)}$ values. For a batch of $B$ examples and a network of depth $L$ with average width $d$ :

\text{Memory} \propto B \times d \times L

$B$: batch size
$d$: average layer width
$L$: number of layers

For a deep network with large batches, this can be tens of gigabytes. is a technique that caches activations only at certain layers and recomputes the rest during the backward pass, trading computation for memory. It is commonly used when training very large models.