The Goal: Concrete, Implementable Formulas
The chain rule tells us that we can compute gradients layer by layer. Now let's pin down exactly what to compute at each layer, in an order that avoids redundant work.
This is where backpropagation becomes concrete. The formulas here are what frameworks like PyTorch implement — knowing them means understanding what actually happens during every training step.
Plain-language preview of the four steps:
- Forward pass — run the input through the network layer by layer, saving (caching) every intermediate result. This is just making a prediction, but carefully writing down all intermediate work.
- Output error signal — compare the prediction to the true answer. Compute a vector that says "here is how wrong the output layer was, and in which direction."
- Propagate backward — carry that error signal backward through the network, layer by layer. Each layer converts "how wrong was my output?" into "how wrong was my input?" using the chain rule.
- Compute weight gradients — once each layer knows its error signal, the weight gradients are simple outer products: (how wrong was I) × (what did I receive as input).
The key concept is the :
- error signal at layer l
- loss
- pre-activation at layer l
Once you have for every layer, the weight gradients follow immediately. The backward pass is really just computing all the values from output to input.
Step 1: Forward Pass (With Caching)
For to :
- pre-activation vector - MUST BE CACHED
- post-activation vector - MUST BE CACHED
- weight matrix at layer l
- bias vector at layer l
- activation function
After the forward pass, compute the loss: .
Step 2: Output Layer Error Signal
The error signal for the output layer depends on both the loss function and the output activation. For the standard pairings:
- Cross-entropy + softmax (multi-class):
- MSE + linear (regression):
- Cross-entropy + sigmoid (binary):
Step 3: Propagate Backward
Given (the error signal at layer ), compute :
- weight matrix of the NEXT layer
- elementwise (Hadamard) product
- derivative of the activation function
Two parts:
— route the error signal backward through the weight matrix. The forward pass mapped (size ) to (size ) using (shape ). To map the error backward, multiply by (shape ).
\odot \sigma'(z^{(l)}) — apply the . This is why we cached : we need the pre-activation values to compute \sigma'(z^{(l)}).
For ReLU: \sigma'(z) = 1 if z > 0, else . For sigmoid: \sigma'(z) = a(1-a) where .
Step 4: Compute Weight Gradients
Once you have all the error signals , the weight gradients are simple outer products:
- error signal at layer l - shape n_l x 1
- previous activation - shape n_{l-1} x 1
- error signal also equals the bias gradient directly
Shape check for layer with neurons and inputs:
- δ⁽ˡ⁾: shape (nₗ × 1)
- (a⁽ˡ⁻¹⁾)ᵀ: shape (1 × nₗ₋₁)
- Outer product: which equals the shape of ✓
Step 5: Update Parameters
After computing all gradients, apply gradient descent to every layer:
- learning rate
- layer index
Interactive example
Run a complete forward and backward pass - see error signals delta flow backward through each layer
Coming soon
Memory: The Cost of Caching
The forward pass caches all and values. For a batch of examples and a network of depth with average width :
- batch size
- average layer width
- number of layers
For a deep network with large batches, this can be tens of gigabytes. is a technique that caches activations only at certain layers and recomputes the rest during the backward pass, trading computation for memory. It is commonly used when training very large models.