The Chain Rule, Refreshed
When you have a composition of functions, derivatives multiply. For :
Plain-language intuition first: Imagine you are pulling a lever that controls a gear, which controls a second gear, which moves a platform. If moving the lever 1 cm turns the first gear by 2 degrees, and turning the first gear by 1 degree turns the second gear by 3 degrees, then moving the lever 1 cm causes 2 × 3 = 6 degrees of rotation in the second gear. That's the chain rule: effects along a chain of dependencies multiply. In a neural network, moving a weight changes a pre-activation, which changes an activation, which changes a loss. Each change is a local ratio — a derivative — and they multiply to give the total effect.
- outer function
- inner function
Extend this to any number of nested functions. For :
- outermost function
- middle function
- innermost function
Every step along the chain contributes its own derivative, and they all multiply together. This is the entire mathematical machinery behind backpropagation.
A deep network is just functions stacked inside other functions. The derivative of that stack is a product of each layer's local derivative — and that's exactly what the chain rule computes. Without it, there would be no principled way to ask "how does changing this weight in layer 1 affect the final loss?" With it, the answer falls out automatically, layer by layer.
Tracing Gradients Through One Neuron
Let's be concrete. One neuron with cross-entropy loss:
- linear pre-activation: wx + b
- sigmoid activation: \sigma(z)
- binary cross-entropy loss
We want — how does the weight affect the loss?
The path from to goes through , then , then :
- how loss changes with activation
- sigmoid derivative
- how linear output changes with weight
Now compute each factor:
Factor 1 - : differentiate the cross-entropy with respect to the activation.
Factor 2 - : the sigmoid derivative.
- sigmoid output = \sigma(z)
Factor 3 - : how the linear output changes with the weight. Since :
Multiply all three and simplify. The derivative cancels beautifully with the log derivative:
- predicted probability = a
- true label
The Local Gradient Concept
Every node in a computation graph has two roles:
- Forward pass: compute the output from the inputs.
- Backward pass: compute the and multiply it by the incoming gradient signal.
Key local gradients to memorize:
- Multiply node : local grad for is ; local grad for is
- Add node : local grad for both inputs is
- ReLU: local grad is if z > 0, else
- Sigmoid: local grad is
The chain rule says: total gradient = incoming gradient × local gradient. Backpropagation applies this rule at every node, flowing from the loss backward to the inputs.
Interactive example
Backprop through a small graph - click a node to see its local gradient and the signal flowing backward
Coming soon
Extending to Multiple Layers
For a two-layer network, the gradient for layer 1's weights involves more chain links:
- weights of layer 1
- layer 2 activation
- pre-activation at layer l
- post-activation at layer l
Working backward through the factors:
- Compute: : loss gradient at layer 2 output — computed first
- Compute: : layer 2 activation derivative
- Compute: : how layer 2's linear output changes with layer 1's output — this is
- Compute: : layer 1 activation derivative
- Compute: : how layer 1's linear output changes with its weights — this is (the input)
Notice factor 3: . To propagate gradients backward through a linear layer that used in the forward pass, you multiply by . This is where the in backpropagation formulas.
Why It Is Efficient
The brilliant part: you compute these products once, from output to input, reusing results.
When computing layer 2's gradient, you compute the error signal = . To get layer 1's error signal, you use — you do not recompute from the loss. And to get layer 0's error signal, you use . Each layer's computation is in the layer size, and the total cost is — the same order as a single forward pass.
This reuse of intermediate computations is the efficiency gain that makes deep learning feasible.