When Functions Are Nested
The power rule handles or with ease. But what about ? Or ? Or ?
These are - one function inside another - and the power rule alone cannot handle them. You need the chain rule.
The chain rule is probably the single most important differentiation technique in machine learning. - the algorithm that trains every neural network - is the chain rule applied systematically through all layers. Understanding it here means you will understand backprop intuitively, not just mechanically.
The Driving Analogy
Imagine a car where your foot position controls engine RPM, and engine RPM controls wheel speed. Two separate relationships chained together:
- Wheel speed increases by 5 mph per 100 extra RPM
- Engine RPM increases by 200 RPM per inch you press the pedal
How fast do the wheels speed up per inch of pedal? Multiply the two rates:
You multiply the two rates. That is the chain rule. When two functions are chained, multiply their individual derivatives.
The Formal Rule
For a composition - "g of x, fed into f":
- outer function
- inner function
- variable
In words: derivative of the outside (evaluated at the inside), times the derivative of the inside.
Or more memorably: outer' × inner' - but critically, the outer derivative is evaluated at the whole inner function, not at alone.
Worked Example 1:
Identify the pieces:
- Outer: , so f'(u) = 3u^2
- Inner: , so g'(x) = 2
Apply the chain rule:
- shorthand for the inner function 2x+1
We differentiated the outer to get , evaluated it at the full inner expression , and multiplied by the inner derivative 2.
Worked Example 2:
Key fact: the derivative of with respect to is itself - the exponential function is its own derivative.
- Outer: , so f'(u) = e^u
- Inner: , so g'(x) = 2x
- Euler's number - base of the natural exponential
Worked Example 3:
- Outer: , derivative
- Inner: , derivative
- the inner expression x²+5
Interactive example
Chain rule visualizer - see how outer and inner derivatives combine for different compositions
Coming soon
A Key ML Example: The Sigmoid Derivative
The is one of the most important functions in neural networks. What is its derivative?
Write .
The derivative of the sigmoid is the sigmoid times one minus itself. This formula is used in backpropagation every time a sigmoid activation appears - and now you know exactly where it comes from.
Chains Within Chains
You can have triple compositions: . Apply the chain rule at each level:
- outermost function
- middle function
- innermost function
Example:
- Outer: , derivative
- Middle: , derivative
- Inner: , derivative
Result:
Why the Chain Rule Is Backpropagation
A neural network is a composition: output = layer3(layer2(layer1(input))). To find for a weight in layer 1, you chain through all three layers:
- loss
- weight in layer 1
Four terms multiplied together - one per link in the chain. Backpropagation is the efficient algorithm for computing all such products without redundant work. The mathematical content is just the chain rule, applied over and over.
import torch
# PyTorch autograd implements the chain rule for any composition
z = torch.tensor(0.0, requires_grad=True)
# Chain: sigmoid(z) = 1 / (1 + exp(-z))
sigma = torch.sigmoid(z)
sigma.backward()
print(f"σ'(0) = {z.grad.item():.4f}") # → 0.25 (σ(0)·(1-σ(0)) = 0.5·0.5)
# Chain rule through a 3-layer composition
z = torch.tensor(2.0, requires_grad=True)
a1 = torch.relu(z) # layer 1
a2 = torch.sigmoid(a1) # layer 2
loss = (a2 - 0.8) ** 2 # MSE loss
loss.backward()
print(f"∂L/∂z = {z.grad.item():.6f}") # chain rule applied automatically
Interactive example
Backpropagation demo - trace how gradients flow backward through a 3-layer network using the chain rule
Coming soon