The chain rule — Derivatives

When Functions Are Nested

The power rule handles $x^2$ or $x^5$ with ease. But what about $(2x+1)^2$ ? Or $e^{x^2}$ ? Or $\sqrt{x^2 + 3x}$ ?

These are - one function inside another - and the power rule alone cannot handle them. You need the chain rule.

The chain rule is probably the single most important differentiation technique in machine learning. - the algorithm that trains every neural network - is the chain rule applied systematically through all layers. Understanding it here means you will understand backprop intuitively, not just mechanically.

The Driving Analogy

Imagine a car where your foot position controls engine RPM, and engine RPM controls wheel speed. Two separate relationships chained together:

Wheel speed increases by 5 mph per 100 extra RPM
Engine RPM increases by 200 RPM per inch you press the pedal

How fast do the wheels speed up per inch of pedal? Multiply the two rates:

5\thinspace\text{mph}/100\thinspace\text{RPM} \times 200\thinspace\text{RPM}/\text{inch} = 10\thinspace\text{mph}/\text{inch}

You multiply the two rates. That is the chain rule. When two functions are chained, multiply their individual derivatives.

The Formal Rule

For a composition $f(g(x))$ - "g of x, fed into f":

\frac{d}{dx} f(g(x)) = f&#39;(g(x)) \cdot g&#39;(x)

$f$: outer function
$g$: inner function
$x$: variable

In words: derivative of the outside (evaluated at the inside), times the derivative of the inside.

Or more memorably: outer' × inner' - but critically, the outer derivative is evaluated at the whole inner function, not at $x$ alone.

Worked Example 1: $\frac{d}{dx}(2x+1)^3$

Identify the pieces:

Outer: $f(u) = u^3$ , so $f'(u) = 3u^2$
Inner: $g(x) = 2x+1$ , so $g'(x) = 2$

Apply the chain rule:

\frac{d}{dx}(2x+1)^3 = 3(2x+1)^2 \cdot 2 = 6(2x+1)^2

$u$: shorthand for the inner function 2x+1

We differentiated the outer $()^3$ to get $3()^2$ , evaluated it at the full inner expression $(2x+1)$ , and multiplied by the inner derivative 2.

Worked Example 2: $\frac{d}{dx} e^{x^2}$

Key fact: the derivative of $e^u$ with respect to $u$ is $e^u$ itself - the exponential function is its own derivative.

Outer: $e^{()}$ , so $f'(u) = e^u$
Inner: $x^2$ , so $g'(x) = 2x$

\frac{d}{dx} e^{x^2} = e^{x^2} \cdot 2x = 2x\thinspacee^{x^2}

$e$: Euler's number - base of the natural exponential

Worked Example 3: $\frac{d}{dx}(x^2+5)^4$

Outer: $()^4$ , derivative $4()^3$
Inner: $x^2+5$ , derivative $2x$

\frac{d}{dx}(x^2+5)^4 = 4(x^2+5)^3 \cdot 2x = 8x(x^2+5)^3

$u$: the inner expression x²+5

Interactive example

Chain rule visualizer - see how outer and inner derivatives combine for different compositions

Coming soon

A Key ML Example: The Sigmoid Derivative

The $\sigma(z) = \frac{1}{1+e^{-z}}$ is one of the most important functions in neural networks. What is its derivative?

Write $\sigma(z) = (1 + e^{-z})^{-1}$ .

The derivative of the sigmoid is the sigmoid times one minus itself. This formula is used in backpropagation every time a sigmoid activation appears - and now you know exactly where it comes from.

Chains Within Chains

You can have triple compositions: $f(g(h(x)))$ . Apply the chain rule at each level:

\frac{d}{dx} f(g(h(x))) = f&#39;(g(h(x))) \cdot g&#39;(h(x)) \cdot h&#39;(x)

$f$: outermost function
$g$: middle function
$h$: innermost function

Example: $\frac{d}{dx} \sin((3x+1)^2)$

Outer: $\sin()$ , derivative $\cos()$
Middle: $()^2$ , derivative $2()$
Inner: $3x+1$ , derivative $3$

Result: $\cos((3x+1)^2) \cdot 2(3x+1) \cdot 3 = 6(3x+1)\cos((3x+1)^2)$

Why the Chain Rule Is Backpropagation

A neural network is a composition: output = layer3(layer2(layer1(input))). To find $\partial L / \partial w$ for a weight in layer 1, you chain through all three layers:

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a_3} \cdot \frac{\partial a_3}{\partial a_2} \cdot \frac{\partial a_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial w}

$L$: loss
$w$: weight in layer 1

Four terms multiplied together - one per link in the chain. Backpropagation is the efficient algorithm for computing all such products without redundant work. The mathematical content is just the chain rule, applied over and over.

import torch

# PyTorch autograd implements the chain rule for any composition
z = torch.tensor(0.0, requires_grad=True)

# Chain: sigmoid(z) = 1 / (1 + exp(-z))
sigma = torch.sigmoid(z)
sigma.backward()
print(f"σ'(0) = {z.grad.item():.4f}")   # → 0.25  (σ(0)·(1-σ(0)) = 0.5·0.5)

# Chain rule through a 3-layer composition
z = torch.tensor(2.0, requires_grad=True)
a1 = torch.relu(z)           # layer 1
a2 = torch.sigmoid(a1)       # layer 2
loss = (a2 - 0.8) ** 2       # MSE loss
loss.backward()
print(f"∂L/∂z = {z.grad.item():.6f}")  # chain rule applied automatically