Skip to content
Math Foundation Derivatives
Lesson 3 ⏱ 12 min

The chain rule

Video coming soon

The Chain Rule: Derivatives of Nested Functions

Intuition through the driving analogy, three worked examples, the sigmoid derivative, and how backpropagation is the chain rule applied in layers.

⏱ ~7 min

🧮

Quick refresher

The power rule

d/dx xⁿ = n·xⁿ⁻¹. The exponent becomes a multiplier and decreases by 1. Constants differentiate to 0. Sum rule - differentiate each term separately.

Example

d/dx x³ = 3x².

d/dx (5x² + 2x) = 10x + 2.

When Functions Are Nested

The power rule handles x2x^2 or x5x^5 with ease. But what about (2x+1)2(2x+1)^2? Or ex2e^{x^2}? Or x2+3x\sqrt{x^2 + 3x}?

These are - one function inside another - and the power rule alone cannot handle them. You need the chain rule.

The chain rule is probably the single most important differentiation technique in machine learning. - the algorithm that trains every neural network - is the chain rule applied systematically through all layers. Understanding it here means you will understand backprop intuitively, not just mechanically.

The Driving Analogy

Imagine a car where your foot position controls engine RPM, and engine RPM controls wheel speed. Two separate relationships chained together:

  • Wheel speed increases by 5 mph per 100 extra RPM
  • Engine RPM increases by 200 RPM per inch you press the pedal

How fast do the wheels speed up per inch of pedal? Multiply the two rates:

5mph/100RPM×200RPM/inch=10mph/inch5\thinspace\text{mph}/100\thinspace\text{RPM} \times 200\thinspace\text{RPM}/\text{inch} = 10\thinspace\text{mph}/\text{inch}

You multiply the two rates. That is the chain rule. When two functions are chained, multiply their individual derivatives.

The Formal Rule

For a composition f(g(x))f(g(x)) - "g of x, fed into f":

\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)
ff
outer function
gg
inner function
xx
variable

In words: derivative of the outside (evaluated at the inside), times the derivative of the inside.

Or more memorably: outer' × inner' - but critically, the outer derivative is evaluated at the whole inner function, not at xx alone.

Worked Example 1: ddx(2x+1)3\frac{d}{dx}(2x+1)^3

Identify the pieces:

  • Outer: f(u)=u3f(u) = u^3, so f'(u) = 3u^2
  • Inner: g(x)=2x+1g(x) = 2x+1, so g'(x) = 2

Apply the chain rule:

ddx(2x+1)3=3(2x+1)22=6(2x+1)2\frac{d}{dx}(2x+1)^3 = 3(2x+1)^2 \cdot 2 = 6(2x+1)^2
uu
shorthand for the inner function 2x+1

We differentiated the outer ()3()^3 to get 3()23()^2, evaluated it at the full inner expression (2x+1)(2x+1), and multiplied by the inner derivative 2.

Worked Example 2: ddxex2\frac{d}{dx} e^{x^2}

Key fact: the derivative of eue^u with respect to uu is eue^u itself - the exponential function is its own derivative.

  • Outer: e()e^{()}, so f'(u) = e^u
  • Inner: x2x^2, so g'(x) = 2x
ddxex2=ex22x=2x\thinspaceex2\frac{d}{dx} e^{x^2} = e^{x^2} \cdot 2x = 2x\thinspacee^{x^2}
ee
Euler's number - base of the natural exponential

Worked Example 3: ddx(x2+5)4\frac{d}{dx}(x^2+5)^4

  • Outer: ()4()^4, derivative 4()34()^3
  • Inner: x2+5x^2+5, derivative 2x2x
ddx(x2+5)4=4(x2+5)32x=8x(x2+5)3\frac{d}{dx}(x^2+5)^4 = 4(x^2+5)^3 \cdot 2x = 8x(x^2+5)^3
uu
the inner expression x²+5

Interactive example

Chain rule visualizer - see how outer and inner derivatives combine for different compositions

Coming soon

A Key ML Example: The Sigmoid Derivative

The σ(z)=11+ez\sigma(z) = \frac{1}{1+e^{-z}} is one of the most important functions in neural networks. What is its derivative?

Write σ(z)=(1+ez)1\sigma(z) = (1 + e^{-z})^{-1}.

The derivative of the sigmoid is the sigmoid times one minus itself. This formula is used in backpropagation every time a sigmoid activation appears - and now you know exactly where it comes from.

Chains Within Chains

You can have triple compositions: f(g(h(x)))f(g(h(x))). Apply the chain rule at each level:

\frac{d}{dx} f(g(h(x))) = f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x)
ff
outermost function
gg
middle function
hh
innermost function

Example: ddxsin((3x+1)2)\frac{d}{dx} \sin((3x+1)^2)

  • Outer: sin()\sin(), derivative cos()\cos()
  • Middle: ()2()^2, derivative 2()2()
  • Inner: 3x+13x+1, derivative 33

Result: cos((3x+1)2)2(3x+1)3=6(3x+1)cos((3x+1)2)\cos((3x+1)^2) \cdot 2(3x+1) \cdot 3 = 6(3x+1)\cos((3x+1)^2)

Why the Chain Rule Is Backpropagation

A neural network is a composition: output = layer3(layer2(layer1(input))). To find L/w\partial L / \partial w for a weight in layer 1, you chain through all three layers:

Lw=La3a3a2a2a1a1w\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a_3} \cdot \frac{\partial a_3}{\partial a_2} \cdot \frac{\partial a_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial w}
LL
loss
ww
weight in layer 1

Four terms multiplied together - one per link in the chain. Backpropagation is the efficient algorithm for computing all such products without redundant work. The mathematical content is just the chain rule, applied over and over.

import torch

# PyTorch autograd implements the chain rule for any composition
z = torch.tensor(0.0, requires_grad=True)

# Chain: sigmoid(z) = 1 / (1 + exp(-z))
sigma = torch.sigmoid(z)
sigma.backward()
print(f"σ'(0) = {z.grad.item():.4f}")   # → 0.25  (σ(0)·(1-σ(0)) = 0.5·0.5)

# Chain rule through a 3-layer composition
z = torch.tensor(2.0, requires_grad=True)
a1 = torch.relu(z)           # layer 1
a2 = torch.sigmoid(a1)       # layer 2
loss = (a2 - 0.8) ** 2       # MSE loss
loss.backward()
print(f"∂L/∂z = {z.grad.item():.6f}")  # chain rule applied automatically

Interactive example

Backpropagation demo - trace how gradients flow backward through a 3-layer network using the chain rule

Coming soon

Quiz

1 / 3

Using the chain rule, what is d/dx (2x + 1)³?