Skip to content
Backpropagation
Lesson 2 ⏱ 14 min

The chain rule in networks

Video coming soon

Chain Rule in Neural Networks: Tracing Gradients Backward

Step-by-step derivation of gradients through a single neuron using the chain rule, extending to multiple layers, and the concept of local gradients at each computation node.

⏱ ~8 min

🧮

Quick refresher

Chain rule

For f(g(x)), the derivative is f'(g(x)) times g'(x). Derivative of outside (at inside) times derivative of inside.

Example

d/dx (2x+1)³ = 3(2x+1)² · 2 = 6(2x+1)².

The Chain Rule, Refreshed

When you have a composition of functions, derivatives multiply. For y=f(g(x))y = f(g(x)):

Plain-language intuition first: Imagine you are pulling a lever that controls a gear, which controls a second gear, which moves a platform. If moving the lever 1 cm turns the first gear by 2 degrees, and turning the first gear by 1 degree turns the second gear by 3 degrees, then moving the lever 1 cm causes 2 × 3 = 6 degrees of rotation in the second gear. That's the chain rule: effects along a chain of dependencies multiply. In a neural network, moving a weight changes a pre-activation, which changes an activation, which changes a loss. Each change is a local ratio — a derivative — and they multiply to give the total effect.

dydx=dfdgdgdx\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}
ff
outer function
gg
inner function

Extend this to any number of nested functions. For y=f(g(h(x)))y = f(g(h(x))):

dydx=dfdgdgdhdhdx\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}
ff
outermost function
gg
middle function
hh
innermost function

Every step along the chain contributes its own derivative, and they all multiply together. This is the entire mathematical machinery behind backpropagation.

A deep network is just functions stacked inside other functions. The derivative of that stack is a product of each layer's local derivative — and that's exactly what the chain rule computes. Without it, there would be no principled way to ask "how does changing this weight in layer 1 affect the final loss?" With it, the answer falls out automatically, layer by layer.

Tracing Gradients Through One Neuron

Let's be concrete. One neuron with cross-entropy loss:

z=wx+b,a=σ(z),L=ylog(a)(1y)log(1a)z = wx + b, \quad a = \sigma(z), \quad L = -y\log(a) - (1-y)\log(1-a)
zz
linear pre-activation: wx + b
aa
sigmoid activation: \sigma(z)
LL
binary cross-entropy loss

We want L/w\partial L / \partial w — how does the weight ww affect the loss?

The path from ww to LL goes through zz, then aa, then LL:

Lw=Laazzw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}
L/a\partial L / \partial a
how loss changes with activation
a/z\partial a / \partial z
sigmoid derivative
z/w\partial z / \partial w
how linear output changes with weight

Now compute each factor:

Factor 1 - L/a\partial L / \partial a: differentiate the cross-entropy with respect to the activation.

La=ya+1y1a\frac{\partial L}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a}

Factor 2 - a/z\partial a / \partial z: the sigmoid derivative.

\frac{\partial a}{\partial z} = \sigma'(z) = \sigma(z)(1 - \sigma(z)) = a(1-a)
aa
sigmoid output = \sigma(z)

Factor 3 - z/w\partial z / \partial w: how the linear output changes with the weight. Since z=wx+bz = wx + b:

zw=x\frac{\partial z}{\partial w} = x

Multiply all three and simplify. The derivative a(1a)a(1-a) cancels beautifully with the log derivative:

yaa(1a)+1y1aa(1a)=y(1a)+(1y)a=ay-\frac{y}{a} \cdot a(1-a) + \frac{1-y}{1-a} \cdot a(1-a) = -y(1-a) + (1-y)a = a - y
Lw=(y^y)x\frac{\partial L}{\partial w} = (\hat{y} - y) \cdot x
y^\hat{y}
predicted probability = a
yy
true label

The Local Gradient Concept

Every node in a computation graph has two roles:

  1. Forward pass: compute the output from the inputs.
  2. Backward pass: compute the and multiply it by the incoming gradient signal.

Key local gradients to memorize:

  • Multiply node z=wxz = w \cdot x: local grad for ww is xx; local grad for xx is ww
  • Add node z=a+bz = a + b: local grad for both inputs is 11
  • ReLU: local grad is 11 if z > 0, else 00
  • Sigmoid: local grad is a(1a)a(1-a)

The chain rule says: total gradient = incoming gradient × local gradient. Backpropagation applies this rule at every node, flowing from the loss backward to the inputs.

Interactive example

Backprop through a small graph - click a node to see its local gradient and the signal flowing backward

Coming soon

Extending to Multiple Layers

For a two-layer network, the gradient for layer 1's weights involves more chain links:

LW(1)=La(2)a(2)z(2)z(2)a(1)a(1)z(1)z(1)W(1)\frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial a^{(2)}} \cdot \frac{\partial a^{(2)}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial W^{(1)}}
WW
weights of layer 1
aa
layer 2 activation
zz
pre-activation at layer l
aa
post-activation at layer l

Working backward through the factors:

  1. Compute: L/a(2)\partial L / \partial a^{(2)}: loss gradient at layer 2 output — computed first
  2. Compute: a(2)/z(2)\partial a^{(2)} / \partial z^{(2)}: layer 2 activation derivative
  3. Compute: z(2)/a(1)\partial z^{(2)} / \partial a^{(1)}: how layer 2's linear output changes with layer 1's output — this is W(2)W^{(2)}
  4. Compute: a(1)/z(1)\partial a^{(1)} / \partial z^{(1)}: layer 1 activation derivative
  5. Compute: z(1)/W(1)\partial z^{(1)} / \partial W^{(1)}: how layer 1's linear output changes with its weights — this is a(0)a^{(0)} (the input)

Notice factor 3: z(2)/a(1)=W(2)\partial z^{(2)} / \partial a^{(1)} = W^{(2)}. To propagate gradients backward through a linear layer that used W(2)W^{(2)} in the forward pass, you multiply by W(2)W^{(2)\top}. This is where the in backpropagation formulas.

Why It Is Efficient

The brilliant part: you compute these products once, from output to input, reusing results.

When computing layer 2's gradient, you compute the error signal = L/z(2)\partial L / \partial z^{(2)}. To get layer 1's error signal, you use δ(2)\delta^{(2)} — you do not recompute from the loss. And to get layer 0's error signal, you use δ(1)\delta^{(1)}. Each layer's computation is O(n2)O(n^2) in the layer size, and the total cost is O(total parameters)O(\text{total parameters}) — the same order as a single forward pass.

This reuse of intermediate computations is the efficiency gain that makes deep learning feasible.

Quiz

1 / 3

For the single neuron z=wx+b, a=σ(z), L=loss(a,y): ∂L/∂w = ...