The chain rule in networks — Backpropagation

The Chain Rule, Refreshed

When you have a composition of functions, derivatives multiply. For $y = f(g(x))$ :

Plain-language intuition first: Imagine you are pulling a lever that controls a gear, which controls a second gear, which moves a platform. If moving the lever 1 cm turns the first gear by 2 degrees, and turning the first gear by 1 degree turns the second gear by 3 degrees, then moving the lever 1 cm causes 2 × 3 = 6 degrees of rotation in the second gear. That's the chain rule: effects along a chain of dependencies multiply. In a neural network, moving a weight changes a pre-activation, which changes an activation, which changes a loss. Each change is a local ratio — a derivative — and they multiply to give the total effect.

\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

$f$: outer function
$g$: inner function

Extend this to any number of nested functions. For $y = f(g(h(x)))$ :

\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}

$f$: outermost function
$g$: middle function
$h$: innermost function

Every step along the chain contributes its own derivative, and they all multiply together. This is the entire mathematical machinery behind backpropagation.

A deep network is just functions stacked inside other functions. The derivative of that stack is a product of each layer's local derivative — and that's exactly what the chain rule computes. Without it, there would be no principled way to ask "how does changing this weight in layer 1 affect the final loss?" With it, the answer falls out automatically, layer by layer.

Tracing Gradients Through One Neuron

Let's be concrete. One neuron with cross-entropy loss:

z = wx + b, \quad a = \sigma(z), \quad L = -y\log(a) - (1-y)\log(1-a)

$z$: linear pre-activation: wx + b
$a$: sigmoid activation: \sigma(z)
$L$: binary cross-entropy loss

We want $\partial L / \partial w$ — how does the weight $w$ affect the loss?

The path from $w$ to $L$ goes through $z$ , then $a$ , then $L$ :

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

$\partial L / \partial a$: how loss changes with activation
$\partial a / \partial z$: sigmoid derivative
$\partial z / \partial w$: how linear output changes with weight

Now compute each factor:

Factor 1 - $\partial L / \partial a$ : differentiate the cross-entropy with respect to the activation.

\frac{\partial L}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a}

Factor 2 - $\partial a / \partial z$ : the sigmoid derivative.

\frac{\partial a}{\partial z} = \sigma&#39;(z) = \sigma(z)(1 - \sigma(z)) = a(1-a)

$a$: sigmoid output = \sigma(z)

Factor 3 - $\partial z / \partial w$ : how the linear output changes with the weight. Since $z = wx + b$ :

\frac{\partial z}{\partial w} = x

Multiply all three and simplify. The derivative $a(1-a)$ cancels beautifully with the log derivative:

-\frac{y}{a} \cdot a(1-a) + \frac{1-y}{1-a} \cdot a(1-a) = -y(1-a) + (1-y)a = a - y

\frac{\partial L}{\partial w} = (\hat{y} - y) \cdot x

$\hat{y}$: predicted probability = a
$y$: true label

Computing this gradient in Python

import math

def sigmoid(z): return 1 / (1 + math.exp(-z))

# Single neuron: z = w*x + b, a = sigmoid(z), L = cross-entropy
x, y_true = 2.0, 1   # input feature, true label
w, b = 0.5, 0.1      # current weight and bias

# Forward pass
z = w * x + b        # = 1.1
a = sigmoid(z)       # ≈ 0.750  (prediction)

# Loss (binary cross-entropy)
L = -(y_true * math.log(a) + (1 - y_true) * math.log(1 - a))
# ≈ 0.288

# Backward pass — chain rule: dL/dw = (a - y) * x
dL_dw = (a - y_true) * x   # ≈ (0.750 - 1) * 2 = -0.500
dL_db = (a - y_true)       # ≈ -0.250

# Weight update (learning rate 0.1)
w_new = w - 0.1 * dL_dw    # 0.5 - 0.1*(-0.5) = 0.55
b_new = b - 0.1 * dL_db    # 0.1 - 0.1*(-0.25) = 0.125

print(f"dL/dw = {dL_dw:.4f}, dL/db = {dL_db:.4f}")
print(f"Updated: w={w_new:.4f}, b={b_new:.4f}")

The gradient is negative (−0.5), so the weight update increases w. Increasing w pushes the prediction closer to 1 (the true label).

The Local Gradient Concept

Every node in a computation graph has two roles:

Forward pass: compute the output from the inputs.
Backward pass: compute the and multiply it by the incoming gradient signal.

Key local gradients to memorize:

Multiply node $z = w \cdot x$ : local grad for $w$ is $x$ ; local grad for $x$ is $w$
Add node $z = a + b$ : local grad for both inputs is $1$
ReLU: local grad is $1$ if $z > 0$ , else $0$
Sigmoid: local grad is $a(1-a)$

The chain rule says: total gradient = incoming gradient × local gradient. Backpropagation applies this rule at every node, flowing from the loss backward to the inputs.

Interactive example

Backprop through a small graph - click a node to see its local gradient and the signal flowing backward

Coming soon

Extending to Multiple Layers

For a two-layer network, the gradient for layer 1's weights involves more chain links:

\frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial a^{(2)}} \cdot \frac{\partial a^{(2)}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial W^{(1)}}

$W$: weights of layer 1
$a$: layer 2 activation
$z$: pre-activation at layer l
$a$: post-activation at layer l

Working backward through the factors:

Compute: $\partial L / \partial a^{(2)}$ : loss gradient at layer 2 output — computed first
Compute: $\partial a^{(2)} / \partial z^{(2)}$ : layer 2 activation derivative
Compute: $\partial z^{(2)} / \partial a^{(1)}$ : how layer 2's linear output changes with layer 1's output — this is $W^{(2)}$
Compute: $\partial a^{(1)} / \partial z^{(1)}$ : layer 1 activation derivative
Compute: $\partial z^{(1)} / \partial W^{(1)}$ : how layer 1's linear output changes with its weights — this is $a^{(0)}$ (the input)

Notice factor 3: $\partial z^{(2)} / \partial a^{(1)} = W^{(2)}$ . To propagate gradients backward through a linear layer that used $W^{(2)}$ in the forward pass, you multiply by $W^{(2)\top}$ . This is where the in backpropagation formulas.

Why It Is Efficient

The brilliant part: you compute these products once, from output to input, reusing results.

When computing layer 2's gradient, you compute the error signal = $\partial L / \partial z^{(2)}$ . To get layer 1's error signal, you use $\delta^{(2)}$ — you do not recompute from the loss. And to get layer 0's error signal, you use $\delta^{(1)}$ . Each layer's computation is $O(n^2)$ in the layer size, and the total cost is $O(\text{total parameters})$ — the same order as a single forward pass.

This reuse of intermediate computations is the efficiency gain that makes deep learning feasible.