Skip to content
Backpropagation
Lesson 1 ⏱ 10 min

Which weights caused the error?

Video coming soon

The Backpropagation Problem: Why Naive Gradient Computation Fails

How numerical differentiation scales catastrophically with parameter count, and the key insight that the composition structure of networks allows all gradients to be computed in one backward pass.

⏱ ~6 min

🧮

Quick refresher

Partial derivatives and gradient descent

The gradient of a loss L with respect to weight w tells you which direction to nudge w to decrease L. Gradient descent update: w ← w - α(∂L/∂w). Computing this for every weight is the core challenge of training.

Example

A network with 330,000 parameters needs 330,000 partial derivatives per training step.

Numerical estimation would require 330,000 forward passes.

Backprop computes all of them in one backward pass.

Setting the Stage

Think about how you would tune 330,000 dials by hand. Each dial is a weight. The question you need to answer for each one: "should I turn this clockwise or counterclockwise, and by how much?" If you just guessed randomly, it would take longer than the age of the universe to find a good combination. You need a systematic signal telling you the direction.

That signal is the gradient. The gradient of the loss with respect to a weight tells you exactly: "if I increase this weight by a tiny amount, the loss increases (or decreases) by this much." That's the directional instruction gradient descent needs. Without it, you can't train.

A business analogy: imagine 330,000 pricing knobs at a large retailer. You want to set them to maximize revenue. One way to test each knob: change it slightly, measure revenue, put it back, move to the next. That's 330,000 experiments for one round of adjustments. The neural network equivalent — numerical differentiation — has exactly this scaling problem. Backpropagation is the discovery that you can efficiently infer the effect of all 330,000 knobs simultaneously, in one measurement sweep.

You know the gradient descent update rule:

wwαLww \leftarrow w - \alpha \cdot \frac{\partial L}{\partial w}
ww
a single weight parameter
α\alpha
learning rate
LL
loss function

For a single logistic regression neuron with two or three weights, computing L/w\partial L / \partial w is a few lines of calculus. Done.

Backpropagation is the algorithm that trains neural networks. Without an efficient way to compute gradients across millions of weights, deep learning would be computationally impossible — backprop is what makes it feasible.

But now you have a neural network. A modest one: 3 hidden layers, 256 neurons each, and 10-class classification of 784-dimensional input. Parameter count:

784×256+256×256+256×256+256×10330,000 parameters784 \times 256 + 256 \times 256 + 256 \times 256 + 256 \times 10 \approx 330{,}000 \text{ parameters}

The update rule still says "compute L/w\partial L / \partial w for every parameter." That is 330,000 partial derivatives, per training example, per training step. How?

The Naive Approach: Numerical Differentiation

The definition of a derivative is a limit:

LwL(w+ε)L(w)ε\frac{\partial L}{\partial w} \approx \frac{L(w + \varepsilon) - L(w)}{\varepsilon}
ε\varepsilon
tiny perturbation, e.g. 0.0001
LL
loss function
ww
weight being differentiated

In principle, you could apply this to every weight. Perturb w1w_1 by a tiny amount, run the forward pass, measure how much the loss changes. Repeat for w2,w3,,w330,000w_2, w_3, \ldots, w_{330{,}000}.

That means 330,000 forward passes per gradient computation. For a network taking 1 ms per forward pass: 330,000 × 1 ms = 5.5 minutes per gradient update. For 100,000 training steps: over 1 year per training run. For a network with hundreds of millions of parameters, multiply accordingly.

The scales catastrophically with parameter count. We need something smarter.

The Key Insight: Structure of the Computation

The loss LL is not an arbitrary function of all the weights. It has specific structure: it is a composition of functions.

Starting from the input, you apply W(1)W^{(1)} then b(1)b^{(1)} then ReLU, then W(2)W^{(2)} then b(2)b^{(2)} then ReLU, and so on to the final loss function. Each step is a simple, differentiable operation.

The calculus rule for compositions is the :

\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x)
ff
outer function
gg
inner function
xx
input

The key insight: instead of computing each partial derivative independently (requiring independent passes), you can reuse intermediate computations. Once you know the gradient of the loss with respect to layer LL's output, you can use that — plus layer LL's local derivative — to get the gradient with respect to layer L1L-1's output. And so on, backward.

Backpropagation: One Pass, All Gradients

The algorithm (Rumelhart, Hinton, and Williams, 1986) formalizes this idea. It computes all gradients in a single backward pass:

  1. Run the forward pass to compute predictions and cache all intermediate values.
  2. Compute the loss from the prediction and true label.
  3. Run the backward pass: starting from the loss, propagate gradient signals backward layer by layer, applying the chain rule at each step.
  4. After the backward pass, every weight has its gradient L/w\partial L / \partial w ready.

Total cost: one forward pass + one backward pass. The backward pass costs roughly 2×2\times a forward pass. So one training step costs about 3×3\times a single forward pass — regardless of whether the network has 1,000 parameters or 1 billion.

Automatic Differentiation

In practice, you do not implement backpropagation by hand. Modern frameworks — PyTorch, TensorFlow, JAX — implement .

During the forward pass, the framework silently records every operation in a computation graph: each node is an operation (matrix multiply, add, ReLU, log), and each edge records which values fed into it. When you call loss.backward() in PyTorch, the framework traverses this graph backward, computing local gradients at each node and multiplying them via the chain rule.

All gradients are computed exactly (not approximately) and stored in the .grad attribute of each tensor. You write the forward pass; the framework handles the backward pass for free.

Quiz

1 / 3

Why can't we compute gradients for deep networks by perturbing each weight numerically?