Setting the Stage
Think about how you would tune 330,000 dials by hand. Each dial is a weight. The question you need to answer for each one: "should I turn this clockwise or counterclockwise, and by how much?" If you just guessed randomly, it would take longer than the age of the universe to find a good combination. You need a systematic signal telling you the direction.
That signal is the gradient. The gradient of the loss with respect to a weight tells you exactly: "if I increase this weight by a tiny amount, the loss increases (or decreases) by this much." That's the directional instruction gradient descent needs. Without it, you can't train.
A business analogy: imagine 330,000 pricing knobs at a large retailer. You want to set them to maximize revenue. One way to test each knob: change it slightly, measure revenue, put it back, move to the next. That's 330,000 experiments for one round of adjustments. The neural network equivalent — numerical differentiation — has exactly this scaling problem. Backpropagation is the discovery that you can efficiently infer the effect of all 330,000 knobs simultaneously, in one measurement sweep.
You know the gradient descent update rule:
- a single weight parameter
- learning rate
- loss function
For a single logistic regression neuron with two or three weights, computing is a few lines of calculus. Done.
Backpropagation is the algorithm that trains neural networks. Without an efficient way to compute gradients across millions of weights, deep learning would be computationally impossible — backprop is what makes it feasible.
But now you have a neural network. A modest one: 3 hidden layers, 256 neurons each, and 10-class classification of 784-dimensional input. Parameter count:
The update rule still says "compute for every parameter." That is 330,000 partial derivatives, per training example, per training step. How?
The Naive Approach: Numerical Differentiation
The definition of a derivative is a limit:
- tiny perturbation, e.g. 0.0001
- loss function
- weight being differentiated
In principle, you could apply this to every weight. Perturb by a tiny amount, run the forward pass, measure how much the loss changes. Repeat for .
That means 330,000 forward passes per gradient computation. For a network taking 1 ms per forward pass: 330,000 × 1 ms = 5.5 minutes per gradient update. For 100,000 training steps: over 1 year per training run. For a network with hundreds of millions of parameters, multiply accordingly.
The scales catastrophically with parameter count. We need something smarter.
The Key Insight: Structure of the Computation
The loss is not an arbitrary function of all the weights. It has specific structure: it is a composition of functions.
Starting from the input, you apply then then ReLU, then then then ReLU, and so on to the final loss function. Each step is a simple, differentiable operation.
The calculus rule for compositions is the :
- outer function
- inner function
- input
The key insight: instead of computing each partial derivative independently (requiring independent passes), you can reuse intermediate computations. Once you know the gradient of the loss with respect to layer 's output, you can use that — plus layer 's local derivative — to get the gradient with respect to layer 's output. And so on, backward.
Backpropagation: One Pass, All Gradients
The algorithm (Rumelhart, Hinton, and Williams, 1986) formalizes this idea. It computes all gradients in a single backward pass:
- Run the forward pass to compute predictions and cache all intermediate values.
- Compute the loss from the prediction and true label.
- Run the backward pass: starting from the loss, propagate gradient signals backward layer by layer, applying the chain rule at each step.
- After the backward pass, every weight has its gradient ready.
Total cost: one forward pass + one backward pass. The backward pass costs roughly a forward pass. So one training step costs about a single forward pass — regardless of whether the network has 1,000 parameters or 1 billion.
Automatic Differentiation
In practice, you do not implement backpropagation by hand. Modern frameworks — PyTorch, TensorFlow, JAX — implement .
During the forward pass, the framework silently records every operation in a computation graph: each node is an operation (matrix multiply, add, ReLU, log), and each edge records which values fed into it. When you call loss.backward() in PyTorch, the framework traverses this graph backward, computing local gradients at each node and multiplying them via the chain rule.
All gradients are computed exactly (not approximately) and stored in the .grad attribute of each tensor. You write the forward pass; the framework handles the backward pass for free.