Skip to content
Framing the Problem
Lesson 3 ⏱ 10 min

The training loop

Video coming soon

The Training Loop: Five Steps That Train Every Model

A step-by-step walkthrough of the universal training algorithm - initialization, forward pass, loss, backprop, and gradient descent update.

⏱ ~7 min

🧮

Quick refresher

What a derivative measures

The derivative of a function at a point tells us its slope there — how fast it's changing. If we're minimizing a function, the derivative tells us which direction to step.

Example

If L is a loss function and ∂L/∂w = 3, the loss is increasing as w increases — so we should decrease w.

The Universal Training Algorithm

Every ML model - from linear regression to GPT - trains using some version of the same loop. The details differ enormously, but the skeleton is identical. Understand this loop and you understand how all models train.

Here it is in five steps:

  1. Initialize parameters θ randomly
  2. Repeat until convergence:
    • a. Sample a batch of training examples
    • b. Forward pass: compute predictions y^=f(X;;θ)\hat{\mathbf{y}} = f(\mathbf{X};; \boldsymbol{\theta})
    • c. Compute loss: L=1niL(yi,y^i)\mathcal{L} = \frac{1}{n}\sum_i L(y_i, \hat{y}_i)
    • d. Backward pass: compute gradients θL\nabla_{\boldsymbol{\theta}}\thinspace\mathcal{L}
    • e. Update: θθαθL\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha\thinspace\nabla_{\boldsymbol{\theta}}\thinspace\mathcal{L}

Let us walk through each step - not just what it does, but why.

Step 1: Random Initialization

We start with random parameters. Why not all zeros?

If all weights start identical, every neuron in a layer produces the same output for any input. Every neuron gets the same gradient. Every neuron updates the same way. They never diverge - the entire layer behaves like a single neuron repeated.

The is the mathematical reason random initialization is required.

We initialize with small values (e.g., drawn from a Gaussian with σ=0.01\sigma = 0.01) to avoid saturating activation functions before training even begins.

Step 2a: Sample a Batch

Rather than processing all training data at once, we process small batches - typically 32 to 256 examples. This is called mini-batch gradient descent.

The is the universal training strategy in modern ML.

On GPU hardware, batches of 64 examples often run nearly as fast as a single example because the hardware processes them in parallel.

Step 2b: Forward Pass

Apply the model to the batch: compute y^=f(X;;θ)\hat{\mathbf{y}} = f(\mathbf{X};; \boldsymbol{\theta}). For a neural network this means computing each layer in sequence: multiply weights, add bias, apply activation - passing activations forward through the network.

Critically, we cache all intermediate values (activations, pre-activations). The backward pass needs these cached values to compute gradients efficiently via the chain rule. Without caching, we would need to recompute the entire forward pass.

Step 2c: Compute Loss

Compare predictions to true labels and compute the loss:

L=1ni=1nL(yi,;y^i)\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n} L(y_i,; \hat{y}_i)
nn
number of examples in this batch
yiy_i
true label for example i
hat_y_i
model prediction for example i

One number. Lower means better predictions. This is our signal for how the current parameters are performing.

Step 2d: Backward Pass (Backpropagation)

Here calculus earns its keep. We compute the gradient θL\nabla_{\boldsymbol{\theta}}\thinspace\mathcal{L}: how does the loss change if we adjust each parameter by a tiny amount?

θL=[Lθ1,;Lθ2,;,;Lθp]\nabla_{\boldsymbol{\theta}}\thinspace\mathcal{L} = \left[\frac{\partial \mathcal{L}}{\partial \theta_1},; \frac{\partial \mathcal{L}}{\partial \theta_2},; \ldots,; \frac{\partial \mathcal{L}}{\partial \theta_p}\right]
nablathetaLnabla_theta_L
gradient of the loss with respect to all parameters - a vector with one entry per parameter
partial_L_w_j
how much the loss changes per unit change in parameter w_j

For a model with 1 million parameters, this is a 1-million-dimensional vector. The backpropagation algorithm computes all these partial derivatives efficiently using the chain rule, propagating error signals backward through the network layers. We cover backprop in the neural networks unit.

Step 2e: Parameter Update

We now know the gradient - which direction is "uphill" on the loss surface. We step in the opposite direction to go downhill:

θθαθL\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha\thinspace\nabla_{\boldsymbol{\theta}}\thinspace\mathcal{L}
thetatheta
current parameter values
alphaalpha
learning rate - controls step size, typically 0.001 to 0.1
nablaLnabla_L
gradient vector - points uphill, so we subtract it

The (alpha) is the learning rate. Too large: we overshoot the minimum, bouncing back and forth. Too small: we take forever to converge. Typical values range from 10410^{-4} to 10110^{-1}.

This single update rule is applied to every parameter simultaneously. Every weight and bias adjusts by its partial derivative, scaled by α\alpha.

The works because the gradient is a local linear approximation of the loss surface.

Repeat Until Convergence

Steps 2a through 2e repeat for many iterations. Each full pass through the training data is one epoch. After enough epochs, the loss plateaus - we have found a local minimum.

Convergence — in plain terms — is when the model stops getting noticeably better. Each update moves parameters by a tinier amount, and the loss changes by less than some small threshold (e.g., 0.0001 per epoch). In practice you set an early stopping patience: if validation loss has not improved in, say, 10 consecutive epochs, stop training. This prevents wasted compute and guards against overfitting.

In practice, we also monitor validation loss on held-out data to detect overfitting. If training loss keeps falling but validation loss starts rising, training should stop early.

Interactive example

Watch the training loop in action - see how loss decreases as the parameter update repeats

Coming soon

Quiz

1 / 3

In the training loop, which step computes gradients?