The training loop — Framing the Problem

The Universal Training Algorithm

Every ML model - from linear regression to GPT - trains using some version of the same loop. The details differ enormously, but the skeleton is identical. Understand this loop and you understand how all models train.

Here it is in five steps:

Initialize parameters θ randomly
Repeat until convergence:
- a. Sample a batch of training examples
- b. Forward pass: compute predictions $\hat{\mathbf{y}} = f(\mathbf{X};; \boldsymbol{\theta})$
- c. Compute loss: $\mathcal{L} = \frac{1}{n}\sum_i L(y_i, \hat{y}_i)$
- d. Backward pass: compute gradients $\nabla_{\boldsymbol{\theta}}\thinspace\mathcal{L}$
- e. Update: $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha\thinspace\nabla_{\boldsymbol{\theta}}\thinspace\mathcal{L}$

Let us walk through each step - not just what it does, but why.

Step 1: Random Initialization

We start with random parameters. Why not all zeros?

If all weights start identical, every neuron in a layer produces the same output for any input. Every neuron gets the same gradient. Every neuron updates the same way. They never diverge - the entire layer behaves like a single neuron repeated.

The is the mathematical reason random initialization is required.

We initialize with small values (e.g., drawn from a Gaussian with $\sigma = 0.01$ ) to avoid saturating activation functions before training even begins.

Step 2a: Sample a Batch

Rather than processing all training data at once, we process small batches - typically 32 to 256 examples. This is called mini-batch gradient descent.

The is the universal training strategy in modern ML.

On GPU hardware, batches of 64 examples often run nearly as fast as a single example because the hardware processes them in parallel.

Step 2b: Forward Pass

Apply the model to the batch: compute $\hat{\mathbf{y}} = f(\mathbf{X};; \boldsymbol{\theta})$ . For a neural network this means computing each layer in sequence: multiply weights, add bias, apply activation - passing activations forward through the network.

Critically, we cache all intermediate values (activations, pre-activations). The backward pass needs these cached values to compute gradients efficiently via the chain rule. Without caching, we would need to recompute the entire forward pass.

Step 2c: Compute Loss

Compare predictions to true labels and compute the loss:

\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n} L(y_i,; \hat{y}_i)

$n$: number of examples in this batch
$y_i$: true label for example i
$hat_y_i$: model prediction for example i

One number. Lower means better predictions. This is our signal for how the current parameters are performing.

Step 2d: Backward Pass (Backpropagation)

Here calculus earns its keep. We compute the gradient $\nabla_{\boldsymbol{\theta}}\thinspace\mathcal{L}$ : how does the loss change if we adjust each parameter by a tiny amount?

\nabla_{\boldsymbol{\theta}}\thinspace\mathcal{L} = \left[\frac{\partial \mathcal{L}}{\partial \theta_1},; \frac{\partial \mathcal{L}}{\partial \theta_2},; \ldots,; \frac{\partial \mathcal{L}}{\partial \theta_p}\right]

$nabla_theta_L$: gradient of the loss with respect to all parameters - a vector with one entry per parameter
$partial_L_w_j$: how much the loss changes per unit change in parameter w_j

For a model with 1 million parameters, this is a 1-million-dimensional vector. The backpropagation algorithm computes all these partial derivatives efficiently using the chain rule, propagating error signals backward through the network layers. We cover backprop in the neural networks unit.

Step 2e: Parameter Update

We now know the gradient - which direction is "uphill" on the loss surface. We step in the opposite direction to go downhill:

\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \alpha\thinspace\nabla_{\boldsymbol{\theta}}\thinspace\mathcal{L}

$theta$: current parameter values
$alpha$: learning rate - controls step size, typically 0.001 to 0.1
$nabla_L$: gradient vector - points uphill, so we subtract it

The (alpha) is the learning rate. Too large: we overshoot the minimum, bouncing back and forth. Too small: we take forever to converge. Typical values range from $10^{-4}$ to $10^{-1}$ .

This single update rule is applied to every parameter simultaneously. Every weight and bias adjusts by its partial derivative, scaled by $\alpha$ .

One Update Step in Numbers

Suppose a single parameter $w$ currently equals $3.0$ , its gradient is $\frac{\partial \mathcal{L}}{\partial w} = 2.4$ , and the learning rate is $\alpha = 0.1$ . Applying the update rule:

Update: $w \leftarrow 3.0 - (0.1)(2.4) = 3.0 - 0.24 = 2.76$

The gradient was positive (loss increases as $w$ increases), so we stepped $w$ in the negative direction — downhill. Next step, say the gradient is $1.8$ :

Next update: $w \leftarrow 2.76 - (0.1)(1.8) = 2.76 - 0.18 = 2.58$

Repeated thousands of times across all parameters simultaneously, these tiny steps walk the model toward a configuration with low loss.

The works because the gradient is a local linear approximation of the loss surface.

Repeat Until Convergence

Steps 2a through 2e repeat for many iterations. Each full pass through the training data is one epoch. After enough epochs, the loss plateaus - we have found a local minimum.

Convergence — in plain terms — is when the model stops getting noticeably better. Each update moves parameters by a tinier amount, and the loss changes by less than some small threshold (e.g., 0.0001 per epoch). In practice you set an early stopping patience: if validation loss has not improved in, say, 10 consecutive epochs, stop training. This prevents wasted compute and guards against overfitting.

In practice, we also monitor validation loss on held-out data to detect overfitting. If training loss keeps falling but validation loss starts rising, training should stop early.

The Training Loop in Code (Python / PyTorch)

import torch

model     = MyModel()                                      # Step 1: model with random θ
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn   = torch.nn.MSELoss()

for epoch in range(num_epochs):                            # Step 2: repeat
    for X_batch, y_batch in dataloader:                    # Step 2a: sample batch
        y_pred = model(X_batch)                            # Step 2b: forward pass
        loss   = loss_fn(y_pred, y_batch)                  # Step 2c: compute loss
        optimizer.zero_grad()                              #   clear old gradients
        loss.backward()                                    # Step 2d: backward pass
        optimizer.step()                                   # Step 2e: update θ
    print(f"Epoch {epoch}: loss = {loss.item():.4f}")

Five lines of loop body mirror the five conceptual steps exactly.

Interactive example

Watch the training loop in action - see how loss decreases as the parameter update repeats

Coming soon