Skip to content
Gradient Descent
Lesson 6 ⏱ 10 min

Convergence

Video coming soon

Convergence - Reading Loss Curves and Knowing When to Stop

Loss curve anatomy, early stopping algorithm, gradient norm criterion, and how to diagnose overfitting, underfitting, and learning rate problems from a single plot.

⏱ ~6 min

🧮

Quick refresher

Epoch and iteration

One iteration is one gradient update (one mini-batch). One epoch is a full pass through all training data. If you have 10,000 examples with batch size 100, one epoch = 100 iterations.

Example

Training for 50 epochs with batch size 64 on a dataset of 6,400 examples: 100 iterations per epoch × 50 epochs = 5,000 total gradient updates.

How Do We Know When to Stop?

Gradient descent can run indefinitely. The loss usually keeps decreasing — but at some point, further training gives diminishing returns or actively hurts the model's ability to generalize. There's no perfect universal stopping criterion, but there are good principled heuristics.

A model that doesn't converge is a model that doesn't learn. Understanding convergence tells you when to stop training, whether your setup is working, and how to diagnose common failures from a single loss curve.

Iteration, Epoch, and Steps

Two precise definitions you need:

  • : one mini-batch gradient update
  • : one full pass through the entire training dataset

If you have n=50,000n = 50{,}000 training examples and batch size B=100B = 100:

T=nB=50,000100=500 iterations per epochT = \frac{n}{B} = \frac{50{,}000}{100} = 500 \text{ iterations per epoch}
nn
total training examples
BB
batch size
TT
iterations per epoch

Training for 10 epochs means 5,000 total gradient updates.

Fixed Epochs: The Baseline

The simplest approach: train for a fixed number of epochs, then stop.

Pros: simple, reproducible, easy to reason about. Cons: no automatic detection of overfitting or under-convergence. You might stop too early or too late.

Always pair with model checkpointing — save the model weights whenever a metric improves (or every N epochs). Even if you train too long and the model degrades, you can restore the best saved checkpoint.

Early Stopping: The Principled Approach

The monitors validation loss and stops training when it stops improving — automatically catching overfitting.

Algorithm:

  1. After each epoch, evaluate on the validation set
  2. If validation loss improved: save current weights as "best model," reset patience counter to 0
  3. If validation loss did not improve: increment patience counter
  4. If patience counter reaches threshold (e.g., K=10K = 10): stop training and restore the best saved weights

This answers the right question: "has the model stopped getting better on unseen data?" — not "has it stopped getting better on training data?" (which is almost always the wrong question). Early stopping also acts as implicit regularization.

Gradient Norm: Mathematical Convergence

A third criterion: stop when the is nearly zero:

|\nabla L| < \varepsilon
L\|\nabla L\|
Euclidean length of the gradient vector
ε\varepsilon
convergence tolerance - a small threshold like 1e-6 or 1e-5

Near a minimum, gradients become small. This is a natural mathematical criterion. In practice, gradient-norm stopping is more common for convex problems (logistic regression, linear regression with regularization) where any stationary point is the global minimum. For neural networks, gradients can also be near zero at saddle points and flat regions.

Reading Loss Curves

The Practical Workflow

For most projects:

  1. Start with fixed epochs (50–200 depending on dataset size) to establish a baseline loss curve
  2. Add early stopping with patience K=10K = 10 to 2020 epochs
  3. Save checkpoints every epoch (or when validation loss improves)
  4. Plot training and validation loss after every run
  5. Diagnose from the shape of the curves, adjust, repeat

Interactive example

Interactive loss curve simulator - choose a training scenario and see the characteristic loss curve pattern

Coming soon

Quiz

1 / 3

One 'epoch' is defined as...