Convergence — Gradient Descent

How Do We Know When to Stop?

Gradient descent can run indefinitely. The loss usually keeps decreasing — but at some point, further training gives diminishing returns or actively hurts the model's ability to generalize. There's no perfect universal stopping criterion, but there are good principled heuristics.

A model that doesn't converge is a model that doesn't learn. Understanding convergence tells you when to stop training, whether your setup is working, and how to diagnose common failures from a single loss curve.

Iteration, Epoch, and Steps

Two precise definitions you need:

: one mini-batch gradient update
: one full pass through the entire training dataset

If you have $n = 50{,}000$ training examples and batch size $B = 100$ :

T = \frac{n}{B} = \frac{50{,}000}{100} = 500 \text{ iterations per epoch}

$n$: total training examples
$B$: batch size
$T$: iterations per epoch

Training for 10 epochs means 5,000 total gradient updates.

Fixed Epochs: The Baseline

The simplest approach: train for a fixed number of epochs, then stop.

Pros: simple, reproducible, easy to reason about. Cons: no automatic detection of overfitting or under-convergence. You might stop too early or too late.

Always pair with model checkpointing — save the model weights whenever a metric improves (or every N epochs). Even if you train too long and the model degrades, you can restore the best saved checkpoint.

Early Stopping: The Principled Approach

The monitors validation loss and stops training when it stops improving — automatically catching overfitting.

Algorithm:

After each epoch, evaluate on the validation set
If validation loss improved: save current weights as "best model," reset patience counter to 0
If validation loss did not improve: increment patience counter
If patience counter reaches threshold (e.g., $K = 10$ ): stop training and restore the best saved weights

This answers the right question: "has the model stopped getting better on unseen data?" — not "has it stopped getting better on training data?" (which is almost always the wrong question). Early stopping also acts as implicit regularization.

Gradient Norm: Mathematical Convergence

A third criterion: stop when the is nearly zero:

|\nabla L| &lt; \varepsilon

$\|\nabla L\|$: Euclidean length of the gradient vector
$\varepsilon$: convergence tolerance - a small threshold like 1e-6 or 1e-5

Near a minimum, gradients become small. This is a natural mathematical criterion. In practice, gradient-norm stopping is more common for convex problems (logistic regression, linear regression with regularization) where any stationary point is the global minimum. For neural networks, gradients can also be near zero at saddle points and flat regions.

Reading Loss Curves

Diagnosing training from the loss curve

The loss curve (loss vs. epoch for both training and validation sets) is your primary diagnostic tool. Learn to read it:

Healthy training: both curves decrease steadily, validation slightly higher than training, both plateau together.

Overfitting: training loss keeps falling, validation loss plateaus then rises. Gap grows with each epoch. Fix: dropout, L2 regularization, more data, or early stopping.

Underfitting: both losses are high and the gap between them is small (model is consistently bad). Loss plateaus early at a poor value. Fix: increase model capacity, train longer, reduce regularization.

Learning rate too large: loss oscillates up and down rather than decreasing monotonically. Fix: reduce $\alpha$ by 10×.

Learning rate too small: loss decreases very slowly over many epochs, barely moving. Fix: increase $\alpha$ by 10×.

The Practical Workflow

For most projects:

Start with fixed epochs (50–200 depending on dataset size) to establish a baseline loss curve
Add early stopping with patience $K = 10$ to $20$ epochs
Save checkpoints every epoch (or when validation loss improves)
Plot training and validation loss after every run
Diagnose from the shape of the curves, adjust, repeat

Interactive example

Interactive loss curve simulator - choose a training scenario and see the characteristic loss curve pattern

Coming soon

Epoch and iteration