How Do We Know When to Stop?
Gradient descent can run indefinitely. The loss usually keeps decreasing — but at some point, further training gives diminishing returns or actively hurts the model's ability to generalize. There's no perfect universal stopping criterion, but there are good principled heuristics.
A model that doesn't converge is a model that doesn't learn. Understanding convergence tells you when to stop training, whether your setup is working, and how to diagnose common failures from a single loss curve.
Iteration, Epoch, and Steps
Two precise definitions you need:
- : one mini-batch gradient update
- : one full pass through the entire training dataset
If you have training examples and batch size :
- total training examples
- batch size
- iterations per epoch
Training for 10 epochs means 5,000 total gradient updates.
Fixed Epochs: The Baseline
The simplest approach: train for a fixed number of epochs, then stop.
Pros: simple, reproducible, easy to reason about. Cons: no automatic detection of overfitting or under-convergence. You might stop too early or too late.
Always pair with model checkpointing — save the model weights whenever a metric improves (or every N epochs). Even if you train too long and the model degrades, you can restore the best saved checkpoint.
Early Stopping: The Principled Approach
The monitors validation loss and stops training when it stops improving — automatically catching overfitting.
Algorithm:
- After each epoch, evaluate on the validation set
- If validation loss improved: save current weights as "best model," reset patience counter to 0
- If validation loss did not improve: increment patience counter
- If patience counter reaches threshold (e.g., ): stop training and restore the best saved weights
This answers the right question: "has the model stopped getting better on unseen data?" — not "has it stopped getting better on training data?" (which is almost always the wrong question). Early stopping also acts as implicit regularization.
Gradient Norm: Mathematical Convergence
A third criterion: stop when the is nearly zero:
- Euclidean length of the gradient vector
- convergence tolerance - a small threshold like 1e-6 or 1e-5
Near a minimum, gradients become small. This is a natural mathematical criterion. In practice, gradient-norm stopping is more common for convex problems (logistic regression, linear regression with regularization) where any stationary point is the global minimum. For neural networks, gradients can also be near zero at saddle points and flat regions.
Reading Loss Curves
The Practical Workflow
For most projects:
- Start with fixed epochs (50–200 depending on dataset size) to establish a baseline loss curve
- Add early stopping with patience to epochs
- Save checkpoints every epoch (or when validation loss improves)
- Plot training and validation loss after every run
- Diagnose from the shape of the curves, adjust, repeat
Interactive example
Interactive loss curve simulator - choose a training scenario and see the characteristic loss curve pattern
Coming soon