The Most Critical Hyperparameter
If you had to tune just one thing in your training setup, it's the learning rate . The doesn't appear in predictions — it controls the training process itself. You can't learn the right learning rate from data; you have to choose it.
Too Large: Overshooting
Imagine finding the bottom of a valley by taking giant leaps. You overshoot to the other side. Then leap back. Then over again. You bounce between the walls and never land in the valley.
Concrete example: , starting at , :
- learning rate = 1.5 (too large)
Loss is exploding — divergence. Signs of too-large :
- Loss increases during training instead of decreasing
- Loss oscillates up and down, never settling
- Loss becomes (numerical overflow)
Too Small: Slow Death
A tiny learning rate means tiny steps. The math is correct — you'll eventually converge — but "eventually" might mean millions of steps instead of thousands. You have a finite compute budget.
Think of trying to cross a city by shuffling forward one inch at a time. You'll get there in principle — but by the time you arrive, the meeting is long over. A learning rate that is too small wastes compute budget on steps so tiny that thousands of them together amount to almost no real progress.
Signs of too-small :
- Loss decreases extremely slowly over many epochs
- Gradient norms are reasonable but weights barely move
- Training feels like it's doing nothing
Finding the Right Learning Rate
Start with the standard default: for deep learning with the Adam optimizer, try first. This is the most reliable default across diverse tasks.
Order-of-magnitude search: try . Plot loss curves. The right order of magnitude is the one that gives the fastest steady decrease without oscillation.
LR range test: start training with a very small and increase it gradually over a few hundred iterations while watching the loss. The loss initially decreases, then starts to oscillate or explode. The best static learning rate sits just before the instability begins.
Adaptive Optimizers: Let the Algorithm Decide
Instead of one global , adaptive optimizers give each parameter its own effective learning rate based on its gradient history.
Adam (Adaptive Moment Estimation) is the most widely used. It combines:
- Momentum: uses a running average of past gradients, so updates have "direction memory" and don't bounce as much.
- Per-parameter adaptive rates: parameters with consistently large gradients get effectively smaller rates; sparse or infrequently updated parameters get relatively larger rates.
- first moment estimate - running mean of gradients (momentum)
- second moment estimate - running mean of squared gradients (adaptive rate)
- bias-corrected first moment
- bias-corrected second moment
- small constant for numerical stability - prevents division by zero
Good Adam defaults: , , . With Adam, the choice of is less sensitive than with plain SGD — it adapts internally. But you still need to be in roughly the right range.
Interactive example
Compare training curves with different learning rates on the same loss - see divergence, slow convergence, and the sweet spot
Coming soon