Learning rate — Gradient Descent

The Most Critical Hyperparameter

If you had to tune just one thing in your training setup, it's the learning rate . The doesn't appear in predictions — it controls the training process itself. You can't learn the right learning rate from data; you have to choose it.

Too Large: Overshooting

Imagine finding the bottom of a valley by taking giant leaps. You overshoot to the other side. Then leap back. Then over again. You bounce between the walls and never land in the valley.

Concrete example: $L = w^2$ , starting at $w_0 = 4$ , $\alpha = 1.5$ :

w_1 \leftarrow 4 - 1.5 \cdot (2 \cdot 4) = 4 - 12 = -8 \quad (L = 64, \text{ was } 16)

$\alpha$: learning rate = 1.5 (too large)

w_2 \leftarrow -8 - 1.5 \cdot (2 \cdot (-8)) = -8 + 24 = 16 \quad (L = 256, \text{ was } 64)

Loss is exploding — divergence. Signs of too-large $\alpha$ :

Loss increases during training instead of decreasing
Loss oscillates up and down, never settling
Loss becomes $\text{NaN}$ (numerical overflow)

Too Small: Slow Death

A tiny learning rate means tiny steps. The math is correct — you'll eventually converge — but "eventually" might mean millions of steps instead of thousands. You have a finite compute budget.

Think of trying to cross a city by shuffling forward one inch at a time. You'll get there in principle — but by the time you arrive, the meeting is long over. A learning rate that is too small wastes compute budget on steps so tiny that thousands of them together amount to almost no real progress.

Signs of too-small $\alpha$ :

Loss decreases extremely slowly over many epochs
Gradient norms are reasonable but weights barely move
Training feels like it's doing nothing

Finding the Right Learning Rate

Start with the standard default: for deep learning with the Adam optimizer, try $\alpha = 0.001$ first. This is the most reliable default across diverse tasks.

Order-of-magnitude search: try $0.1,\ 0.01,\ 0.001,\ 0.0001$ . Plot loss curves. The right order of magnitude is the one that gives the fastest steady decrease without oscillation.

LR range test: start training with a very small $\alpha$ and increase it gradually over a few hundred iterations while watching the loss. The loss initially decreases, then starts to oscillate or explode. The best static learning rate sits just before the instability begins.

Adaptive Optimizers: Let the Algorithm Decide

Instead of one global $\alpha$ , adaptive optimizers give each parameter its own effective learning rate based on its gradient history.

Adam (Adaptive Moment Estimation) is the most widely used. It combines:

Momentum: uses a running average of past gradients, so updates have "direction memory" and don't bounce as much.
Per-parameter adaptive rates: parameters with consistently large gradients get effectively smaller rates; sparse or infrequently updated parameters get relatively larger rates.

w_{t+1} = w_t - \frac{\alpha \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

$m_t$: first moment estimate - running mean of gradients (momentum)
$v_t$: second moment estimate - running mean of squared gradients (adaptive rate)
$\hat{m}_t$: bias-corrected first moment
$\hat{v}_t$: bias-corrected second moment
$\epsilon$: small constant for numerical stability - prevents division by zero

Good Adam defaults: $\alpha = 0.001$ , $\beta_1 = 0.9$ , $\beta_2 = 0.999$ . With Adam, the choice of $\alpha$ is less sensitive than with plain SGD — it adapts internally. But you still need to be in roughly the right range.

Interactive example

Compare training curves with different learning rates on the same loss - see divergence, slow convergence, and the sweet spot

Coming soon