Skip to content
Gradient Descent
Lesson 4 ⏱ 10 min

Learning rate

Video coming soon

The Learning Rate - Too Big, Too Small, Just Right

Concrete examples of divergence and slow convergence, the LR range test, and how adaptive optimizers like Adam handle the learning rate problem automatically.

⏱ ~6 min

🧮

Quick refresher

The update rule w ← w - α·∇L

Each gradient descent step moves w in the direction -∇L (downhill). The learning rate α scales the step size. Larger α means bigger steps.

Example

At w=4 with L=w², gradient=8, α=0.1: w ← 4 - 0.1·8 = 3.2.

With α=1.5: w ← 4 - 1.5·8 = -8 (overshoot!).

The Most Critical Hyperparameter

If you had to tune just one thing in your training setup, it's the learning rate . The doesn't appear in predictions — it controls the training process itself. You can't learn the right learning rate from data; you have to choose it.

Too Large: Overshooting

Imagine finding the bottom of a valley by taking giant leaps. You overshoot to the other side. Then leap back. Then over again. You bounce between the walls and never land in the valley.

Concrete example: L=w2L = w^2, starting at w0=4w_0 = 4, α=1.5\alpha = 1.5:

w141.5(24)=412=8(L=64, was 16)w_1 \leftarrow 4 - 1.5 \cdot (2 \cdot 4) = 4 - 12 = -8 \quad (L = 64, \text{ was } 16)
α\alpha
learning rate = 1.5 (too large)
w281.5(2(8))=8+24=16(L=256, was 64)w_2 \leftarrow -8 - 1.5 \cdot (2 \cdot (-8)) = -8 + 24 = 16 \quad (L = 256, \text{ was } 64)

Loss is exploding — divergence. Signs of too-large α\alpha:

  • Loss increases during training instead of decreasing
  • Loss oscillates up and down, never settling
  • Loss becomes NaN\text{NaN} (numerical overflow)

Too Small: Slow Death

A tiny learning rate means tiny steps. The math is correct — you'll eventually converge — but "eventually" might mean millions of steps instead of thousands. You have a finite compute budget.

Think of trying to cross a city by shuffling forward one inch at a time. You'll get there in principle — but by the time you arrive, the meeting is long over. A learning rate that is too small wastes compute budget on steps so tiny that thousands of them together amount to almost no real progress.

Signs of too-small α\alpha:

  • Loss decreases extremely slowly over many epochs
  • Gradient norms are reasonable but weights barely move
  • Training feels like it's doing nothing

Finding the Right Learning Rate

Start with the standard default: for deep learning with the Adam optimizer, try α=0.001\alpha = 0.001 first. This is the most reliable default across diverse tasks.

Order-of-magnitude search: try 0.1, 0.01, 0.001, 0.00010.1,\ 0.01,\ 0.001,\ 0.0001. Plot loss curves. The right order of magnitude is the one that gives the fastest steady decrease without oscillation.

LR range test: start training with a very small α\alpha and increase it gradually over a few hundred iterations while watching the loss. The loss initially decreases, then starts to oscillate or explode. The best static learning rate sits just before the instability begins.

Adaptive Optimizers: Let the Algorithm Decide

Instead of one global α\alpha, adaptive optimizers give each parameter its own effective learning rate based on its gradient history.

Adam (Adaptive Moment Estimation) is the most widely used. It combines:

  1. Momentum: uses a running average of past gradients, so updates have "direction memory" and don't bounce as much.
  2. Per-parameter adaptive rates: parameters with consistently large gradients get effectively smaller rates; sparse or infrequently updated parameters get relatively larger rates.
wt+1=wtαm^tv^t+ϵw_{t+1} = w_t - \frac{\alpha \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}
mtm_t
first moment estimate - running mean of gradients (momentum)
vtv_t
second moment estimate - running mean of squared gradients (adaptive rate)
m^t\hat{m}_t
bias-corrected first moment
v^t\hat{v}_t
bias-corrected second moment
ϵ\epsilon
small constant for numerical stability - prevents division by zero

Good Adam defaults: α=0.001\alpha = 0.001, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999. With Adam, the choice of α\alpha is less sensitive than with plain SGD — it adapts internally. But you still need to be in roughly the right range.

Interactive example

Compare training curves with different learning rates on the same loss - see divergence, slow convergence, and the sweet spot

Coming soon

Quiz

1 / 3

If the training loss increases or oscillates wildly during training, the most likely cause is...