Skip to content
Gradient Descent
Lesson 10 ⏱ 12 min

Learning rate schedules

Video coming soon

Learning Rate Schedules - When to Speed Up and Slow Down

Why a constant learning rate is suboptimal across a full training run, animated plots of step decay, cosine annealing, and warmup + decay schedules, and practical rules for choosing a schedule.

⏱ ~7 min

🧮

Quick refresher

Learning rate

The learning rate α controls the size of each gradient descent step: θ ← θ - α·∇L. A large α takes big steps (fast, but may overshoot); a small α takes small steps (stable, but slow). The right α depends on the curvature of the loss surface, which changes throughout training.

Example

With α=0.1 and gradient 2.0, the parameter moves by 0.2.

With α=0.01 and the same gradient, it moves by 0.02 — ten times less.

Early in training when gradients are large and noisy, you might prefer the small step to stay stable.

Why One Learning Rate Is Never Optimal

A training run has phases, and each phase demands something different from the optimizer.

Starting training with a fixed learning rate is like trying to parallel park at highway speed — you'll never settle precisely into the spot. But starting too small means you spend ages just finding the right neighborhood. The obvious fix: start bold enough to move quickly, then ease off as you get close. That's all a learning rate schedule does.

Early training: The model starts from random weights. Loss is high, gradients are large and noisy (they depend on random initializations interacting with the full complexity of the data). A learning rate that is too large here can send parameters flying into a region of the loss surface that's hard to recover from.

Mid training: The model is making rapid progress. This is the phase that benefits most from an appropriately-sized learning rate — large enough to move quickly, small enough to not overshoot.

Late training: The model is near a good minimum. To settle precisely into the minimum rather than bouncing around it, we need a much smaller learning rate — detailed refinement of weights. Think of it like annealing metal: rapid cooling locks in whatever atomic structure currently exists; slow, gradual cooling lets atoms find lower-energy, more stable arrangements. A large learning rate near the minimum is like rapid cooling — it overshoots and locks in a slightly suboptimal solution. Slowing the learning rate (cooling down) lets the optimizer settle into fine-grained details of the loss surface that a large step would jump right over, reaching a sharper, better-defined minimum.

A fixed can only be optimal for one of these phases at once. Learning rate schedules make α a function of the current step or epoch.

Schedule 1: Step Decay

The simplest schedule: reduce α by a fixed factor every epochs.

α(n)=α0γn/N\alpha(n) = \alpha_0 \cdot \gamma^{\lfloor n / N \rfloor}
α0\alpha_0
initial learning rate
γ\gamma
drop factor — typically 0.1 (divide by 10)
nn
current epoch
NN
epoch interval between drops
\lfloor \cdot \rfloor
floor function

Example: α₀=0.1, γ=0.1, N=30.

  • Epochs 0–29: α = 0.1
  • Epochs 30–59: α = 0.01
  • Epochs 60–89: α = 0.001

Step decay is simple and works well for image classification (ResNet was famously trained with step decay at epochs 30, 60, 90 of a 90-epoch run). The downside: the sudden 10× drops can cause momentary spikes in the training loss as the optimizer adjusts.

Schedule 2: Cosine Annealing

Cosine annealing smoothly decays α from a maximum to a minimum following a cosine curve:

α(t)=αmin+12(αmaxαmin)(1+cos(πtT))\alpha(t) = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)
αmax\alpha_{\max}
maximum learning rate (at start)
αmin\alpha_{\min}
minimum learning rate (at end, often 0 or 1e-6)
tt
current step
TT
total steps in the schedule

At t=0t=0: α=αmax\alpha = \alpha_{\max}. At t=Tt=T: α=αmin\alpha = \alpha_{\min}. In between, it decreases smoothly — quickly at first, then very gently as it approaches the minimum.

Cosine restarts: a popular variant (SGDR) resets α to α_max periodically. This allows the model to escape local minima between restarts and creates natural "snapshot" points for ensembling (more in Unit 7).

Schedule 3: Linear Warmup + Decay

Used by virtually all large transformer models (BERT, GPT, T5, and their descendants). The schedule has two phases:

Phase 1 — Warmup: Increase α linearly from 0 (or a small value) to α_max over the first steps.

α(t)=αmaxtTw,tTw\alpha(t) = \alpha_{\max} \cdot \frac{t}{T_w}, \quad t \leq T_w
tt
current step
TwT_w
warmup steps
αmax\alpha_{\max}
peak learning rate

Phase 2 — Decay: After warmup, decay α (often with cosine or inverse square root) over the remaining steps.

\alpha(t) = \alpha_{\max} \cdot \sqrt{\frac{T_w}{t}}, \quad t > T_w \quad \text{(inverse sqrt decay)}
TT
total training steps
TwT_w
warmup steps

Why Warmup?

Random weight initialization means the first gradient steps are computed with a model that's essentially random. The gradient signal at this point has high variance — some batches push parameters one way, others push the opposite way. A large α amplifies this noise and can create large, destructive parameter updates that take many steps to recover from.

By starting with a very small α and ramping up, we allow the model to first settle into a region of the loss landscape where gradients are more consistent, then accelerate once the signal is trustworthy.

Practical Comparison

ScheduleProsConsBest for
Step decaySimple, interpretableAbrupt transitionsCNNs for image classification
Cosine annealingSmooth, well-behavedRequires choosing T_maxGeneral purpose
Warmup + cosineBest final accuracyTwo hyperparameters (T_w, α_max)Transformers, large models

Rule of thumb: use cosine annealing + short warmup (≈4% of total steps) for most modern architectures. Set α_max via a brief sweep (try 1e-3, 3e-4, 1e-4 and pick the fastest initial descent).

Code: Schedules in PyTorch

import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# Cosine annealing alone
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=1e-6)

# Linear warmup → cosine decay (PyTorch 1.13+)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_steps)
cosine = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps, eta_min=1e-6)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[warmup_steps])

for epoch in range(num_epochs):
    train_one_epoch(model, optimizer, dataloader)
    scheduler.step()

Always call scheduler.step() once per epoch (or once per step, depending on the scheduler). Log the current lr during training — unexpected lr behavior is a common source of bugs.

Quiz

1 / 3

Why does warmup help at the start of training?