Why One Learning Rate Is Never Optimal
A training run has phases, and each phase demands something different from the optimizer.
Starting training with a fixed learning rate is like trying to parallel park at highway speed — you'll never settle precisely into the spot. But starting too small means you spend ages just finding the right neighborhood. The obvious fix: start bold enough to move quickly, then ease off as you get close. That's all a learning rate schedule does.
Early training: The model starts from random weights. Loss is high, gradients are large and noisy (they depend on random initializations interacting with the full complexity of the data). A learning rate that is too large here can send parameters flying into a region of the loss surface that's hard to recover from.
Mid training: The model is making rapid progress. This is the phase that benefits most from an appropriately-sized learning rate — large enough to move quickly, small enough to not overshoot.
Late training: The model is near a good minimum. To settle precisely into the minimum rather than bouncing around it, we need a much smaller learning rate — detailed refinement of weights. Think of it like annealing metal: rapid cooling locks in whatever atomic structure currently exists; slow, gradual cooling lets atoms find lower-energy, more stable arrangements. A large learning rate near the minimum is like rapid cooling — it overshoots and locks in a slightly suboptimal solution. Slowing the learning rate (cooling down) lets the optimizer settle into fine-grained details of the loss surface that a large step would jump right over, reaching a sharper, better-defined minimum.
A fixed can only be optimal for one of these phases at once. Learning rate schedules make α a function of the current step or epoch.
Schedule 1: Step Decay
The simplest schedule: reduce α by a fixed factor every epochs.
- initial learning rate
- drop factor — typically 0.1 (divide by 10)
- current epoch
- epoch interval between drops
- floor function
Example: α₀=0.1, γ=0.1, N=30.
- Epochs 0–29: α = 0.1
- Epochs 30–59: α = 0.01
- Epochs 60–89: α = 0.001
Step decay is simple and works well for image classification (ResNet was famously trained with step decay at epochs 30, 60, 90 of a 90-epoch run). The downside: the sudden 10× drops can cause momentary spikes in the training loss as the optimizer adjusts.
Schedule 2: Cosine Annealing
Cosine annealing smoothly decays α from a maximum to a minimum following a cosine curve:
- maximum learning rate (at start)
- minimum learning rate (at end, often 0 or 1e-6)
- current step
- total steps in the schedule
At : . At : . In between, it decreases smoothly — quickly at first, then very gently as it approaches the minimum.
Cosine restarts: a popular variant (SGDR) resets α to α_max periodically. This allows the model to escape local minima between restarts and creates natural "snapshot" points for ensembling (more in Unit 7).
Schedule 3: Linear Warmup + Decay
Used by virtually all large transformer models (BERT, GPT, T5, and their descendants). The schedule has two phases:
Phase 1 — Warmup: Increase α linearly from 0 (or a small value) to α_max over the first steps.
- current step
- warmup steps
- peak learning rate
Phase 2 — Decay: After warmup, decay α (often with cosine or inverse square root) over the remaining steps.
- total training steps
- warmup steps
Why Warmup?
Random weight initialization means the first gradient steps are computed with a model that's essentially random. The gradient signal at this point has high variance — some batches push parameters one way, others push the opposite way. A large α amplifies this noise and can create large, destructive parameter updates that take many steps to recover from.
By starting with a very small α and ramping up, we allow the model to first settle into a region of the loss landscape where gradients are more consistent, then accelerate once the signal is trustworthy.
Practical Comparison
| Schedule | Pros | Cons | Best for |
|---|---|---|---|
| Step decay | Simple, interpretable | Abrupt transitions | CNNs for image classification |
| Cosine annealing | Smooth, well-behaved | Requires choosing T_max | General purpose |
| Warmup + cosine | Best final accuracy | Two hyperparameters (T_w, α_max) | Transformers, large models |
Rule of thumb: use cosine annealing + short warmup (≈4% of total steps) for most modern architectures. Set α_max via a brief sweep (try 1e-3, 3e-4, 1e-4 and pick the fastest initial descent).
Code: Schedules in PyTorch
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
# Cosine annealing alone
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=1e-6)
# Linear warmup → cosine decay (PyTorch 1.13+)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_steps)
cosine = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps, eta_min=1e-6)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[warmup_steps])
for epoch in range(num_epochs):
train_one_epoch(model, optimizer, dataloader)
scheduler.step()
Always call scheduler.step() once per epoch (or once per step, depending on the scheduler). Log the current lr during training — unexpected lr behavior is a common source of bugs.