Learning rate schedules — Gradient Descent

Why One Learning Rate Is Never Optimal

A training run has phases, and each phase demands something different from the optimizer.

Starting training with a fixed learning rate is like trying to parallel park at highway speed — you'll never settle precisely into the spot. But starting too small means you spend ages just finding the right neighborhood. The obvious fix: start bold enough to move quickly, then ease off as you get close. That's all a learning rate schedule does.

Early training: The model starts from random weights. Loss is high, gradients are large and noisy (they depend on random initializations interacting with the full complexity of the data). A learning rate that is too large here can send parameters flying into a region of the loss surface that's hard to recover from.

Mid training: The model is making rapid progress. This is the phase that benefits most from an appropriately-sized learning rate — large enough to move quickly, small enough to not overshoot.

Late training: The model is near a good minimum. To settle precisely into the minimum rather than bouncing around it, we need a much smaller learning rate — detailed refinement of weights. Think of it like annealing metal: rapid cooling locks in whatever atomic structure currently exists; slow, gradual cooling lets atoms find lower-energy, more stable arrangements. A large learning rate near the minimum is like rapid cooling — it overshoots and locks in a slightly suboptimal solution. Slowing the learning rate (cooling down) lets the optimizer settle into fine-grained details of the loss surface that a large step would jump right over, reaching a sharper, better-defined minimum.

A fixed can only be optimal for one of these phases at once. Learning rate schedules make α a function of the current step or epoch.

Schedule 1: Step Decay

The simplest schedule: reduce α by a fixed factor every epochs.

\alpha(n) = \alpha_0 \cdot \gamma^{\lfloor n / N \rfloor}

$\alpha_0$: initial learning rate
$\gamma$: drop factor — typically 0.1 (divide by 10)
$n$: current epoch
$N$: epoch interval between drops
$\lfloor \cdot \rfloor$: floor function

Example: α₀=0.1, γ=0.1, N=30.

Epochs 0–29: α = 0.1
Epochs 30–59: α = 0.01
Epochs 60–89: α = 0.001

Step decay is simple and works well for image classification (ResNet was famously trained with step decay at epochs 30, 60, 90 of a 90-epoch run). The downside: the sudden 10× drops can cause momentary spikes in the training loss as the optimizer adjusts.

Schedule 2: Cosine Annealing

Cosine annealing smoothly decays α from a maximum to a minimum following a cosine curve:

\alpha(t) = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

$\alpha_{\max}$: maximum learning rate (at start)
$\alpha_{\min}$: minimum learning rate (at end, often 0 or 1e-6)
$t$: current step
$T$: total steps in the schedule

At $t=0$ : $\alpha = \alpha_{\max}$ . At $t=T$ : $\alpha = \alpha_{\min}$ . In between, it decreases smoothly — quickly at first, then very gently as it approaches the minimum.

Cosine restarts: a popular variant (SGDR) resets α to α_max periodically. This allows the model to escape local minima between restarts and creates natural "snapshot" points for ensembling (more in Unit 7).

Schedule 3: Linear Warmup + Decay

Used by virtually all large transformer models (BERT, GPT, T5, and their descendants). The schedule has two phases:

Phase 1 — Warmup: Increase α linearly from 0 (or a small value) to α_max over the first steps.

\alpha(t) = \alpha_{\max} \cdot \frac{t}{T_w}, \quad t \leq T_w

$t$: current step
$T_w$: warmup steps
$\alpha_{\max}$: peak learning rate

Phase 2 — Decay: After warmup, decay α (often with cosine or inverse square root) over the remaining steps.

\alpha(t) = \alpha_{\max} \cdot \sqrt{\frac{T_w}{t}}, \quad t &gt; T_w \quad \text{(inverse sqrt decay)}

$T$: total training steps
$T_w$: warmup steps

Why Warmup?

Random weight initialization means the first gradient steps are computed with a model that's essentially random. The gradient signal at this point has high variance — some batches push parameters one way, others push the opposite way. A large α amplifies this noise and can create large, destructive parameter updates that take many steps to recover from.

By starting with a very small α and ramping up, we allow the model to first settle into a region of the loss landscape where gradients are more consistent, then accelerate once the signal is trustworthy.

Practical Comparison

Schedule	Pros	Cons	Best for
Step decay	Simple, interpretable	Abrupt transitions	CNNs for image classification
Cosine annealing	Smooth, well-behaved	Requires choosing T_max	General purpose
Warmup + cosine	Best final accuracy	Two hyperparameters (T_w, α_max)	Transformers, large models

Rule of thumb: use cosine annealing + short warmup (≈4% of total steps) for most modern architectures. Set α_max via a brief sweep (try 1e-3, 3e-4, 1e-4 and pick the fastest initial descent).

Code: Schedules in PyTorch

import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# Cosine annealing alone
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=1e-6)

# Linear warmup → cosine decay (PyTorch 1.13+)
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=warmup_steps)
cosine = CosineAnnealingLR(optimizer, T_max=total_steps - warmup_steps, eta_min=1e-6)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[warmup_steps])

for epoch in range(num_epochs):
    train_one_epoch(model, optimizer, dataloader)
    scheduler.step()

Always call scheduler.step() once per epoch (or once per step, depending on the scheduler). Log the current lr during training — unexpected lr behavior is a common source of bugs.