The Problem with Fixed Learning Rates
The ideal learning rate is not constant throughout training. It changes based on where you are in the optimization:
- Early training: weights are randomly initialized, gradients are noisy and unreliable. A large learning rate causes explosions. A small rate is needed for stability.
- Middle training: the model is learning; you want a large enough rate to make progress.
- Late training: you're close to a good solution; large steps cause overshooting and oscillation. You want small, careful steps.
Learning rate schedules are used in every serious training run — warmup + cosine decay is the default recipe for training transformers, and cyclical schedules are standard in image classification. Getting the schedule wrong is one of the most common reasons a model underperforms despite correct architecture and data.
A is a function that gives the appropriate learning rate at each training step .
Schedule 1: Linear Warmup
The linearly increases the learning rate from 0 to over the first steps:
- learning rate at step t
- number of warmup steps
Why warmup is necessary for Adam specifically: At step 1, Adam's second moment . With β₂=0.999, this is only 0.1% of the squared gradient. The bias-corrected estimate — which is correct but based on a single noisy gradient. If that gradient is unusually large (which is common at random initialization), the second moment gets poisoned and subsequent learning rates become unreliable.
Warmup keeps the actual step size small during this unreliable period, regardless of what the moment estimates say.
Rule of thumb: warm up for 5–10% of total training steps, or at minimum 1000 steps for large models.
Schedule 2: Cosine Annealing
After warmup, you need to decay the learning rate. is the most popular decay schedule.
The cosine function starts at 1, smoothly curves down to -1 over half a rotation. We use only the first half (0 to π), which gives a smooth S-shaped decay from maximum to minimum — fast decay in the middle, slow at both ends.
- minimum learning rate at end of training (often 0 or 10% of peak)
- maximum (peak) learning rate
- total number of training steps
- current step
At : , so . At : , so . At : , so .
The cosine shape is important: it decays slowly at first (when loss is still decreasing rapidly and you can afford a larger LR) and faster near the end (when the model is converging and needs small steps). A linear decay does not match the typical learning curve shape.
Schedule 3: Cosine with Warm Restarts (SGDR)
The (Loshchilov & Hutter, 2017) extends cosine annealing by periodically resetting the learning rate:
- steps since the last restart
- period of the current restart cycle
Each restart "shakes" the optimizer out of a local region and lets it explore elsewhere before cooling down again. Cycles often grow longer (e.g., T₁=100, T₂=200, T₃=400) to give each cycle time to converge.
Schedule 4: Polynomial Decay
Simple to tune, occasionally used for fine-tuning:
- polynomial power — p=1 is linear, p=2 is quadratic
Power p=1 gives linear decay. Power p=2 gives slow early decay and fast late decay. Less common than cosine in current practice.
The Combined Schedule: Warmup + Cosine Decay
The standard recipe for modern deep learning (GPT, BERT, Llama, and derivatives):
- warmup steps
- total training steps
In Code
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
# Linear warmup
warmup = LinearLR(optimizer, start_factor=0.001, end_factor=1.0, total_iters=1000)
# Cosine decay after warmup
cosine = CosineAnnealingLR(optimizer, T_max=total_steps - 1000, eta_min=3e-5)
# Chain them
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])
# Training loop
for step, batch in enumerate(dataloader):
loss = model(batch)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()