Skip to content
Advanced Optimization
Lesson 8 ⏱ 12 min

Learning rate schedules: warmup and decay

Video coming soon

Learning Rate Schedules: Warmup, Cosine, and Beyond

Why fixed learning rates fail. Linear warmup rationale. Cosine annealing derivation. Polynomial decay. SGDR warm restarts. Practical recipe for transformers.

⏱ ~8 min

🧮

Quick refresher

Adam optimizer

Adam maintains an EMA of gradients (first moment, β₁=0.9) and squared gradients (second moment, β₂=0.999). The update is α·m̂ₜ/(√v̂ₜ + ε). The learning rate α scales every update uniformly — it can be changed during training without breaking the algorithm.

Example

If you halve α mid-training, all subsequent steps are half as large.

The moment estimates m and v continue accumulating from their current state — they don't reset.

A schedule is just a rule for how to set α at each step.

The Problem with Fixed Learning Rates

The ideal learning rate is not constant throughout training. It changes based on where you are in the optimization:

  • Early training: weights are randomly initialized, gradients are noisy and unreliable. A large learning rate causes explosions. A small rate is needed for stability.
  • Middle training: the model is learning; you want a large enough rate to make progress.
  • Late training: you're close to a good solution; large steps cause overshooting and oscillation. You want small, careful steps.

Learning rate schedules are used in every serious training run — warmup + cosine decay is the default recipe for training transformers, and cyclical schedules are standard in image classification. Getting the schedule wrong is one of the most common reasons a model underperforms despite correct architecture and data.

A is a function that gives the appropriate learning rate at each training step .

Schedule 1: Linear Warmup

The linearly increases the learning rate from 0 to over the first steps:

α(t)=αpeaktNwarmupfor tNwarmup\alpha(t) = \alpha_{\text{peak}} \cdot \frac{t}{N_{\text{warmup}}} \quad \text{for } t \leq N_{\text{warmup}}
α(t)α(t)
learning rate at step t
NwarmupN_warmup
number of warmup steps

Why warmup is necessary for Adam specifically: At step 1, Adam's second moment v1=(1β2)g12v_1 = (1-\beta_2)g_1^2. With β₂=0.999, this is only 0.1% of the squared gradient. The bias-corrected estimate v^1=v1/(10.999)=g12\hat{v}_1 = v_1/(1-0.999) = g_1^2 — which is correct but based on a single noisy gradient. If that gradient is unusually large (which is common at random initialization), the second moment gets poisoned and subsequent learning rates become unreliable.

Warmup keeps the actual step size small during this unreliable period, regardless of what the moment estimates say.

Rule of thumb: warm up for 5–10% of total training steps, or at minimum 1000 steps for large models.

Schedule 2: Cosine Annealing

After warmup, you need to decay the learning rate. is the most popular decay schedule.

The cosine function starts at 1, smoothly curves down to -1 over half a rotation. We use only the first half (0 to π), which gives a smooth S-shaped decay from maximum to minimum — fast decay in the middle, slow at both ends.

α(t)=αmin+12(αmaxαmin)(1+cos!(πtT))\alpha(t) = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos!\left(\frac{\pi t}{T}\right)\right)
αminα_min
minimum learning rate at end of training (often 0 or 10% of peak)
αmaxα_max
maximum (peak) learning rate
TT
total number of training steps
tt
current step

At t=0t=0: cos(0)=1\cos(0) = 1, so α=αmax\alpha = \alpha_{\max}. At t=Tt=T: cos(π)=1\cos(\pi) = -1, so α=αmin\alpha = \alpha_{\min}. At t=T/2t=T/2: cos(π/2)=0\cos(\pi/2) = 0, so α=(αmax+αmin)/2\alpha = (\alpha_{\max} + \alpha_{\min})/2.

The cosine shape is important: it decays slowly at first (when loss is still decreasing rapidly and you can afford a larger LR) and faster near the end (when the model is converging and needs small steps). A linear decay does not match the typical learning curve shape.

Schedule 3: Cosine with Warm Restarts (SGDR)

The (Loshchilov & Hutter, 2017) extends cosine annealing by periodically resetting the learning rate:

α(t)=αmin+12(αmaxαmin)(1+cos!(πTcurTi))\alpha(t) = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos!\left(\frac{\pi T_{\text{cur}}}{T_i}\right)\right)
TcurT_cur
steps since the last restart
TiT_i
period of the current restart cycle

Each restart "shakes" the optimizer out of a local region and lets it explore elsewhere before cooling down again. Cycles often grow longer (e.g., T₁=100, T₂=200, T₃=400) to give each cycle time to converge.

Schedule 4: Polynomial Decay

Simple to tune, occasionally used for fine-tuning:

α(t)=α0(1tT)p\alpha(t) = \alpha_0 \cdot \left(1 - \frac{t}{T}\right)^p
pp
polynomial power — p=1 is linear, p=2 is quadratic

Power p=1 gives linear decay. Power p=2 gives slow early decay and fast late decay. Less common than cosine in current practice.

The Combined Schedule: Warmup + Cosine Decay

The standard recipe for modern deep learning (GPT, BERT, Llama, and derivatives):

α(t)={αpeaktNwamp;tNw\[6pt]αmin+αpeakαmin2!(1+cos!π(tNw)TNw)amp;tgt;Nw\alpha(t) = \begin{cases} \alpha_{\text{peak}} \cdot \dfrac{t}{N_w} & t \leq N_w \[6pt] \alpha_{\min} + \dfrac{\alpha_{\text{peak}}-\alpha_{\min}}{2}!\left(1 + \cos!\dfrac{\pi(t-N_w)}{T-N_w}\right) & t > N_w \end{cases}
NwN_w
warmup steps
TT
total training steps

In Code

from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

# Linear warmup
warmup = LinearLR(optimizer, start_factor=0.001, end_factor=1.0, total_iters=1000)

# Cosine decay after warmup
cosine = CosineAnnealingLR(optimizer, T_max=total_steps - 1000, eta_min=3e-5)

# Chain them
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])

# Training loop
for step, batch in enumerate(dataloader):
    loss = model(batch)
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    scheduler.step()

Quiz

1 / 3

Why is learning rate warmup important for transformers but often unnecessary for small CNNs?