Learning rate schedules: warmup and decay — Advanced Optimization

The Problem with Fixed Learning Rates

The ideal learning rate is not constant throughout training. It changes based on where you are in the optimization:

Early training: weights are randomly initialized, gradients are noisy and unreliable. A large learning rate causes explosions. A small rate is needed for stability.
Middle training: the model is learning; you want a large enough rate to make progress.
Late training: you're close to a good solution; large steps cause overshooting and oscillation. You want small, careful steps.

Learning rate schedules are used in every serious training run — warmup + cosine decay is the default recipe for training transformers, and cyclical schedules are standard in image classification. Getting the schedule wrong is one of the most common reasons a model underperforms despite correct architecture and data.

A is a function that gives the appropriate learning rate at each training step .

Schedule 1: Linear Warmup

The linearly increases the learning rate from 0 to over the first steps:

\alpha(t) = \alpha_{\text{peak}} \cdot \frac{t}{N_{\text{warmup}}} \quad \text{for } t \leq N_{\text{warmup}}

$α(t)$: learning rate at step t
$N_warmup$: number of warmup steps

Why warmup is necessary for Adam specifically: At step 1, Adam's second moment $v_1 = (1-\beta_2)g_1^2$ . With β₂=0.999, this is only 0.1% of the squared gradient. The bias-corrected estimate $\hat{v}_1 = v_1/(1-0.999) = g_1^2$ — which is correct but based on a single noisy gradient. If that gradient is unusually large (which is common at random initialization), the second moment gets poisoned and subsequent learning rates become unreliable.

Warmup keeps the actual step size small during this unreliable period, regardless of what the moment estimates say.

Rule of thumb: warm up for 5–10% of total training steps, or at minimum 1000 steps for large models.

Schedule 2: Cosine Annealing

After warmup, you need to decay the learning rate. is the most popular decay schedule.

The cosine function starts at 1, smoothly curves down to -1 over half a rotation. We use only the first half (0 to π), which gives a smooth S-shaped decay from maximum to minimum — fast decay in the middle, slow at both ends.

\alpha(t) = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos!\left(\frac{\pi t}{T}\right)\right)

$α_min$: minimum learning rate at end of training (often 0 or 10% of peak)
$α_max$: maximum (peak) learning rate
$T$: total number of training steps
$t$: current step

At $t=0$ : $\cos(0) = 1$ , so $\alpha = \alpha_{\max}$ . At $t=T$ : $\cos(\pi) = -1$ , so $\alpha = \alpha_{\min}$ . At $t=T/2$ : $\cos(\pi/2) = 0$ , so $\alpha = (\alpha_{\max} + \alpha_{\min})/2$ .

The cosine shape is important: it decays slowly at first (when loss is still decreasing rapidly and you can afford a larger LR) and faster near the end (when the model is converging and needs small steps). A linear decay does not match the typical learning curve shape.

Schedule 3: Cosine with Warm Restarts (SGDR)

The (Loshchilov & Hutter, 2017) extends cosine annealing by periodically resetting the learning rate:

\alpha(t) = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos!\left(\frac{\pi T_{\text{cur}}}{T_i}\right)\right)

$T_cur$: steps since the last restart
$T_i$: period of the current restart cycle

Each restart "shakes" the optimizer out of a local region and lets it explore elsewhere before cooling down again. Cycles often grow longer (e.g., T₁=100, T₂=200, T₃=400) to give each cycle time to converge.

Schedule 4: Polynomial Decay

Simple to tune, occasionally used for fine-tuning:

\alpha(t) = \alpha_0 \cdot \left(1 - \frac{t}{T}\right)^p

$p$: polynomial power — p=1 is linear, p=2 is quadratic

Power p=1 gives linear decay. Power p=2 gives slow early decay and fast late decay. Less common than cosine in current practice.

The Combined Schedule: Warmup + Cosine Decay

The standard recipe for modern deep learning (GPT, BERT, Llama, and derivatives):

\alpha(t) = \begin{cases} \alpha_{\text{peak}} \cdot \dfrac{t}{N_w} &amp; t \leq N_w \[6pt] \alpha_{\min} + \dfrac{\alpha_{\text{peak}}-\alpha_{\min}}{2}!\left(1 + \cos!\dfrac{\pi(t-N_w)}{T-N_w}\right) &amp; t &gt; N_w \end{cases}

$N_w$: warmup steps
$T$: total training steps

In Code

from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

# Linear warmup
warmup = LinearLR(optimizer, start_factor=0.001, end_factor=1.0, total_iters=1000)

# Cosine decay after warmup
cosine = CosineAnnealingLR(optimizer, T_max=total_steps - 1000, eta_min=3e-5)

# Chain them
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])

# Training loop
for step, batch in enumerate(dataloader):
    loss = model(batch)
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    scheduler.step()