Optimizer cookbook: when to use what — Advanced Optimization

The Decision Guide

After six lessons on specific algorithms, here is how to choose in practice. This is the knowledge accumulated across hundreds of papers and thousands of training runs.

The optimizer you choose can make the difference between a model that converges in hours and one that doesn't converge at all. This guide distills the practical conventions used in industry — the defaults that experienced ML engineers reach for first.

By Architecture: The Primary Decision

Transformers / NLP

Always start here: .

optimizer = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.999),
                  weight_decay=0.01, eps=1e-8)
# + linear warmup for 5-10% of steps
# + cosine decay to 3e-5

Why not Adam? The decoupled weight decay in AdamW matters for large models — it provides more consistent regularization than L2 in the gradient.

Why not SGD? Transformers are notoriously sensitive to learning rate in each layer — the adaptive per-parameter rates in Adam are essential.

CNNs / Vision

Default: SGD + momentum + cosine/step decay. This is the finding of "Bag of Tricks for Image Classification" (He et al., 2019) and is reproducible.

optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9,
                weight_decay=1e-4, nesterov=True)
# Cosine decay from 0.1 to 0
# OR step decay: multiply LR by 0.1 at epochs 30, 60, 80

Note the learning rate: 0.1, not 3e-4. SGD can use much larger learning rates than Adam because there's no adaptive scaling. This often leads to better final accuracy (Adam can converge to slightly sharper minima).

Sparse Data (NLP Embeddings, Recommenders)

Use Adam or AdaGrad. The per-parameter adaptive rates are the key feature: sparse features get large effective learning rates when they do appear; dense features get smaller rates. RMSprop also works. SGD fails here without per-parameter rates.

# Word2Vec, embedding tables, recommendation systems
optimizer = Adam(model.parameters(), lr=0.001)
# Or AdaGrad for one-pass over very large datasets
optimizer = Adagrad(model.parameters(), lr=0.01)

Large Models (>1B parameters)

AdamW + gradient clipping. For very large models, gradient explosions are more common and more catastrophic. Always add:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

This is a global norm clip: compute across all parameters, and if it exceeds max_norm, scale all gradients by max_norm / ‖g‖₂. This preserves gradient direction while bounding magnitude.

Fine-tuning Pretrained Models

Rule: use 10× to 100× smaller learning rate than training.

# Pretrained with lr=3e-4 → Fine-tune with lr=2e-5 to 5e-5
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
# Short warmup (200-500 steps), cosine decay to 0

Why so small? The pretrained weights encode years of expensive pretraining. Aggressive updates cause — the new training data overwrites the representations that made the pretrained model valuable. Small LR = careful refinement of existing representations.

Tabular / Small Datasets

Try Adam first; also try SGD. On small datasets, Adam often converges faster and to a comparable minimum. But SGD can sometimes find better-generalizing solutions. Try both and pick by validation loss.

Common Gotchas

Gotcha 1: Forgetting Bias Correction

If you implement Adam from scratch, bias correction is non-optional. Without it, early training steps are 10× too small (β₂=0.999) and the model appears to not be learning. This is a silent failure — loss just looks flat for the first ~100 steps.

Gotcha 2: Too Large LR with Adam

Adam's adaptive scaling already reduces effective learning rates per-parameter. Starting too high (e.g., lr=0.01 instead of 3e-4) causes instability. Adam is sensitive to the absolute value of α in a way SGD is not.

Gotcha 3: No Warmup for Transformers

Training a large transformer from scratch without warmup almost always causes early instability or NaN loss. The noisy early gradients poison the second moment estimate. Warmup is not optional for transformers with random initialization.

Gotcha 4: Weight Decay in Adam vs AdamW

Using Adam(weight_decay=0.01) is NOT the same as AdamW. In PyTorch's Adam, weight decay is implemented as L2 regularization added to the gradient — it gets scaled by the adaptive rate. In AdamW, weight decay is applied directly to parameters. For transformers, use AdamW.

Summary: The Optimizer Genealogy

Optimizer	Adds over predecessor	Use case
Vanilla SGD	—	Baseline
SGD + Momentum	Velocity accumulation	CNNs, convex tasks
Nesterov	Look-ahead gradient	Better momentum convergence
AdaGrad	Per-parameter LR (cumulative sum)	Sparse features, one-pass
RMSprop	Per-parameter LR (EMA, no death)	RNNs, general adaptive
Adam	Momentum + RMSprop + bias correction	Default, transformers
AdamW	Decoupled weight decay	Transformers (standard)

Each row adds exactly one idea from this unit. Together they tell a complete story of how practical deep learning optimization evolved.