The Decision Guide
After six lessons on specific algorithms, here is how to choose in practice. This is the knowledge accumulated across hundreds of papers and thousands of training runs.
The optimizer you choose can make the difference between a model that converges in hours and one that doesn't converge at all. This guide distills the practical conventions used in industry — the defaults that experienced ML engineers reach for first.
By Architecture: The Primary Decision
Transformers / NLP
Always start here: .
optimizer = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.999),
weight_decay=0.01, eps=1e-8)
# + linear warmup for 5-10% of steps
# + cosine decay to 3e-5
Why not Adam? The decoupled weight decay in AdamW matters for large models — it provides more consistent regularization than L2 in the gradient.
Why not SGD? Transformers are notoriously sensitive to learning rate in each layer — the adaptive per-parameter rates in Adam are essential.
CNNs / Vision
Default: SGD + momentum + cosine/step decay. This is the finding of "Bag of Tricks for Image Classification" (He et al., 2019) and is reproducible.
optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9,
weight_decay=1e-4, nesterov=True)
# Cosine decay from 0.1 to 0
# OR step decay: multiply LR by 0.1 at epochs 30, 60, 80
Note the learning rate: 0.1, not 3e-4. SGD can use much larger learning rates than Adam because there's no adaptive scaling. This often leads to better final accuracy (Adam can converge to slightly sharper minima).
Sparse Data (NLP Embeddings, Recommenders)
Use Adam or AdaGrad. The per-parameter adaptive rates are the key feature: sparse features get large effective learning rates when they do appear; dense features get smaller rates. RMSprop also works. SGD fails here without per-parameter rates.
# Word2Vec, embedding tables, recommendation systems optimizer = Adam(model.parameters(), lr=0.001) # Or AdaGrad for one-pass over very large datasets optimizer = Adagrad(model.parameters(), lr=0.01)
Large Models (>1B parameters)
AdamW + gradient clipping. For very large models, gradient explosions are more common and more catastrophic. Always add:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
This is a global norm clip: compute across all parameters, and if it exceeds max_norm, scale all gradients by max_norm / ‖g‖₂. This preserves gradient direction while bounding magnitude.
Fine-tuning Pretrained Models
Rule: use 10× to 100× smaller learning rate than training.
# Pretrained with lr=3e-4 → Fine-tune with lr=2e-5 to 5e-5 optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01) # Short warmup (200-500 steps), cosine decay to 0
Why so small? The pretrained weights encode years of expensive pretraining. Aggressive updates cause — the new training data overwrites the representations that made the pretrained model valuable. Small LR = careful refinement of existing representations.
Tabular / Small Datasets
Try Adam first; also try SGD. On small datasets, Adam often converges faster and to a comparable minimum. But SGD can sometimes find better-generalizing solutions. Try both and pick by validation loss.
Common Gotchas
Gotcha 1: Forgetting Bias Correction
If you implement Adam from scratch, bias correction is non-optional. Without it, early training steps are 10× too small (β₂=0.999) and the model appears to not be learning. This is a silent failure — loss just looks flat for the first ~100 steps.
Gotcha 2: Too Large LR with Adam
Adam's adaptive scaling already reduces effective learning rates per-parameter. Starting too high (e.g., lr=0.01 instead of 3e-4) causes instability. Adam is sensitive to the absolute value of α in a way SGD is not.
Gotcha 3: No Warmup for Transformers
Training a large transformer from scratch without warmup almost always causes early instability or NaN loss. The noisy early gradients poison the second moment estimate. Warmup is not optional for transformers with random initialization.
Gotcha 4: Weight Decay in Adam vs AdamW
Using Adam(weight_decay=0.01) is NOT the same as AdamW. In PyTorch's Adam, weight decay is implemented as L2 regularization added to the gradient — it gets scaled by the adaptive rate. In AdamW, weight decay is applied directly to parameters. For transformers, use AdamW.
Summary: The Optimizer Genealogy
| Optimizer | Adds over predecessor | Use case |
|---|---|---|
| Vanilla SGD | — | Baseline |
| SGD + Momentum | Velocity accumulation | CNNs, convex tasks |
| Nesterov | Look-ahead gradient | Better momentum convergence |
| AdaGrad | Per-parameter LR (cumulative sum) | Sparse features, one-pass |
| RMSprop | Per-parameter LR (EMA, no death) | RNNs, general adaptive |
| Adam | Momentum + RMSprop + bias correction | Default, transformers |
| AdamW | Decoupled weight decay | Transformers (standard) |
Each row adds exactly one idea from this unit. Together they tell a complete story of how practical deep learning optimization evolved.