Two Good Ideas, Combined
The previous two lessons introduced two insights:
- Momentum (Lesson 7): maintain a running average of gradients to smooth oscillations and accumulate speed in consistent directions
- RMSprop (Lesson 8): maintain a running average of squared gradients to give each parameter its own adaptive learning rate
Adam (Adaptive Moment Estimation) combines both ideas in one algorithm, adds a critical numerical fix called bias correction, and packages it with robust defaults that work across a wide range of problems.
Adam is the default optimizer for training transformers, diffusion models, and most modern neural networks. Understanding how it combines momentum and adaptive rates explains why it outperforms vanilla SGD on nearly every benchmark.
The Algorithm
Adam maintains two running statistics per parameter: the first moment (gradient direction) and the second moment (gradient variance). Their rates of change are governed by (momentum decay) and (variance decay), both fully defined in the formula below.
- first moment (EMA of gradient)
- second moment (EMA of squared gradient)
- decay rate for first moment — default 0.9
- decay rate for second moment — default 0.999
- current gradient
- current step number (starts at 1)
- bias-corrected first moment
- bias-corrected second moment
- learning rate — default 1e-3
- numerical stability constant — default 1e-8
The Bias Correction
Both and are initialized to zero vectors. In the early steps, they are biased toward zero because they haven't had time to "fill up" with gradient information.
Example at step t=1 with β₁=0.9:
Without correction: — only 10% of the actual gradient!
The bias-corrected estimate:
This restores the correct scale. As grows, and the correction vanishes. By step 50 with : — essentially no correction.
Worked Numerical Example
Parameters: α=1e-3, β₁=0.9, β₂=0.999, ε=1e-8. Start: m=0, v=0, θ=2.0.
Step 1, gradient g=3.0:
First moment:
Second moment:
Bias-corrected: ,
Update:
New value:
Step 2, gradient g=3.0:
First moment:
Second moment:
Bias-corrected: ,
Update:
Consistent gradients → consistent step size of ≈ α. This predictability is why Adam is easy to tune.
Default Hyperparameters and Why They Work
| Parameter | Default | Role |
|---|---|---|
| α | 1e-3 | Overall step scale. This is the main tuning knob. |
| β₁ | 0.9 | ~10-step window for gradient direction |
| β₂ | 0.999 | ~1000-step window for gradient variance |
| ε | 1e-8 | Prevents division by zero |
β₂ = 0.999 gives a long window for estimating gradient variance — it's slow to react to changes in gradient scale, which is generally good (you want a stable estimate). β₁ = 0.9 adapts direction faster.
The practical rule: for most problems, keep β₁, β₂, ε at defaults. Tune only α. Common values to try: 1e-3, 3e-4, 1e-4. If loss oscillates, halve α. If loss is slow, try 3×.
Code: Adam in PyTorch
import torch.optim as optim
# Adam with default hyperparameters — works for most problems
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# AdamW: Adam + proper weight decay decoupling (preferred for transformers)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
for x_batch, y_batch in dataloader:
optimizer.zero_grad()
loss = criterion(model(x_batch), y_batch)
loss.backward()
optimizer.step()
Adam is the default optimizer in nearly every deep learning framework. When in doubt, start with Adam(lr=1e-3). If you see training instability early, try lr=3e-4. If you're training a transformer, use AdamW with a weight decay of 0.01–0.1.