Adam in One Sentence
Adam = + + bias correction.
Everything in this lesson follows from two lessons you have already seen: EMA (lesson 11-2) and RMSprop (lesson 11-6).
Adam is the default optimizer for almost every deep learning task — transformers, GANs, diffusion models, and most production models are trained with it. If you only learn one optimizer in depth, it should be this one.
The Algorithm: Line by Line
Initialize: , , . At each step:
Step 1: Compute gradient.
- gradient at step t
Step 2: Update first moment (EMA of gradients — momentum).
- first moment estimate — EMA of gradients
- first moment decay rate (default: 0.9)
Step 3: Update second moment (EMA of squared gradients — RMSprop).
- second moment estimate — EMA of squared gradients
- second moment decay rate (default: 0.999)
Step 4: Bias correction.
- bias-corrected first moment
- bias-corrected second moment
Step 5: Parameter update.
- learning rate (default: 3e-4)
- numerical stability (default: 1e-8)
Default hyperparameters: , , , .
Why Each Piece Matters
The numerator is the bias-corrected momentum. It smooths the gradient direction over ~10 steps, reducing noise and accelerating in consistent directions.
The denominator is the bias-corrected RMS of recent gradients (~1000 steps). It normalizes the step to be roughly the same size regardless of the parameter's gradient magnitude.
Together, the update is approximately a signal-to-noise ratio: the smoothed gradient direction (signal) divided by the typical gradient magnitude (noise scale). When the gradient is consistently pointing one direction (high signal), the ratio is large. When the gradient is noisy and near zero in expectation, the ratio is small.
Full Numeric Example
One parameter, starting at . Gradients at steps 1–4: [0.5, 0.3, 0.8, 0.6]. Hyperparameters: α=0.001, β₁=0.9, β₂=0.999, ε=1e-8.
Step t=1: g=0.5
- m₁ = 0.9·0 + 0.1·0.5 = 0.05
- v₁ = 0.999·0 + 0.001·0.25 = 0.00025
- m̂₁ = 0.05/(1-0.9) = 0.5
- v̂₁ = 0.00025/(1-0.999) = 0.25
- Update: 0.001 · 0.5 / (√0.25 + 1e-8) = 0.001 · 0.5 / 0.5 = 0.001
- θ₁ = 1.0 - 0.001 = 0.999
Step t=2: g=0.3
- m₂ = 0.9·0.05 + 0.1·0.3 = 0.045+0.03 = 0.075
- v₂ = 0.999·0.00025 + 0.001·0.09 = 0.0002498+0.00009 = 0.0003398
- m̂₂ = 0.075/(1-0.81) = 0.075/0.19 ≈ 0.395
- v̂₂ = 0.0003398/(1-0.998) = 0.0003398/0.002 ≈ 0.170
- Update: 0.001 · 0.395 / (√0.170) = 0.001 · 0.395 / 0.412 ≈ 0.00096
- θ₂ = 0.999 - 0.00096 ≈ 0.998
Notice the bias correction is critical at step 1: m₁=0.05 but m̂₁=0.5 — the raw EMA is 10× smaller than the true gradient. Without bias correction, Adam would take a 10× smaller step than intended at the start of training.
AdamW: Decoupled Weight Decay
Standard Adam + L2 regularization adds to the gradient before computing moments:
This is problematic: the regularization term gets scaled by , which means parameters with large gradient history receive less — the opposite of what you want.
The (Loshchilov & Hutter, 2019) fixes this:
- weight decay coefficient — independent of adaptive scaling
- parameters after the standard Adam update
Weight decay is applied directly to with no adaptive scaling. The regularization effect is now clean and predictable. For transformers and language models, AdamW is the standard choice.
In Code
# Standard Adam
optimizer = torch.optim.Adam(
model.parameters(),
lr=3e-4,
betas=(0.9, 0.999), # (β₁, β₂)
eps=1e-8
)
# AdamW (preferred for transformers)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=3e-4,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0.01 # λ — decoupled
)
For large models: add gradient clipping before the optimizer step — torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). This prevents a single bad batch from causing catastrophic gradient explosions.