Skip to content
Advanced Optimization
Lesson 7 ⏱ 18 min

Adam: the complete derivation

Video coming soon

Adam: Momentum + RMSprop + Bias Correction

Adam step-by-step derivation. First and second moment EMAs. Bias correction derivation. Full numeric example. AdamW fix. Why Adam is the default optimizer.

⏱ ~10 min

🧮

Quick refresher

Exponential moving averages and bias correction

EMA: vₜ = β·vₜ₋₁ + (1-β)·xₜ. Effective window 1/(1-β). Early estimates are biased toward zero because we initialize v₀=0. Bias correction: v̂ₜ = vₜ/(1-βᵗ) recovers the correct estimate, especially important for the first ~1/(1-β) steps.

Example

With β=0.9, x=10: v₁=1.0, v̂₁=10.0; v₅=4.12, v̂₅=10.05.

Bias correction recovers the true value from the underestimated raw EMA.

Adam in One Sentence

Adam = + + bias correction.

Everything in this lesson follows from two lessons you have already seen: EMA (lesson 11-2) and RMSprop (lesson 11-6).

Adam is the default optimizer for almost every deep learning task — transformers, GANs, diffusion models, and most production models are trained with it. If you only learn one optimizer in depth, it should be this one.

The Algorithm: Line by Line

Initialize: m0=0m_0 = 0, v0=0v_0 = 0, t=0t = 0. At each step:

Step 1: Compute gradient.

gt=L(θt)g_t = \nabla L(\theta_t)
gtgₜ
gradient at step t

Step 2: Update first moment (EMA of gradients — momentum).

mt=β1mt1+(1β1)gtm_t = \beta_1 \cdot m_{t-1} + (1-\beta_1)\cdot g_t
mtmₜ
first moment estimate — EMA of gradients
β1β₁
first moment decay rate (default: 0.9)

Step 3: Update second moment (EMA of squared gradients — RMSprop).

vt=β2vt1+(1β2)gt2v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2)\cdot g_t^2
vtvₜ
second moment estimate — EMA of squared gradients
β2β₂
second moment decay rate (default: 0.999)

Step 4: Bias correction.

m^t=mt1β1tv^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
m^tm̂ₜ
bias-corrected first moment
v^tv̂ₜ
bias-corrected second moment

Step 5: Parameter update.

θt+1=θtαm^tv^t+ε\theta_{t+1} = \theta_t - \frac{\alpha \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}
αα
learning rate (default: 3e-4)
εε
numerical stability (default: 1e-8)

Default hyperparameters: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ε=108\varepsilon = 10^{-8}, α=3×104\alpha = 3 \times 10^{-4}.

Why Each Piece Matters

The numerator m^t\hat{m}_t is the bias-corrected momentum. It smooths the gradient direction over ~10 steps, reducing noise and accelerating in consistent directions.

The denominator v^t\sqrt{\hat{v}_t} is the bias-corrected RMS of recent gradients (~1000 steps). It normalizes the step to be roughly the same size regardless of the parameter's gradient magnitude.

Together, the update m^t/v^t\hat{m}_t / \sqrt{\hat{v}_t} is approximately a signal-to-noise ratio: the smoothed gradient direction (signal) divided by the typical gradient magnitude (noise scale). When the gradient is consistently pointing one direction (high signal), the ratio is large. When the gradient is noisy and near zero in expectation, the ratio is small.

Full Numeric Example

One parameter, starting at θ=1.0\theta = 1.0. Gradients at steps 1–4: [0.5, 0.3, 0.8, 0.6]. Hyperparameters: α=0.001, β₁=0.9, β₂=0.999, ε=1e-8.

Step t=1: g=0.5

  • m₁ = 0.9·0 + 0.1·0.5 = 0.05
  • v₁ = 0.999·0 + 0.001·0.25 = 0.00025
  • m̂₁ = 0.05/(1-0.9) = 0.5
  • v̂₁ = 0.00025/(1-0.999) = 0.25
  • Update: 0.001 · 0.5 / (√0.25 + 1e-8) = 0.001 · 0.5 / 0.5 = 0.001
  • θ₁ = 1.0 - 0.001 = 0.999

Step t=2: g=0.3

  • m₂ = 0.9·0.05 + 0.1·0.3 = 0.045+0.03 = 0.075
  • v₂ = 0.999·0.00025 + 0.001·0.09 = 0.0002498+0.00009 = 0.0003398
  • m̂₂ = 0.075/(1-0.81) = 0.075/0.19 ≈ 0.395
  • v̂₂ = 0.0003398/(1-0.998) = 0.0003398/0.002 ≈ 0.170
  • Update: 0.001 · 0.395 / (√0.170) = 0.001 · 0.395 / 0.412 ≈ 0.00096
  • θ₂ = 0.999 - 0.00096 ≈ 0.998

Notice the bias correction is critical at step 1: m₁=0.05 but m̂₁=0.5 — the raw EMA is 10× smaller than the true gradient. Without bias correction, Adam would take a 10× smaller step than intended at the start of training.

AdamW: Decoupled Weight Decay

Standard Adam + L2 regularization adds λθ\lambda\theta to the gradient before computing moments:

gt=L(θt)+λθtg_t = \nabla L(\theta_t) + \lambda\theta_t

This is problematic: the regularization term gets scaled by 1/v^t1/\sqrt{\hat{v}_t}, which means parameters with large gradient history receive less — the opposite of what you want.

The (Loshchilov & Hutter, 2019) fixes this:

θt+1=θtαm^tv^t+εαλθt\theta_{t+1} = \theta_t - \frac{\alpha\hat{m}_t}{\sqrt{\hat{v}_t}+\varepsilon} - \alpha\lambda\theta_t
λλ
weight decay coefficient — independent of adaptive scaling
θafteradaptivestepθ after adaptive step
parameters after the standard Adam update

Weight decay is applied directly to θ\theta with no adaptive scaling. The regularization effect is now clean and predictable. For transformers and language models, AdamW is the standard choice.

In Code

# Standard Adam
optimizer = torch.optim.Adam(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),   # (β₁, β₂)
    eps=1e-8
)

# AdamW (preferred for transformers)
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.01   # λ — decoupled
)

For large models: add gradient clipping before the optimizer step — torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). This prevents a single bad batch from causing catastrophic gradient explosions.

Quiz

1 / 3

Adam's update θ ← θ - α·m̂ₜ/(√v̂ₜ + ε) combines which two ideas?