Adam: the complete derivation — Advanced Optimization

Adam in One Sentence

Adam = + + bias correction.

Everything in this lesson follows from two lessons you have already seen: EMA (lesson 11-2) and RMSprop (lesson 11-6).

Adam is the default optimizer for almost every deep learning task — transformers, GANs, diffusion models, and most production models are trained with it. If you only learn one optimizer in depth, it should be this one.

The Algorithm: Line by Line

Initialize: $m_0 = 0$ , $v_0 = 0$ , $t = 0$ . At each step:

Step 1: Compute gradient.

g_t = \nabla L(\theta_t)

$gₜ$: gradient at step t

Step 2: Update first moment (EMA of gradients — momentum).

m_t = \beta_1 \cdot m_{t-1} + (1-\beta_1)\cdot g_t

$mₜ$: first moment estimate — EMA of gradients
$β₁$: first moment decay rate (default: 0.9)

Step 3: Update second moment (EMA of squared gradients — RMSprop).

v_t = \beta_2 \cdot v_{t-1} + (1-\beta_2)\cdot g_t^2

$vₜ$: second moment estimate — EMA of squared gradients
$β₂$: second moment decay rate (default: 0.999)

Step 4: Bias correction.

\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

$m̂ₜ$: bias-corrected first moment
$v̂ₜ$: bias-corrected second moment

Step 5: Parameter update.

\theta_{t+1} = \theta_t - \frac{\alpha \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}

$α$: learning rate (default: 3e-4)
$ε$: numerical stability (default: 1e-8)

Default hyperparameters: $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\varepsilon = 10^{-8}$ , $\alpha = 3 \times 10^{-4}$ .

Why Each Piece Matters

The numerator $\hat{m}_t$ is the bias-corrected momentum. It smooths the gradient direction over ~10 steps, reducing noise and accelerating in consistent directions.

The denominator $\sqrt{\hat{v}_t}$ is the bias-corrected RMS of recent gradients (~1000 steps). It normalizes the step to be roughly the same size regardless of the parameter's gradient magnitude.

Together, the update $\hat{m}_t / \sqrt{\hat{v}_t}$ is approximately a signal-to-noise ratio: the smoothed gradient direction (signal) divided by the typical gradient magnitude (noise scale). When the gradient is consistently pointing one direction (high signal), the ratio is large. When the gradient is noisy and near zero in expectation, the ratio is small.

Full Numeric Example

One parameter, starting at $\theta = 1.0$ . Gradients at steps 1–4: [0.5, 0.3, 0.8, 0.6]. Hyperparameters: α=0.001, β₁=0.9, β₂=0.999, ε=1e-8.

Step t=1: g=0.5

m₁ = 0.9·0 + 0.1·0.5 = 0.05
v₁ = 0.999·0 + 0.001·0.25 = 0.00025
m̂₁ = 0.05/(1-0.9) = 0.5
v̂₁ = 0.00025/(1-0.999) = 0.25
Update: 0.001 · 0.5 / (√0.25 + 1e-8) = 0.001 · 0.5 / 0.5 = 0.001
θ₁ = 1.0 - 0.001 = 0.999

Step t=2: g=0.3

m₂ = 0.9·0.05 + 0.1·0.3 = 0.045+0.03 = 0.075
v₂ = 0.999·0.00025 + 0.001·0.09 = 0.0002498+0.00009 = 0.0003398
m̂₂ = 0.075/(1-0.81) = 0.075/0.19 ≈ 0.395
v̂₂ = 0.0003398/(1-0.998) = 0.0003398/0.002 ≈ 0.170
Update: 0.001 · 0.395 / (√0.170) = 0.001 · 0.395 / 0.412 ≈ 0.00096
θ₂ = 0.999 - 0.00096 ≈ 0.998

Notice the bias correction is critical at step 1: m₁=0.05 but m̂₁=0.5 — the raw EMA is 10× smaller than the true gradient. Without bias correction, Adam would take a 10× smaller step than intended at the start of training.

AdamW: Decoupled Weight Decay

Standard Adam + L2 regularization adds $\lambda\theta$ to the gradient before computing moments:

g_t = \nabla L(\theta_t) + \lambda\theta_t

This is problematic: the regularization term gets scaled by $1/\sqrt{\hat{v}_t}$ , which means parameters with large gradient history receive less — the opposite of what you want.

The (Loshchilov & Hutter, 2019) fixes this:

\theta_{t+1} = \theta_t - \frac{\alpha\hat{m}_t}{\sqrt{\hat{v}_t}+\varepsilon} - \alpha\lambda\theta_t

$λ$: weight decay coefficient — independent of adaptive scaling
$θ after adaptive step$: parameters after the standard Adam update

Weight decay is applied directly to $\theta$ with no adaptive scaling. The regularization effect is now clean and predictable. For transformers and language models, AdamW is the standard choice.

In Code

# Standard Adam
optimizer = torch.optim.Adam(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),   # (β₁, β₂)
    eps=1e-8
)

# AdamW (preferred for transformers)
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.01   # λ — decoupled
)

For large models: add gradient clipping before the optimizer step — torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). This prevents a single bad batch from causing catastrophic gradient explosions.

What beta1 and beta2 actually mean (plain English)

If you've been using Adam as a black box, here's what the two key hyperparameters actually control. beta1 (default 0.9) governs the momentum window — how many recent gradient steps influence the direction of the next update. With beta1=0.9, the effective window is roughly 10 steps: the update direction is a rolling average of the last ~10 gradients. Lower beta1 makes Adam more reactive to recent gradients; higher makes it smoother but slower to change course. beta2 (default 0.999) controls the adaptive scaling window — how many steps are used to estimate each parameter's typical gradient magnitude. With beta2=0.999, that window is roughly 1000 steps. The long window is intentional: you want a stable, reliable estimate of gradient scale, not one that swings wildly after a single noisy batch. In practice: leave both at defaults. The only common reason to lower beta2 is training with very small batch sizes, where gradient noise is high and you may want a shorter window (~0.99 instead of 0.999).