Skip to content
Gradient Descent
Lesson 9 ⏱ 16 min

Adam: the full derivation

Video coming soon

Adam - Combining Momentum and Adaptive Rates

How Adam merges the momentum idea from SGD with the adaptive rates from RMSprop, why bias correction matters in the first few steps, default hyperparameter choices, and practical guidance on when to tune α vs. β₁/β₂.

⏱ ~8 min

🧮

Quick refresher

Exponential moving average

An EMA of a sequence is updated as EMA ← β·EMA_prev + (1-β)·xₜ. It weights recent values more than old ones, with decay rate β. The effective window is roughly 1/(1-β) steps.

Example

EMA starting at 0, β=0.9, receiving constant input 1.0: after step 1, EMA=0.1; step 2, EMA=0.19; step 10, EMA≈0.65; step 50, EMA≈0.99.

It asymptotically approaches the true mean of 1.0.

Two Good Ideas, Combined

The previous two lessons introduced two insights:

  1. Momentum (Lesson 7): maintain a running average of gradients to smooth oscillations and accumulate speed in consistent directions
  2. RMSprop (Lesson 8): maintain a running average of squared gradients to give each parameter its own adaptive learning rate

Adam (Adaptive Moment Estimation) combines both ideas in one algorithm, adds a critical numerical fix called bias correction, and packages it with robust defaults that work across a wide range of problems.

Adam is the default optimizer for training transformers, diffusion models, and most modern neural networks. Understanding how it combines momentum and adaptive rates explains why it outperforms vanilla SGD on nearly every benchmark.

The Algorithm

Adam maintains two running statistics per parameter: the first moment (gradient direction) and the second moment (gradient variance). Their rates of change are governed by (momentum decay) and (variance decay), both fully defined in the formula below.

mamp;β1m+(1β1)g vamp;β2v+(1β2)g2\[6pt]m^amp;=m1β1tv^=v1β2t\[6pt]θamp;θα,m^v^+ε\begin{aligned} m &\leftarrow \beta_1 m + (1-\beta_1) g \ v &\leftarrow \beta_2 v + (1-\beta_2) g^2 \[6pt] \hat{m} &= \frac{m}{1 - \beta_1^t} \qquad \hat{v} = \frac{v}{1 - \beta_2^t} \[6pt] \theta &\leftarrow \theta - \frac{\alpha ,\hat{m}}{\sqrt{\hat{v}} + \varepsilon} \end{aligned}
mm
first moment (EMA of gradient)
vv
second moment (EMA of squared gradient)
β1\beta_1
decay rate for first moment — default 0.9
β2\beta_2
decay rate for second moment — default 0.999
gg
current gradient
tt
current step number (starts at 1)
m^\hat{m}
bias-corrected first moment
v^\hat{v}
bias-corrected second moment
α\alpha
learning rate — default 1e-3
ε\varepsilon
numerical stability constant — default 1e-8

The Bias Correction

Both mm and vv are initialized to zero vectors. In the early steps, they are biased toward zero because they haven't had time to "fill up" with gradient information.

Example at step t=1 with β₁=0.9:

Without correction: m=0.1gm = 0.1 \cdot g — only 10% of the actual gradient!

The bias-corrected estimate: m^=m/(10.91)=m/0.1=10m\hat{m} = m / (1 - 0.9^1) = m / 0.1 = 10m

This restores the correct scale. As tt grows, 1βt11 - \beta^t \to 1 and the correction vanishes. By step 50 with β1=0.9\beta_1 = 0.9: 10.9500.9951 - 0.9^{50} \approx 0.995 — essentially no correction.

Worked Numerical Example

Parameters: α=1e-3, β₁=0.9, β₂=0.999, ε=1e-8. Start: m=0, v=0, θ=2.0.

Step 1, gradient g=3.0:

First moment: m=0.1×3.0=0.3m = 0.1 \times 3.0 = 0.3

Second moment: v=0.001×9.0=0.009v = 0.001 \times 9.0 = 0.009

Bias-corrected: m^=0.3/0.1=3.0\hat{m} = 0.3 / 0.1 = 3.0, v^=0.009/0.001=9.0\hat{v} = 0.009 / 0.001 = 9.0

Update: Δθ=103×3.0/(9+108)=0.001\Delta\theta = 10^{-3} \times 3.0 / (\sqrt{9} + 10^{-8}) = 0.001

New value: θ=2.00.001=1.999\theta = 2.0 - 0.001 = 1.999

Step 2, gradient g=3.0:

First moment: m=0.9(0.3)+0.1(3.0)=0.57m = 0.9(0.3) + 0.1(3.0) = 0.57

Second moment: v=0.999(0.009)+0.001(9.0)=0.017991v = 0.999(0.009) + 0.001(9.0) = 0.017991

Bias-corrected: m^=0.57/(10.81)=3.0\hat{m} = 0.57 / (1-0.81) = 3.0, v^=0.017991/(10.998)8.995\hat{v} = 0.017991 / (1-0.998) \approx 8.995

Update: Δθ103×3.0/8.9950.001\Delta\theta \approx 10^{-3} \times 3.0 / \sqrt{8.995} \approx 0.001

Consistent gradients → consistent step size of ≈ α. This predictability is why Adam is easy to tune.

Default Hyperparameters and Why They Work

ParameterDefaultRole
α1e-3Overall step scale. This is the main tuning knob.
β₁0.9~10-step window for gradient direction
β₂0.999~1000-step window for gradient variance
ε1e-8Prevents division by zero

β₂ = 0.999 gives a long window for estimating gradient variance — it's slow to react to changes in gradient scale, which is generally good (you want a stable estimate). β₁ = 0.9 adapts direction faster.

The practical rule: for most problems, keep β₁, β₂, ε at defaults. Tune only α. Common values to try: 1e-3, 3e-4, 1e-4. If loss oscillates, halve α. If loss is slow, try 3×.

Code: Adam in PyTorch

import torch.optim as optim

# Adam with default hyperparameters — works for most problems
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# AdamW: Adam + proper weight decay decoupling (preferred for transformers)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

for x_batch, y_batch in dataloader:
    optimizer.zero_grad()
    loss = criterion(model(x_batch), y_batch)
    loss.backward()
    optimizer.step()

Adam is the default optimizer in nearly every deep learning framework. When in doubt, start with Adam(lr=1e-3). If you see training instability early, try lr=3e-4. If you're training a transformer, use AdamW with a weight decay of 0.01–0.1.

Quiz

1 / 3

Adam's first moment estimate m tracks what quantity?