Adam: the full derivation — Gradient Descent

Two Good Ideas, Combined

The previous two lessons introduced two insights:

Momentum (Lesson 7): maintain a running average of gradients to smooth oscillations and accumulate speed in consistent directions
RMSprop (Lesson 8): maintain a running average of squared gradients to give each parameter its own adaptive learning rate

Adam (Adaptive Moment Estimation) combines both ideas in one algorithm, adds a critical numerical fix called bias correction, and packages it with robust defaults that work across a wide range of problems.

Adam is the default optimizer for training transformers, diffusion models, and most modern neural networks. Understanding how it combines momentum and adaptive rates explains why it outperforms vanilla SGD on nearly every benchmark.

The Algorithm

Adam maintains two running statistics per parameter: the first moment (gradient direction) and the second moment (gradient variance). Their rates of change are governed by (momentum decay) and (variance decay), both fully defined in the formula below.

\begin{aligned} m &amp;\leftarrow \beta_1 m + (1-\beta_1) g \ v &amp;\leftarrow \beta_2 v + (1-\beta_2) g^2 \[6pt] \hat{m} &amp;= \frac{m}{1 - \beta_1^t} \qquad \hat{v} = \frac{v}{1 - \beta_2^t} \[6pt] \theta &amp;\leftarrow \theta - \frac{\alpha ,\hat{m}}{\sqrt{\hat{v}} + \varepsilon} \end{aligned}

$m$: first moment (EMA of gradient)
$v$: second moment (EMA of squared gradient)
$\beta_1$: decay rate for first moment — default 0.9
$\beta_2$: decay rate for second moment — default 0.999
$g$: current gradient
$t$: current step number (starts at 1)
$\hat{m}$: bias-corrected first moment
$\hat{v}$: bias-corrected second moment
$\alpha$: learning rate — default 1e-3
$\varepsilon$: numerical stability constant — default 1e-8

The Bias Correction

Both $m$ and $v$ are initialized to zero vectors. In the early steps, they are biased toward zero because they haven't had time to "fill up" with gradient information.

Example at step t=1 with β₁=0.9:

Without correction: $m = 0.1 \cdot g$ — only 10% of the actual gradient!

The bias-corrected estimate: $\hat{m} = m / (1 - 0.9^1) = m / 0.1 = 10m$

This restores the correct scale. As $t$ grows, $1 - \beta^t \to 1$ and the correction vanishes. By step 50 with $\beta_1 = 0.9$ : $1 - 0.9^{50} \approx 0.995$ — essentially no correction.

Worked Numerical Example

Parameters: α=1e-3, β₁=0.9, β₂=0.999, ε=1e-8. Start: m=0, v=0, θ=2.0.

Step 1, gradient g=3.0:

First moment: $m = 0.1 \times 3.0 = 0.3$

Second moment: $v = 0.001 \times 9.0 = 0.009$

Bias-corrected: $\hat{m} = 0.3 / 0.1 = 3.0$ , $\hat{v} = 0.009 / 0.001 = 9.0$

Update: $\Delta\theta = 10^{-3} \times 3.0 / (\sqrt{9} + 10^{-8}) = 0.001$

New value: $\theta = 2.0 - 0.001 = 1.999$

Step 2, gradient g=3.0:

First moment: $m = 0.9(0.3) + 0.1(3.0) = 0.57$

Second moment: $v = 0.999(0.009) + 0.001(9.0) = 0.017991$

Bias-corrected: $\hat{m} = 0.57 / (1-0.81) = 3.0$ , $\hat{v} = 0.017991 / (1-0.998) \approx 8.995$

Update: $\Delta\theta \approx 10^{-3} \times 3.0 / \sqrt{8.995} \approx 0.001$

Consistent gradients → consistent step size of ≈ α. This predictability is why Adam is easy to tune.

Default Hyperparameters and Why They Work

Parameter	Default	Role
α	1e-3	Overall step scale. This is the main tuning knob.
β₁	0.9	~10-step window for gradient direction
β₂	0.999	~1000-step window for gradient variance
ε	1e-8	Prevents division by zero

β₂ = 0.999 gives a long window for estimating gradient variance — it's slow to react to changes in gradient scale, which is generally good (you want a stable estimate). β₁ = 0.9 adapts direction faster.

The practical rule: for most problems, keep β₁, β₂, ε at defaults. Tune only α. Common values to try: 1e-3, 3e-4, 1e-4. If loss oscillates, halve α. If loss is slow, try 3×.

Code: Adam in PyTorch

import torch.optim as optim

# Adam with default hyperparameters — works for most problems
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# AdamW: Adam + proper weight decay decoupling (preferred for transformers)
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

for x_batch, y_batch in dataloader:
    optimizer.zero_grad()
    loss = criterion(model(x_batch), y_batch)
    loss.backward()
    optimizer.step()

Adam is the default optimizer in nearly every deep learning framework. When in doubt, start with Adam(lr=1e-3). If you see training instability early, try lr=3e-4. If you're training a transformer, use AdamW with a weight decay of 0.01–0.1.