Exponential moving averages — Advanced Optimization

Why This Matters

Before deriving Adam, RMSprop, or momentum, you need one building block: the .

Once you understand EMA deeply, the derivation of Adam in lesson 11-7 will be immediate.

EMA is not just a math prerequisite — it is the mechanism that makes adaptive optimizers stable. Without it, Adam would react to every noisy gradient spike and diverge. The smoothing property of EMA is why modern optimizers can train billion-parameter models reliably.

Definition

Let be a sequence of values (e.g., gradient values). The EMA with parameter ∈ [0, 1) is:

In plain English: the new running average is mostly the old running average (scaled by β) plus a small contribution from today's fresh observation (scaled by 1−β). The larger β is, the more you trust the past over the present.

v_t = \beta \cdot v_{t-1} + (1-\beta) \cdot x_t

$vₜ$: the EMA at step t — the running estimate
$β$: decay rate — larger β means longer memory
$xₜ$: the new observation at step t

with initial condition $v_0 = 0$ .

The factor $(1-\beta)$ ensures that if $x_t$ is constant, the EMA converges to that constant (not to zero):

v_t \to x \text{ as } t \to \infty \text{ when } x_t = x \text{ for all } t

Expanding the Recurrence

Substituting the recurrence repeatedly:

v_t = (1-\beta)\sum_{k=0}^{t-1} \beta^k \cdot x_{t-k}

$β^k$: the weight on the observation k steps ago
$xₜ₋ₖ$: the observation k steps in the past

(plus a term $\beta^t v_0$ which is zero since $v_0 = 0$ .)

This is a weighted average where the weight on $x_{t-k}$ decays geometrically with age $k$ :

Most recent observation $x_t$ : weight $(1-\beta)\beta^0 = 1-\beta$
One step ago $x_{t-1}$ : weight $(1-\beta)\beta$
Two steps ago: weight $(1-\beta)\beta^2$

Older observations are exponentially downweighted. Recent observations dominate.

The Effective Window

How far back does an EMA actually "see"? The half-life is the number of steps $k$ for the weight to fall to half of its initial value: $\beta^k = 0.5$ , so $k = \log(0.5)/\log(\beta)$ .

A simpler and widely-used approximation: the effective window is:

N_{\text{eff}} = \frac{1}{1-\beta}

$N_eff$: effective window — the approximate number of recent steps the EMA remembers

$\beta$	Effective window
0.5	2 steps
0.9	10 steps
0.99	100 steps
0.999	1000 steps

In Adam, the first moment uses $\beta_1 = 0.9$ (window ≈ 10 gradient steps) and the second moment uses $\beta_2 = 0.999$ (window ≈ 1000 steps).

Bias Correction: The Early-Training Problem

There is a subtle issue. We initialize $v_0 = 0$ . At early steps:

v_1 = \beta \cdot 0 + (1-\beta) \cdot x_1 = (1-\beta) \cdot x_1

For $\beta = 0.9$ : $v_1 = 0.1 \cdot x_1$ — only 10% of the actual value. The EMA starts at zero and takes many steps to "warm up." During this warm-up, estimates are biased toward zero.

The fix is bias correction:

\hat{v}_t = \frac{v_t}{1 - \beta^t}

$v̂ₜ$: bias-corrected EMA estimate at step t
$vₜ$: raw (uncorrected) EMA
$β^t$: β raised to the power t — decays toward 0 as t grows

At $t=1$ : divides by $1 - 0.9 = 0.1$ , multiplying by 10 — correcting the downward bias. At $t=100$ : $\beta^{100} = 0.9^{100} \approx 2.7 \times 10^{-5}$ , so the correction is essentially 1 — no effect.

Bias correction matters only at the start of training.

Full Numerical Walkthrough (β = 0.9)

Let $x_t = [10, 8, 12, 9, 11, \ldots]$ (say, some gradient component). Starting from $v_0 = 0$ :

$t$	$x_t$	$v_t = 0.9 v_{t-1} + 0.1 x_t$	$1 - 0.9^t$	$\hat{v}_t = v_t / (1-0.9^t)$
1	10	0.9·0 + 0.1·10 = 1.00	0.100	10.00
2	8	0.9·1 + 0.1·8 = 1.70	0.190	8.95
3	12	0.9·1.7 + 0.1·12 = 2.73	0.271	10.07
4	9	0.9·2.73 + 0.1·9 = 3.36	0.344	9.76
5	11	0.9·3.36 + 0.1·11 = 4.12	0.410	10.05

The raw EMA $v_t$ is still far below the true input magnitudes (around 10) at step 5. The bias-corrected $\hat{v}_t$ tracks the inputs accurately from step 1.

The Key Takeaway

An EMA with decay $\beta$ maintains a running estimate of the recent average of its input, with an effective memory window of $1/(1-\beta)$ steps. Bias correction fixes the cold-start problem at the beginning of training.

With this understood: momentum = EMA of gradients. RMSprop = EMA of squared gradients. Adam = both, with bias correction. That is the entire story.