Skip to content
Advanced Optimization
Lesson 2 ⏱ 10 min

Exponential moving averages

Video coming soon

Exponential Moving Averages: The Core Primitive of Adam

EMA definition and expansion. The effective window. Why EMA weights recent observations more. Bias correction. Numerical walkthrough with β=0.9. This is all you need to understand Adam.

⏱ ~7 min

🧮

Quick refresher

Weighted average

A weighted average assigns different importance to different values: weighted_avg = Σ wᵢ·xᵢ / Σ wᵢ. The weights control how much each observation contributes to the average. If all weights are equal, it reduces to the standard mean.

Example

Scores [90, 60, 80] with weights [0.5, 0.3, 0.2]: weighted avg = (0.5·90 + 0.3·60 + 0.2·80) / (0.5+0.3+0.2) = (45+18+16)/1 = 79.

Recent scores weighted more heavily.

Why This Matters

Before deriving Adam, RMSprop, or momentum, you need one building block: the .

Once you understand EMA deeply, the derivation of Adam in lesson 11-7 will be immediate.

EMA is not just a math prerequisite — it is the mechanism that makes adaptive optimizers stable. Without it, Adam would react to every noisy gradient spike and diverge. The smoothing property of EMA is why modern optimizers can train billion-parameter models reliably.

Definition

Let be a sequence of values (e.g., gradient values). The EMA with parameter ∈ [0, 1) is:

In plain English: the new running average is mostly the old running average (scaled by β) plus a small contribution from today's fresh observation (scaled by 1−β). The larger β is, the more you trust the past over the present.

vt=βvt1+(1β)xtv_t = \beta \cdot v_{t-1} + (1-\beta) \cdot x_t
vtvₜ
the EMA at step t — the running estimate
ββ
decay rate — larger β means longer memory
xtxₜ
the new observation at step t

with initial condition v0=0v_0 = 0.

The factor (1β)(1-\beta) ensures that if xtx_t is constant, the EMA converges to that constant (not to zero):

vtx as t when xt=x for all tv_t \to x \text{ as } t \to \infty \text{ when } x_t = x \text{ for all } t

Expanding the Recurrence

Substituting the recurrence repeatedly:

vt=(1β)k=0t1βkxtkv_t = (1-\beta)\sum_{k=0}^{t-1} \beta^k \cdot x_{t-k}
βkβ^k
the weight on the observation k steps ago
xtkxₜ₋ₖ
the observation k steps in the past

(plus a term βtv0\beta^t v_0 which is zero since v0=0v_0 = 0.)

This is a weighted average where the weight on xtkx_{t-k} decays geometrically with age kk:

  • Most recent observation xtx_t: weight (1β)β0=1β(1-\beta)\beta^0 = 1-\beta
  • One step ago xt1x_{t-1}: weight (1β)β(1-\beta)\beta
  • Two steps ago: weight (1β)β2(1-\beta)\beta^2

Older observations are exponentially downweighted. Recent observations dominate.

The Effective Window

How far back does an EMA actually "see"? The half-life is the number of steps kk for the weight to fall to half of its initial value: βk=0.5\beta^k = 0.5, so k=log(0.5)/log(β)k = \log(0.5)/\log(\beta).

A simpler and widely-used approximation: the effective window is:

Neff=11βN_{\text{eff}} = \frac{1}{1-\beta}
NeffN_eff
effective window — the approximate number of recent steps the EMA remembers
β\betaEffective window
0.52 steps
0.910 steps
0.99100 steps
0.9991000 steps

In Adam, the first moment uses β1=0.9\beta_1 = 0.9 (window ≈ 10 gradient steps) and the second moment uses β2=0.999\beta_2 = 0.999 (window ≈ 1000 steps).

Bias Correction: The Early-Training Problem

There is a subtle issue. We initialize v0=0v_0 = 0. At early steps:

v1=β0+(1β)x1=(1β)x1v_1 = \beta \cdot 0 + (1-\beta) \cdot x_1 = (1-\beta) \cdot x_1

For β=0.9\beta = 0.9: v1=0.1x1v_1 = 0.1 \cdot x_1only 10% of the actual value. The EMA starts at zero and takes many steps to "warm up." During this warm-up, estimates are biased toward zero.

The fix is bias correction:

v^t=vt1βt\hat{v}_t = \frac{v_t}{1 - \beta^t}
v^tv̂ₜ
bias-corrected EMA estimate at step t
vtvₜ
raw (uncorrected) EMA
βtβ^t
β raised to the power t — decays toward 0 as t grows

At t=1t=1: divides by 10.9=0.11 - 0.9 = 0.1, multiplying by 10 — correcting the downward bias. At t=100t=100: β100=0.91002.7×105\beta^{100} = 0.9^{100} \approx 2.7 \times 10^{-5}, so the correction is essentially 1 — no effect.

Bias correction matters only at the start of training.

Full Numerical Walkthrough (β = 0.9)

Let xt=[10,8,12,9,11,]x_t = [10, 8, 12, 9, 11, \ldots] (say, some gradient component). Starting from v0=0v_0 = 0:

ttxtx_tvt=0.9vt1+0.1xtv_t = 0.9 v_{t-1} + 0.1 x_t10.9t1 - 0.9^tv^t=vt/(10.9t)\hat{v}_t = v_t / (1-0.9^t)
1100.9·0 + 0.1·10 = 1.000.10010.00
280.9·1 + 0.1·8 = 1.700.1908.95
3120.9·1.7 + 0.1·12 = 2.730.27110.07
490.9·2.73 + 0.1·9 = 3.360.3449.76
5110.9·3.36 + 0.1·11 = 4.120.41010.05

The raw EMA vtv_t is still far below the true input magnitudes (around 10) at step 5. The bias-corrected v^t\hat{v}_t tracks the inputs accurately from step 1.

The Key Takeaway

An EMA with decay β\beta maintains a running estimate of the recent average of its input, with an effective memory window of 1/(1β)1/(1-\beta) steps. Bias correction fixes the cold-start problem at the beginning of training.

With this understood: momentum = EMA of gradients. RMSprop = EMA of squared gradients. Adam = both, with bias correction. That is the entire story.

Quiz

1 / 3

With β=0.9, the EMA has an effective window of approximately how many steps?