Exponential Moving Averages: The Core Primitive of Adam
EMA definition and expansion. The effective window. Why EMA weights recent observations more. Bias correction. Numerical walkthrough with β=0.9. This is all you need to understand Adam.
⏱ ~7 min
🧮
Quick refresher
Weighted average
A weighted average assigns different importance to different values: weighted_avg = Σ wᵢ·xᵢ / Σ wᵢ. The weights control how much each observation contributes to the average. If all weights are equal, it reduces to the standard mean.
Before deriving Adam, RMSprop, or momentum, you need one building block: the .
Once you understand EMA deeply, the derivation of Adam in lesson 11-7 will be immediate.
EMA is not just a math prerequisite — it is the mechanism that makes adaptive optimizers stable. Without it, Adam would react to every noisy gradient spike and diverge. The smoothing property of EMA is why modern optimizers can train billion-parameter models reliably.
Definition
Let be a sequence of values (e.g., gradient values). The EMA with parameter ∈ [0, 1) is:
In plain English: the new running average is mostly the old running average (scaled by β) plus a small contribution from today's fresh observation (scaled by 1−β). The larger β is, the more you trust the past over the present.
vt=β⋅vt−1+(1−β)⋅xt
vt
the EMA at step t — the running estimate
β
decay rate — larger β means longer memory
xt
the new observation at step t
with initial condition v0=0.
The factor (1−β) ensures that if xt is constant, the EMA converges to that constant (not to zero):
vt→x as t→∞ when xt=x for all t
Expanding the Recurrence
Substituting the recurrence repeatedly:
vt=(1−β)k=0∑t−1βk⋅xt−k
βk
the weight on the observation k steps ago
xt−k
the observation k steps in the past
(plus a term βtv0 which is zero since v0=0.)
This is a weighted average where the weight on xt−k decays geometrically with age k:
Most recent observation xt: weight (1−β)β0=1−β
One step ago xt−1: weight (1−β)β
Two steps ago: weight (1−β)β2
Older observations are exponentially downweighted. Recent observations dominate.
The Effective Window
How far back does an EMA actually "see"? The half-life is the number of steps k for the weight to fall to half of its initial value: βk=0.5, so k=log(0.5)/log(β).
A simpler and widely-used approximation: the effective window is:
Neff=1−β1
Neff
effective window — the approximate number of recent steps the EMA remembers
β
Effective window
0.5
2 steps
0.9
10 steps
0.99
100 steps
0.999
1000 steps
In Adam, the first moment uses β1=0.9 (window ≈ 10 gradient steps) and the second moment uses β2=0.999 (window ≈ 1000 steps).
Bias Correction: The Early-Training Problem
There is a subtle issue. We initialize v0=0. At early steps:
v1=β⋅0+(1−β)⋅x1=(1−β)⋅x1
For β=0.9: v1=0.1⋅x1 — only 10% of the actual value. The EMA starts at zero and takes many steps to "warm up." During this warm-up, estimates are biased toward zero.
The fix is bias correction:
v^t=1−βtvt
v^t
bias-corrected EMA estimate at step t
vt
raw (uncorrected) EMA
βt
β raised to the power t — decays toward 0 as t grows
At t=1: divides by 1−0.9=0.1, multiplying by 10 — correcting the downward bias. At t=100: β100=0.9100≈2.7×10−5, so the correction is essentially 1 — no effect.
Bias correction matters only at the start of training.
Full Numerical Walkthrough (β = 0.9)
Let xt=[10,8,12,9,11,…] (say, some gradient component). Starting from v0=0:
t
xt
vt=0.9vt−1+0.1xt
1−0.9t
v^t=vt/(1−0.9t)
1
10
0.9·0 + 0.1·10 = 1.00
0.100
10.00
2
8
0.9·1 + 0.1·8 = 1.70
0.190
8.95
3
12
0.9·1.7 + 0.1·12 = 2.73
0.271
10.07
4
9
0.9·2.73 + 0.1·9 = 3.36
0.344
9.76
5
11
0.9·3.36 + 0.1·11 = 4.12
0.410
10.05
The raw EMA vt is still far below the true input magnitudes (around 10) at step 5. The bias-corrected v^t tracks the inputs accurately from step 1.
The Key Takeaway
An EMA with decay β maintains a running estimate of the recent average of its input, with an effective memory window of 1/(1−β) steps. Bias correction fixes the cold-start problem at the beginning of training.
With this understood: momentum = EMA of gradients. RMSprop = EMA of squared gradients. Adam = both, with bias correction. That is the entire story.
Quiz
1 / 3
With β=0.9, the EMA has an effective window of approximately how many steps?