Skip to content
Advanced Optimization
Lesson 6 ⏱ 12 min

RMSprop: fixing AdaGrad

Video coming soon

RMSprop: Adaptive Rates Without the Death Spiral

AdaGrad's fatal flaw: monotonic decay. The fix: replace sum with EMA. RMSprop derivation. Effective window. Numerical stability. Why this is 75% of Adam.

⏱ ~7 min

🧮

Quick refresher

AdaGrad's accumulated squared gradient

AdaGrad accumulates Gₜ = Gₜ₋₁ + (∇L)² — a monotonically growing sum of squared gradients. The effective learning rate α/√Gₜ decreases permanently, eventually approaching zero and stopping learning.

Example

After 1000 steps with average squared gradient of 1, G₁₀₀₀ = 1000, effective LR = α/√1000 ≈ 0.032α.

After 1,000,000 steps, effective LR ≈ 0.001α.

AdaGrad essentially stops updating after enough steps.

The One-Line Fix

AdaGrad's problem: is a sum that only grows. Fix: replace the sum with an EMA.

RMSprop was the fix that made AdaGrad practical — and it became the default optimizer in deep RL and recurrent networks before Adam arrived. Hinton introduced it in a 2012 Coursera lecture, making it one of the few widely used algorithms that has never been formally published in a paper.

This is (Root Mean Square Propagation). It maintains a running estimate of the recent mean squared gradient:

vt=βvt1+(1β)(L(θt))2v_t = \beta \cdot v_{t-1} + (1-\beta)\cdot\left(\nabla L(\theta_t)\right)^2
vtvₜ
EMA of squared gradients — tracks recent gradient variance
ββ
decay rate — controls the memory window (typical: 0.9)
L(θt)∇L(θₜ)
gradient at step t

Parameter update:

θt+1=θtαvt+εL(θt)\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t} + \varepsilon}\cdot\nabla L(\theta_t)
αα
learning rate
εε
numerical stability constant (typically 1e-8)

Compare to AdaGrad: the only change is GtvtG_t \leftarrow v_t, switching from a sum to an EMA.

Why the EMA Fixes Everything

With β=0.9\beta = 0.9, the effective window is 1/(10.9)=101/(1-0.9) = 10 steps (from lesson 11-2). The accumulated vtv_t tracks the recent mean squared gradient — roughly the average of the last ~10 squared gradients.

When the gradient magnitude is roughly constant, vtv_t stabilizes at the true mean squared gradient rather than growing without bound:

Fixed point: v=βv+(1β)g2v = \beta \cdot v + (1-\beta) \cdot g^2v(1β)=(1β)g2v(1-\beta) = (1-\beta)g^2v=g2v = g^2.

So vtg\sqrt{v_t} \approx g (the RMS of recent gradients), and the effective learning rate stabilizes at:

αeff=αvtαg\alpha_{\text{eff}} = \frac{\alpha}{\sqrt{v_t}} \approx \frac{\alpha}{g}
αeffα_eff
effective learning rate — stabilizes instead of decaying

The effective learning rate no longer decays to zero. It floats around a value determined by the recent gradient scale.

Full Numerical Example (β = 0.9)

Gradient sequence for one parameter: [3, 3, 3, 3, 3, …] (constant gradient of 3). α=0.1, ε=1e-8.

Expected steady-state: v9v \to 9, so effective LR → 0.1/3 ≈ 0.033.

Step(L)2(∇L)^2vtv_tvt\sqrt{v_t}Effective LR
190.9·0 + 0.1·9 = 0.900.9490.105
290.9·0.9 + 0.9 = 1.711.3080.076
390.9·1.71 + 0.9 = 2.441.5620.064
593.691.9210.052
1096.512.5510.039
3098.822.9700.034
99.003.0000.033

The effective learning rate converges to 0.033 and stays there — it does not continue decaying. For AdaGrad, at step 30 the accumulated G₃₀ = 270, giving effective LR = 0.1/√270 ≈ 0.006 and still falling.

Now contrast with a parameter that has smaller gradients: gradient sequence [0.1, 0.1, …].

Steady-state: v0.01v \to 0.01, effective LR → 0.1/0.1 = 1.0. RMSprop automatically assigns this parameter a much larger effective learning rate — 30× larger than the parameter with gradient=3. Per-parameter adaptation is working.

RMSprop vs AdaGrad: Side by Side

PropertyAdaGradRMSprop
DenominatorSum: Gt=Gt1+g2G_t = G_{t-1} + g^2EMA: vt=βvt1+(1β)g2v_t = \beta v_{t-1} + (1-\beta)g^2
Denominator behaviorMonotonically increasesStabilizes around mean g2g^2
Effective LR over timeAlways decaysStabilizes
Long training runsEventually stops updatingContinues learning
Sparse featuresExcellentGood
Standard hyperparameterN/Aβ = 0.9

In Code

optimizer = torch.optim.RMSprop(
    model.parameters(),
    lr=0.001,
    alpha=0.9,      # β: decay for squared gradient EMA
    eps=1e-8
)

# Internally:
# v = alpha * v + (1 - alpha) * grad**2
# param -= lr / (v.sqrt() + eps) * grad

Typical learning rate for RMSprop: 1e-3 or 1e-4 (same order as Adam). The adaptive scaling does most of the work, so LR tuning is less sensitive than for SGD.

Up next: Adam combines RMSprop's second moment with a first-moment EMA (momentum) and adds bias correction — completing the full picture.

Quiz

1 / 3

RMSprop replaces AdaGrad's G_t = G_{t-1} + (∇L)² with: