RMSprop: fixing AdaGrad — Advanced Optimization

The One-Line Fix

AdaGrad's problem: is a sum that only grows. Fix: replace the sum with an EMA.

RMSprop was the fix that made AdaGrad practical — and it became the default optimizer in deep RL and recurrent networks before Adam arrived. Hinton introduced it in a 2012 Coursera lecture, making it one of the few widely used algorithms that has never been formally published in a paper.

This is (Root Mean Square Propagation). It maintains a running estimate of the recent mean squared gradient:

v_t = \beta \cdot v_{t-1} + (1-\beta)\cdot\left(\nabla L(\theta_t)\right)^2

$vₜ$: EMA of squared gradients — tracks recent gradient variance
$β$: decay rate — controls the memory window (typical: 0.9)
$∇L(θₜ)$: gradient at step t

Parameter update:

\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t} + \varepsilon}\cdot\nabla L(\theta_t)

$α$: learning rate
$ε$: numerical stability constant (typically 1e-8)

Compare to AdaGrad: the only change is $G_t \leftarrow v_t$ , switching from a sum to an EMA.

Why the EMA Fixes Everything

With $\beta = 0.9$ , the effective window is $1/(1-0.9) = 10$ steps (from lesson 11-2). The accumulated $v_t$ tracks the recent mean squared gradient — roughly the average of the last ~10 squared gradients.

When the gradient magnitude is roughly constant, $v_t$ stabilizes at the true mean squared gradient rather than growing without bound:

Fixed point: $v = \beta \cdot v + (1-\beta) \cdot g^2$ → $v(1-\beta) = (1-\beta)g^2$ → $v = g^2$ .

So $\sqrt{v_t} \approx g$ (the RMS of recent gradients), and the effective learning rate stabilizes at:

\alpha_{\text{eff}} = \frac{\alpha}{\sqrt{v_t}} \approx \frac{\alpha}{g}

$α_eff$: effective learning rate — stabilizes instead of decaying

The effective learning rate no longer decays to zero. It floats around a value determined by the recent gradient scale.

Full Numerical Example (β = 0.9)

Gradient sequence for one parameter: [3, 3, 3, 3, 3, …] (constant gradient of 3). α=0.1, ε=1e-8.

Expected steady-state: $v \to 9$ , so effective LR → 0.1/3 ≈ 0.033.

Step	$(∇L)^2$	$v_t$	$\sqrt{v_t}$	Effective LR
1	9	0.9·0 + 0.1·9 = 0.90	0.949	0.105
2	9	0.9·0.9 + 0.9 = 1.71	1.308	0.076
3	9	0.9·1.71 + 0.9 = 2.44	1.562	0.064
5	9	3.69	1.921	0.052
10	9	6.51	2.551	0.039
30	9	8.82	2.970	0.034
∞	9	9.00	3.000	0.033

The effective learning rate converges to 0.033 and stays there — it does not continue decaying. For AdaGrad, at step 30 the accumulated G₃₀ = 270, giving effective LR = 0.1/√270 ≈ 0.006 and still falling.

Now contrast with a parameter that has smaller gradients: gradient sequence [0.1, 0.1, …].

Steady-state: $v \to 0.01$ , effective LR → 0.1/0.1 = 1.0. RMSprop automatically assigns this parameter a much larger effective learning rate — 30× larger than the parameter with gradient=3. Per-parameter adaptation is working.

Why RMSprop beats AdaGrad for recommendation engines

Imagine you're training a recommendation engine with millions of sparse user-item features — most users interact with only a tiny fraction of all available items. With AdaGrad, features for popular items (frequent gradients) accumulate large denominators quickly and eventually hit near-zero effective learning rates. Features for niche items accumulate slowly but still trend toward zero. After enough steps AdaGrad has essentially stopped updating everything. RMSprop solves this: its EMA denominator stabilizes. Popular-item features get a calibrated, steady learning rate throughout training; niche features, which appear rarely, also maintain usable rates because the EMA forgets old updates between their rare appearances. The result: the model keeps improving late in training instead of plateauing prematurely.

RMSprop vs AdaGrad: Side by Side

Property	AdaGrad	RMSprop
Denominator	Sum: $G_t = G_{t-1} + g^2$	EMA: $v_t = \beta v_{t-1} + (1-\beta)g^2$
Denominator behavior	Monotonically increases	Stabilizes around mean $g^2$
Effective LR over time	Always decays	Stabilizes
Long training runs	Eventually stops updating	Continues learning
Sparse features	Excellent	Good
Standard hyperparameter	N/A	β = 0.9

In Code

optimizer = torch.optim.RMSprop(
    model.parameters(),
    lr=0.001,
    alpha=0.9,      # β: decay for squared gradient EMA
    eps=1e-8
)

# Internally:
# v = alpha * v + (1 - alpha) * grad**2
# param -= lr / (v.sqrt() + eps) * grad

Typical learning rate for RMSprop: 1e-3 or 1e-4 (same order as Adam). The adaptive scaling does most of the work, so LR tuning is less sensitive than for SGD.

Up next: Adam combines RMSprop's second moment with a first-moment EMA (momentum) and adds bias correction — completing the full picture.