The One-Line Fix
AdaGrad's problem: is a sum that only grows. Fix: replace the sum with an EMA.
RMSprop was the fix that made AdaGrad practical — and it became the default optimizer in deep RL and recurrent networks before Adam arrived. Hinton introduced it in a 2012 Coursera lecture, making it one of the few widely used algorithms that has never been formally published in a paper.
This is (Root Mean Square Propagation). It maintains a running estimate of the recent mean squared gradient:
- EMA of squared gradients — tracks recent gradient variance
- decay rate — controls the memory window (typical: 0.9)
- gradient at step t
Parameter update:
- learning rate
- numerical stability constant (typically 1e-8)
Compare to AdaGrad: the only change is , switching from a sum to an EMA.
Why the EMA Fixes Everything
With , the effective window is steps (from lesson 11-2). The accumulated tracks the recent mean squared gradient — roughly the average of the last ~10 squared gradients.
When the gradient magnitude is roughly constant, stabilizes at the true mean squared gradient rather than growing without bound:
Fixed point: → → .
So (the RMS of recent gradients), and the effective learning rate stabilizes at:
- effective learning rate — stabilizes instead of decaying
The effective learning rate no longer decays to zero. It floats around a value determined by the recent gradient scale.
Full Numerical Example (β = 0.9)
Gradient sequence for one parameter: [3, 3, 3, 3, 3, …] (constant gradient of 3). α=0.1, ε=1e-8.
Expected steady-state: , so effective LR → 0.1/3 ≈ 0.033.
| Step | Effective LR | |||
|---|---|---|---|---|
| 1 | 9 | 0.9·0 + 0.1·9 = 0.90 | 0.949 | 0.105 |
| 2 | 9 | 0.9·0.9 + 0.9 = 1.71 | 1.308 | 0.076 |
| 3 | 9 | 0.9·1.71 + 0.9 = 2.44 | 1.562 | 0.064 |
| 5 | 9 | 3.69 | 1.921 | 0.052 |
| 10 | 9 | 6.51 | 2.551 | 0.039 |
| 30 | 9 | 8.82 | 2.970 | 0.034 |
| ∞ | 9 | 9.00 | 3.000 | 0.033 |
The effective learning rate converges to 0.033 and stays there — it does not continue decaying. For AdaGrad, at step 30 the accumulated G₃₀ = 270, giving effective LR = 0.1/√270 ≈ 0.006 and still falling.
Now contrast with a parameter that has smaller gradients: gradient sequence [0.1, 0.1, …].
Steady-state: , effective LR → 0.1/0.1 = 1.0. RMSprop automatically assigns this parameter a much larger effective learning rate — 30× larger than the parameter with gradient=3. Per-parameter adaptation is working.
RMSprop vs AdaGrad: Side by Side
| Property | AdaGrad | RMSprop |
|---|---|---|
| Denominator | Sum: | EMA: |
| Denominator behavior | Monotonically increases | Stabilizes around mean |
| Effective LR over time | Always decays | Stabilizes |
| Long training runs | Eventually stops updating | Continues learning |
| Sparse features | Excellent | Good |
| Standard hyperparameter | N/A | β = 0.9 |
In Code
optimizer = torch.optim.RMSprop(
model.parameters(),
lr=0.001,
alpha=0.9, # β: decay for squared gradient EMA
eps=1e-8
)
# Internally:
# v = alpha * v + (1 - alpha) * grad**2
# param -= lr / (v.sqrt() + eps) * grad
Typical learning rate for RMSprop: 1e-3 or 1e-4 (same order as Adam). The adaptive scaling does most of the work, so LR tuning is less sensitive than for SGD.
Up next: Adam combines RMSprop's second moment with a first-moment EMA (momentum) and adds bias correction — completing the full picture.