Skip to content
Gradient Descent
Lesson 8 ⏱ 12 min

RMSprop: adaptive per-parameter rates

Video coming soon

RMSprop - Why One Learning Rate Isn't Enough

The problem with a global learning rate for heterogeneous parameters, the exponential moving average of squared gradients, and a side-by-side animated comparison of vanilla SGD vs. RMSprop on a two-parameter loss surface.

⏱ ~7 min

🧮

Quick refresher

Exponential moving average

An exponential moving average (EMA) of a sequence s₁, s₂, … is updated as EMA ← β·EMA + (1-β)·sₜ. It weights recent values more heavily than old ones, with the influence of past values decaying exponentially. The decay rate is controlled by β.

Example

Starting from EMA = 0, β = 0.9, receiving values 4, 4, 4: after step 1, EMA = 0.4; after step 2, EMA = 0.9×0.4 + 0.1×4 = 0.76; after step 3, EMA = 0.9×0.76 + 0.1×4 = 1.084.

It converges toward 4 gradually.

The Problem with One Learning Rate

Every parameter in a neural network has a different story. A weight connected to a high-frequency, high-variance feature may receive gradients of magnitude 10 at every step. A weight connected to a rare or weakly predictive feature may receive gradients near 0.01 most of the time.

A single global learning rate is a compromise. Set it large enough to move the rarely-updated parameters forward, and the frequently-updated ones overshoot. Set it small enough for the frequently-updated ones, and the sparse parameters barely move.

RMSprop solves this by giving each parameter its own effective learning rate, adapted dynamically based on its recent gradient history.

Imagine tuning a large mixing board where some knobs are adjusted constantly and others barely move all session. You'd want to nudge the inactive ones more aggressively and be cautious with the ones you've already dialed in. Vanilla gradient descent uses the same step size for every parameter regardless of its history. RMSProp watches each parameter's recent behavior and adapts: frequent, large gradients get smaller steps; rare, small gradients get larger ones.

The RMSprop Algorithm

RMSprop maintains one extra number per parameter: , the exponential moving average (EMA) of squared gradients.

vamp;βv+(1β)g2 θamp;θαv+εg\begin{aligned} v &\leftarrow \beta \cdot v + (1 - \beta) \cdot g^2 \ \theta &\leftarrow \theta - \frac{\alpha}{\sqrt{v + \varepsilon}} \cdot g \end{aligned}
vv
EMA of squared gradients (per parameter)
β\beta
decay rate for the EMA — default 0.9
gg
gradient of this parameter at the current step
α\alpha
global learning rate
ε\varepsilon
small constant for numerical stability — default 1e-8
θ\theta
the parameter being updated

The effective learning rate for this parameter is α/v+ε\alpha / \sqrt{v + \varepsilon}. When recent squared gradients are large, vv is large, and the effective rate shrinks. When recent squared gradients are small, vv is small, and the effective rate grows.

Worked Example: Two Parameters

Suppose α=0.1\alpha = 0.1, β=0.9\beta = 0.9, ε=108\varepsilon = 10^{-8}, starting from θ1=θ2=1.0\theta_1 = \theta_2 = 1.0 and v1=v2=0v_1 = v_2 = 0.

Gradients at step 1: g1=10.0g_1 = 10.0, g2=0.1g_2 = 0.1

Update v: v1=0.9(0)+0.1(10)2=10.0v_1 = 0.9(0) + 0.1(10)^2 = 10.0, v2=0.9(0)+0.1(0.1)2=0.001v_2 = 0.9(0) + 0.1(0.1)^2 = 0.001

Effective rates: α1eff=0.110+ε0.032\alpha_1^{\text{eff}} = \frac{0.1}{\sqrt{10 + \varepsilon}} \approx 0.032, α2eff=0.10.001+ε3.16\alpha_2^{\text{eff}} = \frac{0.1}{\sqrt{0.001 + \varepsilon}} \approx 3.16

Gradients at step 2: g1=10.0g_1 = 10.0, g2=0.1g_2 = 0.1

Update v: v1=0.9(10)+0.1(100)=19.0v_1 = 0.9(10) + 0.1(100) = 19.0, v2=0.9(0.001)+0.1(0.01)=0.0019v_2 = 0.9(0.001) + 0.1(0.01) = 0.0019

Effective rates: α1eff=0.1190.023\alpha_1^{\text{eff}} = \frac{0.1}{\sqrt{19}} \approx 0.023 (getting smaller), α2eff=0.10.00192.29\alpha_2^{\text{eff}} = \frac{0.1}{\sqrt{0.0019}} \approx 2.29 (also adapting)

Parameter 1 — with large, consistent gradients — gets a small effective step. Parameter 2 — with tiny gradients — moves much more aggressively per step.

What β Controls

The determines the timescale of adaptation. With β=0.9\beta = 0.9, the effective window is roughly 1/(1β)=101/(1-\beta) = 10 steps. The RMSprop estimate of gradient magnitude is based on the last ~10 gradient steps.

  • β close to 1 (0.99): slow adaptation, stable effective rates, better for stationary gradient statistics
  • β = 0.9 (default): responsive to recent gradient history
  • β close to 0: almost uses only the current gradient squared — very noisy

RMSprop vs. AdaGrad

AdaGrad (an earlier adaptive method) accumulates the sum of all squared gradients: vv+g2v \leftarrow v + g^2. The effective rate then shrinks monotonically and eventually approaches zero — learning stops. RMSprop fixes this by using an EMA instead, so old gradients decay away and the effective rate can recover if recent gradients become small.

Code: RMSprop in PyTorch

import torch.optim as optim

# Default settings: lr=0.01, alpha=0.9 (β), eps=1e-8
optimizer = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.9, eps=1e-8)

for x_batch, y_batch in dataloader:
    optimizer.zero_grad()
    loss = criterion(model(x_batch), y_batch)
    loss.backward()
    optimizer.step()

PyTorch's alpha parameter corresponds to β\beta in the formula above. RMSprop is often a good choice for recurrent networks — it was the optimizer used in the original DeepMind Atari paper and remains competitive with Adam in certain settings.

Unit 11 (Advanced Optimization) derives the theoretical motivation for adaptive gradient methods and presents the complete convergence analysis.

Quiz

1 / 3

RMSprop maintains a running statistic v for each parameter. What does v track?