RMSprop: adaptive per-parameter rates — Gradient Descent

The Problem with One Learning Rate

Every parameter in a neural network has a different story. A weight connected to a high-frequency, high-variance feature may receive gradients of magnitude 10 at every step. A weight connected to a rare or weakly predictive feature may receive gradients near 0.01 most of the time.

A single global learning rate is a compromise. Set it large enough to move the rarely-updated parameters forward, and the frequently-updated ones overshoot. Set it small enough for the frequently-updated ones, and the sparse parameters barely move.

RMSprop solves this by giving each parameter its own effective learning rate, adapted dynamically based on its recent gradient history.

Imagine tuning a large mixing board where some knobs are adjusted constantly and others barely move all session. You'd want to nudge the inactive ones more aggressively and be cautious with the ones you've already dialed in. Vanilla gradient descent uses the same step size for every parameter regardless of its history. RMSProp watches each parameter's recent behavior and adapts: frequent, large gradients get smaller steps; rare, small gradients get larger ones.

For engineers: RMSprop as an automatic gain controller (AGC)

In electronics and communications, an automatic gain controller (AGC) monitors incoming signal strength and automatically reduces amplifier gain when the signal is strong, preventing saturation or clipping. RMSprop works on exactly the same principle: it monitors each parameter's recent gradient magnitude (via the running average $v$ of squared gradients) and reduces the effective learning rate when that magnitude has been large, preventing overshooting. Parameters with quiet, small gradients receive higher effective gain — they need the boost to move at all. The analogy is precise: the effective rate $\alpha / \sqrt{v + \varepsilon}$ is a gain term inversely proportional to recent signal power, just as an AGC circuit reduces amplifier gain proportional to received signal strength.

The RMSprop Algorithm

RMSprop maintains one extra number per parameter: , the exponential moving average (EMA) of squared gradients.

\begin{aligned} v &amp;\leftarrow \beta \cdot v + (1 - \beta) \cdot g^2 \ \theta &amp;\leftarrow \theta - \frac{\alpha}{\sqrt{v + \varepsilon}} \cdot g \end{aligned}

$v$: EMA of squared gradients (per parameter)
$\beta$: decay rate for the EMA — default 0.9
$g$: gradient of this parameter at the current step
$\alpha$: global learning rate
$\varepsilon$: small constant for numerical stability — default 1e-8
$\theta$: the parameter being updated

The effective learning rate for this parameter is $\alpha / \sqrt{v + \varepsilon}$ . When recent squared gradients are large, $v$ is large, and the effective rate shrinks. When recent squared gradients are small, $v$ is small, and the effective rate grows.

Worked Example: Two Parameters

Suppose $\alpha = 0.1$ , $\beta = 0.9$ , $\varepsilon = 10^{-8}$ , starting from $\theta_1 = \theta_2 = 1.0$ and $v_1 = v_2 = 0$ .

Gradients at step 1: $g_1 = 10.0$ , $g_2 = 0.1$

Update v: $v_1 = 0.9(0) + 0.1(10)^2 = 10.0$ , $v_2 = 0.9(0) + 0.1(0.1)^2 = 0.001$

Effective rates: $\alpha_1^{\text{eff}} = \frac{0.1}{\sqrt{10 + \varepsilon}} \approx 0.032$ , $\alpha_2^{\text{eff}} = \frac{0.1}{\sqrt{0.001 + \varepsilon}} \approx 3.16$

Gradients at step 2: $g_1 = 10.0$ , $g_2 = 0.1$

Update v: $v_1 = 0.9(10) + 0.1(100) = 19.0$ , $v_2 = 0.9(0.001) + 0.1(0.01) = 0.0019$

Effective rates: $\alpha_1^{\text{eff}} = \frac{0.1}{\sqrt{19}} \approx 0.023$ (getting smaller), $\alpha_2^{\text{eff}} = \frac{0.1}{\sqrt{0.0019}} \approx 2.29$ (also adapting)

Parameter 1 — with large, consistent gradients — gets a small effective step. Parameter 2 — with tiny gradients — moves much more aggressively per step.

What β Controls

The determines the timescale of adaptation. With $\beta = 0.9$ , the effective window is roughly $1/(1-\beta) = 10$ steps. The RMSprop estimate of gradient magnitude is based on the last ~10 gradient steps.

β close to 1 (0.99): slow adaptation, stable effective rates, better for stationary gradient statistics
β = 0.9 (default): responsive to recent gradient history
β close to 0: almost uses only the current gradient squared — very noisy

RMSprop vs. AdaGrad

AdaGrad (an earlier adaptive method) accumulates the sum of all squared gradients: $v \leftarrow v + g^2$ . The effective rate then shrinks monotonically and eventually approaches zero — learning stops. RMSprop fixes this by using an EMA instead, so old gradients decay away and the effective rate can recover if recent gradients become small.

Code: RMSprop in PyTorch

import torch.optim as optim

# Default settings: lr=0.01, alpha=0.9 (β), eps=1e-8
optimizer = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.9, eps=1e-8)

for x_batch, y_batch in dataloader:
    optimizer.zero_grad()
    loss = criterion(model(x_batch), y_batch)
    loss.backward()
    optimizer.step()

PyTorch's alpha parameter corresponds to $\beta$ in the formula above. RMSprop is often a good choice for recurrent networks — it was the optimizer used in the original DeepMind Atari paper and remains competitive with Adam in certain settings.

Unit 11 (Advanced Optimization) derives the theoretical motivation for adaptive gradient methods and presents the complete convergence analysis.