The Problem with One Learning Rate
Every parameter in a neural network has a different story. A weight connected to a high-frequency, high-variance feature may receive gradients of magnitude 10 at every step. A weight connected to a rare or weakly predictive feature may receive gradients near 0.01 most of the time.
A single global learning rate is a compromise. Set it large enough to move the rarely-updated parameters forward, and the frequently-updated ones overshoot. Set it small enough for the frequently-updated ones, and the sparse parameters barely move.
RMSprop solves this by giving each parameter its own effective learning rate, adapted dynamically based on its recent gradient history.
Imagine tuning a large mixing board where some knobs are adjusted constantly and others barely move all session. You'd want to nudge the inactive ones more aggressively and be cautious with the ones you've already dialed in. Vanilla gradient descent uses the same step size for every parameter regardless of its history. RMSProp watches each parameter's recent behavior and adapts: frequent, large gradients get smaller steps; rare, small gradients get larger ones.
The RMSprop Algorithm
RMSprop maintains one extra number per parameter: , the exponential moving average (EMA) of squared gradients.
- EMA of squared gradients (per parameter)
- decay rate for the EMA — default 0.9
- gradient of this parameter at the current step
- global learning rate
- small constant for numerical stability — default 1e-8
- the parameter being updated
The effective learning rate for this parameter is . When recent squared gradients are large, is large, and the effective rate shrinks. When recent squared gradients are small, is small, and the effective rate grows.
Worked Example: Two Parameters
Suppose , , , starting from and .
Gradients at step 1: ,
Update v: ,
Effective rates: ,
Gradients at step 2: ,
Update v: ,
Effective rates: (getting smaller), (also adapting)
Parameter 1 — with large, consistent gradients — gets a small effective step. Parameter 2 — with tiny gradients — moves much more aggressively per step.
What β Controls
The determines the timescale of adaptation. With , the effective window is roughly steps. The RMSprop estimate of gradient magnitude is based on the last ~10 gradient steps.
- β close to 1 (0.99): slow adaptation, stable effective rates, better for stationary gradient statistics
- β = 0.9 (default): responsive to recent gradient history
- β close to 0: almost uses only the current gradient squared — very noisy
RMSprop vs. AdaGrad
AdaGrad (an earlier adaptive method) accumulates the sum of all squared gradients: . The effective rate then shrinks monotonically and eventually approaches zero — learning stops. RMSprop fixes this by using an EMA instead, so old gradients decay away and the effective rate can recover if recent gradients become small.
Code: RMSprop in PyTorch
import torch.optim as optim
# Default settings: lr=0.01, alpha=0.9 (β), eps=1e-8
optimizer = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.9, eps=1e-8)
for x_batch, y_batch in dataloader:
optimizer.zero_grad()
loss = criterion(model(x_batch), y_batch)
loss.backward()
optimizer.step()
PyTorch's alpha parameter corresponds to in the formula above. RMSprop is often a good choice for recurrent networks — it was the optimizer used in the original DeepMind Atari paper and remains competitive with Adam in certain settings.
Unit 11 (Advanced Optimization) derives the theoretical motivation for adaptive gradient methods and presents the complete convergence analysis.