Skip to content
Advanced Optimization
Lesson 5 ⏱ 12 min

AdaGrad: per-parameter learning rates

Video coming soon

AdaGrad: Give Each Parameter Its Own Learning Rate

The single-LR problem. AdaGrad's solution: divide by accumulated squared gradient. Why rare features benefit. The fatal flaw: monotonic accumulation kills the learning rate.

⏱ ~7 min

🧮

Quick refresher

Why one learning rate is insufficient

Different parameters receive gradients of very different magnitudes. Frequent features (common words, dense inputs) accumulate large gradients; rare features receive sparse, small-magnitude gradients. One learning rate is simultaneously too large for frequent parameters (causing oscillation) and too small for rare parameters (causing slow learning).

Example

A word embedding model: the word 'the' appears in nearly every sentence, so its embedding gradient is large.

The word 'phosphorescent' appears rarely, so its gradient is small and infrequent.

Optimal step sizes differ by orders of magnitude.

The Problem

Every parameter has its own gradient history. Some parameters receive large, frequent gradients — they're well-covered by the training data. Others receive small, rare gradients — they need larger steps to learn anything.

Momentum and Nesterov didn't address this: they smooth the direction but still use one global learning rate .

Adaptive learning rate methods are what made modern NLP possible. Word embeddings for rare words, parameters connected to infrequent features — these need larger updates than the high-frequency counterparts. AdaGrad was the first optimizer to recognize and solve this, paving the way for RMSprop and Adam.

The (Adaptive Gradient) is the first optimizer to give each parameter its own effective learning rate.

The Algorithm

Maintain a for each parameter:

Gt=Gt1+(L(θt))2G_t = G_{t-1} + \left(\nabla L(\theta_t)\right)^2
GtGₜ
accumulated sum of squared gradients from step 1 to t
L(θt)∇L(θₜ)
gradient of the loss with respect to parameters at step t

(All operations are elementwise for vectors; each parameter has its own GG.)

The parameter update:

θt+1=θtαGt+εL(θt)\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t} + \varepsilon} \cdot \nabla L(\theta_t)
αα
global learning rate — a scalar hyperparameter
εε
small constant for numerical stability (typically 1e-8)
L(θt)∇L(θₜ)
current gradient

The effective learning rate for each parameter is α/Gt\alpha / \sqrt{G_t}. Parameters that have received large historical gradients get smaller effective steps. Parameters with small historical gradients get larger effective steps.

Worked Numerical Example

Two parameters: θ1\theta_1 (frequent, large gradients) and θ2\theta_2 (rare, small gradients). Learning rate α = 0.1, ε = 1e-8.

Parameter θ₁ (gradients: 10, 9, 11, 10, 10):

StepGradientGtG_tGt\sqrt{G_t}Effective LRStep size
11010010.00.010.10
2918113.40.00750.067
31130217.40.00570.063
51050222.40.00450.045

The effective learning rate for θ₁ drops from 0.01 to 0.0045 in just 5 steps — and keeps shrinking.

Parameter θ₂ (gradients: 0, 0, 0, 0.1, 0):

StepGradientGtG_tEffective LR at step 4
1–300α/ε → huge (capped by ε)
40.10.010.1/0.1 = 1.0

At step 4, when the rare parameter finally gets a gradient, its accumulated G is tiny (0.01), giving it a large effective learning rate of 1.0. This is precisely what we want: large steps for rarely-updated parameters.

The Fatal Flaw: Monotonic Decay

Where AdaGrad Still Shines

Despite this flaw, AdaGrad remains the best choice for specific scenarios:

Sparse NLP tasks: bag-of-words features, n-gram models, early word2vec training. Here the "fatal decay" works in your favor: common words (which have learned enough) stop being updated; rare words (which need more updates) keep learning.

One-pass over data: if you will see each example roughly once, AdaGrad's monotonic decay maps well to the training time horizon.

Convex with known convergence horizon: AdaGrad has provably optimal regret bounds for online convex optimization.

For everything else — especially deep learning with long training runs — RMSprop (next lesson) fixes the fatal flaw by replacing the sum with an EMA.

In Code

optimizer = torch.optim.Adagrad(
    model.parameters(),
    lr=0.01,
    eps=1e-8
)

# What AdaGrad computes internally:
# G += grad ** 2
# param -= lr / (G.sqrt() + eps) * grad

The learning rate for Adagrad is often set higher than for SGD (e.g., 0.01 instead of 0.001) because the adaptive scaling reduces it quickly anyway. The adaptive normalization does much of the tuning automatically.

Quiz

1 / 3

In AdaGrad, parameter θⱼ receives large gradients for the first 100 steps, then small gradients. What happens to its effective learning rate?