AdaGrad: per-parameter learning rates — Advanced Optimization

The Problem

Every parameter has its own gradient history. Some parameters receive large, frequent gradients — they're well-covered by the training data. Others receive small, rare gradients — they need larger steps to learn anything.

Momentum and Nesterov didn't address this: they smooth the direction but still use one global learning rate .

Adaptive learning rate methods are what made modern NLP possible. Word embeddings for rare words, parameters connected to infrequent features — these need larger updates than the high-frequency counterparts. AdaGrad was the first optimizer to recognize and solve this, paving the way for RMSprop and Adam.

The (Adaptive Gradient) is the first optimizer to give each parameter its own effective learning rate.

The Algorithm

Maintain a for each parameter:

G_t = G_{t-1} + \left(\nabla L(\theta_t)\right)^2

$Gₜ$: accumulated sum of squared gradients from step 1 to t
$∇L(θₜ)$: gradient of the loss with respect to parameters at step t

(All operations are elementwise for vectors; each parameter has its own $G$ .)

The parameter update:

\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{G_t} + \varepsilon} \cdot \nabla L(\theta_t)

$α$: global learning rate — a scalar hyperparameter
$ε$: small constant for numerical stability (typically 1e-8)
$∇L(θₜ)$: current gradient

The effective learning rate for each parameter is $\alpha / \sqrt{G_t}$ . Parameters that have received large historical gradients get smaller effective steps. Parameters with small historical gradients get larger effective steps.

Worked Numerical Example

Two parameters: $\theta_1$ (frequent, large gradients) and $\theta_2$ (rare, small gradients). Learning rate α = 0.1, ε = 1e-8.

Parameter θ₁ (gradients: 10, 9, 11, 10, 10):

Step	Gradient	$G_t$	$\sqrt{G_t}$	Effective LR	Step size
1	10	100	10.0	0.01	0.10
2	9	181	13.4	0.0075	0.067
3	11	302	17.4	0.0057	0.063
5	10	502	22.4	0.0045	0.045

The effective learning rate for θ₁ drops from 0.01 to 0.0045 in just 5 steps — and keeps shrinking.

Parameter θ₂ (gradients: 0, 0, 0, 0.1, 0):

Step	Gradient	$G_t$	Effective LR at step 4
1–3	0	0	α/ε → huge (capped by ε)
4	0.1	0.01	0.1/0.1 = 1.0

At step 4, when the rare parameter finally gets a gradient, its accumulated G is tiny (0.01), giving it a large effective learning rate of 1.0. This is precisely what we want: large steps for rarely-updated parameters.

The Fatal Flaw: Monotonic Decay

Where AdaGrad Still Shines

Despite this flaw, AdaGrad remains the best choice for specific scenarios:

Sparse NLP tasks: bag-of-words features, n-gram models, early word2vec training. Here the "fatal decay" works in your favor: common words (which have learned enough) stop being updated; rare words (which need more updates) keep learning.

One-pass over data: if you will see each example roughly once, AdaGrad's monotonic decay maps well to the training time horizon.

Convex with known convergence horizon: AdaGrad has provably optimal regret bounds for online convex optimization.

For everything else — especially deep learning with long training runs — RMSprop (next lesson) fixes the fatal flaw by replacing the sum with an EMA.

In Code

optimizer = torch.optim.Adagrad(
    model.parameters(),
    lr=0.01,
    eps=1e-8
)

# What AdaGrad computes internally:
# G += grad ** 2
# param -= lr / (G.sqrt() + eps) * grad

The learning rate for Adagrad is often set higher than for SGD (e.g., 0.01 instead of 0.001) because the adaptive scaling reduces it quickly anyway. The adaptive normalization does much of the tuning automatically.