The Problem: Zigzagging Across a Ravine
Imagine a loss surface shaped like a long, narrow valley — steep walls on the sides, gentle slope along the floor toward the minimum. This is called a and it's extremely common in practice (any time two parameters interact asymmetrically).
Vanilla gradient descent gets stuck in a frustrating pattern here. In the steep direction (across the ravine), gradients are large — so the update overshoots the other side. In the shallow direction (along the valley), gradients are small — so progress is glacial. The optimizer zigzags: left, right, left, right, inching forward with every oscillation.
Think of a ball rolling down a hillside. Once it's moving, it doesn't stop the instant the ground flattens — it carries its speed forward and keeps going. Gradient descent without momentum behaves the opposite way: it halts whenever the slope becomes shallow, even if the minimum is just a short distance ahead. Momentum gives gradient descent the same memory a rolling ball has.
The Velocity Idea
Momentum borrows from classical mechanics. Instead of moving in the direction of the current gradient alone, we maintain a velocity vector that accumulates past gradients:
- velocity vector (same shape as θ)
- momentum coefficient — how much previous velocity is retained. Typical: 0.9
- learning rate
- gradient of the loss with respect to θ
- model parameters
The parameter (beta) controls how much history is retained. With , each update keeps 90% of the previous velocity and adds 10% of the current gradient's contribution.
Physical Analogy: The Rolling Ball
Think of a ball rolling down the valley. On a flat surface, it accelerates steadily. When it hits a bump, it doesn't immediately reverse — its momentum carries it forward. The ball averages out small obstacles and speeds up on consistent downhill terrain.
That's exactly what's happening mathematically. The velocity is a weighted average of all past gradients, with exponentially decaying weights:
- velocity at step t
- gradient at step k
- momentum coefficient
Recent gradients matter most; older ones decay exponentially. The sum of all weights is — so with and , the effective step scale is in steady state.
Worked Numerical Example
Consider a 2D loss with parameters and . Suppose the gradients alternate for θ₂ (±2.0) but stay consistent for θ₁ (always +1.0). With , :
| Step | g₁ | g₂ | v₁ | v₂ | Net v₂ direction |
|---|---|---|---|---|---|
| 1 | 1.0 | +2.0 | 0.10 | 0.20 | forward |
| 2 | 1.0 | −2.0 | 0.19 | −0.02 | nearly zero! |
| 3 | 1.0 | +2.0 | 0.27 | 0.18 | forward |
| 4 | 1.0 | −2.0 | 0.34 | −0.04 | nearly zero! |
The alternating θ₂ gradients cancel almost completely in the velocity. Meanwhile θ₁ velocity grows toward its steady-state value of , accelerating progress along the valley floor.
Choosing β
The standard default is . This means the effective window of gradients contributing meaningfully to the velocity is roughly steps.
- β = 0.5: short memory, responds quickly to new gradients, less smoothing
- β = 0.9: standard choice, good balance of smoothing and responsiveness
- β = 0.99: very long memory, strong smoothing, but slow to change direction
Code: SGD with Momentum in PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Linear(10, 1)
# SGD with momentum=0.9 (the standard setting)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
for x_batch, y_batch in dataloader:
optimizer.zero_grad()
loss = criterion(model(x_batch), y_batch)
loss.backward()
optimizer.step()
The momentum=0.9 argument maps directly to the in the update equation. PyTorch internally maintains the velocity buffer — it accumulates across calls to optimizer.step() and is reset to zero when you create a new optimizer.