SGD with momentum — Advanced Optimization

The Problem Revisited

Vanilla gradient descent zigzags in ravines because it responds only to the current gradient — with no memory of where it has been. Each step is independent of the last.

The fix is physical: give the optimizer a velocity that accumulates over time. In directions where the gradient consistently points the same way, speed builds up. In directions where the gradient oscillates, opposite contributions cancel out.

Momentum is the reason SGD can train production image classifiers. Without it, vanilla SGD either crawls or oscillates — with it, the same architecture trains in a fraction of the time. Almost every real training run uses momentum, and understanding it is essential before tackling Adam.

This is .

The Algorithm

Maintain a velocity vector . The update rule:

v_t = \beta \cdot v_{t-1} + \alpha \cdot \nabla L(\theta_t)

$vₜ$: velocity at step t — accumulated gradient history
$β$: momentum coefficient — controls how much past gradient is retained (typical: 0.9)
$α$: learning rate — step size scalar
$∇L(θₜ)$: gradient of loss at current parameters
$θ$: model parameters

\theta_{t+1} = \theta_t - v_t

$θₜ₊₁$: updated parameters

This is equivalent to the EMA formulation from lesson 11-2 (the scaling by $\alpha$ vs $(1-\beta)$ is a convention difference; both are in common use).

Physical Analogy

Think of a ball rolling down a hill. In vanilla GD, the ball is replaced at the bottom of its last position and given a new push exactly downhill — no memory of how it was moving before. With momentum, the ball has mass: it accumulates velocity in the downhill direction, overshoots a little (which you avoid with a sensible β), and rolls steadily toward the minimum.

In the valleys of a ravine:

Along the valley floor (shallow slope, consistent gradient direction): velocity builds up steadily. The ball accelerates. Progress is fast.
Across the ravine walls (steep curvature, gradient alternates sign): each bounce is partially canceled by the previous bounce in the velocity. The oscillations are damped. The ball rolls smoothly rather than zigzagging.

Numerical Example: Ravine

Consider $L(\theta_1, \theta_2) = \theta_1^2 + 10\theta_2^2$ . Starting at $\theta = (4, 0.5)$ .

Gradients: $\nabla L = (2\theta_1, 20\theta_2)$ . At start: $\nabla L = (8, 10)$ .

Vanilla GD (α=0.05): Step → $(4-0.4, 0.5-0.5) = (3.6, 0.0)$ . Then gradient = (7.2, 0). Step → (3.6-0.36, 0) = (3.24, 0). Progress along θ₂ required only one step but was violent; along θ₁ it crawls.

Momentum (α=0.05, β=0.9), starting with v=0:

Step 1: gradient = (8, 10). v = (0.4, 0.5). θ = (3.6, 0.0). (Same as GD first step.) Step 2: gradient = (7.2, 0). v = (0.9·0.4 + 0.05·7.2, 0.9·0.5 + 0) = (0.72, 0.45). θ = (3.6-0.72, 0-0.45) = (2.88, -0.45).

Notice: along θ₂, the gradient flipped sign to 0, but the velocity still carries momentum of 0.45 from step 1. This overshoots slightly — but only to ±0.45, not ±0.5. The oscillation is already being damped.

After many steps: momentum's velocity in the consistent θ₁ direction builds up, making progress 5–10× faster than vanilla GD.

In Code

# PyTorch: SGD with momentum
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9    # β = 0.9
)

# The update PyTorch computes:
# v = 0.9 * v_prev + grad
# param -= 0.01 * v

For most tasks, momentum β=0.9 is a safe default. This is the same β₁ used in Adam's first moment — Adam's momentum is just the EMA of gradients, which is exactly what we've derived here.

What optimizer.step() is actually doing with momentum

When you use torch.optim.SGD(..., momentum=0.9) and call optimizer.step(), PyTorch is not simply stepping in the direction of the current gradient. It maintains a hidden velocity buffer for every single parameter in your model, updated and carried across every batch. A few things this implies that are easy to miss:

The first epoch behaves differently — velocity is building from zero, so early steps are smaller than later steps.
Changing batch size mid-training disturbs the accumulated velocity — the gradient scale changes but the velocity buffer doesn't reset.
The effective learning rate is higher than the lr you set — by up to 10× for β=0.9 once velocity has built up.

None of this is a bug — it is all working exactly as intended. But knowing it explains why early training can look erratic, and why models trained with momentum can be sensitive to batch size changes mid-run.

Exponential moving average

The Problem Revisited

The Algorithm

Physical Analogy

Numerical Example: Ravine

In Code