Skip to content
Advanced Optimization
Lesson 3 ⏱ 12 min

SGD with momentum

Video coming soon

Momentum: Accumulating Velocity to Smooth the Path

The velocity accumulation idea. Why momentum damps oscillations and accelerates in consistent directions. Physical analogy. Numerical example on a ravine. PyTorch usage.

⏱ ~7 min

🧮

Quick refresher

Exponential moving average

An EMA vₜ = β·vₜ₋₁ + (1-β)·xₜ computes a running weighted average of its inputs, with recent values weighted more. The effective window is 1/(1-β) steps. Early estimates are biased toward zero, corrected by dividing by (1-βᵗ).

Example

With β=0.9, x=[10,10,10,...]: v₁=1, v₂=1.9, v₃=2.71 (raw).

Bias-corrected: v̂₁=10, v̂₂=10, v̂₃=10.

The correction is essential early in training.

The Problem Revisited

Vanilla gradient descent zigzags in ravines because it responds only to the current gradient — with no memory of where it has been. Each step is independent of the last.

The fix is physical: give the optimizer a velocity that accumulates over time. In directions where the gradient consistently points the same way, speed builds up. In directions where the gradient oscillates, opposite contributions cancel out.

Momentum is the reason SGD can train production image classifiers. Without it, vanilla SGD either crawls or oscillates — with it, the same architecture trains in a fraction of the time. Almost every real training run uses momentum, and understanding it is essential before tackling Adam.

This is .

The Algorithm

Maintain a velocity vector . The update rule:

vt=βvt1+αL(θt)v_t = \beta \cdot v_{t-1} + \alpha \cdot \nabla L(\theta_t)
vtvₜ
velocity at step t — accumulated gradient history
ββ
momentum coefficient — controls how much past gradient is retained (typical: 0.9)
αα
learning rate — step size scalar
L(θt)∇L(θₜ)
gradient of loss at current parameters
θθ
model parameters
θt+1=θtvt\theta_{t+1} = \theta_t - v_t
θt+1θₜ₊₁
updated parameters

This is equivalent to the EMA formulation from lesson 11-2 (the scaling by α\alpha vs (1β)(1-\beta) is a convention difference; both are in common use).

Physical Analogy

Think of a ball rolling down a hill. In vanilla GD, the ball is replaced at the bottom of its last position and given a new push exactly downhill — no memory of how it was moving before. With momentum, the ball has mass: it accumulates velocity in the downhill direction, overshoots a little (which you avoid with a sensible β), and rolls steadily toward the minimum.

In the valleys of a ravine:

  • Along the valley floor (shallow slope, consistent gradient direction): velocity builds up steadily. The ball accelerates. Progress is fast.
  • Across the ravine walls (steep curvature, gradient alternates sign): each bounce is partially canceled by the previous bounce in the velocity. The oscillations are damped. The ball rolls smoothly rather than zigzagging.

Numerical Example: Ravine

Consider L(θ1,θ2)=θ12+10θ22L(\theta_1, \theta_2) = \theta_1^2 + 10\theta_2^2. Starting at θ=(4,0.5)\theta = (4, 0.5).

Gradients: L=(2θ1,20θ2)\nabla L = (2\theta_1, 20\theta_2). At start: L=(8,10)\nabla L = (8, 10).

Vanilla GD (α=0.05): Step → (40.4,0.50.5)=(3.6,0.0)(4-0.4, 0.5-0.5) = (3.6, 0.0). Then gradient = (7.2, 0). Step → (3.6-0.36, 0) = (3.24, 0). Progress along θ₂ required only one step but was violent; along θ₁ it crawls.

Momentum (α=0.05, β=0.9), starting with v=0:

Step 1: gradient = (8, 10). v = (0.4, 0.5). θ = (3.6, 0.0). (Same as GD first step.) Step 2: gradient = (7.2, 0). v = (0.9·0.4 + 0.05·7.2, 0.9·0.5 + 0) = (0.72, 0.45). θ = (3.6-0.72, 0-0.45) = (2.88, -0.45).

Notice: along θ₂, the gradient flipped sign to 0, but the velocity still carries momentum of 0.45 from step 1. This overshoots slightly — but only to ±0.45, not ±0.5. The oscillation is already being damped.

After many steps: momentum's velocity in the consistent θ₁ direction builds up, making progress 5–10× faster than vanilla GD.

In Code

# PyTorch: SGD with momentum
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9    # β = 0.9
)

# The update PyTorch computes:
# v = 0.9 * v_prev + grad
# param -= 0.01 * v

For most tasks, momentum β=0.9 is a safe default. This is the same β₁ used in Adam's first moment — Adam's momentum is just the EMA of gradients, which is exactly what we've derived here.

Quiz

1 / 3

With momentum (β=0.9), the gradient in direction A has been consistently positive for 10 steps. What happens to the effective step size in direction A compared to vanilla GD?