Skip to content
Gradient Descent
Lesson 7 ⏱ 12 min

Momentum: adding velocity to gradient steps

Video coming soon

Momentum - Why Gradient Descent Gets Stuck in Ravines

Visual walkthrough of oscillating zigzag paths in ravine-shaped loss surfaces, the velocity accumulation idea, the β parameter, and a side-by-side comparison of vanilla SGD vs. SGD with momentum on a 2D loss function.

⏱ ~7 min

🧮

Quick refresher

Gradient descent update rule

In gradient descent we update each parameter by subtracting a small multiple of the gradient: θ ← θ - α·∇L. The gradient points uphill; we step downhill by a size controlled by the learning rate α.

Example

With α = 0.1 and a single parameter θ = 3.0 where ∂L/∂θ = 2.0, the update is θ ← 3.0 - 0.1×2.0 = 2.8.

The Problem: Zigzagging Across a Ravine

Imagine a loss surface shaped like a long, narrow valley — steep walls on the sides, gentle slope along the floor toward the minimum. This is called a and it's extremely common in practice (any time two parameters interact asymmetrically).

Vanilla gradient descent gets stuck in a frustrating pattern here. In the steep direction (across the ravine), gradients are large — so the update overshoots the other side. In the shallow direction (along the valley), gradients are small — so progress is glacial. The optimizer zigzags: left, right, left, right, inching forward with every oscillation.

Think of a ball rolling down a hillside. Once it's moving, it doesn't stop the instant the ground flattens — it carries its speed forward and keeps going. Gradient descent without momentum behaves the opposite way: it halts whenever the slope becomes shallow, even if the minimum is just a short distance ahead. Momentum gives gradient descent the same memory a rolling ball has.

The Velocity Idea

Momentum borrows from classical mechanics. Instead of moving in the direction of the current gradient alone, we maintain a velocity vector that accumulates past gradients:

vamp;βvprev+αL θamp;θv\begin{aligned} v &\leftarrow \beta \cdot v_{\text{prev}} + \alpha \cdot \nabla L \ \theta &\leftarrow \theta - v \end{aligned}
vv
velocity vector (same shape as θ)
ββ
momentum coefficient — how much previous velocity is retained. Typical: 0.9
αα
learning rate
L\nabla L
gradient of the loss with respect to θ
θ\theta
model parameters

The parameter (beta) controls how much history is retained. With β=0.9\beta = 0.9, each update keeps 90% of the previous velocity and adds 10% of the current gradient's contribution.

Physical Analogy: The Rolling Ball

Think of a ball rolling down the valley. On a flat surface, it accelerates steadily. When it hits a bump, it doesn't immediately reverse — its momentum carries it forward. The ball averages out small obstacles and speeds up on consistent downhill terrain.

That's exactly what's happening mathematically. The velocity vv is a weighted average of all past gradients, with exponentially decaying weights:

vt=αk=0tβtkgkv_t = \alpha \sum_{k=0}^{t} \beta^{t-k} \cdot g_k
vtv_t
velocity at step t
gkg_k
gradient at step k
β\beta
momentum coefficient

Recent gradients matter most; older ones decay exponentially. The sum of all weights is α1β\frac{\alpha}{1-\beta} — so with β=0.9\beta = 0.9 and α=0.01\alpha = 0.01, the effective step scale is 0.010.1=0.1\frac{0.01}{0.1} = 0.1 in steady state.

Worked Numerical Example

Consider a 2D loss with parameters and . Suppose the gradients alternate for θ₂ (±2.0) but stay consistent for θ₁ (always +1.0). With α=0.1\alpha = 0.1, β=0.9\beta = 0.9:

Stepg₁g₂v₁v₂Net v₂ direction
11.0+2.00.100.20forward
21.0−2.00.19−0.02nearly zero!
31.0+2.00.270.18forward
41.0−2.00.34−0.04nearly zero!

The alternating θ₂ gradients cancel almost completely in the velocity. Meanwhile θ₁ velocity grows toward its steady-state value of 0.10.1=1.0\frac{0.1}{0.1} = 1.0, accelerating progress along the valley floor.

Choosing β

The standard default is β=0.9\beta = 0.9. This means the effective window of gradients contributing meaningfully to the velocity is roughly 11β=10\frac{1}{1-\beta} = 10 steps.

  • β = 0.5: short memory, responds quickly to new gradients, less smoothing
  • β = 0.9: standard choice, good balance of smoothing and responsiveness
  • β = 0.99: very long memory, strong smoothing, but slow to change direction

Code: SGD with Momentum in PyTorch

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 1)

# SGD with momentum=0.9 (the standard setting)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

for x_batch, y_batch in dataloader:
    optimizer.zero_grad()
    loss = criterion(model(x_batch), y_batch)
    loss.backward()
    optimizer.step()

The momentum=0.9 argument maps directly to the β\beta in the update equation. PyTorch internally maintains the velocity buffer — it accumulates across calls to optimizer.step() and is reset to zero when you create a new optimizer.

Quiz

1 / 3

In SGD with momentum, the velocity update is v ← β·v_prev + α·∇L. What does the β parameter control?