Momentum: adding velocity to gradient steps — Gradient Descent

The Problem: Zigzagging Across a Ravine

Imagine a loss surface shaped like a long, narrow valley — steep walls on the sides, gentle slope along the floor toward the minimum. This is called a and it's extremely common in practice (any time two parameters interact asymmetrically).

Vanilla gradient descent gets stuck in a frustrating pattern here. In the steep direction (across the ravine), gradients are large — so the update overshoots the other side. In the shallow direction (along the valley), gradients are small — so progress is glacial. The optimizer zigzags: left, right, left, right, inching forward with every oscillation.

Think of a ball rolling down a hillside. Once it's moving, it doesn't stop the instant the ground flattens — it carries its speed forward and keeps going. Gradient descent without momentum behaves the opposite way: it halts whenever the slope becomes shallow, even if the minimum is just a short distance ahead. Momentum gives gradient descent the same memory a rolling ball has.

The Velocity Idea

Momentum borrows from classical mechanics. Instead of moving in the direction of the current gradient alone, we maintain a velocity vector that accumulates past gradients:

\begin{aligned} v &amp;\leftarrow \beta \cdot v_{\text{prev}} + \alpha \cdot \nabla L \ \theta &amp;\leftarrow \theta - v \end{aligned}

$v$: velocity vector (same shape as θ)
$β$: momentum coefficient — how much previous velocity is retained. Typical: 0.9
$α$: learning rate
$\nabla L$: gradient of the loss with respect to θ
$\theta$: model parameters

The parameter (beta) controls how much history is retained. With $\beta = 0.9$ , each update keeps 90% of the previous velocity and adds 10% of the current gradient's contribution.

Physical Analogy: The Rolling Ball

Think of a ball rolling down the valley. On a flat surface, it accelerates steadily. When it hits a bump, it doesn't immediately reverse — its momentum carries it forward. The ball averages out small obstacles and speeds up on consistent downhill terrain.

That's exactly what's happening mathematically. The velocity $v$ is a weighted average of all past gradients, with exponentially decaying weights:

v_t = \alpha \sum_{k=0}^{t} \beta^{t-k} \cdot g_k

$v_t$: velocity at step t
$g_k$: gradient at step k
$\beta$: momentum coefficient

Recent gradients matter most; older ones decay exponentially. The sum of all weights is $\frac{\alpha}{1-\beta}$ — so with $\beta = 0.9$ and $\alpha = 0.01$ , the effective step scale is $\frac{0.01}{0.1} = 0.1$ in steady state.

For engineers: momentum as a first-order IIR low-pass filter

Readers with signal processing or control systems backgrounds will recognize the velocity update $v \leftarrow \beta v + \alpha g$ as a first-order IIR (infinite impulse response) low-pass filter applied to the gradient signal $g$ . The coefficient $\beta$ places a discrete-time pole at $z = \beta$ , giving a time constant of $\tau = -1/\ln(\beta)$ steps — for $\beta = 0.9$ , approximately 9.5 steps. High-frequency components of the gradient (the back-and-forth oscillation across the ravine) are attenuated; the low-frequency trend (the consistent downhill direction) passes through. This is precisely what a low-pass filter does to a noisy sensor signal: momentum smooths the optimizer's trajectory just as a filter smooths a noisy measurement, damping noise while preserving the underlying signal.

Worked Numerical Example

Consider a 2D loss with parameters and . Suppose the gradients alternate for θ₂ (±2.0) but stay consistent for θ₁ (always +1.0). With $\alpha = 0.1$ , $\beta = 0.9$ :

Step	g₁	g₂	v₁	v₂	Net v₂ direction
1	1.0	+2.0	0.10	0.20	forward
2	1.0	−2.0	0.19	−0.02	nearly zero!
3	1.0	+2.0	0.27	0.18	forward
4	1.0	−2.0	0.34	−0.04	nearly zero!

The alternating θ₂ gradients cancel almost completely in the velocity. Meanwhile θ₁ velocity grows toward its steady-state value of $\frac{0.1}{0.1} = 1.0$ , accelerating progress along the valley floor.

Choosing β

The standard default is $\beta = 0.9$ . This means the effective window of gradients contributing meaningfully to the velocity is roughly $\frac{1}{1-\beta} = 10$ steps.

β = 0.5: short memory, responds quickly to new gradients, less smoothing
β = 0.9: standard choice, good balance of smoothing and responsiveness
β = 0.99: very long memory, strong smoothing, but slow to change direction

Code: SGD with Momentum in PyTorch

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Linear(10, 1)

# SGD with momentum=0.9 (the standard setting)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

for x_batch, y_batch in dataloader:
    optimizer.zero_grad()
    loss = criterion(model(x_batch), y_batch)
    loss.backward()
    optimizer.step()

The momentum=0.9 argument maps directly to the $\beta$ in the update equation. PyTorch internally maintains the velocity buffer — it accumulates across calls to optimizer.step() and is reset to zero when you create a new optimizer.