The limitation of standard momentum. The look-ahead idea: evaluate gradient at the future position. Why this corrects overshooting. Convergence comparison. PyTorch usage.
⏱ ~7 min
🧮
Quick refresher
SGD with momentum
Momentum maintains a velocity vₜ = β·vₜ₋₁ + α·∇L(θ) and updates θ ← θ - vₜ. It accumulates speed in consistent gradient directions and damps oscillations when gradients alternate sign.
Example
With β=0.9 and constant gradient g=1, the velocity builds up to v∞ = α·g/(1-β) = 10α.
The effective step size is 10× the raw learning rate in the direction of consistent gradient.
The Problem with Standard Momentum
Standard momentum works well, but it has a subtle flaw: it computes the gradient at the wrong place.
Nesterov momentum is the version used in many high-performance training setups, including the SGD configurations that train ImageNet models. The improvement in convergence speed is consistent enough that it became the standard choice whenever momentum SGD is used.
The update sequence in standard momentum is:
Evaluate gradient at current position θt
Update velocity: vt=βvt−1+α∇L(θt)
Step: θt+1=θt−vt
The problem: the velocity βvt−1 was already "committed" before you evaluated the gradient. You know you're going to move by approximately βvt−1 regardless — that's the momentum. So why evaluate the gradient at θt, which is where you currently are, rather than at the position you're about to be?
The Nesterov Idea: Look Before You Step
The ( ) swaps the order:
First, take the momentum step to the look-ahead position:
θlook=θt−β⋅vt−1
θlook
look-ahead position — where pure momentum would take you
β
momentum coefficient
vt−1
previous velocity
Evaluate the gradient at the look-ahead position: ∇L(θlook)
Update velocity and parameters:
vt=β⋅vt−1+α⋅∇L(θlook)
vt
updated velocity
α
learning rate
∇L(θlook)
gradient at the look-ahead position
θt+1=θt−vt
θt+1
updated parameters
The gradient is computed after the "free" momentum step, giving a more accurate signal about where the objective is going.
The Physical Analogy
Standard momentum: running downhill. You look at the slope where you are standing, then leap in that direction with accumulated velocity. You may overshoot the valley floor before realizing you should have braked.
Nesterov: running downhill. You project yourself forward to where your momentum will carry you, look at the slope there, then apply the corrected step. You see the upslope coming before you reach it, and you brake earlier.
The difference is subtle but consistently beneficial.
Numerical Comparison
Consider minimizing L(θ)=5θ2 starting at θ0=1.0. True minimum at θ=0. Use α=0.1, β=0.9, v0=0.
Standard momentum:
Value: ∇L(θ0)=10θ0=10
Value: v1=0.9⋅0+0.1⋅10=1.0
Value: θ1=1.0−1.0=0.0 (happened to land on minimum in one step)
Value: ∇L(θ1)=0
Value: v2=0.9⋅1.0+0=0.9
Value: θ2=0−0.9=−0.9 (overshot past the minimum!)
Nesterov momentum:
Value: θlook=1.0−0.9⋅0=1.0 (same as current at start)
Value: v1=0.9⋅0+0.1⋅10=1.0; θ1=0.0
Value: θlook=0.0−0.9⋅1.0=−0.9
Value: ∇L(−0.9)=10⋅(−0.9)=−9 (gradient points back toward 0!)
Value: v2=0.9⋅1.0+0.1⋅(−9)=0.9−0.9=0.0
Value: θ2=0.0−0.0=0.0 (stays at minimum!)
Nesterov recognized it was about to overshoot (the look-ahead gradient was negative) and applied a braking correction. Standard momentum overshot to -0.9.
The Equivalent Reformulation
In practice, computing ∇L(θlook) requires storing the look-ahead parameters. An equivalent formulation that avoids this uses a change of variables ϕ=θ−βv:
One boolean change. In practice, Nesterov almost always performs equal to or better than standard momentum — it is the preferred default when using SGD with momentum.
Quiz
1 / 3
The key difference between standard momentum and Nesterov momentum is: