Skip to content
Advanced Optimization
Lesson 4 ⏱ 12 min

Nesterov momentum: looking ahead

Video coming soon

Nesterov Momentum: Correct Before You Overshoot

The limitation of standard momentum. The look-ahead idea: evaluate gradient at the future position. Why this corrects overshooting. Convergence comparison. PyTorch usage.

⏱ ~7 min

🧮

Quick refresher

SGD with momentum

Momentum maintains a velocity vₜ = β·vₜ₋₁ + α·∇L(θ) and updates θ ← θ - vₜ. It accumulates speed in consistent gradient directions and damps oscillations when gradients alternate sign.

Example

With β=0.9 and constant gradient g=1, the velocity builds up to v∞ = α·g/(1-β) = 10α.

The effective step size is 10× the raw learning rate in the direction of consistent gradient.

The Problem with Standard Momentum

Standard momentum works well, but it has a subtle flaw: it computes the gradient at the wrong place.

Nesterov momentum is the version used in many high-performance training setups, including the SGD configurations that train ImageNet models. The improvement in convergence speed is consistent enough that it became the standard choice whenever momentum SGD is used.

The update sequence in standard momentum is:

  1. Evaluate gradient at current position θt\theta_t
  2. Update velocity: vt=βvt1+αL(θt)v_t = \beta v_{t-1} + \alpha \nabla L(\theta_t)
  3. Step: θt+1=θtvt\theta_{t+1} = \theta_t - v_t

The problem: the velocity βvt1\beta v_{t-1} was already "committed" before you evaluated the gradient. You know you're going to move by approximately βvt1\beta v_{t-1} regardless — that's the momentum. So why evaluate the gradient at θt\theta_t, which is where you currently are, rather than at the position you're about to be?

The Nesterov Idea: Look Before You Step

The ( ) swaps the order:

  1. First, take the momentum step to the look-ahead position:

    θlook=θtβvt1\theta_{\text{look}} = \theta_t - \beta \cdot v_{t-1}
    θlookθ_look
    look-ahead position — where pure momentum would take you
    ββ
    momentum coefficient
    vt1vₜ₋₁
    previous velocity
  2. Evaluate the gradient at the look-ahead position: L(θlook)\nabla L(\theta_{\text{look}})

  3. Update velocity and parameters:

    vt=βvt1+αL(θlook)v_t = \beta \cdot v_{t-1} + \alpha \cdot \nabla L(\theta_{\text{look}})
    vtvₜ
    updated velocity
    αα
    learning rate
    L(θlook)∇L(θ_look)
    gradient at the look-ahead position
    θt+1=θtvt\theta_{t+1} = \theta_t - v_t
    θt+1θₜ₊₁
    updated parameters

The gradient is computed after the "free" momentum step, giving a more accurate signal about where the objective is going.

The Physical Analogy

Standard momentum: running downhill. You look at the slope where you are standing, then leap in that direction with accumulated velocity. You may overshoot the valley floor before realizing you should have braked.

Nesterov: running downhill. You project yourself forward to where your momentum will carry you, look at the slope there, then apply the corrected step. You see the upslope coming before you reach it, and you brake earlier.

The difference is subtle but consistently beneficial.

Numerical Comparison

Consider minimizing L(θ)=5θ2L(\theta) = 5\theta^2 starting at θ0=1.0\theta_0 = 1.0. True minimum at θ=0\theta = 0. Use α=0.1\alpha = 0.1, β=0.9\beta = 0.9, v0=0v_0 = 0.

Standard momentum:

  • Value: L(θ0)=10θ0=10\nabla L(\theta_0) = 10\theta_0 = 10
  • Value: v1=0.90+0.110=1.0v_1 = 0.9 \cdot 0 + 0.1 \cdot 10 = 1.0
  • Value: θ1=1.01.0=0.0\theta_1 = 1.0 - 1.0 = 0.0 (happened to land on minimum in one step)
  • Value: L(θ1)=0\nabla L(\theta_1) = 0
  • Value: v2=0.91.0+0=0.9v_2 = 0.9 \cdot 1.0 + 0 = 0.9
  • Value: θ2=00.9=0.9\theta_2 = 0 - 0.9 = -0.9 (overshot past the minimum!)

Nesterov momentum:

  • Value: θlook=1.00.90=1.0\theta_{\text{look}} = 1.0 - 0.9 \cdot 0 = 1.0 (same as current at start)
  • Value: v1=0.90+0.110=1.0v_1 = 0.9 \cdot 0 + 0.1 \cdot 10 = 1.0; θ1=0.0\theta_1 = 0.0
  • Value: θlook=0.00.91.0=0.9\theta_{\text{look}} = 0.0 - 0.9 \cdot 1.0 = -0.9
  • Value: L(0.9)=10(0.9)=9\nabla L(-0.9) = 10 \cdot (-0.9) = -9 (gradient points back toward 0!)
  • Value: v2=0.91.0+0.1(9)=0.90.9=0.0v_2 = 0.9 \cdot 1.0 + 0.1 \cdot (-9) = 0.9 - 0.9 = 0.0
  • Value: θ2=0.00.0=0.0\theta_2 = 0.0 - 0.0 = 0.0 (stays at minimum!)

Nesterov recognized it was about to overshoot (the look-ahead gradient was negative) and applied a braking correction. Standard momentum overshot to -0.9.

The Equivalent Reformulation

In practice, computing L(θlook)\nabla L(\theta_{\text{look}}) requires storing the look-ahead parameters. An equivalent formulation that avoids this uses a change of variables ϕ=θβv\phi = \theta - \beta v:

ϕt+1=ϕtαL(ϕt)+β2(ϕtϕt1)αβL(ϕt1)\phi_{t+1} = \phi_t - \alpha\nabla L(\phi_t) + \beta^2(\phi_t - \phi_{t-1}) - \alpha\beta\nabla L(\phi_{t-1})
φtφₜ
reparameterized position — the look-ahead point

In PyTorch, this is handled automatically:

optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    nesterov=True    # Enable Nesterov look-ahead
)

One boolean change. In practice, Nesterov almost always performs equal to or better than standard momentum — it is the preferred default when using SGD with momentum.

Quiz

1 / 3

The key difference between standard momentum and Nesterov momentum is: