Skip to content
Advanced Optimization
Lesson 1 ⏱ 10 min

Why vanilla gradient descent struggles

Video coming soon

The Four Problems with Vanilla Gradient Descent

Ravines and zigzagging. Saddle points in high dimensions. Mini-batch noise. One learning rate for all parameters. These four problems are why Adam, momentum, and learning rate schedules exist.

⏱ ~8 min

🧮

Quick refresher

Gradient descent update rule

Gradient descent updates parameters by stepping opposite to the gradient: θ ← θ - α·∇L(θ). The learning rate α controls step size. It follows the direction of steepest descent in the loss landscape.

Example

For L(θ) = θ², ∇L = 2θ.

Starting at θ=3, with α=0.1: θ ← 3 - 0.1·6 = 2.4.

Then 2.4 - 0.1·4.8 = 1.92.

Converges toward 0.

Vanilla GD Works. Barely.

Basic gradient descent is correct: it points downhill. For a smooth, convex function like a bowl, it converges reliably. But the loss landscapes of neural networks are nothing like bowls. They have elongated ridges, saddle points, noisy gradients, and parameters operating at wildly different scales.

Every practical optimizer — Adam, RMSprop, momentum SGD — was invented to address the specific failure modes described in this lesson. Understanding what breaks vanilla gradient descent is what makes the entire history of optimizer development make sense.

Here are the four fundamental problems that motivated everything in this unit.

Problem 1: Ravines (Different Curvatures)

Imagine a loss surface shaped like an elongated valley — steep walls in one direction, shallow slope along the valley floor. This is a .

Interactive example

Gradient descent zigzagging in an elongated bowl — shows how large curvature along one axis causes overshooting

Coming soon

Along the steep axis: the gradient is large, so steps are large, so you overshoot, bounce to the other side, overshoot again. The optimizer zigzags.

Along the shallow axis: the gradient is small, so steps are tiny. Progress toward the minimum is agonizingly slow.

To measure this curvature mismatch precisely, we use the , which tells us the curvature in every direction simultaneously.

Formally: the loss surface curvature is captured by the . When the ratio of largest to smallest eigenvalue — the — is large, vanilla GD needs O(κ)O(\kappa) steps to converge where an ideally-scaled method needs O(1)O(1).

For deep networks, condition numbers in the thousands are common.

Problem 2: Saddle Points

A is a point where L=0\nabla L = 0 but it is not a local minimum. The has both positive and negative eigenvalues.

At a saddle point, gradient descent stalls: the gradient is zero or near-zero, so updates become tiny.

The critical insight: in high dimensions, saddle points are far more common than local minima. A local minimum requires every eigenvalue of the Hessian to be positive. With dd parameters, if eigenvalues are randomly ± with equal probability, the chance all are positive is (1/2)d(1/2)^d. For d=106d = 10^6, this is essentially zero.

Dauphin et al. (2014) showed empirically that in large neural networks, the loss at critical points is well-predicted by assuming they are saddle points — local minima are rare at high loss values. Training a neural network is mostly navigating saddle points, not falling into local minima.

Problem 3: Noisy Gradients (Mini-Batch Variance)

Using the full dataset to compute a gradient is expensive. Mini-batch SGD estimates the gradient from a small batch. This introduces .

The result: even near a minimum, vanilla SGD never fully settles — it jitters around due to the noisy gradient. A decaying learning rate is needed to reduce step size as convergence approaches, but choosing the right schedule is an art.

The tradeoff: large batches → low variance gradients → more stable steps, but each step is expensive. Small batches → high variance → cheap steps, but noisy direction. Modern deep learning uses mini-batches of 32–4096 for this balance.

Problem 4: Shared Learning Rate Across Parameters

Vanilla GD uses one for every parameter. But parameters live at different scales:

  • The first-layer weights of a language model receive gradients from millions of training tokens. Their gradients are large and well-estimated.
  • An embedding for a rare word is updated only on the few batches where that word appears. Its effective learning rate needs to be larger.
  • Bias terms often have very different gradient magnitudes than weight matrices.

A single learning rate is either too large for well-trained parameters (causing oscillation) or too small for rarely-updated parameters (causing slow learning). What we want is a per-parameter effective learning rate that adapts to each parameter's history.

The Roadmap Forward

These four problems directly motivate the algorithms in this unit:

ProblemSolution
Ravines / oscillationMomentum (smooth out direction)
Slow convergence near saddlesNesterov (look ahead before stepping)
Shared learning rateAdaGrad, RMSprop, Adam (per-parameter rates)
Noisy gradientsLearning rate schedules (decay as training stabilizes)

Each of the next lessons derives one of these solutions from scratch.

Quiz

1 / 3

A loss surface has very different curvatures along two axes: it curves steeply along axis A and gently along axis B. What does vanilla gradient descent do?