Skip to content
Gradient Descent
Lesson 1 ⏱ 10 min

The loss landscape

Video coming soon

The Loss Landscape: Your Training Target Visualized

3D animation of loss surfaces for linear regression (bowl) vs. neural networks (chaotic mountains).

⏱ ~5 min

🧮

Quick refresher

Functions of multiple variables

A function f(w₁, w₂) takes two inputs and returns one output. We can plot it as a 3D surface — like a terrain map where height = output value.

Example

f(w₁, w₂) = w₁² + w₂² is a bowl shape.

f(0,0)=0 (bottom), f(1,0)=1 (up the side).

Training Is Navigation

To train a model, you need to answer one question over and over: which direction should I adjust the parameters to make predictions better?

The loss landscape is the map that answers this.

For a model with , we can draw a 3D surface:

  • x-axis: all possible values of w1w_1
  • y-axis: all possible values of w2w_2
  • Height at any point (w1,w2)(w_1, w_2): the loss L(w1,w2)L(w_1, w_2) for those parameters

Training means until you find a valley.

Real models have millions of parameters — you can't visualize that space. But the intuition transfers: there's a high-dimensional surface over parameter space, and gradient descent is your downhill-hiking algorithm.

InteractiveGradient Descent on a Non-Convex Function
step-2-112
x =2.2000
f(x) =2.7902
f'(x) =8.7368
steps =0

This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.

Contour Maps

Instead of a 3D plot, you'll often see 2D contour plots — overhead views where each line connects points of equal loss. Like topographic maps.

The closer the contour lines are packed, the steeper the terrain. The at any point is perpendicular to the contour lines, pointing toward higher values.

The .

Convex Surfaces: The Easy Case

The loss surface for linear regression is a .

L=1ni=1n(yiy^i)2L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
LL
mean squared error loss
nn
number of training examples
yiy_i
true label for example i
y^i\hat{y}_i
model prediction for example i

A convex function has one critical property: any local minimum is also the global minimum. If you find a flat point ( ), you're at the best possible solution everywhere.

This is mathematically clean. But it only holds for simple linear models.

Non-Convex Surfaces: The Real World

Neural network loss surfaces are not convex. They're high-dimensional terrain with:

Local minima: valleys that aren't the global minimum. The gradient is zero (looks like a bottom), but there are better valleys elsewhere. .

Saddle points: flat points where some directions go downhill and others go uphill. The gradient is zero but you're not at a minimum. .

Flat plateaus: vast regions where the gradient is tiny, causing very slow progress. Can feel like training has stalled.

Quiz

1 / 3

In a loss landscape visualization, what does 'height' represent?