The loss landscape

To train a model, you need to answer one question over and over: which direction should I adjust the parameters to make predictions better?

The loss landscape is the map that answers this.

For a model with , we can draw a 3D surface:

x-axis: all possible values of $w_1$
y-axis: all possible values of $w_2$
Height at any point $(w_1, w_2)$ : the loss $L(w_1, w_2)$ for those parameters

Training means until you find a valley.

Real models have millions of parameters — you can't visualize that space. But the intuition transfers: there's a high-dimensional surface over parameter space, and gradient descent is your downhill-hiking algorithm.

InteractiveGradient Descent on a Non-Convex Function

x =2.2000

f(x) =2.7902

f'(x) =8.7368

steps =0

Learning rate α: 0.15

This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.

Contour Maps

Instead of a 3D plot, you'll often see 2D contour plots — overhead views where each line connects points of equal loss. Like topographic maps.

The closer the contour lines are packed, the steeper the terrain. The at any point is perpendicular to the contour lines, pointing toward higher values.

The .

Convex Surfaces: The Easy Case

The loss surface for linear regression is a .

L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

$L$: mean squared error loss
$n$: number of training examples
$y_i$: true label for example i
$\hat{y}_i$: model prediction for example i

A convex function has one critical property: any local minimum is also the global minimum. If you find a flat point ( ), you're at the best possible solution everywhere.

This is mathematically clean. But it only holds for simple linear models.

Non-Convex Surfaces: The Real World

Neural network loss surfaces are not convex. They're high-dimensional terrain with:

Local minima: valleys that aren't the global minimum. The gradient is zero (looks like a bottom), but there are better valleys elsewhere. .

Saddle points: flat points where some directions go downhill and others go uphill. The gradient is zero but you're not at a minimum. .

For engineers: saddle points as unstable equilibria

In dynamical systems and control theory, a saddle point is a classic unstable equilibrium. Linearize the gradient flow $\dot{\theta} = -\nabla L(\theta)$ around a saddle: the Hessian of $L$ has both positive and negative eigenvalues. Positive eigenvalues correspond to stable directions (restoring forces); negative eigenvalues correspond to unstable directions (repelling forces). A trajectory exactly at the saddle is technically in equilibrium — but any perturbation along an unstable eigendirection grows exponentially. In practice, stochastic gradient noise provides that perturbation continuously, so training trajectories escape saddle points naturally without any special intervention. This is analogous to how small disturbances destabilize unstable equilibria in mechanical and control systems — here, the noise is a feature, not a bug.

Flat plateaus: vast regions where the gradient is tiny, causing very slow progress. Can feel like training has stalled.

Functions of multiple variables

Training Is Navigation

Contour Maps

Convex Surfaces: The Easy Case

Non-Convex Surfaces: The Real World