The loss surface for deep networks — Neural Networks

Contrast With Linear Regression

For linear regression with MSE loss, the loss landscape is a perfect bowl: one global minimum, smooth everywhere, . Gradient descent is guaranteed to find the minimum. You can even solve it analytically with the Normal Equation.

Neural network loss surfaces are far more complex — and understanding why matters for debugging training failures, choosing optimizers, and knowing when your model has actually converged.

For a neural network with even a single hidden layer, the loss landscape is a high-dimensional, , complicated surface. Yet in practice, training neural networks works remarkably well. Here is why.

Saddle Points, Not Local Minima

Your intuition might be: "gradient descent could get stuck in a bad local minimum." In low dimensions, that is a real concern. In 2D, a local minimum is a point lower than all nearby points — it could be far from the global minimum.

In high dimensions, the picture changes completely. A true requires the loss to be higher in every direction you move. With millions of dimensions, that is nearly impossible — there is almost always some direction to decrease the loss.

What does exist in abundance: . Gradient descent stalls near them because the gradient approaches zero. Mini-batch gradient descent saves you: the noise from mini-batches kicks the optimizer off saddle points and keeps it moving.

InteractiveGradient Descent on a Non-Convex Function

x =2.2000

f(x) =2.7902

f'(x) =8.7368

steps =0

Learning rate α: 0.15

This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.

Local Minima Are Actually Fine

Research in the past decade produced a surprising result: for sufficiently large neural networks, most have similar loss values. The loss at a "bad" local minimum is only marginally higher than the loss at the global minimum.

The intuition: an overparameterized model has so much capacity that there are many equivalent solutions. If you can adjust 175 billion weights freely, there are effectively infinite ways to achieve near-zero training error. The loss surface is riddled with good solutions.

Permutation Symmetry

Take any two neurons $j$ and $k$ in hidden layer $l$ . Swap their rows in $W^{(l)}$ and swap the corresponding columns in $W^{(l+1)}$ . The network computes exactly the same function — you have just relabeled which neuron is which.

For a layer with $n$ neurons, there are $n!$ ways to permute them. For 100 neurons: $100! \approx 9 \times 10^{157}$ equivalent parameter configurations. Every minimum in the loss landscape has this many copies.

This is why comparing specific weight values across different training runs is meaningless. Two trained networks can represent identical functions with completely different weight matrices.

Why Overparameterization Helps

Classical statistics says: more parameters than data means overfitting. The neural network reality is more nuanced.

Heavily networks (parameters >> training examples) can fit training data perfectly, yet often generalize well to new data. More surprisingly, they are often easier to train. The loss surface of a heavily overparameterized network is smoother and better-connected — gradient descent navigates it more reliably.

This "double descent" phenomenon is one of the active research frontiers in deep learning theory. The practical takeaway: do not be afraid to use large models.

Why Gradient Descent Is the Only Tool

For a network with 175 billion parameters, there is no analytical solution. The Normal Equation requires inverting a matrix of size (parameters × parameters) — a 175-billion × 175-billion matrix. That is not computationally feasible.

Gradient descent is the only tool. And despite the theoretical challenges — non-convexity, saddle points, permutation symmetry — it works extraordinarily well in practice. The reason is partly the structure of the loss landscape (well-connected, many good solutions), partly the noise injection from mini-batching, and partly good engineering choices in optimizers and initialization that we will cover in upcoming units.