The gradient — Gradient Descent

One Number vs. Many Numbers

For a function of one variable like $f(w) = w^2$ , the derivative $f'(w) = 2w$ gives one slope at any point. But a loss function $L(w_1, w_2, \ldots, w_p)$ has thousands of parameters — we need a slope in every direction simultaneously. That generalization is the : a vector of partial derivatives, one per parameter.

The gradient tells you which direction to nudge your parameters to reduce loss. Without it, training a neural network is guesswork — you'd have no principled way to know which of millions of weights to adjust.

Definition

The with respect to the weight vector w is:

\nabla L = \left[\frac{\partial L}{\partial w_1},; \frac{\partial L}{\partial w_2},; \ldots,; \frac{\partial L}{\partial w_p}\right]

$\nabla L$: gradient of the loss - a vector with one entry per parameter
$\partial L / \partial w_i$: partial derivative - how much L changes when only w_i changes
$p$: total number of parameters (model size)

Each component $\partial L / \partial w_i$ answers: if I nudge $w_i$ by a tiny amount while holding all other weights fixed, how much does the loss change?

If $\partial L / \partial w_i > 0$ : increasing $w_i$ increases the loss — uphill slope in this direction.
If $\partial L / \partial w_i < 0$ : increasing $w_i$ decreases the loss — downhill slope.
If $\partial L / \partial w_i = 0$ : the loss is flat — a in that direction.

Plain language: what is a stationary point?

A stationary point is a location on the loss surface where the slope is zero in every direction — like standing at the very top of a perfectly round hill, or the very bottom of a bowl. The gradient there is the zero vector: $\nabla L = \mathbf{0}$ . If you stood at such a point and checked the ground in every direction, it would feel level. There are three kinds: a minimum (every direction is uphill — you are at the bottom of a bowl, which is what training aims for), a maximum (every direction is downhill — the top of a hill), and a saddle point (some directions go up, some go down — like the center of a mountain pass). Gradient descent stops naturally when it reaches any stationary point because the gradient update becomes zero, so telling these types apart matters for understanding when training has truly succeeded.

Concrete Example

Let $L = w_1^2 + 3w_2^2$ — a 2D bowl, more steeply curved in the $w_2$ direction.

Partial derivatives: $\partial L / \partial w_1 = 2w_1$ and $\partial L / \partial w_2 = 6w_2$ .

\nabla L = [2w_1,; 6w_2]

$\nabla L$: gradient evaluated at the current point (w_1,\thinspace w_2)

Three illustrative points:

At $(3, 1)$ : $\nabla L = \lbrack 6, 6\rbrack$ . Loss is increasing equally steeply in both directions.
At $(0, 2)$ : $\nabla L = \lbrack 0, 12\rbrack$ . Flat in $w_1$ , very steep in $w_2$ . Only $w_2$ needs to move.
At $(0, 0)$ : $\nabla L = \lbrack 0, 0\rbrack$ . We have reached the . Gradient descent stops here.

The Gradient Points Uphill

The key geometric fact: L always points in the direction of steepest increase of L.

Think of standing on a hillside in fog. You can feel the slope under your feet in every direction. The gradient is the direction of steepest uphill. To descend to the valley, step in the opposite direction: $-\nabla L$ .

This is the entire logic of gradient descent: at each step, look at the gradient, then take a step against it to go downhill on the loss surface.

Gradient for 1000 Parameters

If your model has $p = 1{,}000$ weights, the gradient is a 1000-dimensional vector. Each of those 1000 numbers answers: "if I nudge weight $w_i$ a tiny bit, how does the total loss change?"

Some components will be large: those weights have strong influence right now and need significant adjustment. Others near zero: those weights barely affect the loss at the current position, so they barely move.

The vector update $\mathbf{w} \leftarrow \mathbf{w} - \alpha\thinspace\nabla L$ adjusts all 1000 weights at once, each in proportion to its own partial derivative. The gradient automatically allocates how much to adjust each weight.

Interactive example

Explore the gradient on a 2D loss surface - hover over any point to see the gradient vector and the steepest descent direction

Coming soon