Skip to content
Gradient Descent
Lesson 2 ⏱ 12 min

The gradient

Video coming soon

The Gradient - Steepest Ascent in Any Dimension

Visual walkthrough of partial derivatives, the gradient vector, and why it always points in the direction of steepest increase on the loss surface.

⏱ ~7 min

🧮

Quick refresher

Partial derivatives

For a function of multiple variables, the partial derivative ∂f/∂x measures how f changes when we move in the x direction (keeping everything else fixed). The gradient ∇f is the vector of all partial derivatives.

Example

f(w₁, w₂) = w₁² + w₂².

∂f/∂w₁ = 2w₁, ∂f/∂w₂ = 2w₂.

Gradient: [2w₁, 2w₂].

One Number vs. Many Numbers

For a function of one variable like f(w)=w2f(w) = w^2, the derivative f'(w) = 2w gives one slope at any point. But a loss function L(w1,w2,,wp)L(w_1, w_2, \ldots, w_p) has thousands of parameters — we need a slope in every direction simultaneously. That generalization is the : a vector of partial derivatives, one per parameter.

The gradient tells you which direction to nudge your parameters to reduce loss. Without it, training a neural network is guesswork — you'd have no principled way to know which of millions of weights to adjust.

Definition

The with respect to the weight vector w is:

L=[Lw1,;Lw2,;,;Lwp]\nabla L = \left[\frac{\partial L}{\partial w_1},; \frac{\partial L}{\partial w_2},; \ldots,; \frac{\partial L}{\partial w_p}\right]
L\nabla L
gradient of the loss - a vector with one entry per parameter
L/wi\partial L / \partial w_i
partial derivative - how much L changes when only w_i changes
pp
total number of parameters (model size)

Each component L/wi\partial L / \partial w_i answers: if I nudge wiw_i by a tiny amount while holding all other weights fixed, how much does the loss change?

  • If \partial L / \partial w_i > 0: increasing wiw_i increases the loss — uphill slope in this direction.
  • If \partial L / \partial w_i < 0: increasing wiw_i decreases the loss — downhill slope.
  • If L/wi=0\partial L / \partial w_i = 0: the loss is flat — a in that direction.

Concrete Example

Let L=w12+3w22L = w_1^2 + 3w_2^2 — a 2D bowl, more steeply curved in the w2w_2 direction.

Partial derivatives: L/w1=2w1\partial L / \partial w_1 = 2w_1 and L/w2=6w2\partial L / \partial w_2 = 6w_2.

L=[2w1,;6w2]\nabla L = [2w_1,; 6w_2]
L\nabla L
gradient evaluated at the current point (w_1,\thinspace w_2)

Three illustrative points:

  1. At (3,1)(3, 1): L=[6,6]\nabla L = \lbrack 6, 6\rbrack. Loss is increasing equally steeply in both directions.
  2. At (0,2)(0, 2): L=[0,12]\nabla L = \lbrack 0, 12\rbrack. Flat in w1w_1, very steep in w2w_2. Only w2w_2 needs to move.
  3. At (0,0)(0, 0): L=[0,0]\nabla L = \lbrack 0, 0\rbrack. We have reached the . Gradient descent stops here.

The Gradient Points Uphill

The key geometric fact: L always points in the direction of steepest increase of L.

Think of standing on a hillside in fog. You can feel the slope under your feet in every direction. The gradient is the direction of steepest uphill. To descend to the valley, step in the opposite direction: L-\nabla L.

This is the entire logic of gradient descent: at each step, look at the gradient, then take a step against it to go downhill on the loss surface.

Gradient for 1000 Parameters

If your model has p=1,000p = 1{,}000 weights, the gradient is a 1000-dimensional vector. Each of those 1000 numbers answers: "if I nudge weight wiw_i a tiny bit, how does the total loss change?"

Some components will be large: those weights have strong influence right now and need significant adjustment. Others near zero: those weights barely affect the loss at the current position, so they barely move.

The vector update wwαL\mathbf{w} \leftarrow \mathbf{w} - \alpha\thinspace\nabla L adjusts all 1000 weights at once, each in proportion to its own partial derivative. The gradient automatically allocates how much to adjust each weight.

Interactive example

Explore the gradient on a 2D loss surface - hover over any point to see the gradient vector and the steepest descent direction

Coming soon

Quiz

1 / 3

The gradient ∇L points in the direction of...