One Number vs. Many Numbers
For a function of one variable like , the derivative f'(w) = 2w gives one slope at any point. But a loss function has thousands of parameters — we need a slope in every direction simultaneously. That generalization is the : a vector of partial derivatives, one per parameter.
The gradient tells you which direction to nudge your parameters to reduce loss. Without it, training a neural network is guesswork — you'd have no principled way to know which of millions of weights to adjust.
Definition
The with respect to the weight vector w is:
- gradient of the loss - a vector with one entry per parameter
- partial derivative - how much L changes when only w_i changes
- total number of parameters (model size)
Each component answers: if I nudge by a tiny amount while holding all other weights fixed, how much does the loss change?
- If \partial L / \partial w_i > 0: increasing increases the loss — uphill slope in this direction.
- If \partial L / \partial w_i < 0: increasing decreases the loss — downhill slope.
- If : the loss is flat — a in that direction.
Concrete Example
Let — a 2D bowl, more steeply curved in the direction.
Partial derivatives: and .
- gradient evaluated at the current point (w_1,\thinspace w_2)
Three illustrative points:
- At : . Loss is increasing equally steeply in both directions.
- At : . Flat in , very steep in . Only needs to move.
- At : . We have reached the . Gradient descent stops here.
The Gradient Points Uphill
The key geometric fact: L always points in the direction of steepest increase of L.
Think of standing on a hillside in fog. You can feel the slope under your feet in every direction. The gradient is the direction of steepest uphill. To descend to the valley, step in the opposite direction: .
This is the entire logic of gradient descent: at each step, look at the gradient, then take a step against it to go downhill on the loss surface.
Gradient for 1000 Parameters
If your model has weights, the gradient is a 1000-dimensional vector. Each of those 1000 numbers answers: "if I nudge weight a tiny bit, how does the total loss change?"
Some components will be large: those weights have strong influence right now and need significant adjustment. Others near zero: those weights barely affect the loss at the current position, so they barely move.
The vector update adjusts all 1000 weights at once, each in proportion to its own partial derivative. The gradient automatically allocates how much to adjust each weight.
Interactive example
Explore the gradient on a 2D loss surface - hover over any point to see the gradient vector and the steepest descent direction
Coming soon