The Core Idea
You're lost in mountains, trying to find the lowest valley. You can't see the whole map — only the slope of the ground directly under your feet. What do you do? Step in the direction that goes most steeply downhill. Pause. Check the slope again. Step again. Repeat.
That's gradient descent — and it's the core algorithmic idea behind training every modern machine learning model.
The Update Rule
For a single parameter :
- the parameter being updated - changes each iteration
- learning rate - small positive number controlling step size
- gradient - how much L changes per unit change in w
For all parameters as a vector:
- weight vector (all parameters stacked together)
- learning rate - same for every weight in plain gradient descent
- gradient vector - one partial derivative per parameter
Let's dissect every piece:
- (left of arrow): the new weight vector after this update
- (right of arrow): the current weight vector before this update
- : assignment — "set w to the value on the right"
- : the — a small positive number, typically 0.001 to 0.1
- : the gradient — points uphill
- : a small step in the downhill direction
The minus sign is everything. We subtract the gradient, stepping against it — in the direction that decreases the loss.
Worked Example: L = w²
Let's watch gradient descent converge step by step. Loss: , minimum at .
Gradient: . At any point , the gradient is .
Start: , .
- weight after k updates
- learning rate = 0.1
- Compute: , loss = 10.24
- Compute: , loss = 6.55
- Compute: , loss = 4.19
- Compute: , loss = 2.68
Notice: each step is smaller than the last. As approaches 0, the gradient shrinks, so the update shrinks. The algorithm naturally slows down and lands gently at the minimum — it doesn't need to know where the minimum is in advance.
After 20 steps: . After 50 steps: . Converging to 0 — the true minimum — without ever knowing where it was.
For Multiple Parameters
The vector update adjusts all parameters simultaneously:
- individual weight parameters
- partial derivative for parameter 1
Every parameter moves independently, scaled by its own partial derivative. Parameters with large gradients (strongly influencing the loss right now) move more. Parameters with near-zero gradients barely move.
In code: w = w - alpha * grad — one line, regardless of whether has 3 components or 3 billion.
import numpy as np
def gradient_descent(grad_fn, w_init, learning_rate=0.1, n_steps=50):
"""Run gradient descent for n_steps and return the final parameter value."""
w = np.array(w_init, dtype=float)
for step in range(n_steps):
grad = grad_fn(w) # compute gradient ∇L at current w
w = w - learning_rate * grad # update rule: w ← w − α · ∇L
return w
# Example: minimize L(w) = w² (gradient = 2w, minimum at w = 0)
result = gradient_descent(grad_fn=lambda w: 2 * w, w_init=4.0, learning_rate=0.1)
print(f"Converged to w = {result:.6f}") # → w ≈ 0.000000
# Works identically for a vector of parameters — numpy broadcasts automatically:
result_vec = gradient_descent(
grad_fn=lambda w: 2 * w, # gradient of L = ‖w‖² is 2w element-wise
w_init=np.array([3.0, -2.0, 1.5]),
learning_rate=0.1,
)
print(f"Converged to w = {result_vec}") # → all components near 0.0
This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.