Skip to content
Gradient Descent
Lesson 3 ⏱ 14 min

The update rule: w ← w − α∇L

Video coming soon

The Update Rule - One Step at a Time Down the Loss Surface

Step-by-step dissection of w ← w - α·∇L, worked numeric example on a parabola, and why gradient descent naturally slows near the minimum.

⏱ ~8 min

🧮

Quick refresher

The gradient points uphill

The gradient ∇L is a vector pointing in the direction that increases L the most. To minimize L, we step in the direction -∇L.

Example

At w=4, if ∂L/∂w = 8 (gradient is 8), we should decrease w to decrease L.

Update: w ← 4 - α·8.

The Core Idea

You're lost in mountains, trying to find the lowest valley. You can't see the whole map — only the slope of the ground directly under your feet. What do you do? Step in the direction that goes most steeply downhill. Pause. Check the slope again. Step again. Repeat.

That's gradient descent — and it's the core algorithmic idea behind training every modern machine learning model.

The Update Rule

For a single parameter ww:

w;;wαLww ;\leftarrow; w - \alpha \cdot \frac{\partial L}{\partial w}
ww
the parameter being updated - changes each iteration
α\alpha
learning rate - small positive number controlling step size
L/w\partial L / \partial w
gradient - how much L changes per unit change in w

For all parameters as a vector:

w;;wαL\mathbf{w} ;\leftarrow; \mathbf{w} - \alpha \cdot \nabla L
w\mathbf{w}
weight vector (all parameters stacked together)
α\alpha
learning rate - same for every weight in plain gradient descent
L\nabla L
gradient vector - one partial derivative per parameter

Let's dissect every piece:

  • w\mathbf{w} (left of arrow): the new weight vector after this update
  • w\mathbf{w} (right of arrow): the current weight vector before this update
  • : assignment — "set w to the value on the right"
  • : the — a small positive number, typically 0.001 to 0.1
  • L\nabla L: the gradient — points uphill
  • αL-\alpha \cdot \nabla L: a small step in the downhill direction

The minus sign is everything. We subtract the gradient, stepping against it — in the direction that decreases the loss.

Worked Example: L = w²

Let's watch gradient descent converge step by step. Loss: L=w2L = w^2, minimum at w=0w = 0.

Gradient: L/w=2w\partial L / \partial w = 2w. At any point ww, the gradient is 2w2w.

Start: w0=4w_0 = 4, α=0.1\alpha = 0.1.

wk+1=wkα2wk=wk(12α)=wk0.8w_{k+1} = w_k - \alpha \cdot 2w_k = w_k(1 - 2\alpha) = w_k \cdot 0.8
wkw_k
weight after k updates
α\alpha
learning rate = 0.1
  1. Compute: w1=40.18=40.8=3.2w_1 = 4 - 0.1 \cdot 8 = 4 - 0.8 = 3.2, loss = 10.24
  2. Compute: w2=3.20.16.4=3.20.64=2.56w_2 = 3.2 - 0.1 \cdot 6.4 = 3.2 - 0.64 = 2.56, loss = 6.55
  3. Compute: w3=2.560.15.12=2.048w_3 = 2.56 - 0.1 \cdot 5.12 = 2.048, loss = 4.19
  4. Compute: w4=2.0480.14.10=1.638w_4 = 2.048 - 0.1 \cdot 4.10 = 1.638, loss = 2.68

Notice: each step is smaller than the last. As ww approaches 0, the gradient 2w2w shrinks, so the update shrinks. The algorithm naturally slows down and lands gently at the minimum — it doesn't need to know where the minimum is in advance.

After 20 steps: w0.066w \approx 0.066. After 50 steps: w0.00005w \approx 0.00005. Converging to 0 — the true minimum — without ever knowing where it was.

For Multiple Parameters

The vector update adjusts all parameters simultaneously:

[w1 w2 w3][w1 w2 w3]α[L/w1 L/w2 L/w3]\begin{bmatrix} w_1 \ w_2 \ w_3 \end{bmatrix} \leftarrow \begin{bmatrix} w_1 \ w_2 \ w_3 \end{bmatrix}\alpha \cdot \begin{bmatrix} \partial L/\partial w_1 \ \partial L/\partial w_2 \ \partial L/\partial w_3 \end{bmatrix}
w1,w2,w3w_1, w_2, w_3
individual weight parameters
L/w1\partial L / \partial w_1
partial derivative for parameter 1

Every parameter moves independently, scaled by its own partial derivative. Parameters with large gradients (strongly influencing the loss right now) move more. Parameters with near-zero gradients barely move.

In code: w = w - alpha * grad — one line, regardless of whether ww has 3 components or 3 billion.

import numpy as np

def gradient_descent(grad_fn, w_init, learning_rate=0.1, n_steps=50):
    """Run gradient descent for n_steps and return the final parameter value."""
    w = np.array(w_init, dtype=float)
    for step in range(n_steps):
        grad = grad_fn(w)             # compute gradient ∇L at current w
        w = w - learning_rate * grad  # update rule: w ← w − α · ∇L
    return w

# Example: minimize L(w) = w²   (gradient = 2w,  minimum at w = 0)
result = gradient_descent(grad_fn=lambda w: 2 * w, w_init=4.0, learning_rate=0.1)
print(f"Converged to w = {result:.6f}")   # → w ≈ 0.000000

# Works identically for a vector of parameters — numpy broadcasts automatically:
result_vec = gradient_descent(
    grad_fn=lambda w: 2 * w,                    # gradient of L = ‖w‖² is 2w element-wise
    w_init=np.array([3.0, -2.0, 1.5]),
    learning_rate=0.1,
)
print(f"Converged to w = {result_vec}")          # → all components near 0.0
InteractiveGradient Descent on a Non-Convex Function
step-2-112
x =2.2000
f(x) =2.7902
f'(x) =8.7368
steps =0

This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.

Quiz

1 / 3

In w ← w − α·∇L, the learning rate α controls...