The update rule: w ← w − α∇L — Gradient Descent

The Core Idea

You're lost in mountains, trying to find the lowest valley. You can't see the whole map — only the slope of the ground directly under your feet. What do you do? Step in the direction that goes most steeply downhill. Pause. Check the slope again. Step again. Repeat.

That's gradient descent — and it's the core algorithmic idea behind training every modern machine learning model.

The Update Rule

For a single parameter $w$ :

w ;\leftarrow; w - \alpha \cdot \frac{\partial L}{\partial w}

$w$: the parameter being updated - changes each iteration
$\alpha$: learning rate - small positive number controlling step size
$\partial L / \partial w$: gradient - how much L changes per unit change in w

For all parameters as a vector:

\mathbf{w} ;\leftarrow; \mathbf{w} - \alpha \cdot \nabla L

$\mathbf{w}$: weight vector (all parameters stacked together)
$\alpha$: learning rate - same for every weight in plain gradient descent
$\nabla L$: gradient vector - one partial derivative per parameter

Let's dissect every piece:

$\mathbf{w}$ (left of arrow): the new weight vector after this update
$\mathbf{w}$ (right of arrow): the current weight vector before this update
: assignment — "set w to the value on the right"
: the — a small positive number, typically 0.001 to 0.1
$\nabla L$ : the gradient — points uphill
$-\alpha \cdot \nabla L$ : a small step in the downhill direction

The minus sign is everything. We subtract the gradient, stepping against it — in the direction that decreases the loss.

Worked Example: L = w²

Let's watch gradient descent converge step by step. Loss: $L = w^2$ , minimum at $w = 0$ .

Gradient: $\partial L / \partial w = 2w$ . At any point $w$ , the gradient is $2w$ .

Start: $w_0 = 4$ , $\alpha = 0.1$ .

w_{k+1} = w_k - \alpha \cdot 2w_k = w_k(1 - 2\alpha) = w_k \cdot 0.8

$w_k$: weight after k updates
$\alpha$: learning rate = 0.1

Compute: $w_1 = 4 - 0.1 \cdot 8 = 4 - 0.8 = 3.2$ , loss = 10.24
Compute: $w_2 = 3.2 - 0.1 \cdot 6.4 = 3.2 - 0.64 = 2.56$ , loss = 6.55
Compute: $w_3 = 2.56 - 0.1 \cdot 5.12 = 2.048$ , loss = 4.19
Compute: $w_4 = 2.048 - 0.1 \cdot 4.10 = 1.638$ , loss = 2.68

Notice: each step is smaller than the last. As $w$ approaches 0, the gradient $2w$ shrinks, so the update shrinks. The algorithm naturally slows down and lands gently at the minimum — it doesn't need to know where the minimum is in advance.

After 20 steps: $w \approx 0.066$ . After 50 steps: $w \approx 0.00005$ . Converging to 0 — the true minimum — without ever knowing where it was.

For Multiple Parameters

The vector update adjusts all parameters simultaneously:

\begin{bmatrix} w_1 \ w_2 \ w_3 \end{bmatrix} \leftarrow \begin{bmatrix} w_1 \ w_2 \ w_3 \end{bmatrix}\alpha \cdot \begin{bmatrix} \partial L/\partial w_1 \ \partial L/\partial w_2 \ \partial L/\partial w_3 \end{bmatrix}

$w_1, w_2, w_3$: individual weight parameters
$\partial L / \partial w_1$: partial derivative for parameter 1

Every parameter moves independently, scaled by its own partial derivative. Parameters with large gradients (strongly influencing the loss right now) move more. Parameters with near-zero gradients barely move.

In code: w = w - alpha * grad — one line, regardless of whether $w$ has 3 components or 3 billion.

import numpy as np

def gradient_descent(grad_fn, w_init, learning_rate=0.1, n_steps=50):
    """Run gradient descent for n_steps and return the final parameter value."""
    w = np.array(w_init, dtype=float)
    for step in range(n_steps):
        grad = grad_fn(w)             # compute gradient ∇L at current w
        w = w - learning_rate * grad  # update rule: w ← w − α · ∇L
    return w

# Example: minimize L(w) = w²   (gradient = 2w,  minimum at w = 0)
result = gradient_descent(grad_fn=lambda w: 2 * w, w_init=4.0, learning_rate=0.1)
print(f"Converged to w = {result:.6f}")   # → w ≈ 0.000000

# Works identically for a vector of parameters — numpy broadcasts automatically:
result_vec = gradient_descent(
    grad_fn=lambda w: 2 * w,                    # gradient of L = ‖w‖² is 2w element-wise
    w_init=np.array([3.0, -2.0, 1.5]),
    learning_rate=0.1,
)
print(f"Converged to w = {result_vec}")          # → all components near 0.0

InteractiveGradient Descent on a Non-Convex Function

x =2.2000

f(x) =2.7902

f'(x) =8.7368

steps =0

Learning rate α: 0.15

This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.