Partial derivatives — Derivatives

From One Variable to Many

Everything learned about derivatives so far involves functions of a single variable - $f(x)$ . But real ML models do not have one parameter. They have thousands. Millions. Billions.

GPT-3 has 175 billion parameters. During training, you need to know how the loss changes with respect to each one. That is where come in.

A partial derivative answers: "If I change just this one parameter, holding everything else fixed, how does the output change?" Same idea as a regular derivative - slope, rate of change - applied to one variable at a time.

The Hilly Landscape Analogy

Picture yourself standing on a mountain. You can look in any direction and see the ground sloping differently depending on which way you face. Looking north, the ground rises steeply. Looking east, it is flat. Looking south, it descends.

The partial derivative $\partial f/\partial x$ measures the slope in the x-direction. The partial derivative $\partial f/\partial y$ measures the slope in the y-direction. Separate measurements of the same landscape, each focusing on one direction.

This is exactly what "holding other variables constant" means: you are asking about slope in one direction while standing still in all others.

The Computation Rule

To compute $\partial f/\partial x$ :

Look at every term containing $x$ .
Differentiate those terms normally using power rule, chain rule, etc.
Treat every term that does not contain $x$ as a plain constant - it differentiates to zero.

That is it. Partial differentiation is just regular differentiation with a narrowed focus.

Interactive example

3D surface explorer - click any point to see the partial derivative slopes in x and y directions

Coming soon

Worked Example: $f(x, y) = 3x^2 + 2xy + y^2$

Finding $\partial f/\partial x$ (treat as a constant):

Term $3x^2$ : differentiate normally $\to 6x$
Term $2xy$ : y is constant, so this is $2y \cdot x$ , derivative $\to 2y$
Term $y^2$ : no x anywhere, constant $\to 0$

\frac{\partial f}{\partial x} = 6x + 2y

$x$: variable we differentiate w.r.t.
$y$: treated as constant

Finding $\partial f/\partial y$ (treat as a constant):

Term $3x^2$ : no y, constant $\to 0$
Term $2xy$ : x is constant, so $2x \cdot y$ , derivative $\to 2x$
Term $y^2$ : differentiate normally $\to 2y$

\frac{\partial f}{\partial y} = 2x + 2y

$y$: variable we differentiate w.r.t.
$x$: treated as constant

Evaluating at $(x=1, y=2)$ :

Result: $\partial f/\partial x = 6(1) + 2(2) = 10$
Result: $\partial f/\partial y = 2(1) + 2(2) = 6$

The $\partial$ Symbol

The curly $\partial$ (partial) is used instead of a regular $d$ to signal "this is a partial derivative." When you see $\partial L/\partial w$ , it means: "the partial derivative of loss $L$ with respect to weight $w$ , treating all other weights and biases as constants." Pure notation - the mechanics are identical to regular derivatives.

The Gradient: All Partial Derivatives in One Vector

The of $f$ , written (nabla f), is the vector of all partial derivatives:

\nabla f = \left[\frac{\partial f}{\partial x_1},\thinspace \frac{\partial f}{\partial x_2},\thinspace \ldots,\thinspace \frac{\partial f}{\partial x_n}\right]

$\nabla f$: gradient vector - one entry per variable
$x_i$: the i-th input variable

For the example above at $(1, 2)$ : $\nabla f = [10, 6]$ - a 2D vector.

The Critical Property of the Gradient

The gradient has a remarkable geometric property:

The gradient vector points in the direction of steepest increase of $f$ .

If you stand at a point on the landscape and walk in the direction of $\nabla f$ , you climb as steeply as possible. The magnitude $\mid \nabla f\mid$ tells you how steep that incline is.

Conversely: $-\nabla f$ points in the direction of steepest decrease. If you want to minimize $f$ , move in the direction of negative gradient. That is gradient descent:

\mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \nabla L

$w$: parameter vector
$\alpha$: learning rate - step size
$L$: loss function

A Concrete Loss Gradient

Single linear model $\hat{y} = wx + b$ , one training example $(x=2, y=3)$ :

L(w, b) = (3 - 2w - b)^2

$w$: weight
$b$: bias
$L$: squared-error loss

Let $u = 3 - 2w - b$ . Write $L = u^2$ . By the chain rule, $\frac{\partial L}{\partial w} = \frac{dL}{du} \cdot \frac{\partial u}{\partial w}$ :

Chain rule steps for ∂L/∂w and ∂L/∂b

Step 1: outer derivative — $dL/du = 2u$ (power rule on $u^2$ )

Step 2: inner derivatives — treat every variable except the target as constant:

Partial w: $\partial u/\partial w = \partial(3 - 2w - b)/\partial w = -2$
Partial b: $\partial u/\partial b = \partial(3 - 2w - b)/\partial b = -1$

Step 3: multiply:

\frac{\partial L}{\partial w} = 2u \cdot (-2) = -4u = -4(3 - 2w - b)

$u$: residual 3 - 2w - b

\frac{\partial L}{\partial b} = 2u \cdot (-1) = -2u = -2(3 - 2w - b)

At $w=0, b=0$ : $\nabla L = [-12,\thinspace -6]$ . Gradient descent with $\alpha = 0.01$ :

Update: $w \leftarrow 0 - 0.01 \cdot (-12) = 0.12$
Update: $b \leftarrow 0 - 0.01 \cdot (-6) = 0.06$

After one step, both parameters have moved toward lower loss. Repeat thousands of times - that is training.

Scaling to Real Models

In a model with 1,000 weights, the gradient is a 1,000-dimensional vector:

\nabla L = \left[\frac{\partial L}{\partial w_1},\thinspace \frac{\partial L}{\partial w_2},\thinspace \ldots,\thinspace \frac{\partial L}{\partial w_{1000}}\right]

$\nabla L$: gradient of loss - vector with one entry per parameter

Each entry tells you how sensitive the loss is to that particular weight. GPT-3 has 175 billion parameters - its gradient is a 175-billion-dimensional vector. Same concept, vastly different scale.

import torch

# Partial derivatives computed automatically for any function
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)
x, y = 2.0, 3.0

loss = (y - w * x - b) ** 2
loss.backward()   # PyTorch chains ∂L/∂u · ∂u/∂w internally

print(f"∂L/∂w = {w.grad.item():.2f}")  # → -12.0
print(f"∂L/∂b = {b.grad.item():.2f}")  # →  -6.0

# For a model with many parameters, the gradient is a vector
params = torch.randn(5, requires_grad=True)
loss2  = (params ** 2).sum()   # toy loss: sum of squares (minimum at 0)
loss2.backward()
print("Gradient vector:", params.grad)  # each entry = 2 * param

InteractiveGradient Descent on a Non-Convex Function

x =2.2000

f(x) =2.7902

f'(x) =8.7368

steps =0

Learning rate α: 0.15

This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.