Skip to content
Math Foundation Derivatives
Lesson 4 ⏱ 10 min

Partial derivatives

Video coming soon

Partial Derivatives and the Gradient Vector

From single-variable to multi-variable: treating other variables as constants, assembling the gradient vector, and why it points uphill.

⏱ ~6 min

🧮

Quick refresher

Single-variable derivatives

d/dx f(x) measures how fast f changes as x changes - the slope of the tangent line. Power rule: d/dx xⁿ = nxⁿ⁻¹. Chain rule: d/dx f(g(x)) = f'(g(x))·g'(x).

Example

d/dx (3x²) = 6x.

d/dx (x²+5)³ = 3(x²+5)² · 2x = 6x(x²+5)².

From One Variable to Many

Everything learned about derivatives so far involves functions of a single variable - f(x)f(x). But real ML models do not have one parameter. They have thousands. Millions. Billions.

GPT-3 has 175 billion parameters. During training, you need to know how the loss changes with respect to each one. That is where come in.

A partial derivative answers: "If I change just this one parameter, holding everything else fixed, how does the output change?" Same idea as a regular derivative - slope, rate of change - applied to one variable at a time.

The Hilly Landscape Analogy

Picture yourself standing on a mountain. You can look in any direction and see the ground sloping differently depending on which way you face. Looking north, the ground rises steeply. Looking east, it is flat. Looking south, it descends.

The partial derivative f/x\partial f/\partial x measures the slope in the x-direction. The partial derivative f/y\partial f/\partial y measures the slope in the y-direction. Separate measurements of the same landscape, each focusing on one direction.

This is exactly what "holding other variables constant" means: you are asking about slope in one direction while standing still in all others.

The Computation Rule

To compute f/x\partial f/\partial x:

  1. Look at every term containing xx.
  2. Differentiate those terms normally using power rule, chain rule, etc.
  3. Treat every term that does not contain xx as a plain constant - it differentiates to zero.

That is it. Partial differentiation is just regular differentiation with a narrowed focus.

Interactive example

3D surface explorer - click any point to see the partial derivative slopes in x and y directions

Coming soon

Worked Example: f(x,y)=3x2+2xy+y2f(x, y) = 3x^2 + 2xy + y^2

Finding f/x\partial f/\partial x (treat as a constant):

  • Term 3x23x^2: differentiate normally 6x\to 6x
  • Term 2xy2xy: y is constant, so this is 2yx2y \cdot x, derivative 2y\to 2y
  • Term y2y^2: no x anywhere, constant 0\to 0
fx=6x+2y\frac{\partial f}{\partial x} = 6x + 2y
xx
variable we differentiate w.r.t.
yy
treated as constant

Finding f/y\partial f/\partial y (treat as a constant):

  • Term 3x23x^2: no y, constant 0\to 0
  • Term 2xy2xy: x is constant, so 2xy2x \cdot y, derivative 2x\to 2x
  • Term y2y^2: differentiate normally 2y\to 2y
fy=2x+2y\frac{\partial f}{\partial y} = 2x + 2y
yy
variable we differentiate w.r.t.
xx
treated as constant

Evaluating at (x=1,y=2)(x=1, y=2):

  • Result: f/x=6(1)+2(2)=10\partial f/\partial x = 6(1) + 2(2) = 10
  • Result: f/y=2(1)+2(2)=6\partial f/\partial y = 2(1) + 2(2) = 6

The \partial Symbol

The curly \partial (partial) is used instead of a regular dd to signal "this is a partial derivative." When you see L/w\partial L/\partial w, it means: "the partial derivative of loss LL with respect to weight ww, treating all other weights and biases as constants." Pure notation - the mechanics are identical to regular derivatives.

The Gradient: All Partial Derivatives in One Vector

The of ff, written (nabla f), is the vector of all partial derivatives:

f=[fx1,fx2,,fxn]\nabla f = \left[\frac{\partial f}{\partial x_1},\thinspace \frac{\partial f}{\partial x_2},\thinspace \ldots,\thinspace \frac{\partial f}{\partial x_n}\right]
f\nabla f
gradient vector - one entry per variable
xix_i
the i-th input variable

For the example above at (1,2)(1, 2): f=[10,6]\nabla f = [10, 6] - a 2D vector.

The Critical Property of the Gradient

The gradient has a remarkable geometric property:

The gradient vector points in the direction of steepest increase of ff.

If you stand at a point on the landscape and walk in the direction of f\nabla f, you climb as steeply as possible. The magnitude f\mid \nabla f\mid tells you how steep that incline is.

Conversely: f-\nabla f points in the direction of steepest decrease. If you want to minimize ff, move in the direction of negative gradient. That is gradient descent:

wwαL\mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \nabla L
ww
parameter vector
α\alpha
learning rate - step size
LL
loss function

A Concrete Loss Gradient

Single linear model y^=wx+b\hat{y} = wx + b, one training example (x=2,y=3)(x=2, y=3):

L(w,b)=(32wb)2L(w, b) = (3 - 2w - b)^2
ww
weight
bb
bias
LL
squared-error loss

Let u=32wbu = 3 - 2w - b. Write L=u2L = u^2. By the chain rule, Lw=dLduuw\frac{\partial L}{\partial w} = \frac{dL}{du} \cdot \frac{\partial u}{\partial w}:

At w=0,b=0w=0, b=0: L=[12,6]\nabla L = [-12,\thinspace -6]. Gradient descent with α=0.01\alpha = 0.01:

  • Update: w00.01(12)=0.12w \leftarrow 0 - 0.01 \cdot (-12) = 0.12
  • Update: b00.01(6)=0.06b \leftarrow 0 - 0.01 \cdot (-6) = 0.06

After one step, both parameters have moved toward lower loss. Repeat thousands of times - that is training.

Scaling to Real Models

In a model with 1,000 weights, the gradient is a 1,000-dimensional vector:

L=[Lw1,Lw2,,Lw1000]\nabla L = \left[\frac{\partial L}{\partial w_1},\thinspace \frac{\partial L}{\partial w_2},\thinspace \ldots,\thinspace \frac{\partial L}{\partial w_{1000}}\right]
L\nabla L
gradient of loss - vector with one entry per parameter

Each entry tells you how sensitive the loss is to that particular weight. GPT-3 has 175 billion parameters - its gradient is a 175-billion-dimensional vector. Same concept, vastly different scale.

import torch

# Partial derivatives computed automatically for any function
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)
x, y = 2.0, 3.0

loss = (y - w * x - b) ** 2
loss.backward()   # PyTorch chains ∂L/∂u · ∂u/∂w internally

print(f"∂L/∂w = {w.grad.item():.2f}")  # → -12.0
print(f"∂L/∂b = {b.grad.item():.2f}")  # →  -6.0

# For a model with many parameters, the gradient is a vector
params = torch.randn(5, requires_grad=True)
loss2  = (params ** 2).sum()   # toy loss: sum of squares (minimum at 0)
loss2.backward()
print("Gradient vector:", params.grad)  # each entry = 2 * param
InteractiveGradient Descent on a Non-Convex Function
step-2-112
x =2.2000
f(x) =2.7902
f'(x) =8.7368
steps =0

This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.

Quiz

1 / 3

For f(x,y) = 3x² + 2y, what is ∂f/∂x?