From One Variable to Many
Everything learned about derivatives so far involves functions of a single variable - . But real ML models do not have one parameter. They have thousands. Millions. Billions.
GPT-3 has 175 billion parameters. During training, you need to know how the loss changes with respect to each one. That is where come in.
A partial derivative answers: "If I change just this one parameter, holding everything else fixed, how does the output change?" Same idea as a regular derivative - slope, rate of change - applied to one variable at a time.
The Hilly Landscape Analogy
Picture yourself standing on a mountain. You can look in any direction and see the ground sloping differently depending on which way you face. Looking north, the ground rises steeply. Looking east, it is flat. Looking south, it descends.
The partial derivative measures the slope in the x-direction. The partial derivative measures the slope in the y-direction. Separate measurements of the same landscape, each focusing on one direction.
This is exactly what "holding other variables constant" means: you are asking about slope in one direction while standing still in all others.
The Computation Rule
To compute :
- Look at every term containing .
- Differentiate those terms normally using power rule, chain rule, etc.
- Treat every term that does not contain as a plain constant - it differentiates to zero.
That is it. Partial differentiation is just regular differentiation with a narrowed focus.
Interactive example
3D surface explorer - click any point to see the partial derivative slopes in x and y directions
Coming soon
Worked Example:
Finding (treat as a constant):
- Term : differentiate normally
- Term : y is constant, so this is , derivative
- Term : no x anywhere, constant
- variable we differentiate w.r.t.
- treated as constant
Finding (treat as a constant):
- Term : no y, constant
- Term : x is constant, so , derivative
- Term : differentiate normally
- variable we differentiate w.r.t.
- treated as constant
Evaluating at :
- Result:
- Result:
The Symbol
The curly (partial) is used instead of a regular to signal "this is a partial derivative." When you see , it means: "the partial derivative of loss with respect to weight , treating all other weights and biases as constants." Pure notation - the mechanics are identical to regular derivatives.
The Gradient: All Partial Derivatives in One Vector
The of , written (nabla f), is the vector of all partial derivatives:
- gradient vector - one entry per variable
- the i-th input variable
For the example above at : - a 2D vector.
The Critical Property of the Gradient
The gradient has a remarkable geometric property:
The gradient vector points in the direction of steepest increase of .
If you stand at a point on the landscape and walk in the direction of , you climb as steeply as possible. The magnitude tells you how steep that incline is.
Conversely: points in the direction of steepest decrease. If you want to minimize , move in the direction of negative gradient. That is gradient descent:
- parameter vector
- learning rate - step size
- loss function
A Concrete Loss Gradient
Single linear model , one training example :
- weight
- bias
- squared-error loss
Let . Write . By the chain rule, :
At : . Gradient descent with :
- Update:
- Update:
After one step, both parameters have moved toward lower loss. Repeat thousands of times - that is training.
Scaling to Real Models
In a model with 1,000 weights, the gradient is a 1,000-dimensional vector:
- gradient of loss - vector with one entry per parameter
Each entry tells you how sensitive the loss is to that particular weight. GPT-3 has 175 billion parameters - its gradient is a 175-billion-dimensional vector. Same concept, vastly different scale.
import torch
# Partial derivatives computed automatically for any function
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)
x, y = 2.0, 3.0
loss = (y - w * x - b) ** 2
loss.backward() # PyTorch chains ∂L/∂u · ∂u/∂w internally
print(f"∂L/∂w = {w.grad.item():.2f}") # → -12.0
print(f"∂L/∂b = {b.grad.item():.2f}") # → -6.0
# For a model with many parameters, the gradient is a vector
params = torch.randn(5, requires_grad=True)
loss2 = (params ** 2).sum() # toy loss: sum of squares (minimum at 0)
loss2.backward()
print("Gradient vector:", params.grad) # each entry = 2 * param
This function has two local minima — one near x ≈ -1.3 (deeper) and one near x ≈ 1.3. Where gradient descent ends up depends on the starting point and learning rate.