The power rule — Derivatives

The One Rule That Unlocks Most of Calculus

The is the workhorse of differentiation. Learn it once and you can differentiate any polynomial instantly.

\frac{d}{dx} x^n = n \cdot x^{n-1}

$n$: exponent - any real number
$x$: the variable

The exponent slides down to become a coefficient, and the original exponent decreases by one. That is the entire rule.

Every ML loss function you will use — mean squared error, regularization terms, polynomial activations — is built from expressions the power rule can differentiate exactly. Learning this rule once makes gradient computation for all of them trivial.

Four Core Examples

Example 1: $\frac{d}{dx} x^3$

Bring the 3 down, subtract 1: $3 \cdot x^{3-1} = 3x^2$ .

Example 2: $\frac{d}{dx} x^4 = 4x^3$

Example 3: $\frac{d}{dx} x = \frac{d}{dx} x^1 = 1 \cdot x^0 = 1$

The derivative of $x$ is always 1. Makes sense: $f(x) = x$ is a straight line with slope 1.

Example 4: $\frac{d}{dx} \sqrt{x} = \frac{d}{dx} x^{0.5}$

\frac{d}{dx} x^{0.5} = 0.5 \cdot x^{-0.5} = \frac{1}{2\sqrt{x}}

$x^{0.5}$: the square root written as a power

The power rule works for any real exponent - fractional, negative, or zero.

InteractiveDrag the point — watch the tangent line

x =1.20

f(x) =1.44

f'(x) = 2x =2.40

rising — increasing x increases loss

Drag left or right. The orange dashed line is the tangent — its slope is the derivative. At x = 0 the derivative is 0: that's the minimum.

The Constant Rule

The derivative of any standalone constant is zero:

\frac{d}{dx}\thinspace c = 0

$c$: any constant number

Why? A constant function is a flat horizontal line. Slope $= 0$ . There is nothing to change, so the rate of change is zero.

The Coefficient Rule: Constants Slide Out

The passes cleanly through differentiation:

\frac{d}{dx}(c \cdot x^n) = c \cdot n \cdot x^{n-1}

$c$: constant coefficient
$n$: exponent

Examples:

5x²: $5 \cdot 2x = 10x$
3x⁴: $3 \cdot 4x^3 = 12x^3$
-2x³: $-2 \cdot 3x^2 = -6x^2$
100x: $100 \cdot 1 = 100$

The intuition: if a function is scaled by a constant, its rate of change is scaled by the same constant.

The Sum Rule: Differentiate Term by Term

The says each term of a polynomial can be differentiated independently:

\frac{d}{dx}[f(x) + g(x)] = f&#39;(x) + g&#39;(x)

$f(x)$: first function
$g(x)$: second function

Full example:

\frac{d}{dx}(4x^5 - 3x^2 + 7x - 2) = 20x^4 - 6x + 7

$x$: variable

Walk each term: $4 \cdot 5x^4 = 20x^4$ , $-3 \cdot 2x = -6x$ , $7 \cdot 1 = 7$ , and $-2 \to 0$ (constant).

A Real ML Example: Differentiating a Loss Function

Your model predicts $\hat{y} = wx + b$ with input $x = 2$ , true label $y = 3$ , and squared-error :

L(w, b) = (y - \hat{y})^2 = (3 - 2w - b)^2

$w$: weight parameter
$b$: bias parameter
$L$: scalar loss value

Substitute $\hat{y} = wx + b = 2w + b$ :

Now differentiate with respect to using the sum rule (one term at a time):

\frac{\partial L}{\partial w} = \underbrace{\frac{d}{dw}(k^2)}{0} + \underbrace{\frac{d}{dw}(-4kw)}{-4k} + \underbrace{\frac{d}{dw}(4w^2)}_{8w} = -4k + 8w = -4(3 - b) + 8w

$k$: shorthand for 3 - b, treated as constant w.r.t. w

At $w = 0.5,\thinspace b = 0$ : $\partial L/\partial w = -12 + 4 = -8$ .

The update with learning rate :

w \leftarrow w - \alpha \cdot (-8) = w + 8\alpha

$\alpha$: learning rate
$w$: current weight

The negative gradient pushes $w$ upward - exactly the direction that reduces this loss. That is learning.

Rule summary

Rule	Formula	When to use
Power rule	$d/dx\thinspace x^n = nx^{n-1}$	Any power of x
Constant rule	$d/dx\thinspace c = 0$	Standalone numbers
Coefficient rule	$d/dx(c x^n) = c n x^{n-1}$	Scaled power terms
Sum rule	$d/dx[f+g] = f'+g'$	Sums of terms

With these four rules you can differentiate any polynomial. Most ML loss functions - MSE, L2 regularization - are polynomials or closely related. The next lessons extend this to nested functions (chain rule) and multi-variable functions (partial derivatives).

import torch

# PyTorch applies all these rules automatically via autograd
w = torch.tensor(0.5, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

x, y = 2.0, 3.0
loss = (y - w * x - b) ** 2   # polynomial in w and b

loss.backward()
print(f"∂L/∂w = {w.grad.item():.2f}")   # → -8.0  (matches our hand calc: -4(3-0)+8·0.5)
print(f"∂L/∂b = {b.grad.item():.2f}")   # → -6.0