Skip to content
Linear Regression
Lesson 3 ⏱ 14 min

Finding the minimum with calculus

Video coming soon

Finding the Minimum: Derivatives, Gradients, and the Analytical Solution

Step-by-step derivation of the optimal weights for single-feature linear regression by setting partial derivatives to zero and solving the resulting system.

⏱ ~9 min

🧮

Quick refresher

Derivatives and finding minimums

To find the minimum of a function, take its derivative and set it to zero. The derivative is zero at the minimum (the function is flat there).

Example

For L = w², dL/dw = 2w.

Set 2w = 0 → w = 0 is the minimum.

The Mathematical Strategy

We want to find the values of ww and bb that minimize the MSE loss.

This is the lesson where calculus earns its place in machine learning. Finding the parameters that minimize the loss is a calculus problem, and the answer — set the derivative to zero and solve — is exactly how gradient descent decides which direction to step, and how the normal equation finds the exact optimum in one shot.

L(w,b)=1ni=1n(yiwxib)2L(w, b) = \frac{1}{n}\sum_{i=1}^{n}(y_i - wx_i - b)^2
LL
MSE loss - average squared residual over all training examples
ww
the weight (slope)
bb
the bias (intercept)
yiy_i
true label for example i
xix_i
input feature value for example i

There is a powerful observation from calculus: at the minimum of any smooth function, the derivative is zero. Think physically: at the bottom of a bowl, the surface is flat. The slope is zero.

So we can find the minimum by computing partial derivatives and solving where they equal zero. With two parameters, we take partial derivatives with respect to each:

Lw=0andLb=0\frac{\partial L}{\partial w} = 0 \qquad \text{and} \qquad \frac{\partial L}{\partial b} = 0
partial_L_w
partial derivative of loss with respect to weight w
partial_L_b
partial derivative of loss with respect to bias b

Two equations, two unknowns. Solvable.

Computing ∂L/∂w

Focus on one training example first to build intuition. For example ii:

Li=(yiwxib)2L_i = (y_i - wx_i - b)^2

Define the residual ei=yiwxibe_i = y_i - wx_i - b, so Li=ei2L_i = e_i^2.

Apply the chain rule — derivative of the outside (squaring) times derivative of the inside (residual).

First, differentiate the inside. Since ei=yiwxibe_i = y_i - wx_i - b, differentiate each term with respect to ww: yiy_i does not contain ww (gives 0), the term wxi-wx_i gives xi-x_i, and b-b does not contain ww (gives 0). So:

eiw=w(yi)w(wxi)w(b)=0xi0=xi\frac{\partial e_i}{\partial w} = \frac{\partial}{\partial w}(y_i) - \frac{\partial}{\partial w}(wx_i) - \frac{\partial}{\partial w}(b) = 0 - x_i - 0 = -x_i

Now chain with the outer squaring function — derivative of ei2e_i^2 with respect to eie_i is 2ei2e_i:

Liw=2eieiw=2ei(xi)=2xiei\frac{\partial L_i}{\partial w} = 2e_i \cdot \frac{\partial e_i}{\partial w} = 2e_i \cdot (-x_i) = -2x_i e_i
chainrulechain_rule
derivative of outer function times derivative of inner function
eie_i
residual for example i
partial_e_partial_w
how the residual changes when w changes

The is what connects the outer squaring operation to the inner linear model.

Substituting back and averaging over all nn examples:

Lw=2ni=1nxi(yiy^i)\frac{\partial L}{\partial w} = -\frac{2}{n}\sum_{i=1}^{n} x_i(y_i - \hat{y}_i)
nn
number of training examples
xix_i
input for example i
yiy_i
true label for example i
hat_y_i
prediction for example i

Interpretation: the gradient is a weighted sum of residuals, where each residual is weighted by the input xix_i. If large-input examples are consistently underpredicted (positive residuals), this sum is large - telling us to increase ww.

Computing ∂L/∂b

Identical structure. Since ei=yiwxibe_i = y_i - wx_i - b, differentiate each term with respect to bb: yiy_i gives 0, wxi-wx_i gives 0, and b-b gives 1-1. So ei/b=1\partial e_i / \partial b = -1. Chain rule then gives:

Lb=2ni=1n(yiy^i)\frac{\partial L}{\partial b} = -\frac{2}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)
partial_L_b
gradient of loss with respect to bias

The gradient with respect to bb is simply the negative average residual. If we are consistently predicting too low (positive mean residual), this tells us to increase bb to shift all predictions upward. Intuitively, bb controls the overall level of predictions.

Setting to Zero and Solving

Setting L/w=0\partial L / \partial w = 0:

ixi(yiwxib)=0\sum_i x_i(y_i - wx_i - b) = 0
ixiyi=wixi2+bixi (equation 1)\sum_i x_i y_i = w\sum_i x_i^2 + b\sum_i x_i \qquad \cdots \text{ (equation 1)}
sumxysum_xy
cross-sum of inputs and outputs
sumx2sum_x2
sum of squared inputs
sumxsum_x
sum of inputs
nn
number of examples

Setting L/b=0\partial L / \partial b = 0:

iyi=wixi+nb (equation 2)\sum_i y_i = w\sum_i x_i + nb \qquad \cdots \text{ (equation 2)}

Two equations, two unknowns. Here is the step-by-step algebra.

Step 1 — Solve for bb from equation (2):

nb=iyiwixi    b=iyiwixinnb = \sum_i y_i - w\sum_i x_i \implies b = \frac{\sum_i y_i - w\sum_i x_i}{n}

Step 2 — Substitute into equation (1), replacing bb:

ixiyi=wixi2+iyiwixinixi\sum_i x_i y_i = w\sum_i x_i^2 + \frac{\sum_i y_i - w\sum_i x_i}{n}\cdot\sum_i x_i

Step 3 — Multiply both sides by nn to clear the fraction:

nixiyi=nwixi2+(iyiwixi)ixin\sum_i x_i y_i = nw\sum_i x_i^2 + \left(\sum_i y_i - w\sum_i x_i\right)\sum_i x_i

Step 4 — Expand and collect terms in ww on the right:

nixiyiiyiixi=w!(nixi2(ixi)2)n\sum_i x_i y_i - \sum_i y_i \sum_i x_i = w!\left(n\sum_i x_i^2 - \left(\sum_i x_i\right)^2\right)

Step 5 — Divide both sides to isolate ww:

w=nixiyi(ixi)(iyi)nixi2(ixi)2,b=iyiwixinw^* = \frac{n\sum_i x_i y_i - \left(\sum_i x_i\right)\left(\sum_i y_i\right)}{n\sum_i x_i^2 - \left(\sum_i x_i\right)^2}, \qquad b^* = \frac{\sum_i y_i - w^*\sum_i x_i}{n}
wstarw_star
optimal weight - the value that minimizes MSE loss
bstarb_star
optimal bias

These formulas give the exact optimal parameters. Compute them once from the data - no iteration, no learning rate, no approximation.

The Power and Limits of This Approach

Power: this is an analytic (closed-form) solution. Given your data, evaluate the formulas once and immediately have the optimal parameters. Exact. Always works for single-feature linear regression.

The is the fundamental limitation of the analytic approach.

Limit - scale: for pp features, the solution generalizes to the Normal Equation w=(XX)1Xy\mathbf{w}^* = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}. But inverting a p×pp \times p matrix costs O(p3)O(p^3). For p=10,000p = 10{,}000: 101210^{12} operations. For p=1,000,000p = 1{,}000{,}000: 101810^{18}. Completely intractable.

Limit - nonlinearity: neural networks have non-convex loss landscapes. "Set derivative to zero and solve" gives a massive nonlinear system with no closed-form answer. Gradient descent is mandatory.

Interactive example

See the gradient approach zero as parameters move toward the optimal values

Coming soon

Quiz

1 / 3

To find the minimum of the loss function analytically, we...