Finding the minimum with calculus — Linear Regression

The Mathematical Strategy

We want to find the values of $w$ and $b$ that minimize the MSE loss.

This is the lesson where calculus earns its place in machine learning. Finding the parameters that minimize the loss is a calculus problem, and the answer — set the derivative to zero and solve — is exactly how gradient descent decides which direction to step, and how the normal equation finds the exact optimum in one shot.

L(w, b) = \frac{1}{n}\sum_{i=1}^{n}(y_i - wx_i - b)^2

$L$: MSE loss - average squared residual over all training examples
$w$: the weight (slope)
$b$: the bias (intercept)
$y_i$: true label for example i
$x_i$: input feature value for example i

There is a powerful observation from calculus: at the minimum of any smooth function, the derivative is zero. Think physically: at the bottom of a bowl, the surface is flat. The slope is zero.

So we can find the minimum by computing partial derivatives and solving where they equal zero. With two parameters, we take partial derivatives with respect to each:

\frac{\partial L}{\partial w} = 0 \qquad \text{and} \qquad \frac{\partial L}{\partial b} = 0

$partial_L_w$: partial derivative of loss with respect to weight w
$partial_L_b$: partial derivative of loss with respect to bias b

Two equations, two unknowns. Solvable.

Computing ∂L/∂w

Focus on one training example first to build intuition. For example $i$ :

L_i = (y_i - wx_i - b)^2

Define the residual $e_i = y_i - wx_i - b$ , so $L_i = e_i^2$ .

Apply the chain rule — derivative of the outside (squaring) times derivative of the inside (residual).

First, differentiate the inside. Since $e_i = y_i - wx_i - b$ , differentiate each term with respect to $w$ : $y_i$ does not contain $w$ (gives 0), the term $-wx_i$ gives $-x_i$ , and $-b$ does not contain $w$ (gives 0). So:

\frac{\partial e_i}{\partial w} = \frac{\partial}{\partial w}(y_i) - \frac{\partial}{\partial w}(wx_i) - \frac{\partial}{\partial w}(b) = 0 - x_i - 0 = -x_i

Now chain with the outer squaring function — derivative of $e_i^2$ with respect to $e_i$ is $2e_i$ :

\frac{\partial L_i}{\partial w} = 2e_i \cdot \frac{\partial e_i}{\partial w} = 2e_i \cdot (-x_i) = -2x_i e_i

$chain_rule$: derivative of outer function times derivative of inner function
$e_i$: residual for example i
$partial_e_partial_w$: how the residual changes when w changes

The is what connects the outer squaring operation to the inner linear model.

Substituting back and averaging over all $n$ examples:

\frac{\partial L}{\partial w} = -\frac{2}{n}\sum_{i=1}^{n} x_i(y_i - \hat{y}_i)

$n$: number of training examples
$x_i$: input for example i
$y_i$: true label for example i
$hat_y_i$: prediction for example i

Interpretation: the gradient is a weighted sum of residuals, where each residual is weighted by the input $x_i$ . If large-input examples are consistently underpredicted (positive residuals), this sum is large - telling us to increase $w$ .

Computing ∂L/∂b

Identical structure. Since $e_i = y_i - wx_i - b$ , differentiate each term with respect to $b$ : $y_i$ gives 0, $-wx_i$ gives 0, and $-b$ gives $-1$ . So $\partial e_i / \partial b = -1$ . Chain rule then gives:

\frac{\partial L}{\partial b} = -\frac{2}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)

$partial_L_b$: gradient of loss with respect to bias

The gradient with respect to $b$ is simply the negative average residual. If we are consistently predicting too low (positive mean residual), this tells us to increase $b$ to shift all predictions upward. Intuitively, $b$ controls the overall level of predictions.

Setting to Zero and Solving

Setting $\partial L / \partial w = 0$ :

\sum_i x_i(y_i - wx_i - b) = 0

\sum_i x_i y_i = w\sum_i x_i^2 + b\sum_i x_i \qquad \cdots \text{ (equation 1)}

$sum_xy$: cross-sum of inputs and outputs
$sum_x2$: sum of squared inputs
$sum_x$: sum of inputs
$n$: number of examples

Setting $\partial L / \partial b = 0$ :

\sum_i y_i = w\sum_i x_i + nb \qquad \cdots \text{ (equation 2)}

Two equations, two unknowns. Here is the step-by-step algebra.

Step 1 — Solve for $b$ from equation (2):

nb = \sum_i y_i - w\sum_i x_i \implies b = \frac{\sum_i y_i - w\sum_i x_i}{n}

Step 2 — Substitute into equation (1), replacing $b$ :

\sum_i x_i y_i = w\sum_i x_i^2 + \frac{\sum_i y_i - w\sum_i x_i}{n}\cdot\sum_i x_i

Step 3 — Multiply both sides by $n$ to clear the fraction:

n\sum_i x_i y_i = nw\sum_i x_i^2 + \left(\sum_i y_i - w\sum_i x_i\right)\sum_i x_i

Step 4 — Expand and collect terms in $w$ on the right:

n\sum_i x_i y_i - \sum_i y_i \sum_i x_i = w!\left(n\sum_i x_i^2 - \left(\sum_i x_i\right)^2\right)

Step 5 — Divide both sides to isolate $w$ :

w^* = \frac{n\sum_i x_i y_i - \left(\sum_i x_i\right)\left(\sum_i y_i\right)}{n\sum_i x_i^2 - \left(\sum_i x_i\right)^2}, \qquad b^* = \frac{\sum_i y_i - w^*\sum_i x_i}{n}

$w_star$: optimal weight - the value that minimizes MSE loss
$b_star$: optimal bias

These formulas give the exact optimal parameters. Compute them once from the data - no iteration, no learning rate, no approximation.

Worked Numerical Example

Three data points: $(1, 2)$ , $(2, 3)$ , $(3, 5)$ . Compute the required sums ( $n = 3$ ):

Sum of x: $\sum x_i = 1 + 2 + 3 = 6$
Sum of y: $\sum y_i = 2 + 3 + 5 = 10$
Sum of x²: $\sum x_i^2 = 1 + 4 + 9 = 14$
Sum of xy: $\sum x_i y_i = 1{\cdot}2 + 2{\cdot}3 + 3{\cdot}5 = 2 + 6 + 15 = 23$

Plug into the formula for $w^*$ :

w^* = \frac{3 \times 23 - 6 \times 10}{3 \times 14 - 6^2} = \frac{69 - 60}{42 - 36} = \frac{9}{6} = 1.5

Then $b^*$ :

b^* = \frac{10 - 1.5 \times 6}{3} = \frac{10 - 9}{3} = \frac{1}{3} \approx 0.33

The fitted line is $\hat{y} = 1.5x + 0.33$ . Predictions: $\hat{y}_1 \approx 1.83$ , $\hat{y}_2 \approx 3.33$ , $\hat{y}_3 \approx 4.83$ . The data is not perfectly linear so residuals are non-zero — but this is the unique line with minimum total squared error.

The Power and Limits of This Approach

Power: this is an analytic (closed-form) solution. Given your data, evaluate the formulas once and immediately have the optimal parameters. Exact. Always works for single-feature linear regression.

The is the fundamental limitation of the analytic approach.

Limit - scale: for $p$ features, the solution generalizes to the Normal Equation $\mathbf{w}^* = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ . But inverting a $p \times p$ matrix costs $O(p^3)$ . For $p = 10{,}000$ : $10^{12}$ operations. For $p = 1{,}000{,}000$ : $10^{18}$ . Completely intractable.

Limit - nonlinearity: neural networks have non-convex loss landscapes. "Set derivative to zero and solve" gives a massive nonlinear system with no closed-form answer. Gradient descent is mandatory.

Interactive example

See the gradient approach zero as parameters move toward the optimal values

Coming soon