The Mathematical Strategy
We want to find the values of and that minimize the MSE loss.
This is the lesson where calculus earns its place in machine learning. Finding the parameters that minimize the loss is a calculus problem, and the answer — set the derivative to zero and solve — is exactly how gradient descent decides which direction to step, and how the normal equation finds the exact optimum in one shot.
- MSE loss - average squared residual over all training examples
- the weight (slope)
- the bias (intercept)
- true label for example i
- input feature value for example i
There is a powerful observation from calculus: at the minimum of any smooth function, the derivative is zero. Think physically: at the bottom of a bowl, the surface is flat. The slope is zero.
So we can find the minimum by computing partial derivatives and solving where they equal zero. With two parameters, we take partial derivatives with respect to each:
- partial_L_w
- partial derivative of loss with respect to weight w
- partial_L_b
- partial derivative of loss with respect to bias b
Two equations, two unknowns. Solvable.
Computing ∂L/∂w
Focus on one training example first to build intuition. For example :
Define the residual , so .
Apply the chain rule — derivative of the outside (squaring) times derivative of the inside (residual).
First, differentiate the inside. Since , differentiate each term with respect to : does not contain (gives 0), the term gives , and does not contain (gives 0). So:
Now chain with the outer squaring function — derivative of with respect to is :
- derivative of outer function times derivative of inner function
- residual for example i
- partial_e_partial_w
- how the residual changes when w changes
The is what connects the outer squaring operation to the inner linear model.
Substituting back and averaging over all examples:
- number of training examples
- input for example i
- true label for example i
- hat_y_i
- prediction for example i
Interpretation: the gradient is a weighted sum of residuals, where each residual is weighted by the input . If large-input examples are consistently underpredicted (positive residuals), this sum is large - telling us to increase .
Computing ∂L/∂b
Identical structure. Since , differentiate each term with respect to : gives 0, gives 0, and gives . So . Chain rule then gives:
- partial_L_b
- gradient of loss with respect to bias
The gradient with respect to is simply the negative average residual. If we are consistently predicting too low (positive mean residual), this tells us to increase to shift all predictions upward. Intuitively, controls the overall level of predictions.
Setting to Zero and Solving
Setting :
- cross-sum of inputs and outputs
- sum of squared inputs
- sum of inputs
- number of examples
Setting :
Two equations, two unknowns. Here is the step-by-step algebra.
Step 1 — Solve for from equation (2):
Step 2 — Substitute into equation (1), replacing :
Step 3 — Multiply both sides by to clear the fraction:
Step 4 — Expand and collect terms in on the right:
Step 5 — Divide both sides to isolate :
- optimal weight - the value that minimizes MSE loss
- optimal bias
These formulas give the exact optimal parameters. Compute them once from the data - no iteration, no learning rate, no approximation.
The Power and Limits of This Approach
Power: this is an analytic (closed-form) solution. Given your data, evaluate the formulas once and immediately have the optimal parameters. Exact. Always works for single-feature linear regression.
The is the fundamental limitation of the analytic approach.
Limit - scale: for features, the solution generalizes to the Normal Equation . But inverting a matrix costs . For : operations. For : . Completely intractable.
Limit - nonlinearity: neural networks have non-convex loss landscapes. "Set derivative to zero and solve" gives a massive nonlinear system with no closed-form answer. Gradient descent is mandatory.
Interactive example
See the gradient approach zero as parameters move toward the optimal values
Coming soon