The Closed-Form Solution
The normal equation is something remarkable: a single matrix formula that computes the exact optimal weights for linear regression in one step, with no iterations, no learning rate to tune, and no convergence to worry about. Understanding it also reveals when gradient descent is the better choice — and why that is almost always in practice.
From the previous lesson, the gradient of MSE loss with respect to weights is:
Set this to zero to find the minimum:
- X transpose, shape p times n
- label vector, shape n times 1
- data matrix, shape n times p
- weight vector, shape p times 1
If is invertible, multiply both sides on the left by :
- optimal weight vector - exact solution that minimizes MSE
- X transpose
- inverse of the p x p matrix X-transpose times X
- X-transpose times y - the cross-correlation between features and labels
This is the Normal Equation. It gives the exact optimal weights for multi-feature linear regression in a single computation - no iterations, no learning rate, no approximation.
Anatomy of the Formula
Before the technical breakdown, here is the intuition in plain terms. is just turned on its side — rows become columns. Multiplying by creates a compact matrix that records how every pair of features relates to each other across all training examples (i.e., how correlated they are). Multiplying by captures how each feature relates to the target. The inverse then "divides out" those inter-feature correlations — exactly like isolating a variable in algebra — leaving the weights that best explain .
Read it right to left to understand what each piece computes:
- a vector. Entry is : how much feature co-varies with the target across all training examples. Features that correlate strongly with the target tend to get large weights.
- a matrix. Entry is : how correlated features and are with each other. Diagonal entry is the total squared magnitude of feature .
The is the core requirement for the Normal Equation to work.
- inverts the feature correlation matrix. This is the critical step: it adjusts for redundancy between features. If two features are highly correlated, you should not give both large weights - the inverse handles this by "dividing out" shared information between features.
When the Normal Equation Excels
For small to medium datasets with moderate feature counts ( up to a few thousand):
- One computation - no iterations, no learning rate to tune, no convergence checking
- Exact - not an approximation; the true optimal weights
- Stable - production implementations use QR decomposition ( internally) rather than explicit matrix inversion, which is numerically more robust
The is why numpy.linalg.solve is preferred over numpy.linalg.inv.
When the Normal Equation Fails
Too many features: inverting a matrix costs . For : operations. For : . Modern NLP models have billions of parameters - the Normal Equation is not an option.
Perfectly correlated features: if two features are identical (or one is a linear combination of others), is singular. Fix: add L2 regularization, solving instead. For any \lambda > 0, this matrix is always invertible.
Non-linear models: neural networks have non-convex loss surfaces. There is no algebraic structure to "set derivative to zero and solve." Gradient descent is mandatory.
Interactive example
Compare Normal Equation (one shot) vs. gradient descent (iterative) - same answer, very different paths
Coming soon