Skip to content
Linear Regression
Lesson 5 ⏱ 12 min

The normal equation

Video coming soon

The Normal Equation: Linear Regression's Closed-Form Solution

Deriving the Normal Equation from the gradient condition, understanding its anatomy, and seeing exactly why it does not scale to large models.

⏱ ~7 min

🧮

Quick refresher

Matrix inverse

The inverse of matrix A, written A⁻¹, satisfies AA⁻¹ = I. Only square, non-singular matrices are invertible.

Example

If XᵀX is a 5×5 matrix, its inverse (XᵀX)⁻¹ is also 5×5.

Multiplying them gives the 5×5 identity matrix.

The Closed-Form Solution

The normal equation is something remarkable: a single matrix formula that computes the exact optimal weights for linear regression in one step, with no iterations, no learning rate to tune, and no convergence to worry about. Understanding it also reveals when gradient descent is the better choice — and why that is almost always in practice.

From the previous lesson, the gradient of MSE loss with respect to weights is:

Lw=2nX(yXw)\dfrac{\partial L}{\partial \mathbf{w}} = -\dfrac{2}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w})

Set this to zero to find the minimum:

X(yXw)=0    Xy=XXw\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w}) = 0 \implies \mathbf{X}^\top\mathbf{y} = \mathbf{X}^\top\mathbf{X}\thinspace\mathbf{w}
XTX_T
X transpose, shape p times n
yy
label vector, shape n times 1
XX
data matrix, shape n times p
ww
weight vector, shape p times 1

If XX\mathbf{X}^\top\mathbf{X} is invertible, multiply both sides on the left by (XX)1(\mathbf{X}^\top\mathbf{X})^{-1}:

w=(XX)1Xy\mathbf{w}^* = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}
wstarw_star
optimal weight vector - exact solution that minimizes MSE
XTXT
X transpose
XTXinvXTX_inv
inverse of the p x p matrix X-transpose times X
XTyXTy
X-transpose times y - the cross-correlation between features and labels

This is the Normal Equation. It gives the exact optimal weights for multi-feature linear regression in a single computation - no iterations, no learning rate, no approximation.

Anatomy of the Formula

Before the technical breakdown, here is the intuition in plain terms. X\mathbf{X}^\top is just X\mathbf{X} turned on its side — rows become columns. Multiplying X\mathbf{X}^\top by X\mathbf{X} creates a compact p×pp \times p matrix that records how every pair of features relates to each other across all training examples (i.e., how correlated they are). Multiplying X\mathbf{X}^\top by y\mathbf{y} captures how each feature relates to the target. The inverse (XX)1(\mathbf{X}^\top\mathbf{X})^{-1} then "divides out" those inter-feature correlations — exactly like isolating a variable in algebra — leaving the weights that best explain y\mathbf{y}.

Read it right to left to understand what each piece computes:

Xy\mathbf{X}^\top\mathbf{y} - a p×1p \times 1 vector. Entry jj is ixijyi\sum_i x_{ij} y_i: how much feature jj co-varies with the target across all training examples. Features that correlate strongly with the target tend to get large weights.

XX\mathbf{X}^\top\mathbf{X} - a p×pp \times p matrix. Entry (j,k)(j, k) is ixijxik\sum_i x_{ij} x_{ik}: how correlated features jj and kk are with each other. Diagonal entry (j,j)=ixij2(j, j) = \sum_i x_{ij}^2 is the total squared magnitude of feature jj.

The is the core requirement for the Normal Equation to work.

(XX)1(\mathbf{X}^\top\mathbf{X})^{-1} - inverts the feature correlation matrix. This is the critical step: it adjusts for redundancy between features. If two features are highly correlated, you should not give both large weights - the inverse handles this by "dividing out" shared information between features.

When the Normal Equation Excels

For small to medium datasets with moderate feature counts (pp up to a few thousand):

  • One computation - no iterations, no learning rate to tune, no convergence checking
  • Exact - not an approximation; the true optimal weights
  • Stable - production implementations use QR decomposition (numpy.linalg.lstsq\texttt{numpy.linalg.lstsq} internally) rather than explicit matrix inversion, which is numerically more robust

The is why numpy.linalg.solve is preferred over numpy.linalg.inv.

When the Normal Equation Fails

Too many features: inverting a p×pp \times p matrix costs O(p3)O(p^3). For p=10,000p = 10{,}000: 101210^{12} operations. For p=1,000,000p = 1{,}000{,}000: 101810^{18}. Modern NLP models have billions of parameters - the Normal Equation is not an option.

Perfectly correlated features: if two features are identical (or one is a linear combination of others), XX\mathbf{X}^\top\mathbf{X} is singular. Fix: add L2 regularization, solving (XX+λI)1Xy(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y} instead. For any \lambda > 0, this matrix is always invertible.

Non-linear models: neural networks have non-convex loss surfaces. There is no algebraic structure to "set derivative to zero and solve." Gradient descent is mandatory.

Interactive example

Compare Normal Equation (one shot) vs. gradient descent (iterative) - same answer, very different paths

Coming soon

Quiz

1 / 3

The Normal Equation gives the optimal weights...