The normal equation — Linear Regression

The Closed-Form Solution

The normal equation is something remarkable: a single matrix formula that computes the exact optimal weights for linear regression in one step, with no iterations, no learning rate to tune, and no convergence to worry about. Understanding it also reveals when gradient descent is the better choice — and why that is almost always in practice.

From the previous lesson, the gradient of MSE loss with respect to weights is:

\dfrac{\partial L}{\partial \mathbf{w}} = -\dfrac{2}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w})

Set this to zero to find the minimum:

\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w}) = 0 \implies \mathbf{X}^\top\mathbf{y} = \mathbf{X}^\top\mathbf{X}\thinspace\mathbf{w}

$X_T$: X transpose, shape p times n
$y$: label vector, shape n times 1
$X$: data matrix, shape n times p
$w$: weight vector, shape p times 1

If $\mathbf{X}^\top\mathbf{X}$ is invertible, multiply both sides on the left by $(\mathbf{X}^\top\mathbf{X})^{-1}$ :

\mathbf{w}^* = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}

$w_star$: optimal weight vector - exact solution that minimizes MSE
$XT$: X transpose
$XTX_inv$: inverse of the p x p matrix X-transpose times X
$XTy$: X-transpose times y - the cross-correlation between features and labels

This is the Normal Equation. It gives the exact optimal weights for multi-feature linear regression in a single computation - no iterations, no learning rate, no approximation.

Anatomy of the Formula

Before the technical breakdown, here is the intuition in plain terms. $\mathbf{X}^\top$ is just $\mathbf{X}$ turned on its side — rows become columns. Multiplying $\mathbf{X}^\top$ by $\mathbf{X}$ creates a compact $p \times p$ matrix that records how every pair of features relates to each other across all training examples (i.e., how correlated they are). Multiplying $\mathbf{X}^\top$ by $\mathbf{y}$ captures how each feature relates to the target. The inverse $(\mathbf{X}^\top\mathbf{X})^{-1}$ then "divides out" those inter-feature correlations — exactly like isolating a variable in algebra — leaving the weights that best explain $\mathbf{y}$ .

Read it right to left to understand what each piece computes:

$\mathbf{X}^\top\mathbf{y}$ - a $p \times 1$ vector. Entry $j$ is $\sum_i x_{ij} y_i$ : how much feature $j$ co-varies with the target across all training examples. Features that correlate strongly with the target tend to get large weights.

$\mathbf{X}^\top\mathbf{X}$ - a $p \times p$ matrix. Entry $(j, k)$ is $\sum_i x_{ij} x_{ik}$ : how correlated features $j$ and $k$ are with each other. Diagonal entry $(j, j) = \sum_i x_{ij}^2$ is the total squared magnitude of feature $j$ .

The is the core requirement for the Normal Equation to work.

$(\mathbf{X}^\top\mathbf{X})^{-1}$ - inverts the feature correlation matrix. This is the critical step: it adjusts for redundancy between features. If two features are highly correlated, you should not give both large weights - the inverse handles this by "dividing out" shared information between features.

Worked Example: y = 2x with 3 Points

Three training examples with one feature plus absorbed bias (column of 1s):

\mathbf{X} = \begin{bmatrix} 1 &amp; 1 \ 1 &amp; 2 \ 1 &amp; 3 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 2 \ 4 \ 6 \end{bmatrix}

\mathbf{X}^\top\mathbf{X} = \begin{bmatrix} 3 &amp; 6 \ 6 &amp; 14 \end{bmatrix}, \quad \mathbf{X}^\top\mathbf{y} = \begin{bmatrix} 12 \ 28 \end{bmatrix}

Solving gives $w_0 = 0$ (bias) and $w_1 = 2$ (feature weight). Model: $\hat{y} = 2x$ . Predictions: 2, 4, 6. Actuals: 2, 4, 6. MSE = 0 - perfect, since the data was generated from exactly $y = 2x$ .

When the Normal Equation Excels

For small to medium datasets with moderate feature counts ( $p$ up to a few thousand):

One computation - no iterations, no learning rate to tune, no convergence checking
Exact - not an approximation; the true optimal weights
Stable - production implementations use QR decomposition ( $\texttt{numpy.linalg.lstsq}$ internally) rather than explicit matrix inversion, which is numerically more robust

The is why numpy.linalg.solve is preferred over numpy.linalg.inv.

When the Normal Equation Fails

Too many features: inverting a $p \times p$ matrix costs $O(p^3)$ . For $p = 10{,}000$ : $10^{12}$ operations. For $p = 1{,}000{,}000$ : $10^{18}$ . Modern NLP models have billions of parameters - the Normal Equation is not an option.

Perfectly correlated features: if two features are identical (or one is a linear combination of others), $\mathbf{X}^\top\mathbf{X}$ is singular. Fix: add L2 regularization, solving $(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$ instead. For any $\lambda > 0$ , this matrix is always invertible.

Non-linear models: neural networks have non-convex loss surfaces. There is no algebraic structure to "set derivative to zero and solve." Gradient descent is mandatory.

Interactive example

Compare Normal Equation (one shot) vs. gradient descent (iterative) - same answer, very different paths

Coming soon

Matrix inverse

The Closed-Form Solution

Anatomy of the Formula

When the Normal Equation Excels

When the Normal Equation Fails