Skip to content
Linear Regression
Lesson 4 ⏱ 12 min

Multiple features: matrix form

Video coming soon

Multiple Features and the Matrix Form of Linear Regression

Extending linear regression from one feature to many, absorbing bias into the weight vector, and expressing the gradient as a single matrix equation.

⏱ ~7 min

🧮

Quick refresher

Matrix-vector multiplication

Multiplying an m×n matrix by an n×1 vector gives an m×1 vector. Each row of the matrix dots with the vector.

Example

X is 3×2 (3 examples, 2 features), w is 2×1.

Xw is 3×1 — one prediction per example.

From One Feature to Many

Real-world ML problems rarely involve a single input. Predicting house prices means combining square footage, bedrooms, location, and age. Detecting fraud means weighing dozens of transaction signals simultaneously. Multiple-feature linear regression is the simplest model that captures how features combine, and its matrix form is the template for every neural network layer you will build.

Single-feature linear regression: y^=wx+b\hat{y} = wx + b

With pp features, each example is a vector xi=[xi1,xi2,,xip]\mathbf{x}i = [x{i1}, x_{i2}, \ldots, x_{ip}] and the prediction is:

y^i=w1xi1+w2xi2++wpxip+b=wxi+b\hat{y}i = w_1 x{i1} + w_2 x_{i2} + \cdots + w_p x_{ip} + b = \mathbf{w} \cdot \mathbf{x}_i + b
wjw_j
weight for feature j
xijx_ij
value of feature j for example i
bb
bias term
hat_y_i
prediction for example i

The model is still linear - it defines a hyperplane in pp-dimensional space. Adding more features does not change the structure; it just adds more terms to the weighted sum.

Absorbing the Bias

A common algebraic trick simplifies the math: absorb bb into the weight vector by prepending a column of 1s to X\mathbf{X}.

Add a fake feature x0=1x_0 = 1 to every example:

y^i=w01+w1xi1++wpxip=wxi\hat{y}i = w_0 \cdot 1 + w_1 x{i1} + \cdots + w_p x_{ip} = \mathbf{w} \cdot \mathbf{x}_i
w0w_0
the bias, now treated as the weight for the constant feature x_0 = 1
x0x_0
constant feature always equal to 1
ww
augmented weight vector of length p+1

The is a notational convenience that unifies weights and bias into one vector.

In the remaining sections, we assume bias is absorbed: w\mathbf{w} has length p+1p+1 and X\mathbf{X} has a leading column of 1s.

Matrix Form: All Examples at Once

Define:

  • The : n×pn \times p data matrix
  • The w\mathbf{w}: p×1p \times 1 weight vector
  • The y\mathbf{y}: n×1n \times 1 vector of true labels

The prediction vector for all nn examples simultaneously:

y^=Xw\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}
XX
data matrix, shape n times p
ww
weight vector, shape p times 1
hatyhat_y
prediction vector, shape n times 1 - one prediction per example

Check shapes: (n×p)(p×1)=n×1(n \times p) \cdot (p \times 1) = n \times 1 ✓ - one prediction per example.

Concretely: with 1,000 examples and 5 features, X\mathbf{X} is 1000×51000 \times 5 and w\mathbf{w} is 5×15 \times 1. The matrix multiply computes 1,000 dot products simultaneously, exploiting hardware parallelism on CPUs and GPUs.

MSE in Matrix Form

The MSE loss written compactly using the squared L2 norm 2\mid \cdot\mid ^2:

L=1nyXw2=1n(yXw)(yXw)L = \frac{1}{n}|\mathbf{y} - \mathbf{X}\mathbf{w}|^2 = \frac{1}{n}(\mathbf{y} - \mathbf{X}\mathbf{w})^\top(\mathbf{y} - \mathbf{X}\mathbf{w})
yy
true label vector, shape n times 1
XwX_w
predicted label vector Xw, shape n times 1
normsquarednorm_squared
squared L2 norm: sum of squared components of the vector
nn
number of examples

In words: the loss is the squared Euclidean distance between the true label vector y\mathbf{y} and the prediction vector Xw\mathbf{X}\mathbf{w}, divided by nn. We are literally minimizing the distance between predictions and truth in an nn-dimensional space.

The Gradient Vector

Taking the derivative of LL with respect to the entire weight vector w\mathbf{w}:

Lw=2nX(yXw)\frac{\partial L}{\partial \mathbf{w}} = -\frac{2}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w})
XTX_T
X transpose, shape p times n
yminusXwy_minus_Xw
residual vector: true labels minus predictions, shape n times 1
gradwgrad_w
gradient vector, shape p times 1 - one partial derivative per weight

Shape check: X\mathbf{X}^\top is p×np \times n, (yXw)(\mathbf{y} - \mathbf{X}\mathbf{w}) is n×1n \times 1, product is p×1p \times 1 ✓ - same shape as w\mathbf{w}.

The gives the intuition behind the gradient formula.

The (X-transpose) flips the matrix: its rows are the columns of X\mathbf{X}.

Gradient Descent Update in Matrix Form

The gradient descent update for all weights simultaneously:

wwαLw=w+2αnX(yXw)\mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} + \frac{2\alpha}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w})
alphaalpha
learning rate
gradwgrad_w
gradient vector computed above

This single vector equation updates all pp weights in parallel. On GPU hardware, this is one matrix-vector multiply, one addition, and one subtraction - regardless of whether pp is 10 or 10 million.

Interactive example

Step through matrix-form gradient descent - watch all weights update simultaneously each iteration

Coming soon

Quiz

1 / 3

For a dataset with 100 training examples and 5 features, the data matrix X has shape...