Multiple features: matrix form — Linear Regression

From One Feature to Many

Real-world ML problems rarely involve a single input. Predicting house prices means combining square footage, bedrooms, location, and age. Detecting fraud means weighing dozens of transaction signals simultaneously. Multiple-feature linear regression is the simplest model that captures how features combine, and its matrix form is the template for every neural network layer you will build.

Single-feature linear regression: $\hat{y} = wx + b$

With $p$ features, each example is a vector $\mathbf{x}i = [x{i1}, x_{i2}, \ldots, x_{ip}]$ and the prediction is:

\hat{y}i = w_1 x{i1} + w_2 x_{i2} + \cdots + w_p x_{ip} + b = \mathbf{w} \cdot \mathbf{x}_i + b

$w_j$: weight for feature j
$x_ij$: value of feature j for example i
$b$: bias term
$hat_y_i$: prediction for example i

The model is still linear - it defines a hyperplane in $p$ -dimensional space. Adding more features does not change the structure; it just adds more terms to the weighted sum.

Absorbing the Bias

A common algebraic trick simplifies the math: absorb $b$ into the weight vector by prepending a column of 1s to $\mathbf{X}$ .

Add a fake feature $x_0 = 1$ to every example:

\hat{y}i = w_0 \cdot 1 + w_1 x{i1} + \cdots + w_p x_{ip} = \mathbf{w} \cdot \mathbf{x}_i

$w_0$: the bias, now treated as the weight for the constant feature x_0 = 1
$x_0$: constant feature always equal to 1
$w$: augmented weight vector of length p+1

The is a notational convenience that unifies weights and bias into one vector.

In the remaining sections, we assume bias is absorbed: $\mathbf{w}$ has length $p+1$ and $\mathbf{X}$ has a leading column of 1s.

Bias Absorption: Concrete Numbers

Say we have two features — size ( $x_1 = 2$ ) and bedrooms ( $x_2 = 3$ ) — with weights $w_1 = 4$ , $w_2 = 5$ , and bias $b = 10$ .

Without the trick (bias is separate):

\hat{y} = w_1 x_1 + w_2 x_2 + b = 4 \times 2 + 5 \times 3 + 10 = 8 + 15 + 10 = 33

With the bias trick, prepend a constant $1$ to $\mathbf{x}$ and make $b$ the first weight $w_0$ :

\tilde{\mathbf{x}} = [1,\ 2,\ 3], \qquad \tilde{\mathbf{w}} = [10,\ 4,\ 5]

\hat{y} = \tilde{\mathbf{w}} \cdot \tilde{\mathbf{x}} = 10{\cdot}1 + 4{\cdot}2 + 5{\cdot}3 = 10 + 8 + 15 = 33

Same answer — but now everything is a single dot product with no separate $b$ to track. Every subsequent formula (gradients, normal equation) only needs to handle $\tilde{\mathbf{w}}$ , with no special case for the bias.

Matrix Form: All Examples at Once

Define:

The : $n \times p$ data matrix
The $\mathbf{w}$ : $p \times 1$ weight vector
The $\mathbf{y}$ : $n \times 1$ vector of true labels

The prediction vector for all $n$ examples simultaneously:

\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}

$X$: data matrix, shape n times p
$w$: weight vector, shape p times 1
$hat_y$: prediction vector, shape n times 1 - one prediction per example

Check shapes: $(n \times p) \cdot (p \times 1) = n \times 1$ ✓ - one prediction per example.

Concretely: with 1,000 examples and 5 features, $\mathbf{X}$ is $1000 \times 5$ and $\mathbf{w}$ is $5 \times 1$ . The matrix multiply computes 1,000 dot products simultaneously, exploiting hardware parallelism on CPUs and GPUs.

MSE in Matrix Form

The MSE loss written compactly using the squared L2 norm $\mid \cdot\mid ^2$ :

L = \frac{1}{n}|\mathbf{y} - \mathbf{X}\mathbf{w}|^2 = \frac{1}{n}(\mathbf{y} - \mathbf{X}\mathbf{w})^\top(\mathbf{y} - \mathbf{X}\mathbf{w})

$y$: true label vector, shape n times 1
$X_w$: predicted label vector Xw, shape n times 1
$norm_squared$: squared L2 norm: sum of squared components of the vector
$n$: number of examples

In words: the loss is the squared Euclidean distance between the true label vector $\mathbf{y}$ and the prediction vector $\mathbf{X}\mathbf{w}$ , divided by $n$ . We are literally minimizing the distance between predictions and truth in an $n$ -dimensional space.

The Gradient Vector

Taking the derivative of $L$ with respect to the entire weight vector $\mathbf{w}$ :

\frac{\partial L}{\partial \mathbf{w}} = -\frac{2}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w})

$X_T$: X transpose, shape p times n
$y_minus_Xw$: residual vector: true labels minus predictions, shape n times 1
$grad_w$: gradient vector, shape p times 1 - one partial derivative per weight

Shape check: $\mathbf{X}^\top$ is $p \times n$ , $(\mathbf{y} - \mathbf{X}\mathbf{w})$ is $n \times 1$ , product is $p \times 1$ ✓ - same shape as $\mathbf{w}$ .

The gives the intuition behind the gradient formula.

The (X-transpose) flips the matrix: its rows are the columns of $\mathbf{X}$ .

Gradient Descent Update in Matrix Form

The gradient descent update for all weights simultaneously:

\mathbf{w} \leftarrow \mathbf{w} - \alpha \cdot \frac{\partial L}{\partial \mathbf{w}} = \mathbf{w} + \frac{2\alpha}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w})

$alpha$: learning rate
$grad_w$: gradient vector computed above

This single vector equation updates all $p$ weights in parallel. On GPU hardware, this is one matrix-vector multiply, one addition, and one subtraction - regardless of whether $p$ is 10 or 10 million.

Interactive example

Step through matrix-form gradient descent - watch all weights update simultaneously each iteration

Coming soon