From One Feature to Many
Real-world ML problems rarely involve a single input. Predicting house prices means combining square footage, bedrooms, location, and age. Detecting fraud means weighing dozens of transaction signals simultaneously. Multiple-feature linear regression is the simplest model that captures how features combine, and its matrix form is the template for every neural network layer you will build.
Single-feature linear regression:
With features, each example is a vector and the prediction is:
- weight for feature j
- value of feature j for example i
- bias term
- hat_y_i
- prediction for example i
The model is still linear - it defines a hyperplane in -dimensional space. Adding more features does not change the structure; it just adds more terms to the weighted sum.
Absorbing the Bias
A common algebraic trick simplifies the math: absorb into the weight vector by prepending a column of 1s to .
Add a fake feature to every example:
- the bias, now treated as the weight for the constant feature x_0 = 1
- constant feature always equal to 1
- augmented weight vector of length p+1
The is a notational convenience that unifies weights and bias into one vector.
In the remaining sections, we assume bias is absorbed: has length and has a leading column of 1s.
Matrix Form: All Examples at Once
Define:
- The : data matrix
- The : weight vector
- The : vector of true labels
The prediction vector for all examples simultaneously:
- data matrix, shape n times p
- weight vector, shape p times 1
- prediction vector, shape n times 1 - one prediction per example
Check shapes: ✓ - one prediction per example.
Concretely: with 1,000 examples and 5 features, is and is . The matrix multiply computes 1,000 dot products simultaneously, exploiting hardware parallelism on CPUs and GPUs.
MSE in Matrix Form
The MSE loss written compactly using the squared L2 norm :
- true label vector, shape n times 1
- predicted label vector Xw, shape n times 1
- squared L2 norm: sum of squared components of the vector
- number of examples
In words: the loss is the squared Euclidean distance between the true label vector and the prediction vector , divided by . We are literally minimizing the distance between predictions and truth in an -dimensional space.
The Gradient Vector
Taking the derivative of with respect to the entire weight vector :
- X transpose, shape p times n
- residual vector: true labels minus predictions, shape n times 1
- gradient vector, shape p times 1 - one partial derivative per weight
Shape check: is , is , product is ✓ - same shape as .
The gives the intuition behind the gradient formula.
The (X-transpose) flips the matrix: its rows are the columns of .
Gradient Descent Update in Matrix Form
The gradient descent update for all weights simultaneously:
- learning rate
- gradient vector computed above
This single vector equation updates all weights in parallel. On GPU hardware, this is one matrix-vector multiply, one addition, and one subtraction - regardless of whether is 10 or 10 million.
Interactive example
Step through matrix-form gradient descent - watch all weights update simultaneously each iteration
Coming soon