Skip to content
Linear Regression
Lesson 1 ⏱ 10 min

The model: ŷ = wx + b

Video coming soon

The Linear Model: Lines, Weights, and Predictions

Building intuition for ŷ = wx + b from a single feature up to the full matrix form ŷ = Xw + b, with a concrete house-price running example.

⏱ ~6 min

🧮

Quick refresher

Dot product

The dot product of two vectors w and x is Σᵢ wᵢxᵢ — element-wise multiplication then sum. The prediction in linear models is exactly a dot product.

Example

For w=[2,3] and x=[4,5], w·x = 2·4 + 3·5 = 8+15 = 23.

The Simplest Useful Model

Linear regression is the foundation of predictive modeling. It is not the most powerful tool - neural networks learn far more complex relationships - but it is the bedrock that everything else builds on. And it remains genuinely useful: housing price prediction, demand forecasting, risk scoring.

The core assumption: the relationship between input features x and output yy is approximately linear. Add more square footage to a house and the price rises proportionally. Many real relationships, especially over small ranges, are close enough to linear to make this practical.

The Single-Feature Model

Start with one input and one output. The linear model is:

y^=wx+b\hat{y} = wx + b
hatyhat_y
the prediction - hat means estimated
ww
weight - the slope, how much prediction changes per unit increase in x
xx
the single input feature
bb
bias - the y-intercept, shifts the entire line up or down

This is the equation of a line: y=mx+by = mx + b from middle school, with mm renamed to ww. The symbols and are the model's parameters - the numbers we will learn from data.

Concrete example: predict house price from size. Suppose w=0.2w = 0.2 (each extra square foot adds $200, since price is in thousands) and b=50b = 50 (base price $50k). For a 1,500 sq ft house:

\hat{y} = 0.2 \times 1500 + 50 = 300 + 50 = $350\text{k}

Why "Weight" and "Bias"?

The term weight captures how influential a feature is. A feature with weight w=10w = 10 is 100 times more influential than one with weight w=0.1w = 0.1. The magnitude tells you relative importance; the sign tells you direction (positive means higher value → higher prediction, negative means higher value → lower prediction).

The is what allows a line to not pass through the origin.

Multiple Features

Real problems have many features. For pp features x1,x2,,xpx_1, x_2, \ldots, x_p:

y^=w1x1+w2x2++wpxp+b=wx+b\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_p x_p + b = \mathbf{w} \cdot \mathbf{x} + b
wjw_j
weight for feature j - how much feature j contributes to the prediction
xjx_j
value of feature j for this example
pp
total number of features
bb
bias term

The sum w1x1++wpxpw_1 x_1 + \cdots + w_p x_p is exactly a dot product wx\mathbf{w} \cdot \mathbf{x}: multiply each pair of matching elements and sum. This notation is compact and computationally efficient.

Three-feature house example (size, bedrooms, distance):

\hat{y} = 0.15 \times 1800 + 5 \times 3 + (-3) \times 5 + 20 = 270 + 15 - 15 + 20 = $290\text{k}

The model is still just a weighted sum - it defines a hyperplane in the pp-dimensional feature space instead of a line in 2D.

The is the core of every linear model.

For the Full Dataset

To predict all nn training examples simultaneously, stack inputs as a matrix:

y^=Xw+b\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b
XX
data matrix, shape n times p
ww
weight vector, shape p times 1
bb
bias scalar, broadcast across all n examples
hatyhat_y
prediction vector, shape n times 1

Here, X\mathbf{X} is n×pn \times p, w\mathbf{w} is p×1p \times 1, and y^\hat{\mathbf{y}} is n×1n \times 1 - one prediction per example. This single matrix operation replaces a loop over all training examples.

Interactive example

Adjust weight and bias sliders - watch the regression line move over the scatter plot

Coming soon

Quiz

1 / 3

In the model ŷ = wx + b, what does b (bias) do?