The model: ŷ = wx + b — Linear Regression

The Simplest Useful Model

Linear regression is the foundation of predictive modeling. It is not the most powerful tool - neural networks learn far more complex relationships - but it is the bedrock that everything else builds on. And it remains genuinely useful: housing price prediction, demand forecasting, risk scoring.

The core assumption: the relationship between input features x and output $y$ is approximately linear. Add more square footage to a house and the price rises proportionally. Many real relationships, especially over small ranges, are close enough to linear to make this practical.

The Single-Feature Model

Start with one input and one output. The linear model is:

\hat{y} = wx + b

$hat_y$: the prediction - hat means estimated
$w$: weight - the slope, how much prediction changes per unit increase in x
$x$: the single input feature
$b$: bias - the y-intercept, shifts the entire line up or down

This is the equation of a line: $y = mx + b$ from middle school, with $m$ renamed to $w$ . The symbols and are the model's parameters - the numbers we will learn from data.

Concrete example: predict house price from size. Suppose $w = 0.2$ (each extra square foot adds $200, since price is in thousands) and $b = 50$ (base price $50k). For a 1,500 sq ft house:

\hat{y} = 0.2 \times 1500 + 50 = 300 + 50 = $350\text{k}

Why "Weight" and "Bias"?

The term weight captures how influential a feature is. A feature with weight $w = 10$ is 100 times more influential than one with weight $w = 0.1$ . The magnitude tells you relative importance; the sign tells you direction (positive means higher value → higher prediction, negative means higher value → lower prediction).

The is what allows a line to not pass through the origin.

Multiple Features

Real problems have many features. For $p$ features $x_1, x_2, \ldots, x_p$ :

\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_p x_p + b = \mathbf{w} \cdot \mathbf{x} + b

$w_j$: weight for feature j - how much feature j contributes to the prediction
$x_j$: value of feature j for this example
$p$: total number of features
$b$: bias term

The sum $w_1 x_1 + \cdots + w_p x_p$ is exactly a dot product $\mathbf{w} \cdot \mathbf{x}$ : multiply each pair of matching elements and sum. This notation is compact and computationally efficient.

Three-feature house example (size, bedrooms, distance):

\hat{y} = 0.15 \times 1800 + 5 \times 3 + (-3) \times 5 + 20 = 270 + 15 - 15 + 20 = $290\text{k}

The model is still just a weighted sum - it defines a hyperplane in the $p$ -dimensional feature space instead of a line in 2D.

The is the core of every linear model.

For the Full Dataset

To predict all $n$ training examples simultaneously, stack inputs as a matrix:

\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b

$X$: data matrix, shape n times p
$w$: weight vector, shape p times 1
$b$: bias scalar, broadcast across all n examples
$hat_y$: prediction vector, shape n times 1

Here, $\mathbf{X}$ is $n \times p$ , $\mathbf{w}$ is $p \times 1$ , and $\hat{\mathbf{y}}$ is $n \times 1$ - one prediction per example. This single matrix operation replaces a loop over all training examples.

Interactive example

Adjust weight and bias sliders - watch the regression line move over the scatter plot

Coming soon