Skip to content
Linear Regression
Lesson 2 ⏱ 12 min

The cost function

Video coming soon

The Cost Function: Measuring How Wrong Your Model Is

From residuals to Mean Squared Error - why we square errors, what MSE means geometrically, and the relationship between MSE and RMSE.

⏱ ~7 min

🧮

Quick refresher

Summation notation

Σᵢ₌₁ⁿ xᵢ means 'add up x₁ + x₂ + ... + xₙ.' In ML, we use it to average errors over all n training examples.

Example

Σᵢ₌₁³ (yᵢ - ŷᵢ) with values 2, -1, 3 = 2 + (-1) + 3 = 4.

What Makes a Prediction Wrong?

Before training, we need to be precise about what "wrong" means. We need a single number summarizing how bad our predictions are across all training examples. This is the cost function.

The cost function is the only thing your model ever optimizes — it is the single number that defines what "better" means during training. Choose the wrong loss and your model will be mathematically flawless at optimizing something you did not actually want. Understanding cost functions means understanding what you are asking your model to learn.

A good cost function has three properties:

  1. Outputs 00 when all predictions are perfect
  2. Outputs larger values for worse predictions
  3. Is smooth and differentiable everywhere so we can use calculus to minimize it

The design of the loss function is a modeling choice. Different choices lead to different sensitivity to outliers, different computational properties, and different behaviors near the optimum.

The Residual

For one training example, the prediction error is:

ei=yiy^ie_i = y_i - \hat{y}_i
eie_i
residual for example i - actual minus predicted
yiy_i
true label for example i
hat_y_i
model prediction for example i

This is the : actual minus predicted. It is the vertical distance from the prediction to the true value.

  • Positive residual (e_i > 0): actual value was higher than predicted — we predicted too low
  • Negative residual (e_i < 0): actual value was lower than predicted — we predicted too high
  • Zero residual (ei=0e_i = 0): prediction matched reality exactly for this example

Why Not Just Average the Residuals?

First attempt: 1niei=1ni(yiy^i)\frac{1}{n}\sum_i e_i = \frac{1}{n}\sum_i (y_i - \hat{y}_i)

Problem: positive and negative errors cancel. A model always $10 too high has mean residual 10-10. A model always $10 too low has mean residual +10+10. But a model that is +$10 on half the examples and -$10 on the other half has mean residual = 0 - yet it makes $10 errors on every prediction.

In fact, a model that predicts the mean of yy for every example gets mean residual = 0, no matter how scattered the data. This loss function would rate a useless model as perfect.

Why Not Mean Absolute Error?

Better attempt: L=1niyiy^iL = \frac{1}{n}\sum_i \mid y_i - \hat{y}_i\mid (MAE)

This fixes cancellation - errors in both directions count positively. But the absolute value function has a sharp corner at 0 where the derivative is undefined.

The is why MSE is preferred over MAE for introductory gradient descent.

MAE is a valid and widely-used loss function in practice - it is more robust to outliers. For learning gradient descent, MSE's clean calculus is an advantage.

Mean Squared Error (MSE)

The standard choice: square each residual.

L=1ni=1n(yiy^i)2L = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
nn
number of training examples
yiy_i
true label for example i
hat_y_i
model prediction for example i
LL
mean squared error - the average squared residual

Why squaring works:

  1. Any squared number is non-negative - errors in both directions count positively. No cancellation.
  2. Large errors are penalized disproportionately. An error of 2 → squared error 4. An error of 10 → squared error 100 (25 times worse, not 5 times). MSE is sensitive to outliers.
  3. The squared function x2x^2 is smooth and differentiable everywhere. Calculus works perfectly.

The from squaring is what makes gradient descent converge reliably for linear regression.

Worked example: three training examples:

  • Example 1: y1=5y_1 = 5, y^1=3\hat{y}_1 = 3 → residual 2, squared 4
  • Example 2: y2=8y_2 = 8, y^2=9\hat{y}_2 = 9 → residual -1, squared 1
  • Example 3: y3=2y_3 = 2, y^3=2\hat{y}_3 = 2 → residual 0, squared 0
L=(4+1+0)/3=5/31.67L = (4 + 1 + 0) / 3 = 5/3 \approx 1.67

RMSE: Back to Original Units

One downside of MSE: it is in squared units. If yy is in dollars, MSE is in dollars². That is hard to interpret.

Root Mean Squared Error (RMSE) takes the square root to restore original units:

RMSE=L=1ni=1n(yiy^i)2\text{RMSE} = \sqrt{L} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}
RMSERMSE
root mean squared error - same units as the target variable y
LL
MSE - the mean squared error

RMSE is what you report to stakeholders: "our model's typical prediction error is $15,000." MSE is what you use for optimization, because derivatives are cleaner without the square root. Minimizing one minimizes the other - they are monotonically related.

Interactive example

See how residuals contribute to MSE - move points and watch the squared errors update

Coming soon

Quiz

1 / 3

If y=5 and ŷ=3, the residual is...