What Makes a Prediction Wrong?
Before training, we need to be precise about what "wrong" means. We need a single number summarizing how bad our predictions are across all training examples. This is the cost function.
The cost function is the only thing your model ever optimizes — it is the single number that defines what "better" means during training. Choose the wrong loss and your model will be mathematically flawless at optimizing something you did not actually want. Understanding cost functions means understanding what you are asking your model to learn.
A good cost function has three properties:
- Outputs when all predictions are perfect
- Outputs larger values for worse predictions
- Is smooth and differentiable everywhere so we can use calculus to minimize it
The design of the loss function is a modeling choice. Different choices lead to different sensitivity to outliers, different computational properties, and different behaviors near the optimum.
The Residual
For one training example, the prediction error is:
- residual for example i - actual minus predicted
- true label for example i
- hat_y_i
- model prediction for example i
This is the : actual minus predicted. It is the vertical distance from the prediction to the true value.
- Positive residual (e_i > 0): actual value was higher than predicted — we predicted too low
- Negative residual (e_i < 0): actual value was lower than predicted — we predicted too high
- Zero residual (): prediction matched reality exactly for this example
Why Not Just Average the Residuals?
First attempt:
Problem: positive and negative errors cancel. A model always $10 too high has mean residual . A model always $10 too low has mean residual . But a model that is +$10 on half the examples and -$10 on the other half has mean residual = 0 - yet it makes $10 errors on every prediction.
In fact, a model that predicts the mean of for every example gets mean residual = 0, no matter how scattered the data. This loss function would rate a useless model as perfect.
Why Not Mean Absolute Error?
Better attempt: (MAE)
This fixes cancellation - errors in both directions count positively. But the absolute value function has a sharp corner at 0 where the derivative is undefined.
The is why MSE is preferred over MAE for introductory gradient descent.
MAE is a valid and widely-used loss function in practice - it is more robust to outliers. For learning gradient descent, MSE's clean calculus is an advantage.
Mean Squared Error (MSE)
The standard choice: square each residual.
- number of training examples
- true label for example i
- hat_y_i
- model prediction for example i
- mean squared error - the average squared residual
Why squaring works:
- Any squared number is non-negative - errors in both directions count positively. No cancellation.
- Large errors are penalized disproportionately. An error of 2 → squared error 4. An error of 10 → squared error 100 (25 times worse, not 5 times). MSE is sensitive to outliers.
- The squared function is smooth and differentiable everywhere. Calculus works perfectly.
The from squaring is what makes gradient descent converge reliably for linear regression.
Worked example: three training examples:
- Example 1: , → residual 2, squared 4
- Example 2: , → residual -1, squared 1
- Example 3: , → residual 0, squared 0
RMSE: Back to Original Units
One downside of MSE: it is in squared units. If is in dollars, MSE is in dollars². That is hard to interpret.
Root Mean Squared Error (RMSE) takes the square root to restore original units:
- root mean squared error - same units as the target variable y
- MSE - the mean squared error
RMSE is what you report to stakeholders: "our model's typical prediction error is $15,000." MSE is what you use for optimization, because derivatives are cleaner without the square root. Minimizing one minimizes the other - they are monotonically related.
Interactive example
See how residuals contribute to MSE - move points and watch the squared errors update
Coming soon