Skip to content
Framing the Problem
Lesson 4 ⏱ 12 min

Generalization and overfitting

Video coming soon

Generalization: Why Training Accuracy Is Not Enough

Overfitting, underfitting, the bias-variance tradeoff, and why you need a train/validation/test split to measure real performance.

⏱ ~8 min

🧮

Quick refresher

Averages and distributions

The average of n numbers is their sum divided by n. A good model has low average error on examples it has never seen - not just on examples it trained on.

Example

Training errors [0.1, 0.2, 0.0] → mean = 0.1.

Test errors [2.1, 3.5, 1.8] → mean = 2.47.

The model memorized training data but fails on new examples.

The Trap of Training Accuracy

Imagine two students preparing for an exam:

  1. One deeply understands the subject and can reason about new questions
  2. One memorizes the practice exam answers word for word

Both score 100% on the practice exam. But on the real exam - with different questions - the memorizer fails. They learned the answers, not the subject.

This is exactly the ML problem. A model that memorizes training examples does great on training data but fails on new data. And new data is the whole point - you are not building a model to predict things you already know.

The actual goal of ML is generalization: performing well on data the model has never seen.

Overfitting

The is when a model learns the training data too well - memorizing noise instead of learning the underlying pattern.

Signs of overfitting:

  • Training accuracy is very high (99%)
  • Test accuracy is much lower (65%)
  • The gap between training and test performance is large and growing

A 100-degree polynomial will fit any 100 points perfectly. But you would not trust it to predict the 101st point - it memorized rather than generalized.

Overfitting tends to happen when the model has too many parameters relative to training examples, training runs too long, or there is no regularization.

Underfitting

The is the opposite: the model is too simple.

Signs of underfitting:

  • Both training AND test performance are poor — the model lacks the capacity to fit the data at all
  • The gap between training and test is small (the model is consistently bad everywhere)

A straight line fit to curved data will underfit. Fix: more model capacity, more features, or a different architecture.

The Bias-Variance Tradeoff

Every model lives somewhere on a spectrum between two failure modes:

High bias (underfitting): makes strong wrong assumptions. Gets the broad strokes wrong consistently. Even with more training data, the error stays high because the model cannot represent the truth.

High variance (overfitting): fits training data tightly but is very sensitive to which specific examples were in the training set. Use a different training set and you get a very different model.

Expected Error=Bias2+Variance+Irreducible Noise\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}
Bias2Bias2
squared bias - systematic error from wrong model assumptions
VarianceVariance
variance - sensitivity to training set fluctuations
NoiseNoise
irreducible noise - inherent randomness in the data, unavoidable

The irreducible noise is the floor - even a perfect model cannot eliminate genuine randomness in the world. We can only minimize the other two terms, and they trade off against each other.

Adding model complexity reduces bias but increases variance. Adding regularization reduces variance at some cost to bias. The optimal model balances both.

The is why simply making the model bigger does not always help.

Train / Validation / Test Splits

Different stages of development need different data:

Training set: used to compute gradients and update parameters. The model sees this data many times. It can in principle memorize it.

Validation set (dev set): used to make decisions - tune hyperparameters (learning rate, architecture depth, regularization strength), decide when to stop training. The model does not train on this directly, but your decisions are guided by it. Slightly "contaminated" by engineering choices.

Test set: the final, honest estimate of performance. Used exactly once, at the very end, after all decisions are made.

Typical splits: 80/10/10 or 70/15/15 for large datasets. For small datasets, k-fold cross-validation uses data more efficiently. Split all examples into kk equal groups (folds), then train kk times — each time using one different fold as validation and the remaining k1k - 1 folds as training. Average the kk validation scores for a stable performance estimate. Common choices are k=5k = 5 or k=10k = 10.

The Generalization Gap

Generalization Gap=LtestLtrain\text{Generalization Gap} = \mathcal{L}{\text{test}} - \mathcal{L}{\text{train}}
testlosstest_loss
average loss on the held-out test set
trainlosstrain_loss
average loss on the training set
gapgap
generalization gap - how much worse the model is on new data

A small gap means the model learned something general. A large gap means overfitting. A gap near zero with high loss on both sets means underfitting.

Plotting both training loss and validation loss against epochs is one of your primary diagnostic tools. The point where they diverge - validation loss starts rising while training loss keeps falling - is where overfitting begins and where early stopping should trigger.

Interactive example

Adjust model complexity and training epochs - watch the bias-variance tradeoff in real time

Coming soon

Quiz

1 / 3

A model achieves 99% accuracy on training data but only 55% on test data. This indicates...