Generalization and overfitting — Framing the Problem

The Trap of Training Accuracy

Imagine two students preparing for an exam:

One deeply understands the subject and can reason about new questions
One memorizes the practice exam answers word for word

Both score 100% on the practice exam. But on the real exam - with different questions - the memorizer fails. They learned the answers, not the subject.

This is exactly the ML problem. A model that memorizes training examples does great on training data but fails on new data. And new data is the whole point - you are not building a model to predict things you already know.

The actual goal of ML is generalization: performing well on data the model has never seen.

Overfitting

The is when a model learns the training data too well - memorizing noise instead of learning the underlying pattern.

Signs of overfitting:

Training accuracy is very high (99%)
Test accuracy is much lower (65%)
The gap between training and test performance is large and growing

A 100-degree polynomial will fit any 100 points perfectly. But you would not trust it to predict the 101st point - it memorized rather than generalized.

Overfitting tends to happen when the model has too many parameters relative to training examples, training runs too long, or there is no regularization.

Underfitting

The is the opposite: the model is too simple.

Signs of underfitting:

Both training AND test performance are poor — the model lacks the capacity to fit the data at all
The gap between training and test is small (the model is consistently bad everywhere)

A straight line fit to curved data will underfit. Fix: more model capacity, more features, or a different architecture.

The Bias-Variance Tradeoff

Every model lives somewhere on a spectrum between two failure modes:

High bias (underfitting): makes strong wrong assumptions. Gets the broad strokes wrong consistently. Even with more training data, the error stays high because the model cannot represent the truth.

High variance (overfitting): fits training data tightly but is very sensitive to which specific examples were in the training set. Use a different training set and you get a very different model.

\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

$Bias2$: squared bias - systematic error from wrong model assumptions
$Variance$: variance - sensitivity to training set fluctuations
$Noise$: irreducible noise - inherent randomness in the data, unavoidable

The irreducible noise is the floor - even a perfect model cannot eliminate genuine randomness in the world. We can only minimize the other two terms, and they trade off against each other.

Adding model complexity reduces bias but increases variance. Adding regularization reduces variance at some cost to bias. The optimal model balances both.

The is why simply making the model bigger does not always help.

Train / Validation / Test Splits

Different stages of development need different data:

Training set: used to compute gradients and update parameters. The model sees this data many times. It can in principle memorize it.

Validation set (dev set): used to make decisions - tune hyperparameters (learning rate, architecture depth, regularization strength), decide when to stop training. The model does not train on this directly, but your decisions are guided by it. Slightly "contaminated" by engineering choices.

Test set: the final, honest estimate of performance. Used exactly once, at the very end, after all decisions are made.

Typical splits: 80/10/10 or 70/15/15 for large datasets. For small datasets, k-fold cross-validation uses data more efficiently. Split all examples into $k$ equal groups (folds), then train $k$ times — each time using one different fold as validation and the remaining $k - 1$ folds as training. Average the $k$ validation scores for a stable performance estimate. Common choices are $k = 5$ or $k = 10$ .

The Generalization Gap

\text{Generalization Gap} = \mathcal{L}{\text{test}} - \mathcal{L}{\text{train}}

$test_loss$: average loss on the held-out test set
$train_loss$: average loss on the training set
$gap$: generalization gap - how much worse the model is on new data

A small gap means the model learned something general. A large gap means overfitting. A gap near zero with high loss on both sets means underfitting.

Plotting both training loss and validation loss against epochs is one of your primary diagnostic tools. The point where they diverge - validation loss starts rising while training loss keeps falling - is where overfitting begins and where early stopping should trigger.

Interactive example

Adjust model complexity and training epochs - watch the bias-variance tradeoff in real time

Coming soon