Train/val/test splits — Regularization

You have trained a model. It performs well on training data. But does it generalize? The only honest way to find out is to evaluate it on data it has never seen. But that immediately raises a practical question: if you use the test data to make decisions during development, it is no longer "unseen" - you have leaked information from it into your development process.

Getting this right separates reliable ML from models that look good in demos and fail in production.

The Three-Way Split

A clean ML workflow uses three separate data splits:

The : the data used to learn model parameters. Gradient descent happens here.

The (also called dev set): used to make decisions during development - which architecture, which hyperparameters, when to stop training. You look at validation performance to guide choices.

The : used exactly once, at the very end, to report the final unbiased estimate of model performance. You do not make any decisions based on the test set.

The roles are strict. If you find yourself thinking "my test accuracy is lower than I expected, let me try a different architecture" - you have just used the test set to make a development decision. It is now a second validation set. Your reported test performance is no longer trustworthy.

Data Leakage: The Silent Problem

The is when information from the val or test set accidentally influences training. Common forms:

Normalization leakage: computing mean/std on the full dataset (including test) and normalizing all splits with those statistics. Fix: compute statistics on training set only, then apply to val/test.
Feature engineering leakage: a feature computed using information unavailable at prediction time. Example: predicting whether a loan defaults using transactions made after the loan was issued.
Target leakage: a feature that is essentially a proxy for the label. Example: predicting hospital readmission using "number of discharge medications" - determined after the hospital stay.
Augmentation leakage: augmenting the validation or test set, which inflates apparent performance.

Leakage is insidious because it silently makes your model look better. You only discover it when the model fails in deployment.

K-Fold Cross-Validation

When your dataset is small, a simple 80/10/10 split means your validation set might have only 100 examples - too noisy to trust for hyperparameter decisions. solves this:

Split data into $k$ equal folds ( $k=5$ or $k=10$ are common)
For each fold $i$ : train on all folds except fold $i$ , validate on fold $i$
Repeat $k$ times (each fold is the validation set exactly once)
Average the $k$ validation scores

\text{CV score} = \frac{1}{k} \sum_{i=1}^{k} \text{score}_i

$k$: number of folds - typically 5 or 10
$\text{score}_i$: validation score when fold i is the validation set

Every example appears in the validation set exactly once. The result is a much more reliable performance estimate.

Use k-fold when: small datasets (hundreds to tens of thousands of examples) where simple holdout wastes too much data. Cost: k full training runs.

Use simple holdout when: large datasets (100K+ examples) where a 10-20% validation split is already thousands of examples. The estimate is reliable and k-fold's extra compute is not worth it.

Interactive example

K-fold visualization - see how each fold rotates as the validation set

Coming soon

Stratified Splits

If your dataset has class imbalance - say 95% class 0 and 5% class 1 - a random split might produce a validation set with zero examples of class 1 purely by chance.

The ensures each split contains the same class proportions as the full dataset. If training is 95/5, validation and test are also 95/5. Always use stratified splits for classification with imbalanced classes.

Reading the Validation Loss Curve

Plot training loss and validation loss against epoch number. This tells you almost everything about your model's state:

The point where validation loss stops decreasing (before it starts rising) is your optimal stopping point. Early stopping automates this: stop training when validation loss has not improved for $N$ consecutive epochs.