Choosing a model — Putting It Together

The Central Problem: Generalization

You are not training a model to memorize your training data. You are training it to make good predictions on new, unseen data. The gap between training performance and real-world performance is the central challenge of machine learning.

Choosing the right model complexity is the difference between a system that works in production and one that only works on your laptop. Every ML engineer has shipped something that performed well in testing and failed in deployment — this lesson is how you avoid that.

The is the formal framing of this tension:

Bias: error from wrong assumptions. A linear model predicting a curved relationship has high bias - it will systematically miss the true pattern regardless of how much data you give it.
Variance: error from sensitivity to the specific training set. A high-degree polynomial fitted to 20 points will perform differently every time you retrain on a new 20-point sample.

\text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}

$\text{Bias}^2$: systematic error from wrong model assumptions
$\text{Variance}$: sensitivity to training data fluctuations
$\text{Noise}$: irreducible error inherent to the task

You cannot eliminate all three. Decreasing one tends to increase another. Finding the sweet spot is model selection.

The Model Complexity Ladder

Think of models along a spectrum from simple to complex:

Model	Typical Use Case	Data Needed
Linear / Logistic Regression	Baselines, interpretability required	100s
Decision Trees	Tabular data, non-linear	100s-1000s
Random Forests / XGBoost	Tabular data, strong baseline	1000s-10k
SVMs	Small high-dimensional data	1000s
Shallow NNs	Flexible non-linear patterns	10k+
Deep NNs / CNNs	Images, audio	100k+
Transformers	Text, multimodal, large scale	Millions+

Start at the bottom. A logistic regression baseline takes 5 minutes to implement and tells you if the problem is even solvable. A 95% accuracy baseline changes what you do next much more than any model architecture choice.

P9 — scikit-learn model selection cheat sheet

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

# Step 1: Always start with the simplest model that makes sense
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
print("Baseline accuracy:", baseline.score(X_val, y_val))

# Step 2: If underfitting (low train AND val accuracy), try something richer
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Step 3: For tabular data, gradient boosting usually wins
gb = GradientBoostingClassifier(n_estimators=200, max_depth=4)
gb.fit(X_train, y_train)

The order matters: fit the baseline first. If logistic regression gives 94% accuracy, a forest might give 95.5% — is 1.5 percentage points worth the complexity cost?

A Decision Framework

When choosing a model, answer these questions in order:

1. How much labeled data do you have?

Less than 1000 examples: linear models, simple trees, or transfer learning from pretrained models. Deep learning from scratch will overfit.
1k-100k examples: tree ensembles and moderate neural networks are in play.
More than 100k: full deep learning models become viable and often superior.

2. What is the data modality?

Images, text, audio, and video have specialized architectures that incorporate domain structure (spatial, sequential, etc.). Tabular data with hand-engineered features is tree ensemble territory.

3. What are your inference constraints?

A model that needs to run on an edge device in 5ms has fundamentally different constraints than a cloud API with no latency requirement. Latency, memory, and compute all constrain model size.

4. Do you need interpretability?

In healthcare, finance, and legal applications, you may need to explain individual predictions. Linear models and decision trees are inherently interpretable. Neural networks require post-hoc tools (SHAP, LIME) that are less reliable.

Interactive example

Model selector guide - answer questions about your problem and get model recommendations

Coming soon

Underfitting vs. Overfitting: Diagnosis

After training, your loss curves tell you where you are:

Underfitting (high bias):

Training loss is high
Validation loss ≈ training loss
Fix: use a more complex model, add features, reduce regularization, train longer

Overfitting (high variance):

Training loss is low
Validation loss is much higher than training loss
Fix: get more data, use a simpler model, add regularization, use dropout, use early stopping

Well-fit (goal):

Training loss is low
Validation loss ≈ training loss
The gap between them is small and acceptable

Model Selection in Practice

Always use a held-out validation set that the model never sees during training. For small datasets, use :

\text{CV Score} = \frac{1}{k}\sum_{i=1}^{k} L_i

$k$: number of folds - typically 5 or 10
$L_i$: validation loss on fold i

Split training data into $k$ equal folds. Train on $k-1$ , validate on 1. Rotate. Average the $k$ validation scores. This gives a much more reliable estimate of generalization than a single train/val split.

P10 — 5-fold CV step by step (20 examples, k=5)

Suppose you have 20 labelled examples and want to estimate generalisation error with 5-fold CV.

Setup: split the 20 examples into 5 folds of 4 each: F1, F2, F3, F4, F5.

Round	Train on	Validate on	Val loss
1	F2 F3 F4 F5 (16 examples)	F1 (4 examples)	$L_1$
2	F1 F3 F4 F5	F2	$L_2$
3	F1 F2 F4 F5	F3	$L_3$
4	F1 F2 F3 F5	F4	$L_4$
5	F1 F2 F3 F4	F5	$L_5$

Final score: $\text{CV} = (L_1 + L_2 + L_3 + L_4 + L_5) / 5$ .

Every example is in the validation set exactly once. You get 5 independent estimates of how well your model generalises — far more reliable than a single 80/20 split, especially with small datasets.

Never touch the test set until the very end. If you evaluate on the test set while making modeling decisions, you are implicitly optimizing for it and your final test accuracy will be optimistic.