The Central Problem: Generalization
You are not training a model to memorize your training data. You are training it to make good predictions on new, unseen data. The gap between training performance and real-world performance is the central challenge of machine learning.
Choosing the right model complexity is the difference between a system that works in production and one that only works on your laptop. Every ML engineer has shipped something that performed well in testing and failed in deployment — this lesson is how you avoid that.
The is the formal framing of this tension:
- Bias: error from wrong assumptions. A linear model predicting a curved relationship has high bias - it will systematically miss the true pattern regardless of how much data you give it.
- Variance: error from sensitivity to the specific training set. A high-degree polynomial fitted to 20 points will perform differently every time you retrain on a new 20-point sample.
- systematic error from wrong model assumptions
- sensitivity to training data fluctuations
- irreducible error inherent to the task
You cannot eliminate all three. Decreasing one tends to increase another. Finding the sweet spot is model selection.
The Model Complexity Ladder
Think of models along a spectrum from simple to complex:
| Model | Typical Use Case | Data Needed |
|---|---|---|
| Linear / Logistic Regression | Baselines, interpretability required | 100s |
| Decision Trees | Tabular data, non-linear | 100s-1000s |
| Random Forests / XGBoost | Tabular data, strong baseline | 1000s-10k |
| SVMs | Small high-dimensional data | 1000s |
| Shallow NNs | Flexible non-linear patterns | 10k+ |
| Deep NNs / CNNs | Images, audio | 100k+ |
| Transformers | Text, multimodal, large scale | Millions+ |
Start at the bottom. A logistic regression baseline takes 5 minutes to implement and tells you if the problem is even solvable. A 95% accuracy baseline changes what you do next much more than any model architecture choice.
A Decision Framework
When choosing a model, answer these questions in order:
1. How much labeled data do you have?
- Less than 1000 examples: linear models, simple trees, or transfer learning from pretrained models. Deep learning from scratch will overfit.
- 1k-100k examples: tree ensembles and moderate neural networks are in play.
- More than 100k: full deep learning models become viable and often superior.
2. What is the data modality?
Images, text, audio, and video have specialized architectures that incorporate domain structure (spatial, sequential, etc.). Tabular data with hand-engineered features is tree ensemble territory.
3. What are your inference constraints?
A model that needs to run on an edge device in 5ms has fundamentally different constraints than a cloud API with no latency requirement. Latency, memory, and compute all constrain model size.
4. Do you need interpretability?
In healthcare, finance, and legal applications, you may need to explain individual predictions. Linear models and decision trees are inherently interpretable. Neural networks require post-hoc tools (SHAP, LIME) that are less reliable.
Interactive example
Model selector guide - answer questions about your problem and get model recommendations
Coming soon
Underfitting vs. Overfitting: Diagnosis
After training, your loss curves tell you where you are:
Underfitting (high bias):
- Training loss is high
- Validation loss ≈ training loss
- Fix: use a more complex model, add features, reduce regularization, train longer
Overfitting (high variance):
- Training loss is low
- Validation loss is much higher than training loss
- Fix: get more data, use a simpler model, add regularization, use dropout, use early stopping
Well-fit (goal):
- Training loss is low
- Validation loss ≈ training loss
- The gap between them is small and acceptable
Model Selection in Practice
Always use a held-out validation set that the model never sees during training. For small datasets, use :
- number of folds - typically 5 or 10
- validation loss on fold i
Split training data into equal folds. Train on , validate on 1. Rotate. Average the validation scores. This gives a much more reliable estimate of generalization than a single train/val split.
Never touch the test set until the very end. If you evaluate on the test set while making modeling decisions, you are implicitly optimizing for it and your final test accuracy will be optimistic.
Interactive example
Bias-variance visualizer - fit polynomials of different degrees and watch training vs. validation loss
Coming soon