Skip to content
Putting It Together
Lesson 1 ⏱ 12 min

Choosing a model

Video coming soon

Choosing the Right Model: A Framework for Real Projects

The bias-variance trade-off, the model complexity ladder, and a decision tree for matching model family to your data size and structure.

⏱ ~7 min

🧮

Quick refresher

Loss functions and gradient descent

A loss function measures prediction error. Gradient descent iteratively updates parameters to minimize loss: θ ← θ - α·∇L. The learning rate α controls step size.

Example

MSE loss: L = (1/n)Σ(yᵢ - ŷᵢ)².

Gradient descent nudges weights in the direction that reduces this average squared error.

The Central Problem: Generalization

You are not training a model to memorize your training data. You are training it to make good predictions on new, unseen data. The gap between training performance and real-world performance is the central challenge of machine learning.

Choosing the right model complexity is the difference between a system that works in production and one that only works on your laptop. Every ML engineer has shipped something that performed well in testing and failed in deployment — this lesson is how you avoid that.

The is the formal framing of this tension:

  • Bias: error from wrong assumptions. A linear model predicting a curved relationship has high bias - it will systematically miss the true pattern regardless of how much data you give it.
  • Variance: error from sensitivity to the specific training set. A high-degree polynomial fitted to 20 points will perform differently every time you retrain on a new 20-point sample.
Expected Test Error=Bias2+Variance+Noise\text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}
Bias2\text{Bias}^2
systematic error from wrong model assumptions
Variance\text{Variance}
sensitivity to training data fluctuations
Noise\text{Noise}
irreducible error inherent to the task

You cannot eliminate all three. Decreasing one tends to increase another. Finding the sweet spot is model selection.

The Model Complexity Ladder

Think of models along a spectrum from simple to complex:

ModelTypical Use CaseData Needed
Linear / Logistic RegressionBaselines, interpretability required100s
Decision TreesTabular data, non-linear100s-1000s
Random Forests / XGBoostTabular data, strong baseline1000s-10k
SVMsSmall high-dimensional data1000s
Shallow NNsFlexible non-linear patterns10k+
Deep NNs / CNNsImages, audio100k+
TransformersText, multimodal, large scaleMillions+

Start at the bottom. A logistic regression baseline takes 5 minutes to implement and tells you if the problem is even solvable. A 95% accuracy baseline changes what you do next much more than any model architecture choice.

A Decision Framework

When choosing a model, answer these questions in order:

1. How much labeled data do you have?

  • Less than 1000 examples: linear models, simple trees, or transfer learning from pretrained models. Deep learning from scratch will overfit.
  • 1k-100k examples: tree ensembles and moderate neural networks are in play.
  • More than 100k: full deep learning models become viable and often superior.

2. What is the data modality?

Images, text, audio, and video have specialized architectures that incorporate domain structure (spatial, sequential, etc.). Tabular data with hand-engineered features is tree ensemble territory.

3. What are your inference constraints?

A model that needs to run on an edge device in 5ms has fundamentally different constraints than a cloud API with no latency requirement. Latency, memory, and compute all constrain model size.

4. Do you need interpretability?

In healthcare, finance, and legal applications, you may need to explain individual predictions. Linear models and decision trees are inherently interpretable. Neural networks require post-hoc tools (SHAP, LIME) that are less reliable.

Interactive example

Model selector guide - answer questions about your problem and get model recommendations

Coming soon

Underfitting vs. Overfitting: Diagnosis

After training, your loss curves tell you where you are:

Underfitting (high bias):

  • Training loss is high
  • Validation loss ≈ training loss
  • Fix: use a more complex model, add features, reduce regularization, train longer

Overfitting (high variance):

  • Training loss is low
  • Validation loss is much higher than training loss
  • Fix: get more data, use a simpler model, add regularization, use dropout, use early stopping

Well-fit (goal):

  • Training loss is low
  • Validation loss ≈ training loss
  • The gap between them is small and acceptable

Model Selection in Practice

Always use a held-out validation set that the model never sees during training. For small datasets, use :

CV Score=1ki=1kLi\text{CV Score} = \frac{1}{k}\sum_{i=1}^{k} L_i
kk
number of folds - typically 5 or 10
LiL_i
validation loss on fold i

Split training data into kk equal folds. Train on k1k-1, validate on 1. Rotate. Average the kk validation scores. This gives a much more reliable estimate of generalization than a single train/val split.

Never touch the test set until the very end. If you evaluate on the test set while making modeling decisions, you are implicitly optimizing for it and your final test accuracy will be optimistic.

Interactive example

Bias-variance visualizer - fit polynomials of different degrees and watch training vs. validation loss

Coming soon

Quiz

1 / 3

A model with very low training loss but very high validation loss is showing...