Imagine you're a loan officer deciding whether to approve a loan application. Your mental process: "Is their credit score above 700? If yes, is their debt-to-income ratio below 40%? If yes, approve." You've just described a decision tree.
The are the most interpretable machine learning models: you can literally read the rules they've learned.
The Anatomy of a Tree
Every decision tree has three types of nodes:
: The starting point. Every prediction begins here.
Internal nodes: Each applies a binary test: "Is feature F ≤ threshold T?" If yes, go left; if no, go right. Each internal node has exactly two children.
: The endpoints. Each leaf stores a prediction:
- Classification: the majority class of training examples reaching this leaf (or class probabilities)
- Regression: the mean of target values of training examples reaching this leaf
The of the tree is the length of the longest path from root to leaf. A depth-1 tree makes a single split (a "stump"). A depth-5 tree makes up to 5 sequential decisions.
The Greedy Top-Down Construction
Decision tree training uses a greedy recursive algorithm called (Classification and Regression Trees):
- Start with all training data at the root
- Find the best feature-threshold pair (split) to divide the data
- Divide the data into left (feature ≤ threshold) and right (feature > threshold)
- Recursively apply steps 2–4 to each child node
- Stop when a stopping criterion is met (max depth reached, too few points, no improvement possible)
The "best split" is measured by an impurity criterion — information gain, Gini index, or variance reduction for regression. The next lesson covers these in detail.
A Worked Example: Predicting Loan Default
Training data (10 loans):
| Loan | Credit Score | Income (K$) | Default? |
|---|---|---|---|
| L1 | 750 | 80 | No |
| L2 | 600 | 45 | Yes |
| L3 | 680 | 65 | No |
| L4 | 550 | 30 | Yes |
| L5 | 720 | 70 | No |
| L6 | 590 | 40 | Yes |
| L7 | 700 | 55 | No |
| L8 | 620 | 35 | Yes |
| L9 | 660 | 60 | No |
| L10 | 570 | 50 | Yes |
At the root, we need to pick the best first split. Let's say we evaluate "Credit Score ≤ 640" vs. "Income ≤ 50":
Split on Credit Score ≤ 640:
- Left (Credit ≤ 640): {L2, L4, L6, L8, L10} — 5 defaults, 0 no-defaults → pure! → predict Yes
- Right (Credit > 640): {L1, L3, L5, L7, L9} — 0 defaults, 5 no-defaults → pure! → predict No
The tree has learned: if credit score ≤ 640, predict default; otherwise, predict no default. Accuracy: 100% on training data with just one split.
The resulting tree at depth 1:
Credit Score ≤ 640? ├── Yes → Predict: DEFAULT └── No → Predict: NO DEFAULT
Classification Trees vs. Regression Trees
Classification trees: Predict a class label. Splits maximize reduction in class impurity (information gain or Gini). Leaf prediction = majority class.
Regression trees: Predict a continuous value. Splits minimize residual variance. Leaf prediction = mean of target values in that leaf.
- prediction at a leaf node for regression
- number of training points reaching this leaf
- target value for training point i
For regression, a deep tree approximates the target function as a step function — piecewise constant with one step per leaf.
Reading a Decision Tree
A 3-level tree for loan default might look like:
Credit Score ≤ 650?
├── Yes (likely default)
│ └── Debt-to-Income > 40%?
│ ├── Yes → DEFAULT (high confidence)
│ └── No → DEFAULT (moderate confidence)
└── No (likely safe)
└── Income ≤ 40K?
├── Yes → DEFAULT (income too low despite good credit)
└── No → NO DEFAULT (safe)
Each path from root to leaf is a complete "rule": "If credit score ≤ 650 AND debt-to-income > 40%, then predict default." This is the interpretability advantage — a model that underwriters can actually read and audit.
Interactive example
Click to split a dataset and grow a tree — watch how each split carves up the feature space
Coming soon
The Bias-Variance Tradeoff in Trees
A shallow tree: high bias (misses patterns), low variance (stable across different training sets).
A deep tree: low bias (fits training data well), high variance (overfit — small changes to training data produce very different trees).
Pruning and depth limits are the tools for controlling this tradeoff — covered in lesson 16-3.
What to Remember
- Decision trees route inputs through binary feature tests to make predictions
- Internal nodes: feature ≤ threshold; leaf nodes: predict majority class (classification) or mean (regression)
- Construction: greedy top-down CART algorithm — pick the best split at each node
- Depth controls complexity: shallow = underfitting, deep = overfitting
- Trees are highly interpretable — predictions can be traced through explicit rules
- Classification trees minimize class impurity; regression trees minimize variance