Decision trees: the splitting problem — Tree Methods & Ensembles

Imagine you're a loan officer deciding whether to approve a loan application. Your mental process: "Is their credit score above 700? If yes, is their debt-to-income ratio below 40%? If yes, approve." You've just described a decision tree.

The are the most interpretable machine learning models: you can literally read the rules they've learned.

The Anatomy of a Tree

Every decision tree has three types of nodes:

: The starting point. Every prediction begins here.

Internal nodes: Each applies a binary test: "Is feature F ≤ threshold T?" If yes, go left; if no, go right. Each internal node has exactly two children.

: The endpoints. Each leaf stores a prediction:

Classification: the majority class of training examples reaching this leaf (or class probabilities)
Regression: the mean of target values of training examples reaching this leaf

The of the tree is the length of the longest path from root to leaf. A depth-1 tree makes a single split (a "stump"). A depth-5 tree makes up to 5 sequential decisions.

The Greedy Top-Down Construction

Decision tree training uses a greedy recursive algorithm called (Classification and Regression Trees):

Start with all training data at the root
Find the best feature-threshold pair (split) to divide the data
Divide the data into left (feature ≤ threshold) and right (feature > threshold)
Recursively apply steps 2–4 to each child node
Stop when a stopping criterion is met (max depth reached, too few points, no improvement possible)

The "best split" is measured by an impurity criterion — information gain, Gini index, or variance reduction for regression. The next lesson covers these in detail.

A Worked Example: Predicting Loan Default

Training data (10 loans):

Loan	Credit Score	Income (K$)	Default?
L1	750	80	No
L2	600	45	Yes
L3	680	65	No
L4	550	30	Yes
L5	720	70	No
L6	590	40	Yes
L7	700	55	No
L8	620	35	Yes
L9	660	60	No
L10	570	50	Yes

At the root, we need to pick the best first split. Let's say we evaluate "Credit Score ≤ 640" vs. "Income ≤ 50":

Split on Credit Score ≤ 640:

Left (Credit ≤ 640): {L2, L4, L6, L8, L10} — 5 defaults, 0 no-defaults → pure! → predict Yes
Right (Credit > 640): {L1, L3, L5, L7, L9} — 0 defaults, 5 no-defaults → pure! → predict No

The tree has learned: if credit score ≤ 640, predict default; otherwise, predict no default. Accuracy: 100% on training data with just one split.

The resulting tree at depth 1:

Credit Score ≤ 640?
├── Yes → Predict: DEFAULT
└── No  → Predict: NO DEFAULT

Classification Trees vs. Regression Trees

Classification trees: Predict a class label. Splits maximize reduction in class impurity (information gain or Gini). Leaf prediction = majority class.

Regression trees: Predict a continuous value. Splits minimize residual variance. Leaf prediction = mean of target values in that leaf.

\hat{y}{\text{leaf}} = \frac{1}{n{\text{leaf}}} \sum_{i \in \text{leaf}} y_i

$\hat{y}_{leaf}$: prediction at a leaf node for regression
$n_{leaf}$: number of training points reaching this leaf
$y_i$: target value for training point i

For regression, a deep tree approximates the target function as a step function — piecewise constant with one step per leaf.

Reading a Decision Tree

A 3-level tree for loan default might look like:

Credit Score ≤ 650?
├── Yes (likely default)
│   └── Debt-to-Income > 40%?
│       ├── Yes → DEFAULT (high confidence)
│       └── No  → DEFAULT (moderate confidence)
└── No (likely safe)
    └── Income ≤ 40K?
        ├── Yes → DEFAULT (income too low despite good credit)
        └── No  → NO DEFAULT (safe)

Each path from root to leaf is a complete "rule": "If credit score ≤ 650 AND debt-to-income > 40%, then predict default." This is the interpretability advantage — a model that underwriters can actually read and audit.

Interactive example

Click to split a dataset and grow a tree — watch how each split carves up the feature space

Coming soon

The Bias-Variance Tradeoff in Trees

A shallow tree: high bias (misses patterns), low variance (stable across different training sets).

A deep tree: low bias (fits training data well), high variance (overfit — small changes to training data produce very different trees).

Pruning and depth limits are the tools for controlling this tradeoff — covered in lesson 16-3.

What to Remember

Decision trees route inputs through binary feature tests to make predictions
Internal nodes: feature ≤ threshold; leaf nodes: predict majority class (classification) or mean (regression)
Construction: greedy top-down CART algorithm — pick the best split at each node
Depth controls complexity: shallow = underfitting, deep = overfitting
Trees are highly interpretable — predictions can be traced through explicit rules
Classification trees minimize class impurity; regression trees minimize variance