Skip to content
Tree Methods & Ensembles
Lesson 1 ⏱ 12 min

Decision trees: the splitting problem

Video coming soon

Decision Trees: A Series of Yes/No Questions

Building a decision tree by hand on a loan default dataset — showing how the algorithm picks features and thresholds to split data.

⏱ ~8 min

🧮

Quick refresher

Classification and prediction

Classification is a supervised learning task where the goal is to predict a discrete label (class) for each input. A classifier maps input features to one of K possible categories. The training process finds the mapping that minimizes misclassification on labeled examples.

Example

Predicting whether a bank loan will default (yes/no) based on credit score, income, and loan amount is a binary classification problem.

Imagine you're a loan officer deciding whether to approve a loan application. Your mental process: "Is their credit score above 700? If yes, is their debt-to-income ratio below 40%? If yes, approve." You've just described a decision tree.

The are the most interpretable machine learning models: you can literally read the rules they've learned.

The Anatomy of a Tree

Every decision tree has three types of nodes:

: The starting point. Every prediction begins here.

Internal nodes: Each applies a binary test: "Is feature F ≤ threshold T?" If yes, go left; if no, go right. Each internal node has exactly two children.

: The endpoints. Each leaf stores a prediction:

  • Classification: the majority class of training examples reaching this leaf (or class probabilities)
  • Regression: the mean of target values of training examples reaching this leaf

The of the tree is the length of the longest path from root to leaf. A depth-1 tree makes a single split (a "stump"). A depth-5 tree makes up to 5 sequential decisions.

The Greedy Top-Down Construction

Decision tree training uses a greedy recursive algorithm called (Classification and Regression Trees):

  1. Start with all training data at the root
  2. Find the best feature-threshold pair (split) to divide the data
  3. Divide the data into left (feature ≤ threshold) and right (feature > threshold)
  4. Recursively apply steps 2–4 to each child node
  5. Stop when a stopping criterion is met (max depth reached, too few points, no improvement possible)

The "best split" is measured by an impurity criterion — information gain, Gini index, or variance reduction for regression. The next lesson covers these in detail.

A Worked Example: Predicting Loan Default

Training data (10 loans):

LoanCredit ScoreIncome (K$)Default?
L175080No
L260045Yes
L368065No
L455030Yes
L572070No
L659040Yes
L770055No
L862035Yes
L966060No
L1057050Yes

At the root, we need to pick the best first split. Let's say we evaluate "Credit Score ≤ 640" vs. "Income ≤ 50":

Split on Credit Score ≤ 640:

  • Left (Credit ≤ 640): {L2, L4, L6, L8, L10} — 5 defaults, 0 no-defaults → pure! → predict Yes
  • Right (Credit > 640): {L1, L3, L5, L7, L9} — 0 defaults, 5 no-defaults → pure! → predict No

The tree has learned: if credit score ≤ 640, predict default; otherwise, predict no default. Accuracy: 100% on training data with just one split.

The resulting tree at depth 1:

Credit Score ≤ 640?
├── Yes → Predict: DEFAULT
└── No  → Predict: NO DEFAULT

Classification Trees vs. Regression Trees

Classification trees: Predict a class label. Splits maximize reduction in class impurity (information gain or Gini). Leaf prediction = majority class.

Regression trees: Predict a continuous value. Splits minimize residual variance. Leaf prediction = mean of target values in that leaf.

y^leaf=1nleafileafyi\hat{y}{\text{leaf}} = \frac{1}{n{\text{leaf}}} \sum_{i \in \text{leaf}} y_i
y^leaf\hat{y}_{leaf}
prediction at a leaf node for regression
nleafn_{leaf}
number of training points reaching this leaf
yiy_i
target value for training point i

For regression, a deep tree approximates the target function as a step function — piecewise constant with one step per leaf.

Reading a Decision Tree

A 3-level tree for loan default might look like:

Credit Score ≤ 650?
├── Yes (likely default)
│   └── Debt-to-Income > 40%?
│       ├── Yes → DEFAULT (high confidence)
│       └── No  → DEFAULT (moderate confidence)
└── No (likely safe)
    └── Income ≤ 40K?
        ├── Yes → DEFAULT (income too low despite good credit)
        └── No  → NO DEFAULT (safe)

Each path from root to leaf is a complete "rule": "If credit score ≤ 650 AND debt-to-income > 40%, then predict default." This is the interpretability advantage — a model that underwriters can actually read and audit.

Interactive example

Click to split a dataset and grow a tree — watch how each split carves up the feature space

Coming soon

The Bias-Variance Tradeoff in Trees

A shallow tree: high bias (misses patterns), low variance (stable across different training sets).

A deep tree: low bias (fits training data well), high variance (overfit — small changes to training data produce very different trees).

Pruning and depth limits are the tools for controlling this tradeoff — covered in lesson 16-3.

What to Remember

  • Decision trees route inputs through binary feature tests to make predictions
  • Internal nodes: feature ≤ threshold; leaf nodes: predict majority class (classification) or mean (regression)
  • Construction: greedy top-down CART algorithm — pick the best split at each node
  • Depth controls complexity: shallow = underfitting, deep = overfitting
  • Trees are highly interpretable — predictions can be traced through explicit rules
  • Classification trees minimize class impurity; regression trees minimize variance

Quiz

1 / 3

What is a 'leaf node' in a decision tree?