Random forests: bagging and feature subsampling — Tree Methods & Ensembles

A single decision tree is a flawed expert — ask it a question and it gives a confident answer based on one path through the data. Sometimes that answer is brilliant; sometimes it's completely wrong because the tree overfit on noise. What if you could ask 500 experts, each trained on slightly different data, each occasionally looking at different clues — and then take a vote?

That's a .

Random forests consistently outperform most neural networks on tabular data and are the default choice for structured data problems at companies like Airbnb, Uber, and most banks. Understanding how they're built from scratch explains why they're so reliable — and when they'll fail.

From One Tree to Many: Bagging

The is the foundation:

For b = 1 to B: a. Draw a D_b of n examples from the training set (with replacement) b. Train a decision tree T_b on D_b (typically: full depth, no pruning)
Classification prediction: majority vote among T₁(x), T₂(x), ..., T_B(x)
Regression prediction: average: = (1/B) Σ_b T_b(x)

Why does this work? Each tree has high variance — it's sensitive to the specific training examples it sees. But different bootstrap samples lead to different trees, and the prediction errors tend to be uncorrelated. Averaging uncorrelated errors reduces variance by approximately 1/B.

The Random Subspace: Making Trees Decorrelated

Pure bagging of trees still has a problem: if one feature is very dominant (say, credit score is by far the most predictive feature for loan default), all B trees will split on credit score at their root. The trees remain highly correlated, and averaging correlated predictions reduces variance much less than averaging uncorrelated ones.

The fixes this. At every split in every tree:

m = \lfloor \sqrt{d} \rfloor \quad \text{(classification)} \qquad m = \lfloor d/3 \rfloor \quad \text{(regression)}

$m$: number of features randomly selected as candidates at each split
$d$: total number of features in the dataset
$\sqrt{d}$: standard choice for classification (also try d/3 for regression)

Only m features are considered when choosing the split at each node. This forces different trees to find different patterns — the dominant feature gets "blocked" in many trees, so other features get a chance to split. The trees become less correlated, and the variance reduction from averaging is much more effective.

This is the key insight that Leo Breiman discovered: random feature selection per split, not just per tree, produces dramatically better results than bagging alone.

Out-of-Bag Error: Free Validation

Each bootstrap sample uses roughly 63% of training examples (some appear multiple times). The remaining ~37% — the — can be used to estimate generalization error without ever touching a test set.

For each training example x_i, collect predictions from all trees T_b where x_i was NOT in D_b. Average (or vote) those predictions → OOB prediction for x_i.

OOB error = average error of OOB predictions across all training examples.

This is essentially a "free" cross-validation estimate, produced as a byproduct of training. It correlates closely with held-out test error and is very useful for hyperparameter tuning.

Feature Importance: Permutation and Impurity

Random Forests provide two measures of feature importance:

Mean Decrease in Impurity (MDI): For each feature, sum the information gain from every split on that feature, weighted by the number of examples at that node, across all trees. Fast, but biased toward high-cardinality features.

Permutation Importance: For each feature, randomly permute its values in the test set (breaking its relationship with the target), and measure how much the OOB (or test) error increases. Large increase = important feature; small increase = unimportant.

Permutation importance is less biased but slower. Use it when you suspect high-cardinality features are being overvalued.

Worked Example: 3 Trees Voting

Problem: Classify whether a customer will churn. Training data: 100 customers.

Three bootstrap samples produce three trees:

Tree 1 (trained on B1): sees mostly high-spend customers → learns "if spend > 500 and tenure < 12 months, churn"
Tree 2 (trained on B2): different sample → learns "if calls to support > 3 and satisfaction < 3, churn"
Tree 3 (trained on B3): another sample → learns "if spend > 500 or satisfaction < 3, churn"

New customer: spend = 600, tenure = 6 months, support calls = 1, satisfaction = 4

Tree 1: spend > 500 ✓ AND tenure < 12 ✓ → Churn
Tree 2: support calls = 1 (not > 3) → No Churn
Tree 3: spend > 500 ✓ → Churn

Vote: 2 Churn vs. 1 No Churn → Prediction: Churn (with 67% confidence)

Each tree has a different perspective. The ensemble combines them, and the majority vote reduces the chance that any one tree's idiosyncratic mistakes dominate.

Hyperparameters to Tune

Parameter	Effect	Default
n_estimators (B)	More trees = lower variance; diminishing returns past ~300	100–500
max_features (m)	Lower = more decorrelated, more bias	√d classification, d/3 regression
max_depth	Controls individual tree complexity	None (fully grown)
min_samples_leaf	Prevents tiny leaves	1
n_jobs	Parallelism — trees are independent	-1 (all cores)

Interactive example

See 10 trees vote on a test point — toggle individual trees on/off to see how the ensemble changes

Coming soon

What to Remember

Bagging: train B trees on bootstrap samples; average/vote predictions to reduce variance
Random feature subsampling (√d per split): decorrelates trees, making averaging more effective
OOB error: free estimate of generalization error from the ~37% not in each bootstrap sample
Feature importance: impurity-based (fast, biased) or permutation-based (slower, less biased)
More trees = lower variance, rarely overfits — tune max_features and max_depth instead
Random Forests are embarrassingly parallel — all trees train independently