Gradient boosting: additive training — Tree Methods & Ensembles

Random Forests build trees in parallel and hope they compensate for each other's mistakes. Gradient boosting takes a different philosophy: build trees sequentially, where each new tree explicitly corrects what the previous ensemble got wrong.

The insight sounds simple but produces some of the most powerful algorithms ever created for tabular data.

The Additive Model Framework

Gradient boosting builds an :

F_M(x) = F_0(x) + \nu \sum_{m=1}^{M} h_m(x)

$F_M(x)$: final ensemble prediction using M trees
$F_0(x)$: initial constant prediction (usually the mean of y for regression)
$h_m(x)$: the m-th tree (weak learner)
$\nu$: learning rate — scales each tree's contribution

Starting from (a simple constant), we add trees one at a time, each designed to reduce the remaining error.

Fitting Residuals (Regression with MSE)

For squared error loss, L(y, F) = (1/2)(y - F)², the procedure is:

Step 1: Initialize F_0(x) = ȳ (mean of training targets)

For m = 1 to M:

Compute : r_i^{(m)} = y_i - F_{m-1}(x_i)
Fit a shallow tree h_m to the residuals (treating r_i as the target)
Update: F_m(x) = F_{m-1}(x) + ν × h_m(x)

This is gradient descent in function space: each h_m is a step in the direction that reduces the loss, and ν is the step size.

Why "Gradient" Boosting?

For MSE loss, the negative gradient at each point is:

r_i^{(m)} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\Bigg|{F=F{m-1}} = y_i - F_{m-1}(x_i)

$r_i^{(m)}$: negative gradient = residual for MSE loss
$\frac{\partial L}{\partial F}$: gradient of loss with respect to current prediction

For any differentiable loss function, replacing "residuals" with "negative gradient" gives:

r_i^{(m)} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F=F{m-1}}

$r_i^{(m)}$: pseudo-residual for general loss L
$y_i$: true target
$F_{m-1}(x_i)$: current ensemble prediction at step m-1

These are what h_m gets fit to in the general case. For log-loss (binary classification), for absolute error, for Huber loss — the same algorithm applies, just with different pseudo-residuals.

Worked Example: 3 Rounds of Boosting

Dataset: 5 data points with features x and target y (house prices in $100K).

i	x (size, sqft)	y (price)
1	1000	2.0
2	1500	3.0
3	2000	4.5
4	2500	5.5
5	3000	7.0

F_0: ȳ = (2+3+4.5+5.5+7)/5 = 4.4

Round 1 — compute residuals:

i	y_i	F_0	r_i = y_i - F_0
1	2.0	4.4	-2.4
2	3.0	4.4	-1.4
3	4.5	4.4	+0.1
4	5.5	4.4	+1.1
5	7.0	4.4	+2.6

Fit tree h₁ to residuals {-2.4, -1.4, +0.1, +1.1, +2.6} with x as feature. Best split: x ≤ 1750:

Left (x ≤ 1750): mean residual = (-2.4 + -1.4 + 0.1)/3 = -1.23
Right (x > 1750): mean residual = (+1.1 + +2.6)/2 = +1.85

h₁ predicts: -1.23 for x ≤ 1750, +1.85 for x > 1750.

Update with ν = 0.5: F₁(x) = 4.4 + 0.5 × h₁(x)

F₁ for x ≤ 1750 = 4.4 + 0.5 × (-1.23) = 3.79
F₁ for x > 1750 = 4.4 + 0.5 × (1.85) = 5.33

New predictions vs. targets: [i=1: 3.79 vs 2.0], [i=3: 3.79 vs 4.5], [i=5: 5.33 vs 7.0]

Round 2 — new residuals: Errors are now smaller: i=1: -1.79, i=2: -0.79, i=3: +0.71, i=4: +0.17, i=5: +1.67. Fit another tree h₂ to these. F₂ = F₁ + 0.5 × h₂. Errors keep shrinking.

Round 3: Even smaller errors, another small correction. By M=20–50 rounds, the predictions are very close to the true values (assuming not overfitting).

The Learning Rate - Trees Tradeoff

\text{Typically: smaller } \nu \implies \text{more trees } M \text{ needed for same fit}

$\nu$: learning rate (0.01 to 0.3 typically)
$M$: number of trees
$\nu \cdot M$: effective total learning step

Rule of thumb: use ν = 0.1 with M = 100–500 trees. Or ν = 0.01 with M = 1000–5000. Both can achieve similar test performance, but lower ν + more trees often generalizes better, at the cost of longer training.

Key Hyperparameters

Parameter	Effect	Typical range
n_estimators (M)	Number of trees — more = lower bias	100–10,000
learning_rate (ν)	Shrinkage — smaller = more regularization	0.01–0.3
max_depth	Individual tree depth — shallow trees preferred	3–8
subsample	Row subsampling per tree	0.6–1.0
max_features	Feature subsampling per split	0.5–1.0
min_samples_leaf	Prevents tiny leaves	1–50

Unlike Random Forests, gradient boosting CAN overfit with more trees — the learning rate controls this. Use early stopping (monitor OOB or validation error) to stop adding trees when validation error stops improving.

Interactive example

Step through boosting rounds — watch residuals shrink and the ensemble converge to the true function

Coming soon

What to Remember

Gradient boosting is additive: F_M = F_0 + ν Σ h_m, where each h_m corrects the previous ensemble's errors
For MSE: fit h_m to residuals. For general losses: fit to negative gradients (pseudo-residuals)
Learning rate ν shrinks each tree's contribution — small ν + more trees usually generalizes better
CAN overfit with too many trees — use early stopping or regularize with small learning rate
Typical hyperparameters: ν = 0.1, M = 500, max_depth = 3–6, subsample = 0.8
Gradient boosting is sequential — cannot be parallelized across trees (unlike Random Forests)