Random Forests build trees in parallel and hope they compensate for each other's mistakes. Gradient boosting takes a different philosophy: build trees sequentially, where each new tree explicitly corrects what the previous ensemble got wrong.
The insight sounds simple but produces some of the most powerful algorithms ever created for tabular data.
The Additive Model Framework
Gradient boosting builds an :
- final ensemble prediction using M trees
- initial constant prediction (usually the mean of y for regression)
- the m-th tree (weak learner)
- learning rate — scales each tree's contribution
Starting from (a simple constant), we add trees one at a time, each designed to reduce the remaining error.
Fitting Residuals (Regression with MSE)
For squared error loss, L(y, F) = (1/2)(y - F)², the procedure is:
Step 1: Initialize F_0(x) = ȳ (mean of training targets)
For m = 1 to M:
- Compute : r_i^{(m)} = y_i - F_{m-1}(x_i)
- Fit a shallow tree h_m to the residuals (treating r_i as the target)
- Update: F_m(x) = F_{m-1}(x) + ν × h_m(x)
This is gradient descent in function space: each h_m is a step in the direction that reduces the loss, and ν is the step size.
Why "Gradient" Boosting?
For MSE loss, the negative gradient at each point is:
- negative gradient = residual for MSE loss
- gradient of loss with respect to current prediction
For any differentiable loss function, replacing "residuals" with "negative gradient" gives:
- pseudo-residual for general loss L
- true target
- current ensemble prediction at step m-1
These are what h_m gets fit to in the general case. For log-loss (binary classification), for absolute error, for Huber loss — the same algorithm applies, just with different pseudo-residuals.
Worked Example: 3 Rounds of Boosting
Dataset: 5 data points with features x and target y (house prices in $100K).
| i | x (size, sqft) | y (price) |
|---|---|---|
| 1 | 1000 | 2.0 |
| 2 | 1500 | 3.0 |
| 3 | 2000 | 4.5 |
| 4 | 2500 | 5.5 |
| 5 | 3000 | 7.0 |
F_0: ȳ = (2+3+4.5+5.5+7)/5 = 4.4
Round 1 — compute residuals:
| i | y_i | F_0 | r_i = y_i - F_0 |
|---|---|---|---|
| 1 | 2.0 | 4.4 | -2.4 |
| 2 | 3.0 | 4.4 | -1.4 |
| 3 | 4.5 | 4.4 | +0.1 |
| 4 | 5.5 | 4.4 | +1.1 |
| 5 | 7.0 | 4.4 | +2.6 |
Fit tree h₁ to residuals {-2.4, -1.4, +0.1, +1.1, +2.6} with x as feature. Best split: x ≤ 1750:
- Left (x ≤ 1750): mean residual = (-2.4 + -1.4 + 0.1)/3 = -1.23
- Right (x > 1750): mean residual = (+1.1 + +2.6)/2 = +1.85
h₁ predicts: -1.23 for x ≤ 1750, +1.85 for x > 1750.
Update with ν = 0.5: F₁(x) = 4.4 + 0.5 × h₁(x)
- F₁ for x ≤ 1750 = 4.4 + 0.5 × (-1.23) = 3.79
- F₁ for x > 1750 = 4.4 + 0.5 × (1.85) = 5.33
New predictions vs. targets: [i=1: 3.79 vs 2.0], [i=3: 3.79 vs 4.5], [i=5: 5.33 vs 7.0]
Round 2 — new residuals: Errors are now smaller: i=1: -1.79, i=2: -0.79, i=3: +0.71, i=4: +0.17, i=5: +1.67. Fit another tree h₂ to these. F₂ = F₁ + 0.5 × h₂. Errors keep shrinking.
Round 3: Even smaller errors, another small correction. By M=20–50 rounds, the predictions are very close to the true values (assuming not overfitting).
The Learning Rate - Trees Tradeoff
- learning rate (0.01 to 0.3 typically)
- number of trees
- effective total learning step
Rule of thumb: use ν = 0.1 with M = 100–500 trees. Or ν = 0.01 with M = 1000–5000. Both can achieve similar test performance, but lower ν + more trees often generalizes better, at the cost of longer training.
Key Hyperparameters
| Parameter | Effect | Typical range |
|---|---|---|
| n_estimators (M) | Number of trees — more = lower bias | 100–10,000 |
| learning_rate (ν) | Shrinkage — smaller = more regularization | 0.01–0.3 |
| max_depth | Individual tree depth — shallow trees preferred | 3–8 |
| subsample | Row subsampling per tree | 0.6–1.0 |
| max_features | Feature subsampling per split | 0.5–1.0 |
| min_samples_leaf | Prevents tiny leaves | 1–50 |
Unlike Random Forests, gradient boosting CAN overfit with more trees — the learning rate controls this. Use early stopping (monitor OOB or validation error) to stop adding trees when validation error stops improving.
Interactive example
Step through boosting rounds — watch residuals shrink and the ensemble converge to the true function
Coming soon
What to Remember
- Gradient boosting is additive: F_M = F_0 + ν Σ h_m, where each h_m corrects the previous ensemble's errors
- For MSE: fit h_m to residuals. For general losses: fit to negative gradients (pseudo-residuals)
- Learning rate ν shrinks each tree's contribution — small ν + more trees usually generalizes better
- CAN overfit with too many trees — use early stopping or regularize with small learning rate
- Typical hyperparameters: ν = 0.1, M = 500, max_depth = 3–6, subsample = 0.8
- Gradient boosting is sequential — cannot be parallelized across trees (unlike Random Forests)