Skip to content
Tree Methods & Ensembles
Lesson 5 ⏱ 16 min

Gradient boosting: additive training

Video coming soon

Gradient Boosting: Correcting Mistakes Sequentially

Walking through 3 rounds of gradient boosting on a regression problem — showing how residuals shrink at each step.

⏱ ~11 min

🧮

Quick refresher

Gradient descent and residuals

Gradient descent minimizes a loss function by iteratively moving in the direction of the negative gradient. Residuals in regression are the differences between actual and predicted values: r_i = y_i - ŷ_i. A large positive residual means the model is under-predicting; a large negative residual means it's over-predicting.

Example

If you predict a house price as $300K but the actual price is $350K, the residual is +$50K.

If you subtract the residual direction from your error, you'd add $50K to your prediction — correcting toward the truth.

Random Forests build trees in parallel and hope they compensate for each other's mistakes. Gradient boosting takes a different philosophy: build trees sequentially, where each new tree explicitly corrects what the previous ensemble got wrong.

The insight sounds simple but produces some of the most powerful algorithms ever created for tabular data.

The Additive Model Framework

Gradient boosting builds an :

FM(x)=F0(x)+νm=1Mhm(x)F_M(x) = F_0(x) + \nu \sum_{m=1}^{M} h_m(x)
FM(x)F_M(x)
final ensemble prediction using M trees
F0(x)F_0(x)
initial constant prediction (usually the mean of y for regression)
hm(x)h_m(x)
the m-th tree (weak learner)
ν\nu
learning rate — scales each tree's contribution

Starting from (a simple constant), we add trees one at a time, each designed to reduce the remaining error.

Fitting Residuals (Regression with MSE)

For squared error loss, L(y, F) = (1/2)(y - F)², the procedure is:

Step 1: Initialize F_0(x) = ȳ (mean of training targets)

For m = 1 to M:

  1. Compute : r_i^{(m)} = y_i - F_{m-1}(x_i)
  2. Fit a shallow tree h_m to the residuals (treating r_i as the target)
  3. Update: F_m(x) = F_{m-1}(x) + ν × h_m(x)

This is gradient descent in function space: each h_m is a step in the direction that reduces the loss, and ν is the step size.

Why "Gradient" Boosting?

For MSE loss, the negative gradient at each point is:

ri(m)=L(yi,F(xi))F(xi)F=Fm1=yiFm1(xi)r_i^{(m)} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\Bigg|{F=F{m-1}} = y_i - F_{m-1}(x_i)
ri(m)r_i^{(m)}
negative gradient = residual for MSE loss
LF\frac{\partial L}{\partial F}
gradient of loss with respect to current prediction

For any differentiable loss function, replacing "residuals" with "negative gradient" gives:

ri(m)=[L(yi,F(xi))F(xi)]F=Fm1r_i^{(m)} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]{F=F{m-1}}
ri(m)r_i^{(m)}
pseudo-residual for general loss L
yiy_i
true target
Fm1(xi)F_{m-1}(x_i)
current ensemble prediction at step m-1

These are what h_m gets fit to in the general case. For log-loss (binary classification), for absolute error, for Huber loss — the same algorithm applies, just with different pseudo-residuals.

Worked Example: 3 Rounds of Boosting

Dataset: 5 data points with features x and target y (house prices in $100K).

ix (size, sqft)y (price)
110002.0
215003.0
320004.5
425005.5
530007.0

F_0: ȳ = (2+3+4.5+5.5+7)/5 = 4.4

Round 1 — compute residuals:

iy_iF_0r_i = y_i - F_0
12.04.4-2.4
23.04.4-1.4
34.54.4+0.1
45.54.4+1.1
57.04.4+2.6

Fit tree h₁ to residuals {-2.4, -1.4, +0.1, +1.1, +2.6} with x as feature. Best split: x ≤ 1750:

  • Left (x ≤ 1750): mean residual = (-2.4 + -1.4 + 0.1)/3 = -1.23
  • Right (x > 1750): mean residual = (+1.1 + +2.6)/2 = +1.85

h₁ predicts: -1.23 for x ≤ 1750, +1.85 for x > 1750.

Update with ν = 0.5: F₁(x) = 4.4 + 0.5 × h₁(x)

  • F₁ for x ≤ 1750 = 4.4 + 0.5 × (-1.23) = 3.79
  • F₁ for x > 1750 = 4.4 + 0.5 × (1.85) = 5.33

New predictions vs. targets: [i=1: 3.79 vs 2.0], [i=3: 3.79 vs 4.5], [i=5: 5.33 vs 7.0]

Round 2 — new residuals: Errors are now smaller: i=1: -1.79, i=2: -0.79, i=3: +0.71, i=4: +0.17, i=5: +1.67. Fit another tree h₂ to these. F₂ = F₁ + 0.5 × h₂. Errors keep shrinking.

Round 3: Even smaller errors, another small correction. By M=20–50 rounds, the predictions are very close to the true values (assuming not overfitting).

The Learning Rate - Trees Tradeoff

Typically: smaller ν    more trees M needed for same fit\text{Typically: smaller } \nu \implies \text{more trees } M \text{ needed for same fit}
ν\nu
learning rate (0.01 to 0.3 typically)
MM
number of trees
νM\nu \cdot M
effective total learning step

Rule of thumb: use ν = 0.1 with M = 100–500 trees. Or ν = 0.01 with M = 1000–5000. Both can achieve similar test performance, but lower ν + more trees often generalizes better, at the cost of longer training.

Key Hyperparameters

ParameterEffectTypical range
n_estimators (M)Number of trees — more = lower bias100–10,000
learning_rate (ν)Shrinkage — smaller = more regularization0.01–0.3
max_depthIndividual tree depth — shallow trees preferred3–8
subsampleRow subsampling per tree0.6–1.0
max_featuresFeature subsampling per split0.5–1.0
min_samples_leafPrevents tiny leaves1–50

Unlike Random Forests, gradient boosting CAN overfit with more trees — the learning rate controls this. Use early stopping (monitor OOB or validation error) to stop adding trees when validation error stops improving.

Interactive example

Step through boosting rounds — watch residuals shrink and the ensemble converge to the true function

Coming soon

What to Remember

  • Gradient boosting is additive: F_M = F_0 + ν Σ h_m, where each h_m corrects the previous ensemble's errors
  • For MSE: fit h_m to residuals. For general losses: fit to negative gradients (pseudo-residuals)
  • Learning rate ν shrinks each tree's contribution — small ν + more trees usually generalizes better
  • CAN overfit with too many trees — use early stopping or regularize with small learning rate
  • Typical hyperparameters: ν = 0.1, M = 500, max_depth = 3–6, subsample = 0.8
  • Gradient boosting is sequential — cannot be parallelized across trees (unlike Random Forests)

Quiz

1 / 3

In gradient boosting for regression, what target does each new tree h_m predict?