Skip to content
Tree Methods & Ensembles
Lesson 6 ⏱ 14 min

XGBoost: the second-order approximation

Video coming soon

XGBoost: Engineering Gradient Boosting to Win Kaggle

Walking through XGBoost's second-order expansion, regularization, and the exact split-finding algorithm.

⏱ ~12 min

🧮

Quick refresher

Taylor series approximation

A Taylor series approximates a function f(x+Δ) near x using f's derivatives at x: f(x+Δ) ≈ f(x) + f'(x)·Δ + (1/2)f''(x)·Δ². The first derivative gives the slope (gradient); the second derivative gives the curvature (Hessian). Using both the first and second derivatives gives a better local approximation than using only the first.

Example

Approximating sin(x+0.1) near x=0: sin(0) + cos(0)·0.1 - (1/2)sin(0)·0.01 = 0 + 0.1 - 0 = 0.1.

The true value is 0.0998 — very close.

Between 2014 and 2018, XGBoost won more Kaggle competitions than any other algorithm. On one analysis of 29 winning solutions from the Kaggle competition platform, 17 used XGBoost. The question isn't "why is XGBoost good?" — it's "what exactly did Chen and Guestrin engineer that makes it so consistently dominant on tabular data?"

The (eXtreme Gradient Boosting) is gradient boosting, done right.

The Regularized Objective

Standard gradient boosting minimizes the loss directly. XGBoost adds explicit :

L=iL(yi,y^i)+mΩ(fm),whereΩ(f)=γT+12λjwj2\mathcal{L} = \sum_i L(y_i, \hat{y}_i) + \sum_m \Omega(f_m), \quad \text{where} \quad \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_j w_j^2
L\mathcal{L}
total XGBoost objective
iL(yi,y^i)\sum_i L(y_i, \hat{y}_i)
training loss over all examples
Ω(fm)\Omega(f_m)
complexity penalty for tree m
TT
number of leaves in tree m
wjw_j
weight (prediction) of leaf j
γ\gamma
penalty per leaf (encourages fewer leaves)
λ\lambda
L2 penalty on leaf weights (encourages small weights)
  • Symbol: (gamma): penalizes the number of leaves T. A split is only performed if the gain exceeds γ.
  • Symbol: (lambda): L2 regularization on leaf weights. Shrinks leaf predictions toward zero.

Second-Order Taylor Expansion

At boosting round m, we're adding tree f_m to the current ensemble. XGBoost approximates the loss with a second-order Taylor expansion around the current predictions:

L(m)i[gifm(xi)+12hifm(xi)2]+Ω(fm)\mathcal{L}^{(m)} \approx \sum_i \left[g_i f_m(x_i) + \frac{1}{2} h_i f_m(x_i)^2\right] + \Omega(f_m)
gig_i
first derivative of loss for example i: g_i = ∂L(y_i, ŷ_i)/∂ŷ_i
hih_i
second derivative of loss for example i: h_i = ∂²L(y_i, ŷ_i)/∂ŷ_i²
fm(xi)f_m(x_i)
the new tree's prediction for example i

This quadratic approximation can be minimized analytically. For a fixed tree structure, the optimal leaf weight for leaf j is:

wj=GjHj+λw_j^* = -\frac{G_j}{H_j + \lambda}
wjw_j^*
optimal weight for leaf j
GjG_j
sum of gradients g_i for all examples in leaf j
HjH_j
sum of Hessians h_i for all examples in leaf j

And the resulting minimum objective value for that tree structure is:

L(q)=12j=1TGj2Hj+λ+γT\mathcal{L}^*(q) = -\frac{1}{2} \sum_{j=1}^{T} \frac{G_j^2}{H_j + \lambda} + \gamma T
L(q)\mathcal{L}^*(q)
minimum achievable loss for tree with structure q

This is the score of a tree structure — XGBoost uses it to evaluate candidate splits.

The Split Gain Formula

To decide whether to split leaf j into left (L) and right (R) children:

Gain=12[GL2HL+λ+GR2HR+λ(GL+GR)2H+λ]γ\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H + \lambda}\right] - \gamma
Gain\text{Gain}
the improvement in objective from this split
GL,HLG_L, H_L
gradient/Hessian sums for left child
GR,HRG_R, H_R
gradient/Hessian sums for right child
G,HG, H
gradient/Hessian sums for the parent node

A split is accepted only if Gain > 0 (otherwise, the γ penalty makes it not worth splitting). This formula is evaluated for every candidate (feature, threshold) pair.

Worked Example: Computing Split Gain

Suppose a leaf contains 4 examples with L(y_i, ŷ_i) gradients and Hessians:

ig_ih_i
1-2.01.0
2-1.51.0
3+1.01.0
4+2.01.0

Parent: G = (-2 -1.5 +1 +2) = -0.5, H = 4.0

Proposed split: examples {1,2} go left, {3,4} go right.

  • G_L = -3.5, H_L = 2.0
  • G_R = +3.0, H_R = 2.0

With λ = 1.0 and γ = 0.5:

Gain=12[(3.5)22+1+(3.0)22+1(0.5)24+1]0.5\text{Gain} = \frac{1}{2}\left[\frac{(-3.5)^2}{2+1} + \frac{(3.0)^2}{2+1} - \frac{(-0.5)^2}{4+1}\right] - 0.5

= 0.5 × [12.25/3 + 9.0/3 - 0.25/5] - 0.5

= 0.5 × [4.083 + 3.0 - 0.05] - 0.5

= 0.5 × 7.033 - 0.5

= 3.017 - 0.5 = 2.517

Gain = 2.517 > 0 → accept the split. Optimal leaf weights: w_L* = -G_L/(H_L+λ) = 3.5/3 ≈ +1.17, w_R* = -3.0/3 = -1.0.

XGBoost's Engineering Advantages

Column subsampling (colsample_bytree, colsample_bylevel): Sample features at the tree level and/or node level. Reduces correlation between trees (like Random Forests) and speeds up training.

Row subsampling (subsample): Use only a fraction of rows for each tree (stochastic gradient boosting).

Sparsity-aware split finding: XGBoost handles missing values natively — it learns the default direction for missing values at each split (left or right), based on which direction reduces loss more.

Cache-aware computation: XGBoost's exact greedy algorithm accesses sorted feature data in a cache-friendly order. For datasets too large for memory, it uses "approximate" split finding with feature histograms — this is also the approach used by LightGBM and CatBoost.

XGBoost vs. LightGBM vs. CatBoost

All three are gradient boosting implementations. The key differences:

FeatureXGBoostLightGBMCatBoost
Split findingLevel-wiseLeaf-wiseSymmetric trees
Categorical featuresEncode manuallyNative support (limited)Native, ordered boosting
SpeedFastFastestMedium
MemoryMediumLowMedium
Cold-start robustnessGoodGoodExcellent

For most tabular tasks today, LightGBM trains faster and often matches XGBoost's accuracy. CatBoost excels when you have many categorical features. XGBoost remains the most battle-tested and well-documented.

Key Hyperparameters

ParameterEffectTypical range
n_estimatorsNumber of trees100–5000 (use early stopping)
learning_rate (η)Shrinkage per tree0.01–0.3
max_depthTree depth3–10
min_child_weightMinimum Σh_i in a leaf — prunes low-confidence leaves1–20
gamma (γ)Minimum gain to split0–5
lambda (λ)L2 regularization on leaf weights1–100
subsampleRow sampling fraction0.5–1.0
colsample_bytreeFeature sampling fraction per tree0.5–1.0

Interactive example

Input gradient and Hessian values and watch the split gain formula compute the optimal split in real time

Coming soon

What to Remember

  • XGBoost adds γ (penalizes leaves) and λ (penalizes leaf weight magnitude) to the objective
  • Second-order Taylor expansion gives analytic optimal leaf weights: w* = -G/(H+λ)
  • Split gain = ½ [G_L²/(H_L+λ) + G_R²/(H_R+λ) - G²/(H+λ)] - γ
  • Splits with Gain ≤ 0 are not performed
  • Native missing value handling: learns the default direction for missing values
  • Column/row subsampling, cache-aware computation, and sparsity handling give speed advantages
  • LightGBM (leaf-wise growth) often trains faster; CatBoost handles categoricals better

Quiz

1 / 3

In XGBoost's regularized objective, what does the term γT penalize?