XGBoost: the second-order approximation — Tree Methods & Ensembles

Between 2014 and 2018, XGBoost won more Kaggle competitions than any other algorithm. On one analysis of 29 winning solutions from the Kaggle competition platform, 17 used XGBoost. The question isn't "why is XGBoost good?" — it's "what exactly did Chen and Guestrin engineer that makes it so consistently dominant on tabular data?"

The (eXtreme Gradient Boosting) is gradient boosting, done right.

The Regularized Objective

Standard gradient boosting minimizes the loss directly. XGBoost adds explicit :

\mathcal{L} = \sum_i L(y_i, \hat{y}_i) + \sum_m \Omega(f_m), \quad \text{where} \quad \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_j w_j^2

$\mathcal{L}$: total XGBoost objective
$\sum_i L(y_i, \hat{y}_i)$: training loss over all examples
$\Omega(f_m)$: complexity penalty for tree m
$T$: number of leaves in tree m
$w_j$: weight (prediction) of leaf j
$\gamma$: penalty per leaf (encourages fewer leaves)
$\lambda$: L2 penalty on leaf weights (encourages small weights)

Symbol: (gamma): penalizes the number of leaves T. A split is only performed if the gain exceeds γ.
Symbol: (lambda): L2 regularization on leaf weights. Shrinks leaf predictions toward zero.

Second-Order Taylor Expansion

At boosting round m, we're adding tree f_m to the current ensemble. XGBoost approximates the loss with a second-order Taylor expansion around the current predictions:

\mathcal{L}^{(m)} \approx \sum_i \left[g_i f_m(x_i) + \frac{1}{2} h_i f_m(x_i)^2\right] + \Omega(f_m)

$g_i$: first derivative of loss for example i: g_i = ∂L(y_i, ŷ_i)/∂ŷ_i
$h_i$: second derivative of loss for example i: h_i = ∂²L(y_i, ŷ_i)/∂ŷ_i²
$f_m(x_i)$: the new tree's prediction for example i

This quadratic approximation can be minimized analytically. For a fixed tree structure, the optimal leaf weight for leaf j is:

w_j^* = -\frac{G_j}{H_j + \lambda}

$w_j^*$: optimal weight for leaf j
$G_j$: sum of gradients g_i for all examples in leaf j
$H_j$: sum of Hessians h_i for all examples in leaf j

And the resulting minimum objective value for that tree structure is:

\mathcal{L}^*(q) = -\frac{1}{2} \sum_{j=1}^{T} \frac{G_j^2}{H_j + \lambda} + \gamma T

$\mathcal{L}^*(q)$: minimum achievable loss for tree with structure q

This is the score of a tree structure — XGBoost uses it to evaluate candidate splits.

The Split Gain Formula

To decide whether to split leaf j into left (L) and right (R) children:

\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H + \lambda}\right] - \gamma

$\text{Gain}$: the improvement in objective from this split
$G_L, H_L$: gradient/Hessian sums for left child
$G_R, H_R$: gradient/Hessian sums for right child
$G, H$: gradient/Hessian sums for the parent node

A split is accepted only if Gain > 0 (otherwise, the γ penalty makes it not worth splitting). This formula is evaluated for every candidate (feature, threshold) pair.

Worked Example: Computing Split Gain

Suppose a leaf contains 4 examples with L(y_i, ŷ_i) gradients and Hessians:

i	g_i	h_i
1	-2.0	1.0
2	-1.5	1.0
3	+1.0	1.0
4	+2.0	1.0

Parent: G = (-2 -1.5 +1 +2) = -0.5, H = 4.0

Proposed split: examples {1,2} go left, {3,4} go right.

G_L = -3.5, H_L = 2.0
G_R = +3.0, H_R = 2.0

With λ = 1.0 and γ = 0.5:

\text{Gain} = \frac{1}{2}\left[\frac{(-3.5)^2}{2+1} + \frac{(3.0)^2}{2+1} - \frac{(-0.5)^2}{4+1}\right] - 0.5

= 0.5 × [12.25/3 + 9.0/3 - 0.25/5] - 0.5

= 0.5 × [4.083 + 3.0 - 0.05] - 0.5

= 0.5 × 7.033 - 0.5

= 3.017 - 0.5 = 2.517

Gain = 2.517 > 0 → accept the split. Optimal leaf weights: w_L* = -G_L/(H_L+λ) = 3.5/3 ≈ +1.17, w_R* = -3.0/3 = -1.0.

XGBoost's Engineering Advantages

Column subsampling (colsample_bytree, colsample_bylevel): Sample features at the tree level and/or node level. Reduces correlation between trees (like Random Forests) and speeds up training.

Row subsampling (subsample): Use only a fraction of rows for each tree (stochastic gradient boosting).

Sparsity-aware split finding: XGBoost handles missing values natively — it learns the default direction for missing values at each split (left or right), based on which direction reduces loss more.

Cache-aware computation: XGBoost's exact greedy algorithm accesses sorted feature data in a cache-friendly order. For datasets too large for memory, it uses "approximate" split finding with feature histograms — this is also the approach used by LightGBM and CatBoost.

XGBoost vs. LightGBM vs. CatBoost

All three are gradient boosting implementations. The key differences:

Feature	XGBoost	LightGBM	CatBoost
Split finding	Level-wise	Leaf-wise	Symmetric trees
Categorical features	Encode manually	Native support (limited)	Native, ordered boosting
Speed	Fast	Fastest	Medium
Memory	Medium	Low	Medium
Cold-start robustness	Good	Good	Excellent

For most tabular tasks today, LightGBM trains faster and often matches XGBoost's accuracy. CatBoost excels when you have many categorical features. XGBoost remains the most battle-tested and well-documented.

Key Hyperparameters

Parameter	Effect	Typical range
n_estimators	Number of trees	100–5000 (use early stopping)
learning_rate (η)	Shrinkage per tree	0.01–0.3
max_depth	Tree depth	3–10
min_child_weight	Minimum Σh_i in a leaf — prunes low-confidence leaves	1–20
gamma (γ)	Minimum gain to split	0–5
lambda (λ)	L2 regularization on leaf weights	1–100
subsample	Row sampling fraction	0.5–1.0
colsample_bytree	Feature sampling fraction per tree	0.5–1.0

Interactive example

Input gradient and Hessian values and watch the split gain formula compute the optimal split in real time

Coming soon

What to Remember

XGBoost adds γ (penalizes leaves) and λ (penalizes leaf weight magnitude) to the objective
Second-order Taylor expansion gives analytic optimal leaf weights: w* = -G/(H+λ)
Split gain = ½ [G_L²/(H_L+λ) + G_R²/(H_R+λ) - G²/(H+λ)] - γ
Splits with Gain ≤ 0 are not performed
Native missing value handling: learns the default direction for missing values
Column/row subsampling, cache-aware computation, and sparsity handling give speed advantages
LightGBM (leaf-wise growth) often trains faster; CatBoost handles categoricals better