Between 2014 and 2018, XGBoost won more Kaggle competitions than any other algorithm. On one analysis of 29 winning solutions from the Kaggle competition platform, 17 used XGBoost. The question isn't "why is XGBoost good?" — it's "what exactly did Chen and Guestrin engineer that makes it so consistently dominant on tabular data?"
The (eXtreme Gradient Boosting) is gradient boosting, done right.
The Regularized Objective
Standard gradient boosting minimizes the loss directly. XGBoost adds explicit :
- total XGBoost objective
- training loss over all examples
- complexity penalty for tree m
- number of leaves in tree m
- weight (prediction) of leaf j
- penalty per leaf (encourages fewer leaves)
- L2 penalty on leaf weights (encourages small weights)
- Symbol: (gamma): penalizes the number of leaves T. A split is only performed if the gain exceeds γ.
- Symbol: (lambda): L2 regularization on leaf weights. Shrinks leaf predictions toward zero.
Second-Order Taylor Expansion
At boosting round m, we're adding tree f_m to the current ensemble. XGBoost approximates the loss with a second-order Taylor expansion around the current predictions:
- first derivative of loss for example i: g_i = ∂L(y_i, ŷ_i)/∂ŷ_i
- second derivative of loss for example i: h_i = ∂²L(y_i, ŷ_i)/∂ŷ_i²
- the new tree's prediction for example i
This quadratic approximation can be minimized analytically. For a fixed tree structure, the optimal leaf weight for leaf j is:
- optimal weight for leaf j
- sum of gradients g_i for all examples in leaf j
- sum of Hessians h_i for all examples in leaf j
And the resulting minimum objective value for that tree structure is:
- minimum achievable loss for tree with structure q
This is the score of a tree structure — XGBoost uses it to evaluate candidate splits.
The Split Gain Formula
To decide whether to split leaf j into left (L) and right (R) children:
- the improvement in objective from this split
- gradient/Hessian sums for left child
- gradient/Hessian sums for right child
- gradient/Hessian sums for the parent node
A split is accepted only if Gain > 0 (otherwise, the γ penalty makes it not worth splitting). This formula is evaluated for every candidate (feature, threshold) pair.
Worked Example: Computing Split Gain
Suppose a leaf contains 4 examples with L(y_i, ŷ_i) gradients and Hessians:
| i | g_i | h_i |
|---|---|---|
| 1 | -2.0 | 1.0 |
| 2 | -1.5 | 1.0 |
| 3 | +1.0 | 1.0 |
| 4 | +2.0 | 1.0 |
Parent: G = (-2 -1.5 +1 +2) = -0.5, H = 4.0
Proposed split: examples {1,2} go left, {3,4} go right.
- G_L = -3.5, H_L = 2.0
- G_R = +3.0, H_R = 2.0
With λ = 1.0 and γ = 0.5:
= 0.5 × [12.25/3 + 9.0/3 - 0.25/5] - 0.5
= 0.5 × [4.083 + 3.0 - 0.05] - 0.5
= 0.5 × 7.033 - 0.5
= 3.017 - 0.5 = 2.517
Gain = 2.517 > 0 → accept the split. Optimal leaf weights: w_L* = -G_L/(H_L+λ) = 3.5/3 ≈ +1.17, w_R* = -3.0/3 = -1.0.
XGBoost's Engineering Advantages
Column subsampling (colsample_bytree, colsample_bylevel): Sample features at the tree level and/or node level. Reduces correlation between trees (like Random Forests) and speeds up training.
Row subsampling (subsample): Use only a fraction of rows for each tree (stochastic gradient boosting).
Sparsity-aware split finding: XGBoost handles missing values natively — it learns the default direction for missing values at each split (left or right), based on which direction reduces loss more.
Cache-aware computation: XGBoost's exact greedy algorithm accesses sorted feature data in a cache-friendly order. For datasets too large for memory, it uses "approximate" split finding with feature histograms — this is also the approach used by LightGBM and CatBoost.
XGBoost vs. LightGBM vs. CatBoost
All three are gradient boosting implementations. The key differences:
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Split finding | Level-wise | Leaf-wise | Symmetric trees |
| Categorical features | Encode manually | Native support (limited) | Native, ordered boosting |
| Speed | Fast | Fastest | Medium |
| Memory | Medium | Low | Medium |
| Cold-start robustness | Good | Good | Excellent |
For most tabular tasks today, LightGBM trains faster and often matches XGBoost's accuracy. CatBoost excels when you have many categorical features. XGBoost remains the most battle-tested and well-documented.
Key Hyperparameters
| Parameter | Effect | Typical range |
|---|---|---|
| n_estimators | Number of trees | 100–5000 (use early stopping) |
| learning_rate (η) | Shrinkage per tree | 0.01–0.3 |
| max_depth | Tree depth | 3–10 |
| min_child_weight | Minimum Σh_i in a leaf — prunes low-confidence leaves | 1–20 |
| gamma (γ) | Minimum gain to split | 0–5 |
| lambda (λ) | L2 regularization on leaf weights | 1–100 |
| subsample | Row sampling fraction | 0.5–1.0 |
| colsample_bytree | Feature sampling fraction per tree | 0.5–1.0 |
Interactive example
Input gradient and Hessian values and watch the split gain formula compute the optimal split in real time
Coming soon
What to Remember
- XGBoost adds γ (penalizes leaves) and λ (penalizes leaf weight magnitude) to the objective
- Second-order Taylor expansion gives analytic optimal leaf weights: w* = -G/(H+λ)
- Split gain = ½ [G_L²/(H_L+λ) + G_R²/(H_R+λ) - G²/(H+λ)] - γ
- Splits with Gain ≤ 0 are not performed
- Native missing value handling: learns the default direction for missing values
- Column/row subsampling, cache-aware computation, and sparsity handling give speed advantages
- LightGBM (leaf-wise growth) often trains faster; CatBoost handles categoricals better