In Part 2, we saw that gradient boosting works by fitting weak learners to pseudo-residuals. At each round, we train a function that approximates the negative gradient of the loss.
But we left something unspecified: what kind of function is ? In principle, it could be anything. In practice, itβs almost always a decision tree.
This post explains why trees dominate gradient boosting, then derives the formulas that make second-order optimization possible: the optimal leaf weight and the split gain formula.
Why Trees?
Decision trees have properties that make them ideal weak learners:
Non-linear without feature engineering: A tree naturally captures interactions and non-linear patterns. You donβt need to create polynomial features or specify interaction terms; the tree discovers them.
No feature scaling required: Trees only care about the ordering of values, not their magnitude. Features on different scales (age in years, income in dollars) work together without normalization.
Handles missing values: Modern tree implementations learn which direction to send missing values at each split. No imputation required.
Constrained complexity: By limiting depth or leaf count, we get a weak learner that captures broad patterns without overfitting. This is exactly what boosting needs.
Fast to evaluate: At inference time, a tree is just a sequence of comparisons. A depth-6 tree requires at most 6 feature lookups per prediction.
Anatomy of a Tree in Boosting
In gradient boosting, each tree is a function that maps inputs to corrections. Letβs define the components:
Splits: Internal nodes that partition the data. A split tests a feature against a threshold: βIs income < 50000?β Samples go left if yes, right if no.
Leaves: Terminal nodes that output predictions. In standard GBDT, each leaf outputs a single constant value.
Regions: Each leaf corresponds to a region of feature space, defined by the conjunction of all split conditions on the path from root to that leaf.
The tree function can be written as:
where is the number of leaves, is the weight (prediction) of leaf , is the region corresponding to leaf , and is 1 if falls in region and 0 otherwise.
Trees as Basis Functions
From an optimization perspective, each tree defines a set of βbasis functionsβ (the indicator functions for each region). Training the tree selects and shapes these regions, while the leaf weights are the coefficients.
The Second-Order Approximation
In Part 2, we used only the gradient (first derivative). XGBoost and LightGBM go further by using a second-order Taylor expansion of the loss.
Given a sample with true label and current prediction , we want to add a correction . The new prediction will be .
We can approximate the loss after adding using a Taylor expansion:
where:
- is the gradient at the current prediction
- is the Hessian (second derivative)
Why "Second-Order"?
The gradient tells us which direction decreases the loss. The Hessian tells us how quickly the loss curves in that direction. With both, we can estimate not just which way to go, but how far to step before we overshoot.
Summing over all samples and ignoring the constant terms:
This is the objective we want to minimize when building tree .
Optimal Leaf Weights
Now, consider a specific tree structure. Each sample lands in exactly one leaf. For samples in leaf , we can group them together:
where:
- is the sum of gradients for samples in leaf
- is the sum of Hessians for samples in leaf
- is the leaf weight (what weβre solving for)
This is a quadratic function in , and it separates by leaf. For each leaf independently, we can take the derivative and set it to zero: . Solving gives us .
This is the optimal leaf weight when we have no regularization.
Adding Regularization
To prevent overfitting, we add L2 regularization on leaf weights: .
The objective becomes: .
Taking the derivative and solving gives the regularized optimal leaf weight:
The regularization term appears in the denominator. Larger shrinks leaf weights toward zero.
Leaf Weight Calculation
Suppose a leaf contains 3 samples with:
- Gradients: , ,
- Hessians: , ,
- Regularization:
Sums: ,
Optimal weight:
The positive gradient sum means weβre underpredicting; the negative weight will increase predictions (recall: the update is , and our pseudo-residuals are negative gradients).
The Split Gain Formula
Now we can answer the key question: how do we decide where to split?
Given a node with samples , weβre considering splitting it into left () and right () children. Is this split worthwhile?
Objective Reduction from a Split
For the parent node alone (no split), the optimal objective value is: . This comes from substituting the optimal weight back into the quadratic objective.
If we split, the two children have objectives: and .
The gain from the split is the reduction in objective: .
Substituting and rearranging:
Adding Split Complexity Penalty
We often add a penalty for each split to encourage simpler trees:
This is the split gain formula used by XGBoost and LightGBM.
The Split Gain Formula
Where:
- = gradient/Hessian sums for the left child
- = gradient/Hessian sums for the right child
- = L2 regularization on leaf weights (
reg_lambda)- = minimum gain to make a split (
min_split_gain,gamma)
Split Gain Calculation
Consider a node with 100 samples. Weβre evaluating whether to split on
age < 30, which would send 40 samples left and 60 right. With and :Parent: ,
- Score:
Left child (40 samples, age < 30): ,
- Score:
Right child (60 samples, age >= 30): ,
- Score:
Gain:
The split is worthwhile (gain > 0) because it separates high-gradient samples (left) from low-gradient samples (right). Weβd compare this gain to all other candidate splits and pick the best.
Understanding the Formula
Letβs break down what the split gain formula tells us.
The Score Terms
Each term is the βscoreβ of a node. It measures how much objective reduction we get from the optimal leaf weight.
Intuition: When gradients in a region all point the same direction (large ), we can make a strong correction. When theyβre mixed (small ), corrections cancel out. The score is higher when gradients are aligned.
Why Squared Gradient?
The in the numerator might seem odd. Hereβs the intuition:
-
If half the samples have and half have , then
-
No correction can help both groups; the optimal weight is zero
-
Score is zero, reflecting that this node canβt reduce the objective
-
If all samples have , then
-
A strong negative correction helps everyone
-
Score is large, reflecting potential for improvement
The Role of Hessian
The Hessian in the denominator controls step size:
- Large : loss curves sharply, so optimal weight is small (careful steps)
- Small : loss is flat, so we can take larger steps
For squared error, for all samples, so (sample count).
For logistic loss, . Samples with confident predictions ( or ) have small Hessians; uncertain samples () have large Hessians.
The Role of Regularization
(L2 regularization): Appears in every denominator. Larger reduces the score difference between splits, preferring simpler trees with smaller weights.
(split penalty): Subtracts from the gain. If the structural improvement is less than , donβt split. This is an explicit cost for tree complexity.
Building a Tree: The Complete Picture
Now we can describe how a tree is built:
Algorithm: Tree Building with Second-Order Optimization
Input: - Samples with gradients g[], Hessians H[] - Parameters: max_depth, lambda, gamma, min_child_weight 1. Create root node containing all samples 2. While there are nodes to expand: a. For each node: - Compute G, H (gradient/Hessian sums) b. For each feature f: - For each possible split threshold t: - Compute G_L, H_L (samples with f < t) - Compute G_R, H_R (samples with f >= t) - Skip if H_L < min_child_weight or H_R < min_child_weight - Compute Gain using the formula - Track best (f, t, Gain) c. If best Gain > 0: - Split node at (best_f, best_t) - Create left and right children d. Else: - Mark as leaf - Set weight w* = -G/(H + lambda) 3. For all leaves: compute final weights Output: Tree with splits and leaf weights
The expensive part is trying all features and thresholds (step 2b). With samples and features, naive enumeration is per split candidate, which quickly becomes prohibitive.
The Computational Bottleneck
For a dataset with 1 million samples and 100 features, evaluating every possible split at every node is the dominant cost. Part 4 shows how histogram-based training reduces this from to per feature.
Regularization Hyperparameters
Letβs connect the math to the hyperparameters youβll tune:
| Parameter | XGBoost | LightGBM | Effect |
|---|---|---|---|
reg_lambda | lambda_l2 | Shrinks leaf weights toward zero | |
gamma, min_split_loss | min_split_gain | Minimum gain required to split | |
| - | min_child_weight | min_child_weight | Minimum Hessian sum in leaf |
| - | max_depth | max_depth | Maximum tree depth |
| - | max_leaves | num_leaves | Maximum leaf count |
Practical Defaults
- Start with
lambda = 1.0,gamma = 0(XGBoost defaults)- Increase
lambdaif overfitting on small datasets- Increase
gammato prune trees more aggressivelymin_child_weightis the minimum Hessian sum allowed in a child. For squared error where , this equals minimum sample count. For classification where , samples with confident predictions contribute less, so 100 samples might only sum to 10 in Hessian.
Special Case: Squared Error
For squared error loss, the formulas simplify nicely:
- Gradient: (prediction minus target)
- Hessian: (constant)
So and (sample count).
The optimal leaf weight becomes:
This is (approximately) the mean residual in the leaf, which matches our intuition from Part 1.
The split gain simplifies to comparing variance reduction, similar to classical CART.
Whatβs Next
Weβve derived the mathematics that power modern gradient boosting:
- The second-order Taylor approximation of the loss
- The optimal leaf weight formula
- The split gain formula
But finding the best split still requires trying every feature and threshold. With millions of samples and hundreds of features, this is the computational bottleneck.
The next post, Histogram-Based Split Finding, shows how XGBoost and LightGBM solve this with histogram-based training, reducing complexity from O(n log n) to O(bins).
Summary
Second-order optimization uses both gradient and Hessian to find better tree structures:
Key insights:
- Squared gradients measure alignment: how much can we correct together?
- Hessians control step size: how curved is the loss here?
- regularizes weights: shrink toward zero
- regularizes structure: donβt split unless itβs worth it
- Trees are ideal weak learners: non-linear, no scaling, missing values handled
References
-
Chen, T. & Guestrin, C. (2016). βXGBoost: A Scalable Tree Boosting Systemβ. KDD 2016. arXiv
-
Friedman, J.H. (2001). βGreedy Function Approximation: A Gradient Boosting Machineβ. Annals of Statistics, 29(5), 1189-1232. PDF
-
Ke, G. et al. (2017). βLightGBM: A Highly Efficient Gradient Boosting Decision Treeβ. NeurIPS 2017. PDF