In Part 2, we saw that gradient boosting works by fitting weak learners to pseudo-residuals. At each round, we train a function that approximates the negative gradient of the loss.

But we left something unspecified: what kind of function is ? In principle, it could be anything. In practice, it’s almost always a decision tree.

This post explains why trees dominate gradient boosting, then derives the formulas that make second-order optimization possible: the optimal leaf weight and the split gain formula.


Why Trees?

Decision trees have properties that make them ideal weak learners:

Non-linear without feature engineering: A tree naturally captures interactions and non-linear patterns. You don’t need to create polynomial features or specify interaction terms; the tree discovers them.

No feature scaling required: Trees only care about the ordering of values, not their magnitude. Features on different scales (age in years, income in dollars) work together without normalization.

Handles missing values: Modern tree implementations learn which direction to send missing values at each split. No imputation required.

Constrained complexity: By limiting depth or leaf count, we get a weak learner that captures broad patterns without overfitting. This is exactly what boosting needs.

Fast to evaluate: At inference time, a tree is just a sequence of comparisons. A depth-6 tree requires at most 6 feature lookups per prediction.


Anatomy of a Tree in Boosting

In gradient boosting, each tree is a function that maps inputs to corrections. Let’s define the components:

Splits: Internal nodes that partition the data. A split tests a feature against a threshold: β€œIs income < 50000?” Samples go left if yes, right if no.

Leaves: Terminal nodes that output predictions. In standard GBDT, each leaf outputs a single constant value.

Regions: Each leaf corresponds to a region of feature space, defined by the conjunction of all split conditions on the path from root to that leaf.

The tree function can be written as:

where is the number of leaves, is the weight (prediction) of leaf , is the region corresponding to leaf , and is 1 if falls in region and 0 otherwise.

Trees as Basis Functions

From an optimization perspective, each tree defines a set of β€œbasis functions” (the indicator functions for each region). Training the tree selects and shapes these regions, while the leaf weights are the coefficients.


The Second-Order Approximation

In Part 2, we used only the gradient (first derivative). XGBoost and LightGBM go further by using a second-order Taylor expansion of the loss.

Given a sample with true label and current prediction , we want to add a correction . The new prediction will be .

We can approximate the loss after adding using a Taylor expansion:

where:

  • is the gradient at the current prediction
  • is the Hessian (second derivative)

Why "Second-Order"?

The gradient tells us which direction decreases the loss. The Hessian tells us how quickly the loss curves in that direction. With both, we can estimate not just which way to go, but how far to step before we overshoot.

Summing over all samples and ignoring the constant terms:

This is the objective we want to minimize when building tree .


Optimal Leaf Weights

Now, consider a specific tree structure. Each sample lands in exactly one leaf. For samples in leaf , we can group them together:

where:

  • is the sum of gradients for samples in leaf
  • is the sum of Hessians for samples in leaf
  • is the leaf weight (what we’re solving for)

This is a quadratic function in , and it separates by leaf. For each leaf independently, we can take the derivative and set it to zero: . Solving gives us .

This is the optimal leaf weight when we have no regularization.

Adding Regularization

To prevent overfitting, we add L2 regularization on leaf weights: .

The objective becomes: .

Taking the derivative and solving gives the regularized optimal leaf weight:

The regularization term appears in the denominator. Larger shrinks leaf weights toward zero.

Leaf Weight Calculation

Suppose a leaf contains 3 samples with:

  • Gradients: , ,
  • Hessians: , ,
  • Regularization:

Sums: ,

Optimal weight:

The positive gradient sum means we’re underpredicting; the negative weight will increase predictions (recall: the update is , and our pseudo-residuals are negative gradients).


The Split Gain Formula

Now we can answer the key question: how do we decide where to split?

Given a node with samples , we’re considering splitting it into left () and right () children. Is this split worthwhile?

Objective Reduction from a Split

For the parent node alone (no split), the optimal objective value is: . This comes from substituting the optimal weight back into the quadratic objective.

If we split, the two children have objectives: and .

The gain from the split is the reduction in objective: .

Substituting and rearranging:

Adding Split Complexity Penalty

We often add a penalty for each split to encourage simpler trees:

This is the split gain formula used by XGBoost and LightGBM.

The Split Gain Formula

Where:

  • = gradient/Hessian sums for the left child
  • = gradient/Hessian sums for the right child
  • = L2 regularization on leaf weights (reg_lambda)
  • = minimum gain to make a split (min_split_gain, gamma)

Split Gain Calculation

Consider a node with 100 samples. We’re evaluating whether to split on age < 30, which would send 40 samples left and 60 right. With and :

Parent: ,

  • Score:

Left child (40 samples, age < 30): ,

  • Score:

Right child (60 samples, age >= 30): ,

  • Score:

Gain:

The split is worthwhile (gain > 0) because it separates high-gradient samples (left) from low-gradient samples (right). We’d compare this gain to all other candidate splits and pick the best.


Understanding the Formula

Let’s break down what the split gain formula tells us.

The Score Terms

Each term is the β€œscore” of a node. It measures how much objective reduction we get from the optimal leaf weight.

Intuition: When gradients in a region all point the same direction (large ), we can make a strong correction. When they’re mixed (small ), corrections cancel out. The score is higher when gradients are aligned.

Why Squared Gradient?

The in the numerator might seem odd. Here’s the intuition:

  • If half the samples have and half have , then

  • No correction can help both groups; the optimal weight is zero

  • Score is zero, reflecting that this node can’t reduce the objective

  • If all samples have , then

  • A strong negative correction helps everyone

  • Score is large, reflecting potential for improvement

The Role of Hessian

The Hessian in the denominator controls step size:

  • Large : loss curves sharply, so optimal weight is small (careful steps)
  • Small : loss is flat, so we can take larger steps

For squared error, for all samples, so (sample count).

For logistic loss, . Samples with confident predictions ( or ) have small Hessians; uncertain samples () have large Hessians.

The Role of Regularization

(L2 regularization): Appears in every denominator. Larger reduces the score difference between splits, preferring simpler trees with smaller weights.

(split penalty): Subtracts from the gain. If the structural improvement is less than , don’t split. This is an explicit cost for tree complexity.


Building a Tree: The Complete Picture

Now we can describe how a tree is built:

Algorithm: Tree Building with Second-Order Optimization

Input: 
  - Samples with gradients g[], Hessians H[]
  - Parameters: max_depth, lambda, gamma, min_child_weight

1. Create root node containing all samples

2. While there are nodes to expand:
    a. For each node:
       - Compute G, H (gradient/Hessian sums)
       
    b. For each feature f:
       - For each possible split threshold t:
         - Compute G_L, H_L (samples with f < t)
         - Compute G_R, H_R (samples with f >= t)
         - Skip if H_L < min_child_weight or H_R < min_child_weight
         - Compute Gain using the formula
         - Track best (f, t, Gain)
       
    c. If best Gain > 0:
       - Split node at (best_f, best_t)
       - Create left and right children
    d. Else:
       - Mark as leaf
       - Set weight w* = -G/(H + lambda)

3. For all leaves: compute final weights

Output: Tree with splits and leaf weights

The expensive part is trying all features and thresholds (step 2b). With samples and features, naive enumeration is per split candidate, which quickly becomes prohibitive.

The Computational Bottleneck

For a dataset with 1 million samples and 100 features, evaluating every possible split at every node is the dominant cost. Part 4 shows how histogram-based training reduces this from to per feature.


Regularization Hyperparameters

Let’s connect the math to the hyperparameters you’ll tune:

ParameterXGBoostLightGBMEffect
reg_lambdalambda_l2Shrinks leaf weights toward zero
gamma, min_split_lossmin_split_gainMinimum gain required to split
-min_child_weightmin_child_weightMinimum Hessian sum in leaf
-max_depthmax_depthMaximum tree depth
-max_leavesnum_leavesMaximum leaf count

Practical Defaults

  • Start with lambda = 1.0, gamma = 0 (XGBoost defaults)
  • Increase lambda if overfitting on small datasets
  • Increase gamma to prune trees more aggressively
  • min_child_weight is the minimum Hessian sum allowed in a child. For squared error where , this equals minimum sample count. For classification where , samples with confident predictions contribute less, so 100 samples might only sum to 10 in Hessian.

Special Case: Squared Error

For squared error loss, the formulas simplify nicely:

  • Gradient: (prediction minus target)
  • Hessian: (constant)

So and (sample count).

The optimal leaf weight becomes:

This is (approximately) the mean residual in the leaf, which matches our intuition from Part 1.

The split gain simplifies to comparing variance reduction, similar to classical CART.


What’s Next

We’ve derived the mathematics that power modern gradient boosting:

  1. The second-order Taylor approximation of the loss
  2. The optimal leaf weight formula
  3. The split gain formula

But finding the best split still requires trying every feature and threshold. With millions of samples and hundreds of features, this is the computational bottleneck.

The next post, Histogram-Based Split Finding, shows how XGBoost and LightGBM solve this with histogram-based training, reducing complexity from O(n log n) to O(bins).


Summary

Second-order optimization uses both gradient and Hessian to find better tree structures:

Key insights:

  • Squared gradients measure alignment: how much can we correct together?
  • Hessians control step size: how curved is the loss here?
  • regularizes weights: shrink toward zero
  • regularizes structure: don’t split unless it’s worth it
  • Trees are ideal weak learners: non-linear, no scaling, missing values handled

References

  1. Chen, T. & Guestrin, C. (2016). β€œXGBoost: A Scalable Tree Boosting System”. KDD 2016. arXiv

  2. Friedman, J.H. (2001). β€œGreedy Function Approximation: A Gradient Boosting Machine”. Annals of Statistics, 29(5), 1189-1232. PDF

  3. Ke, G. et al. (2017). β€œLightGBM: A Highly Efficient Gradient Boosting Decision Tree”. NeurIPS 2017. PDF