Split Finding#
Split finding determines how to partition samples at each tree node. Given a set of samples with their gradient and Hessian values, we find the feature and threshold that maximizes information gain.
The Split Gain Formula#
The gain from a split measures improvement in the objective:
Where:
\(G_L, H_L\) = sum of gradients/Hessians in left partition
\(G_R, H_R\) = sum of gradients/Hessians in right partition
\(G, H\) = sum of gradients/Hessians in parent node
\(\lambda\) = L2 regularization on leaf weights
\(\gamma\) = complexity penalty per split
Intuition: The formula compares “what we can achieve with two leaves” vs “what we have with one leaf.” The difference is the gain from splitting.
Split Enumeration#
For histogram-based training, we enumerate splits by scanning the histogram:
Forward Scan (Cumulative Sum)#
Algorithm: Enumerate Splits for Feature f
─────────────────────────────────────────
Given histogram H[0..B-1] with (sum_g, sum_h) per bin
G_total = Σ H[b].sum_g
H_total = Σ H[b].sum_h
G_left = 0, H_left = 0
best_gain = -∞
for b = 0 to B-2: # B-1 possible split points
G_left += H[b].sum_g
H_left += H[b].sum_h
G_right = G_total - G_left
H_right = H_total - H_left
# Skip if too few samples on either side
if H_left < min_child_weight or H_right < min_child_weight:
continue
gain = compute_gain(G_left, H_left, G_right, H_right)
if gain > best_gain:
best_gain = gain
best_split = b
Time complexity: O(bins) per feature.
Missing Value Handling#
Real data often has missing values. GBDT can learn the best default direction.
Bidirectional Scanning#
Algorithm: Find Best Split with Missing Values
─────────────────────────────────────────
Given histogram H[0..B-1] and missing_sum = (G_miss, H_miss)
# Option 1: Missing values go LEFT
Scan left-to-right, adding missing to left partition
Record best_gain_left
# Option 2: Missing values go RIGHT
Scan right-to-left, adding missing to right partition
Record best_gain_right
if best_gain_left > best_gain_right:
default_left = true
best_gain = best_gain_left
else:
default_left = false
best_gain = best_gain_right
The learned default_left is stored with the split and used during inference.
Constraints#
Minimum Child Weight#
Prevent splits that create leaves with too few samples:
if H_left < min_child_weight or H_right < min_child_weight:
skip this split
Where Hessian sum approximates the effective sample count (exactly for MSE loss).
Minimum Split Gain#
Only accept splits with sufficient improvement:
if gain < min_split_gain:
don't split (make this a leaf)
The \(\gamma\) parameter in the gain formula serves this purpose.
Monotonic Constraints#
Force predictions to be monotonically increasing/decreasing with a feature:
Monotonic increasing constraint on feature f:
Only allow splits where left_prediction ≤ right_prediction
Monotonic decreasing constraint on feature f:
Only allow splits where left_prediction ≥ right_prediction
Implementation: After computing optimal leaf weights, verify the constraint is satisfied before accepting the split.
Interaction Constraints#
Limit which features can appear together in paths:
If feature A is used in an ancestor:
Only allow features in same interaction group
This reduces model complexity and can improve interpretability.
Leaf Weight Calculation#
Once a split is chosen, we compute optimal leaf weights:
This is the Newton-Raphson update step, treating the leaf as a single “parameter.”
For multi-output, each output dimension has its own weight using the same formula.
Categorical Splits#
For categorical features, the split isn’t “< threshold” but “∈ category set”:
Standard split: feature < threshold → go left
Categorical split: feature ∈ {cat_a, cat_c, cat_e} → go left
Finding the optimal partition of k categories is exponential (2^k subsets). LightGBM uses gradient-based sorting to find a good partition in O(k log k).
See Categorical Features for details.
Parallelization#
Split finding parallelizes naturally:
Feature-parallel: Each thread evaluates different features
parallel for f in features:
best_split[f] = find_best_split(histogram[f])
global_best = argmax(best_split)
Within-feature: For very wide features, parallelize the bin scan (rarely needed with 256 bins).
Numerical Stability#
The gain formula can have numerical issues:
Issue: Division by small Hessian sums
Solution: The \(\lambda\) term ensures denominator is never too small
Issue: Subtracting large similar numbers (catastrophic cancellation)
Solution: Use numerically stable formulations, or higher precision for accumulation
Complexity#
Operation |
Complexity |
|---|---|
Single feature |
O(bins) |
All features |
O(features × bins) |
With missing values |
O(features × bins) — 2 scans but still linear |
With constraints |
O(features × bins) — extra checks per candidate |
The histogram approach keeps this fast regardless of sample count.
Source References#
XGBoost#
src/tree/split_evaluator.cc— Split gain calculationsrc/tree/hist/evaluate_splits.cc— Split enumerationinclude/xgboost/tree_model.h— SplitEntry structure
LightGBM#
src/treelearner/feature_histogram.cpp— Split finding from histogramssrc/treelearner/split_info.hpp— Split information structure