Histogram-Based Training#
Histogram-based training is the core algorithmic innovation that makes modern gradient boosting fast and scalable. Instead of considering every unique feature value as a potential split point, we discretize features into bins and aggregate gradient statistics per bin.
The Core Insight#
Traditional split finding examines every unique value:
Sort feature values: [1.2, 2.3, 3.7, 5.1, 8.9, ...] ← O(n log n)
Try each as split: for each value, compute gain ← O(n)
Histogram-based approach:
Bin boundaries: [0, 2, 4, 6, 8, 10]
Count gradients per bin: bin[0]: sum_g=5.2, sum_h=3.1
bin[1]: sum_g=2.1, sum_h=1.8
...
Try each bin as split: for each bin, compute gain ← O(bins) where bins << n
For a candidate split at bin boundary \(b\), we partition samples into left (\(I_L\)) and right (\(I_R\)) sets. The split gain is:
Where \(G_L = \sum_{i \in I_L} g_i\) and \(H_L = \sum_{i \in I_L} h_i\) are the gradient and Hessian sums for the left partition.
Key insight: We only need \(G\) and \(H\) sums, not individual gradients. This enables aggregation into histograms.
Why Histograms?#
Approach |
Split Finding |
Memory |
Distributed |
|---|---|---|---|
Exact (sort-based) |
O(n log n) |
O(n) floats |
Hard |
Histogram-based |
O(bins) |
O(bins) |
Easy (merge histograms) |
Typical values: n = 1,000,000 samples, bins = 256.
Speedup: ~4000× for split finding, 4× memory reduction.
Building Histograms#
Algorithm#
Algorithm: Build Histogram for a Node
─────────────────────────────────────────
Input:
- Quantized features X_bin (n × d matrix of bin indices)
- Gradients g, Hessians h (n-vectors)
- Sample indices I belonging to this node
Output: Histogram H[f][b] = (sum_g, sum_h) for each feature f and bin b
for each sample i in I:
for each feature f:
bin = X_bin[i, f]
H[f][bin].sum_g += g[i]
H[f][bin].sum_h += h[i]
Row-wise vs Column-wise#
Row-wise (iterate over samples, then features):
Better when #samples << #features
More cache-friendly for row-major data
Used by: XGBoost (default), LightGBM (adaptive)
Column-wise (iterate over features, then samples):
Better when #features << #samples
Enables feature-parallel building
Used by: LightGBM (adaptive)
Histogram Subtraction Trick#
For a binary split, we have: parent = left_child + right_child
Therefore: sibling = parent - built_child
Algorithm:
Build histogram for the smaller child (fewer samples)
Compute sibling histogram by subtraction
Save ~50% of histogram building work
This optimization is critical for efficiency. Both XGBoost and LightGBM implement it.
Cost Analysis:
─────────────────────────────────────────
Without subtraction: O(n_left) + O(n_right) = O(n_parent)
With subtraction: O(min(n_left, n_right)) + O(bins)
Savings: Nearly 50% on average
Parallel Histogram Building#
Histograms can be built in parallel using different strategies:
Data-Parallel (XGBoost approach)#
Partition samples across threads
Each thread builds local histogram
Reduce (sum) local histograms into global
Trade-off: Requires O(threads × bins × features) memory for local histograms.
Feature-Parallel#
Partition features across threads
Each thread builds histograms for its features
No reduction needed
Trade-off: Requires synchronized access to sample indices.
Gradient Discretization (LightGBM)#
LightGBM can quantize gradients themselves for even faster histogram building:
Precision |
Storage |
Histogram Entry |
|---|---|---|
float64 (default) |
8 bytes |
16 bytes (g + h) |
int16 packed |
2 bytes |
4 bytes |
int8 packed |
1 byte |
2 bytes |
Benefits:
4-8× less memory bandwidth
Faster histogram accumulation
Slight accuracy trade-off
The discretization adapts based on leaf size — smaller leaves use lower precision since they’re less likely to overflow.
Histogram Structure#
A histogram entry stores gradient and Hessian sums for each bin:
Feature histogram for feature f:
┌─────────────────────────────────────────────────┐
│ Bin 0: (sum_g = 1.5, sum_h = 0.8) │
│ Bin 1: (sum_g = 2.3, sum_h = 1.2) │
│ Bin 2: (sum_g = 0.1, sum_h = 0.3) │
│ ... │
│ Bin 255: (sum_g = 0.7, sum_h = 0.4) │
└─────────────────────────────────────────────────┘
Memory layout considerations:
Interleaved (g, h, g, h, …): Better for split finding (access both together)
Separate (g[], h[]): Better for SIMD (process all g, then all h)
XGBoost and LightGBM use interleaved layout.
Finding Splits from Histograms#
Once histograms are built, finding the best split is O(bins):
Algorithm: Find Best Split for Feature f
─────────────────────────────────────────
Input: Histogram H[b] for b = 0..num_bins-1
1. Compute total: G_total = Σ H[b].sum_g, H_total = Σ H[b].sum_h
2. Scan left-to-right:
G_left = 0, H_left = 0
for b = 0 to num_bins-1:
G_left += H[b].sum_g
H_left += H[b].sum_h
G_right = G_total - G_left
H_right = H_total - H_left
gain = compute_gain(G_left, H_left, G_right, H_right)
if gain > best_gain:
best_gain = gain
best_threshold = b
This cumulative scan approach avoids recomputing sums for each candidate.
Missing Value Handling#
Missing values need special handling in histograms:
XGBoost approach: Try both directions
For each split:
1. Try all missing → left: compute gain
2. Try all missing → right: compute gain
3. Choose better direction
Implementation: Scan histogram in both directions (forward and backward), tracking
best gain from each. Store the learned default_left direction with the split.
Complexity Summary#
Operation |
Time |
Notes |
|---|---|---|
Histogram building |
O(n × d) |
Per node, n samples, d features |
With subtraction |
O(n/2 × d) |
Build smaller child only |
Split finding |
O(bins × d) |
Per node |
Total per tree |
O(n × d × depth) |
Dominated by histogram building |
The key win is that bins (typically 256) is much smaller than n (often millions).
Source References#
XGBoost#
src/tree/hist/histogram.h— Histogram buildingsrc/tree/updater_quantile_hist.cc— Main training loopsrc/common/hist_util.cc— Histogram utilities
LightGBM#
src/treelearner/feature_histogram.hpp— Histogram structuresrc/io/dense_bin.hpp— Dense histogram buildingsrc/treelearner/serial_tree_learner.cpp— Training loop