Histogram-Based Training#

Histogram-based training is the core algorithmic innovation that makes modern gradient boosting fast and scalable. Instead of considering every unique feature value as a potential split point, we discretize features into bins and aggregate gradient statistics per bin.

The Core Insight#

Traditional split finding examines every unique value:

Sort feature values: [1.2, 2.3, 3.7, 5.1, 8.9, ...]  ← O(n log n)
Try each as split:   for each value, compute gain    ← O(n)

Histogram-based approach:

Bin boundaries: [0, 2, 4, 6, 8, 10]
Count gradients per bin: bin[0]: sum_g=5.2, sum_h=3.1
                         bin[1]: sum_g=2.1, sum_h=1.8
                         ...
Try each bin as split: for each bin, compute gain   ← O(bins) where bins << n

For a candidate split at bin boundary \(b\), we partition samples into left (\(I_L\)) and right (\(I_R\)) sets. The split gain is:

\[ \text{Gain} = \frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma \]

Where \(G_L = \sum_{i \in I_L} g_i\) and \(H_L = \sum_{i \in I_L} h_i\) are the gradient and Hessian sums for the left partition.

Key insight: We only need \(G\) and \(H\) sums, not individual gradients. This enables aggregation into histograms.

Why Histograms?#

Approach	Split Finding	Memory	Distributed
Exact (sort-based)	O(n log n)	O(n) floats	Hard
Histogram-based	O(bins)	O(bins)	Easy (merge histograms)

Typical values: n = 1,000,000 samples, bins = 256.

Speedup: ~4000× for split finding, 4× memory reduction.

Building Histograms#

Algorithm#

Algorithm: Build Histogram for a Node
─────────────────────────────────────────
Input: 
  - Quantized features X_bin (n × d matrix of bin indices)
  - Gradients g, Hessians h (n-vectors)
  - Sample indices I belonging to this node

Output: Histogram H[f][b] = (sum_g, sum_h) for each feature f and bin b

for each sample i in I:
    for each feature f:
        bin = X_bin[i, f]
        H[f][bin].sum_g += g[i]
        H[f][bin].sum_h += h[i]

Row-wise vs Column-wise#

Row-wise (iterate over samples, then features):

Better when #samples << #features
More cache-friendly for row-major data
Used by: XGBoost (default), LightGBM (adaptive)

Column-wise (iterate over features, then samples):

Better when #features << #samples
Enables feature-parallel building
Used by: LightGBM (adaptive)

Histogram Subtraction Trick#

For a binary split, we have: parent = left_child + right_child

Therefore: sibling = parent - built_child

Algorithm:

Build histogram for the smaller child (fewer samples)
Compute sibling histogram by subtraction
Save ~50% of histogram building work

This optimization is critical for efficiency. Both XGBoost and LightGBM implement it.

Cost Analysis:
─────────────────────────────────────────
Without subtraction: O(n_left) + O(n_right) = O(n_parent)
With subtraction:    O(min(n_left, n_right)) + O(bins)

Savings: Nearly 50% on average

Parallel Histogram Building#

Histograms can be built in parallel using different strategies:

Data-Parallel (XGBoost approach)#

Partition samples across threads
Each thread builds local histogram
Reduce (sum) local histograms into global

Trade-off: Requires O(threads × bins × features) memory for local histograms.

Feature-Parallel#

Partition features across threads
Each thread builds histograms for its features
No reduction needed

Trade-off: Requires synchronized access to sample indices.

Gradient Discretization (LightGBM)#

LightGBM can quantize gradients themselves for even faster histogram building:

Precision	Storage	Histogram Entry
float64 (default)	8 bytes	16 bytes (g + h)
int16 packed	2 bytes	4 bytes
int8 packed	1 byte	2 bytes

Benefits:

4-8× less memory bandwidth
Faster histogram accumulation
Slight accuracy trade-off

The discretization adapts based on leaf size — smaller leaves use lower precision since they’re less likely to overflow.

Histogram Structure#

A histogram entry stores gradient and Hessian sums for each bin:

Feature histogram for feature f:
┌─────────────────────────────────────────────────┐
│ Bin 0: (sum_g = 1.5, sum_h = 0.8)              │
│ Bin 1: (sum_g = 2.3, sum_h = 1.2)              │
│ Bin 2: (sum_g = 0.1, sum_h = 0.3)              │
│ ...                                             │
│ Bin 255: (sum_g = 0.7, sum_h = 0.4)            │
└─────────────────────────────────────────────────┘

Memory layout considerations:

Interleaved (g, h, g, h, …): Better for split finding (access both together)
Separate (g[], h[]): Better for SIMD (process all g, then all h)

XGBoost and LightGBM use interleaved layout.

Finding Splits from Histograms#

Once histograms are built, finding the best split is O(bins):

Algorithm: Find Best Split for Feature f
─────────────────────────────────────────
Input: Histogram H[b] for b = 0..num_bins-1

1. Compute total: G_total = Σ H[b].sum_g, H_total = Σ H[b].sum_h

2. Scan left-to-right:
   G_left = 0, H_left = 0
   for b = 0 to num_bins-1:
       G_left += H[b].sum_g
       H_left += H[b].sum_h
       G_right = G_total - G_left
       H_right = H_total - H_left
       
       gain = compute_gain(G_left, H_left, G_right, H_right)
       if gain > best_gain:
           best_gain = gain
           best_threshold = b

This cumulative scan approach avoids recomputing sums for each candidate.

Missing Value Handling#

Missing values need special handling in histograms:

XGBoost approach: Try both directions

For each split:
Try all missing → left: compute gain
Try all missing → right: compute gain
Choose better direction

Implementation: Scan histogram in both directions (forward and backward), tracking best gain from each. Store the learned default_left direction with the split.

Complexity Summary#

Operation	Time	Notes
Histogram building	O(n × d)	Per node, n samples, d features
With subtraction	O(n/2 × d)	Build smaller child only
Split finding	O(bins × d)	Per node
Total per tree	O(n × d × depth)	Dominated by histogram building

The key win is that bins (typically 256) is much smaller than n (often millions).