In Part 3, we derived the split gain formula. To find the best split, we evaluate every feature and every possible threshold, then pick the one with highest gain.

But β€œevery possible threshold” is the problem. With a million samples, each feature might have hundreds of thousands of unique values. Evaluating all of them is prohibitively slow.

This post explains histogram-based training, the optimization that makes modern gradient boosting practical. We’ll see how discretizing features reduces complexity by orders of magnitude, and how the subtraction trick cuts histogram building in half.


The Bottleneck: Exact Split Finding

Traditional (exact) split finding works like this:

  1. For each feature, sort samples by that feature’s value
  2. Scan through sorted values, computing the split gain at each unique value
  3. Track the best split across all features

The problem is step 1: sorting is per feature. With features, that’s per node. For a dataset with 1 million samples and 100 features, you’re looking at billions of operations for each split decision.

Why Sorting?

The split gain formula needs and (gradient and Hessian sums for the left partition). By sorting samples and scanning left-to-right, we can compute these incrementally with running sums. Without sorting, we’d need to re-sum for every candidate threshold.

Early gradient boosting implementations (like the original GBM) used exact split finding, which limited them to small datasets.


The Insight: Aggregate Statistics

Here’s the key observation: we only need gradient and Hessian sums, not individual values.

The split gain formula involves and . We don’t care which specific samples are on each side; we only need the aggregate statistics.

This means we can:

  1. Group samples into buckets (bins) based on feature values
  2. Pre-compute gradient and Hessian sums for each bin
  3. Evaluate splits at bin boundaries instead of individual values

If we use 256 bins instead of 1 million unique values, split finding becomes roughly 4000Γ— faster (since 1,000,000 / 256 β‰ˆ 4000).


Feature Quantization

Before training, we convert continuous features to discrete bin indices. This is called quantization or binning.

The Process

For each feature:

  1. Determine bin boundaries from training data
  2. Map each value to its corresponding bin index (0 to 255)
  3. Store the quantized data as u8 (1 byte per value per feature)

The quantized feature value is:

where are the bin boundaries.

Quantile vs Uniform Binning

Two strategies for choosing bin boundaries:

Uniform binning divides the range into equal-width intervals:

Values: [0, 1, 2, 3, 100]  bins=4
Boundaries: [0, 25, 50, 75, 100]
Result: [0, 0, 0, 0, 3]  ← All small values in one bin!

Quantile binning places boundaries so each bin has roughly equal samples:

Values: [0, 1, 2, 3, 100]  bins=4
Boundaries: [0, 1, 2, 3, 100]
Result: [0, 1, 2, 3, 4]  ← Each value gets its own bin

Why Quantile Binning Wins

Quantile binning places bins where the data is, maximizing split resolution. XGBoost and LightGBM both default to quantile-based binning. XGBoost additionally uses hessian-weighted quantiles, giving more resolution to samples with higher curvature.

Why 256 Bins?

Most implementations default to 256 bins because:

  • 1 byte per value: A u8 can represent 0-255, minimizing memory
  • Sufficient resolution: 256 possible split points per feature is enough for most problems
  • Cache-friendly: Small histograms fit in CPU cache

With 256 bins, a dataset with 1M samples Γ— 100 features shrinks from 400 MB (float32) to 100 MB (u8).

Binning Hyperparameters

ParameterXGBoostLightGBMDefaultEffect
Max binsmax_binmax_bin256More bins = finer splits, more memory
Bin methodtree_method='hist'(always histogram)-Enable histogram training

When to Change max_bin

  • Increase (512, 1024) for high-cardinality features where you see underfitting
  • Decrease (64, 128) for faster training or as regularization on small datasets
  • Default 256 is a good starting point for almost all cases

Building Histograms

A histogram stores gradient and Hessian sums for each bin:

Feature histogram for feature f:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Bin 0: (sum_g = 1.5, sum_h = 0.8)              β”‚
β”‚ Bin 1: (sum_g = 2.3, sum_h = 1.2)              β”‚
β”‚ Bin 2: (sum_g = 0.1, sum_h = 0.3)              β”‚
β”‚ ...                                             β”‚
β”‚ Bin 255: (sum_g = 0.7, sum_h = 0.4)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Algorithm: Build Histogram

Input: 
  - Quantized features X_bin[i, f] (bin index for sample i, feature f)
  - Gradients g[i], Hessians h[i]
  - Sample indices I belonging to this node

Output: Histogram H[f][b] = (sum_g, sum_h) for each feature f, bin b

Initialize: H[f][b] = (0, 0) for all f, b

For each sample i in I:
    For each feature f:
        b = X_bin[i, f]
        H[f][b].sum_g += g[i]
        H[f][b].sum_h += h[i]

The cost is where is the number of samples in the node and is the number of features.


Split Finding from Histograms

Once the histogram is built, finding the best split for a feature is simple:

Algorithm: Find Best Split from Histogram

Input: Histogram H[0..255] for one feature

1. Compute totals:
   G_total = sum(H[b].sum_g for all b)
   H_total = sum(H[b].sum_h for all b)

2. Scan left-to-right with running sums:
   G_left = 0, H_left = 0
   best_gain = 0, best_bin = -1
   
   For b = 0 to 254:   # 255 possible split points
       G_left += H[b].sum_g
       H_left += H[b].sum_h
       G_right = G_total - G_left
       H_right = H_total - H_left
       
       gain = 0.5 * (G_leftΒ²/(H_left+Ξ») + G_rightΒ²/(H_right+Ξ») - G_totalΒ²/(H_total+Ξ»))
       
       If gain > best_gain:
           best_gain = gain
           best_bin = b

Output: best_bin, best_gain

The complexity is per feature, or for all features.


The Subtraction Trick

Here’s where histogram-based training gets even faster.

When we split a node into left and right children, we need histograms for both. But we already have the parent’s histogram. And histograms are additive: .

Therefore: .

The subtraction trick: Build the histogram for only one child, then subtract from the parent to get the other.

Subtraction Trick in Action

Consider splitting a parent node (1000 samples) into left (400 samples) and right (600 samples):

Step 1: Build histogram for the smaller child (left, 400 samples). Cost:

Step 2: Subtract from parent to get right histogram. Cost:

Without subtraction:

With subtraction:

Savings: ~35% in this case. On average across all splits, we save about 50%.

Always Build the Smaller Child

The subtraction trick is most effective when we build the histogram for the smaller child. On average, this saves about 50% of histogram building work.


The Complete Picture

Let’s trace through one split decision with histograms for a node with 10,000 samples, 100 features, and 256 bins.

Step 1: Build histogram

For each sample (10,000), for each feature (100), look up the bin index and add gradient/hessian to that bin. This costs approximately operations. The result is 100 histograms, each with 256 entries containing pairs.

Step 2: Find best split

For each feature (100), scan through all bin boundaries (256) computing the split gain at each. This costs operations.

Step 3: Pick the best

Compare gains across all features and pick the winner. Cost: 100 comparisons.

Total: ~1,000,000 operations, dominated by histogram building.

Compare this to exact split finding: sorting alone would cost operations.


Complexity Comparison

OperationExactHistogram-Based
Sort/Partition per feature (pre-quantized)
Build aggregates per feature per node, all features
Find split per feature per feature
Total per node
With subtraction- average

For typical values (, , ):

  • Exact: ~2 billion operations per node
  • Histogram: ~50 million operations per node
  • Speedup: ~40Γ—

Missing Value Handling

Real data has missing values. How do histograms handle them?

Strategy: Track missing samples in a separate bin, then try both directions. For each candidate split, compute the gain with missing values going left, then with missing values going right, and choose whichever is better. Store the learned β€œdefault direction” with the split.

This adds minimal overhead (two scans instead of one) but gives the model the ability to learn the best handling for missing values, which often outperforms imputation.


Parallel Histogram Building

Histogram building is the dominant cost, so parallelization matters.

Feature-Parallel

Each thread handles different features:

Thread 0: Build histogram for features 0-24
Thread 1: Build histogram for features 25-49
Thread 2: Build histogram for features 50-74
Thread 3: Build histogram for features 75-99
 
No synchronization needed β€” each thread writes to its own histograms.

This is the approach used by LightGBM and XGBoost for multi-core training.

Data-Parallel

Each thread handles different samples, then results are merged:

Thread 0: Partial histogram from samples 0-2499
Thread 1: Partial histogram from samples 2500-4999
Thread 2: Partial histogram from samples 5000-7499
Thread 3: Partial histogram from samples 7500-9999
 
Merge: Sum all partial histograms

This requires extra memory for partial histograms.

Which is Better?

Feature-parallel is usually preferred because it has no merge overhead and no extra memory. Data-parallel becomes useful in distributed settings where samples are partitioned across machines.


What’s Next

Histogram-based training answers β€œhow do we evaluate splits fast?” But it doesn’t answer β€œin what order should we split nodes?”

The next post, Depth-Wise vs Leaf-Wise Tree Growth, explores the two main strategies:

  • Depth-wise (XGBoost default): Split all nodes at each level before moving deeper
  • Leaf-wise (LightGBM default): Always split the highest-gain leaf

Summary

Histogram-based training makes modern gradient boosting fast:

  1. Quantization: Convert continuous features to discrete bins (typically 256)
  2. Histograms: Aggregate gradient/Hessian sums per bin
  3. Fast split finding: instead of
  4. Subtraction trick: Build only the smaller child’s histogram, derive the other

Key insights:

  • Split gain only needs aggregate statistics, not individual values
  • 256 bins provide sufficient resolution while fitting in u8
  • Subtraction trick saves ~50% of histogram building work
  • Feature-parallel building scales well on multi-core CPUs

(For split finding alone; total training speedup is typically 10-100Γ—.)


References

  1. Chen, T. & Guestrin, C. (2016). β€œXGBoost: A Scalable Tree Boosting System”. KDD 2016. arXiv

  2. Ke, G. et al. (2017). β€œLightGBM: A Highly Efficient Gradient Boosting Decision Tree”. NeurIPS 2017. PDF

  3. Greenwald, M. & Khanna, S. (2001). β€œSpace-Efficient Online Computation of Quantile Summaries”. SIGMOD 2001. PDF