RFC-0007: Histogram Building and Optimization#
Status: Implemented
Created: 2025-12-27
Updated: 2026-01-02
Depends on: RFC-0003 (Binning)
Scope: Gradient histogram construction, memory management, and performance optimization
Summary#
Histogram building is typically the dominant cost in histogram-based gradient boosted tree training. This RFC documents the histogram architecture used in boosters, including ordered gradients, the subtraction trick, LRU caching, and the parallelization strategy.
Motivation#
Problem Statement#
Training a gradient boosted tree requires finding optimal splits at each node. The naive approach examines every unique value for every feature at every node—O(samples × features × nodes). For production datasets with millions of samples, this is prohibitive.
Histogram-based splitting discretizes features into bins (typically ≤256), reducing split search to O(bins × features × nodes). However, building histograms—accumulating gradient/hessian sums per bin—becomes the new bottleneck.
Why Histograms Dominate Training Time#
In practice, histogram building dominates because it is performed for (many) nodes and touches every active feature for every row in the node. Reducing the amount of work in this hot path (better memory access, fewer histogram builds, less synchronization) tends to have the highest leverage on end-to-end training time.
This RFC intentionally avoids embedding point-in-time benchmark numbers or library “scorecards”. Performance results live in benchmark reports (see docs/benchmarks/) and Criterion suites; the RFC focuses on the architecture and the reasoning behind decisions.
Design#
Histogram Layout#
Each histogram bin stores gradient and hessian sums:
/// Single histogram bin: (gradient_sum, hessian_sum)
pub type HistogramBin = (f64, f64);
/// Offset and size for a feature's histogram region
pub struct HistogramLayout {
pub offset: u32, // Start position in flat array
pub n_bins: u32, // Number of bins for this feature
}
All feature histograms are stored contiguously in a flat array:
[feat0: bin0..bin255][feat1: bin0..bin63][feat2: bin0..bin128]...
↑ offset=0 ↑ offset=256 ↑ offset=320
This layout enables cache-efficient sequential access during split finding and simple parallel builds with disjoint write regions.
Histogram Builder#
pub struct HistogramBuilder {
parallelism: Parallelism,
}
impl HistogramBuilder {
/// Build histogram for active features in a node
pub fn build_histogram(
&self,
feature_views: &EffectiveViews<'_>,
ordered_indices: &[u32],
ordered_grad_hess: &[GradsTuple],
histogram: &mut [HistogramBin],
feature_indices: Option<&[u32]>, // Column sampling
);
}
The builder supports two modes:
Sequential: Single-threaded, processes features one at a time
Parallel: Multi-threaded, partitions features across threads
Ordered Gradients#
Before histogram building, gradients are reordered to match the sample partition:
Original gradients: [g0, g1, g2, g3, g4, g5, g6, g7]
Node partition: [0, 2, 5, 7] (samples in this node)
Ordered gradients: [g0, g2, g5, g7] (sequential access)
Benefits:
Sequential memory access: Inner loop reads gradients in order
Cache efficiency: Prefetcher works effectively
Vectorization: Compiler can auto-vectorize sequential loops
Implementation:
/// Pre-gather gradients for a node's samples
fn gather_gradients(
gradients: &Gradients,
indices: &[u32],
ordered: &mut [GradsTuple],
) {
for (i, &idx) in indices.iter().enumerate() {
ordered[i] = gradients.get(idx as usize);
}
}
Histogram Building Kernels#
The inner loop accumulates gradients into bins:
fn build_u8_gathered(
bins: &[u8],
ordered_grad_hess: &[GradsTuple],
histogram: &mut [HistogramBin],
indices: &[u32],
) {
for i in 0..indices.len() {
let row = indices[i] as usize;
let bin = bins[row] as usize;
let gh = ordered_grad_hess[i];
histogram[bin].0 += gh.grad as f64;
histogram[bin].1 += gh.hess as f64;
}
}
Different kernels handle storage variants:
build_u8_*/build_u16_*: Dense binsbuild_sparse_*: sparse (CSC-like) binned features (contiguous-row fast path)
Notes on sparse support:
For contiguous row ranges, sparse features are supported by iterating the feature’s
sample_indicesand checking whether each non-default entry falls inside[start_row, start_row + n_rows).For gathered row sets (arbitrary row indices), sparse histogram building is currently not implemented in the gathered kernels; use bundling (EFB) for very sparse one-hot style data (bundles are encoded as dense
u16columns), or ensure the training hot path stays on dense views.
Subtraction Trick#
When splitting a node, we need histograms for both children. The subtraction trick computes the larger child by difference:
Parent (retained from previous level)
├── Smaller child → Build explicitly
└── Larger child = Parent - Smaller
Implementation:
/// Compute sibling histogram by subtraction
fn subtract_histograms(
parent: &[HistogramBin],
smaller: &[HistogramBin],
larger: &mut [HistogramBin],
) {
for i in 0..parent.len() {
larger[i].0 = parent[i].0 - smaller[i].0;
larger[i].1 = parent[i].1 - smaller[i].1;
}
}
This reduces work because only one child histogram is built explicitly; the other is derived from already-available data. It is especially effective when splits are skewed, since the smaller child is cheaper to build.
Histogram Pooling and LRU Cache#
Storing all histograms for deep trees exceeds memory. The implementation uses a fixed-size histogram pool with an LRU eviction policy.
Key points:
The pool allocates
cache_sizeslots up-front.Each slot stores a full per-node histogram:
total_bins = sum(feature_bins).Training nodes are identified by the grower/partitioner’s training node id (leaf id during growth), not the final inference node id.
pub struct HistogramPool {
// data: Box<[HistogramBin]> where bins are laid out as:
// [slot0_bin0..slot0_binN][slot1_bin0..slot1_binN]...
// plus mapping + timestamps for eviction.
}
impl HistogramPool {
/// Acquire a slot for a training node.
/// Returns Hit(slot) if cached, or Miss(slot) if it must be rebuilt.
pub fn acquire(&mut self, node_id: NodeId) -> AcquireResult;
/// Get a cached histogram view for a node (if present).
pub fn get(&self, node_id: NodeId) -> Option<HistogramSlot<'_>>;
/// Move a cached histogram mapping (used by subtraction trick).
pub fn move_mapping(&mut self, from_node: NodeId, to_node: NodeId);
}
Pooling details (what’s actually implemented)#
Slot allocation and clearing#
On a cache miss, the caller clears the slot and rebuilds the histogram.
On a cache hit, the existing bins are reused as-is.
Eviction policy#
The LRU policy is timestamp-based (per-slot
last_used_timeand a global counter), not a linked list.Eviction chooses the minimum timestamp among unpinned slots.
This makes eviction O(cache_size), which is fine because
cache_sizeis small (typically on the order of active leaves).
Pinning (subtraction trick correctness)#
The grower temporarily pins the parent’s slot while performing
move_mapping + build_small + subtract.Pinning prevents the parent histogram (now owned by the large child) from being chosen as an eviction victim mid-operation.
Identity mode (cache_size >= total_nodes)#
When cache_size >= total_nodes, the pool switches to an optimized “identity mapping” mode:
node_id == slot_id(no mapping arrays and no LRU timestamps).Slot validity is tracked via a per-tree epoch counter and a per-slot epoch marker.
reset_mappings()bumps the epoch, so the nextacquire(node)returnsMissunless that slot has already been built in the current epoch.
This avoids resetting large mapping arrays when the cache can hold all nodes.
What happens on eviction#
If a parent histogram needed for the subtraction trick is not in the pool when splitting, training falls back to building both child histograms directly.
Increasing cache_size reduces the likelihood of this fallback at the cost of memory.
Parallelization Strategy#
┌─────────────┐
│ TreeGrower │
└──────┬──────┘
│
┌────────────────────┴────────────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Feature-parallel│ │ Level-parallel │
│ (single node) │ │ (multiple nodes)│
└─────────────────┘ └─────────────────┘
↓ ↓
Many features Many nodes
Large histograms Shallow trees
Feature-parallel is default: Each feature’s histogram region is disjoint, enabling safe parallel writes without locks.
active_features.par_iter().for_each(|feature| {
// Each feature writes to histogram[layout[feature].offset..]
// No synchronization needed
});
Thresholds for parallel dispatch:
MIN_FEATURES_PARALLEL = 4: Don’t parallelize fewer featuresMIN_WORK_PER_THREAD = 4096: Minimum rows × features per thread
Gradient Storage#
/// Packed gradient/hessian pair
#[derive(Clone, Copy)]
#[repr(C, align(8))]
pub struct GradsTuple {
pub grad: f32,
pub hess: f32,
}
8-byte alignment enables single 64-bit load. The compiler can vectorize load operations.
Design Decisions#
DD-1: f64 Histogram Bins#
Context: Gradients are stored as f32, but histograms accumulate many values. What precision for bins?
Options considered:
f32 bins: Matches gradient precision, half the memory
f64 bins: Higher precision for accumulation
Quantized i16/i32: LightGBM approach, complex packing
Decision: Use f64 for histogram bins.
Consequences:
Prevents numerical drift when summing millions of f32 values
Split gain computation is precision-sensitive (difference of large sums)
Memory overhead is acceptable—histograms are small (256 bins × features)
Simpler implementation than quantized approach
DD-2: Ordered Gradients#
Context: During histogram building, we access gradients for samples in the current node. Access can be random (using indices) or sequential (pre-gathered).
Options considered:
Random access:
gradients[sample_index]directlyPre-ordered: Gather gradients into partition order before building
Decision: Pre-order gradients per node before histogram building.
Consequences:
Adds O(n) gather step per node
Enables sequential memory access in hot loop
Enables compiler auto-vectorization
DD-3: LRU Cache vs Full Storage#
Context: Subtraction trick requires parent histograms. Deep trees have many nodes.
Options considered:
Full storage: Keep all histograms in memory
Depth-limited: Keep recent levels only
LRU cache: Evict least-recently-used when full
Decision: LRU cache with configurable size (default 8 slots).
Consequences:
Memory-efficient for deep trees
Works with any growth strategy
Parent eviction before child build causes rebuild (rare in practice)
8 slots sufficient for typical depth-first growth
DD-4: Feature-Parallel over Row-Parallel#
Context: Histogram building can parallelize over rows (each thread builds partial histogram, then merge) or features (each thread builds complete histogram for subset of features).
Options considered:
Row-parallel: Each thread processes sample subset, merge at end
Feature-parallel: Each thread processes feature subset, no merge needed
Hybrid: Row-parallel for few features, feature-parallel otherwise
Decision: Feature-parallel only.
Consequences:
No merge overhead (disjoint write regions)
Scales with feature count (typical: 10-1000 features)
Less effective for very few features (but rare in practice)
DD-5: No Software Prefetching#
Context: Bin access is pseudo-random (determined by sample ordering). Software prefetch could hide latency.
Options considered:
No prefetching: Rely on hardware prefetcher
Software prefetch: Explicit prefetch instructions with lookahead (64 elements)
Decision: No software prefetching.
Consequences:
Hardware prefetcher handles our access pattern effectively
In our experiments, explicit software prefetching was not beneficial
Results may vary across CPUs; revisit only if profiling shows a clear latency bottleneck
Simpler, more portable code
DD-6: No SIMD Vectorization#
Context: Inner loop accumulates two f64 values. SIMD could parallelize.
Options considered:
Scalar loop: Let compiler auto-vectorize
Manual SIMD: Use intrinsics (AVX2, NEON)
pulp crate: Portable SIMD abstraction
Decision: Scalar loops only.
Consequences:
LLVM auto-vectorizes the simple scalar loop effectively
Manual SIMD added overhead on ARM (pulp crate)
Minimal benefit on x86 over auto-vectorization
Simpler, more portable code
DD-7: No Quantized Histograms#
Context: LightGBM uses quantized i16/i32 histogram bins for reduced memory bandwidth.
Options considered:
f64 bins: Current approach
Quantized i16 + scale: Pack grad/hess in 32 bits, dequantize for split finding
Quantized i32: More precision, less packing benefit
Decision: Keep f64 bins, do not implement quantization.
Consequences:
Simpler implementation
No accuracy loss risk from quantization
A high-performance quantized design would likely require accumulating in integer space and dequantizing only for split evaluation
This remains a possible future direction if profiling shows we are memory-bandwidth bound
Benchmarks and Regression Testing#
Benchmark suites live alongside code (Criterion) and human-readable reports live under docs/benchmarks/.
Use
cargo bench --bench training_gbdtfor end-to-end training benchmarks.Use the component benches under
crates/boosters/benches/for focused histogram work.
Keep benchmark numbers out of the RFC so the design stays stable while performance data evolves.
Integration#
Component |
Integration Point |
Notes |
|---|---|---|
RFC-0003 (Binning) |
BinnedDataset |
Provides bin storage and views |
RFC-0008 (GBDT Training) |
SplitFinder |
Iterates histogram bins |
RFC-0008 (GBDT Training) |
TreeGrower |
Orchestrates build + split |
RFC-0008 (GBDT Training) |
GBDTConfig |
|
RFC-0006 (Sampling) |
GOSS/subsampling |
Sampling happens before histogram building; histograms receive pre-sampled gradients |
Usage#
Configuration#
Histogram cache size is the primary user-tunable parameter:
let config = GBDTConfig {
cache_size: 8, // Default, sufficient for typical depth-first growth
..Default::default()
};
When to increase cache size:
Breadth-first tree growth strategy
Very deep trees (max_depth > 15)
Seeing rebuild warnings in logs
Typical settings:
Growth Strategy |
Recommended Cache Size |
|---|---|
Depth-first |
8 (default) |
Breadth-first |
max_depth × 2 |
Best-first (leaf-wise) |
16-32 |
Observability#
Histogram operations are not directly observable, but training logs include:
Per-tree training time (histogram building is ~60-70%)
Memory usage includes histogram cache
Users experiencing slow training with high max_depth should increase cache_size as a first step.
Troubleshooting#
Symptom |
Likely Cause |
Solution |
|---|---|---|
Slow training with deep trees |
Histogram cache eviction |
Increase |
Slow single-threaded mode |
Expected—compute-bound |
Use multi-threaded mode |
High memory usage |
Large histogram cache |
Decrease |
Testing Strategy#
Unit Tests#
Category |
Tests |
|---|---|
Histogram accumulation |
Sum correctness, numerical precision |
Subtraction trick |
parent - child = sibling identity |
LRU eviction |
Correct eviction order, slot reuse |
Parallel consistency |
Same result as sequential |
Concurrency Tests#
Test |
Verification |
|---|---|
Data race detection |
Run under ThreadSanitizer (TSAN) |
Determinism |
Parallel build == sequential build (bit-identical) |
Thread scaling |
1, 2, 4, 8 threads produce same histograms |
Feature-parallel builds should be deterministic since each feature writes to disjoint memory regions. Tests verify this property.
Determinism guarantee: Given identical inputs and ordered_indices, histogram builds produce bit-identical results regardless of thread count. This is because:
Each feature accumulates independently (no cross-feature interaction)
Within a feature, samples are processed in
ordered_indicesorderFloating-point addition order is fixed by the iteration order
Edge Cases#
Case |
Test Approach |
|---|---|
Empty node (0 samples) |
Histogram should be zero |
Single sample |
Histogram should equal that gradient |
All same bin |
Single bin should have full sum |
Max bins (256/65536) |
Boundary bins accumulate correctly |
Overflow potential |
Large gradients × many samples |
Numerical precision |
f64 vs f32 sum comparison |
Subnormal gradients |
Very small hessians (near 0) accumulate correctly |
Subnormal handling: f64 bins correctly accumulate subnormal f32 gradients. The test verifies that summing millions of tiny values (e.g., hessians ≈ 1e-38) doesn’t collapse to zero and maintains reasonable precision.
Performance Regression Testing#
Performance is tracked via Criterion benchmarks in benches/suites/component/:
cargo bench --bench training_gbdt
The histogram building benchmark measures:
Time per histogram build (microseconds)
Throughput (samples × features / second)
CI can optionally flag regressions against stored baselines (thresholds are project-policy, not part of this RFC).
Numerical Validation#
#[test]
fn histogram_sum_equals_node_sum() {
// Sum of all histogram bins should equal sum of node gradients
let hist_sum: (f64, f64) = histogram.iter()
.fold((0.0, 0.0), |acc, b| (acc.0 + b.0, acc.1 + b.1));
let grad_sum: (f64, f64) = ordered_grads.iter()
.fold((0.0, 0.0), |acc, g| (acc.0 + g.grad as f64, acc.1 + g.hess as f64));
assert_relative_eq!(hist_sum.0, grad_sum.0, epsilon = 1e-6);
assert_relative_eq!(hist_sum.1, grad_sum.1, epsilon = 1e-6);
}
Files#
Path |
Contents |
Visibility |
|---|---|---|
|
Module organization |
Internal |
|
|
Internal |
|
|
Internal |
|
Safe iteration over disjoint feature regions |
Internal |
|
|
Public |
|
|
Internal |
Note: Users interact with histograms only through GBDTConfig::cache_size. The histogram building machinery is internal and may change between versions.
Concrete proposal: extracting the LRU policy#
If we want to “split off” the LRU algorithm without changing behavior or performance, the lowest-risk refactor is to extract a small, array-based helper that owns only the eviction bookkeeping (timestamps + pins), while leaving node/slot mapping and histogram storage in HistogramPool.
Proposed internal API (no public surface):
training/gbdt/histograms/lru.rswithstruct TimestampLru { last_used_time: Box<[u64]>, pinned: Box<[bool]>, current_time: u64 }TimestampLrumethods:touch(slot_id)increments time + records timestampfind_victim()scans for the min timestamp among unpinned slotspin(slot_id) / unpin(slot_id)
Migration plan:
Keep
HistogramPool’s existing mapping arrays as-is.Move only
last_used_time/current_time/pinned/find_lru_slot()intoTimestampLru.Keep “identity mode” inside
HistogramPool(it’s not LRU; it’s validity-by-epoch).
Why this is feasible:
The current eviction is already O(cache_size) scanning over a dense array; the helper preserves that exact behavior.
No heap allocations on the hot path; all arrays remain preallocated.
Pin/unpin stays a slot-local boolean.
Why we might not want to do it:
HistogramPoolis performance-critical and currently self-contained; splitting files may not improve readability meaningfully.The tricky part is not LRU itself, it’s the interaction with
move_mapping()and identity-mode semantics.
Open Questions#
Quantized mode for memory-bound hardware: If profiling shows histogram building becomes memory-bandwidth bound on some hardware, should we add a quantized histogram mode?
Adaptive cache sizing: Could we dynamically size the histogram cache based on tree depth and available memory?
References#
LightGBM feature_histogram.hpp - Histogram operations
XGBoost histogram.cuh - GPU histogram building
Ke, G., et al. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree” (2017) - Section 2.2 on histogram-based algorithms
Changelog#
2026-01-02: Consolidated and clarified histogram design and tests
2025-12-27: Initial draft