Feature Bundling and Sparse Feature Optimization#
This document explores techniques for optimizing gradient boosting when datasets contain many sparse or binary features, particularly from one-hot encoding. The key insight is that mutually exclusive features (like one-hot encoded columns) can be bundled together, dramatically reducing memory usage and computation.
Problem Statement#
The One-Hot Encoding Problem#
When categorical features are one-hot encoded, they create a sparse matrix:
Original: color = {red, blue, green} → 3 binary features
Row 1: color=red → [1, 0, 0]
Row 2: color=blue → [0, 1, 0]
Row 3: color=green → [0, 0, 1]
Row 4: color=red → [1, 0, 0]
Observation: In any row, at most one of these features is non-zero. These features are mutually exclusive.
Waste in Current Approach#
With separate features, we:
Store 3 separate binned columns (3× memory vs 1 categorical column)
Build 3 separate histograms (3× histogram work)
Find splits for 3 features (3× split-finding work)
For a dataset with 50 categorical features of average cardinality 20:
1000 one-hot features instead of 50 categorical features
20× memory and compute overhead
Real-World Impact#
Our quality benchmark shows boosters is 4% worse on the Adult dataset (105 one-hot features from 14 original) compared to LightGBM. This is partly due to:
Equal-width binning on binary features (addressed separately via quantile binning)
No feature bundling optimization (this RFC)
No native categorical support (future work)
Exclusive Feature Bundling (EFB)#
Core Idea#
Key insight: Mutually exclusive features can be merged into a single feature without losing information. If features \(f_1, f_2, f_3\) never have non-zero values in the same row, we can encode them as one feature with distinct value ranges:
Original features:
f1 ∈ {0, 1} (0 = absent, 1 = present)
f2 ∈ {0, 1}
f3 ∈ {0, 1}
Bundled feature f_bundle:
0 = all absent (f1=f2=f3=0)
1 = f1 is active
2 = f2 is active
3 = f3 is active
Bundle value: offset_f1 + bin(f1) + offset_f2 + bin(f2) + ...
= 0 + 1 + 0 + 0 + 0 + 0 = 1 (when f1=1)
Formal Definition#
Exclusive Feature Bundling: Given features \(F = \{f_1, ..., f_k\}\), we say they can be bundled if for every sample \(i\): $\(|\{f_j : f_j(x_i) \neq 0\}| \leq 1\)$
i.e., at most one feature has a non-zero value per sample.
Relaxation: LightGBM allows a small conflict rate (default 0.01%) for near-exclusive features. This trades a tiny accuracy loss for significant efficiency gains.
Algorithm: Bundle Construction#
ALGORITHM: ConstructExclusiveBundles(features, max_conflict_rate)
-----------------------------------------------------------------
INPUT:
features: List of sparse features
max_conflict_rate: Maximum allowed conflict (default 0.0001)
OUTPUT:
bundles: List of feature sets that can be bundled
1. // Build conflict graph: edge (i,j) if features i,j conflict
2. conflict_graph = ComputeConflicts(features)
3. // Greedy graph coloring (NP-hard to optimize, greedy is good enough)
4. bundles = []
5. sorted_features = SortByNonZeroCount(features, descending=True)
6. FOR feature IN sorted_features:
7. best_bundle = None
8. best_conflict = infinity
9.
10. FOR bundle IN bundles:
11. conflict = CountConflicts(feature, bundle, features)
12. IF conflict <= max_conflict_rate * n_samples:
13. IF conflict < best_conflict:
14. best_bundle = bundle
15. best_conflict = conflict
16.
17. IF best_bundle is not None:
18. best_bundle.append(feature)
19. ELSE:
20. bundles.append([feature])
21. RETURN bundles
Complexity: O(n_features² × n_samples) worst case, but conflict detection uses sparse indices making it O(n_features² × avg_nnz).
Algorithm: Bundle Value Encoding#
Once features are bundled, encode them as a single feature:
ALGORITHM: EncodeBundleValue(sample, bundle, bin_offsets)
---------------------------------------------------------
INPUT:
sample: Feature values for one row
bundle: List of (feature_idx, bin_mapper) in the bundle
bin_offsets: Cumulative bin offsets for each feature in bundle
OUTPUT:
bin_value: Single bin index representing all features
1. bundle_value = 0 // Default: all features are zero/missing
2. FOR i, (feat_idx, mapper) IN enumerate(bundle):
3. val = sample[feat_idx]
4. IF val != 0: // Only one should be non-zero (by exclusivity)
5. bin = mapper.GetBin(val)
6. bundle_value = bin_offsets[i] + bin
7. BREAK // At most one active feature
8. RETURN bundle_value
Decoding at Split Time#
When finding splits, we need to map bundle bins back to original features:
Bundle: [f1 (2 bins), f2 (3 bins), f3 (4 bins)]
Offsets: [0, 2, 5, 9]
Bundle bin 0: Default (all zero)
Bundle bin 1: f1 = bin 0 (offset 0 + 1 = 1)
Bundle bin 2: f1 = bin 1 (offset 0 + 2 = 2... wait, that's f2's range)
Corrected:
Bins 1-2: f1 (bins 0-1)
Bins 3-5: f2 (bins 0-2)
Bins 6-9: f3 (bins 0-3)
Split on bundle bin 4 → “f2 <= threshold_1”
Conflict Detection#
Conflict Definition#
Two features conflict on sample \(i\) if both have non-zero values: $\(\text{conflict}(f_a, f_b) = |\{i : f_a(x_i) \neq 0 \land f_b(x_i) \neq 0\}|\)$
Efficient Conflict Detection#
For sparse features (stored as non-zero indices), use set intersection:
ALGORITHM: CountConflicts(feature_a, bundle, features)
------------------------------------------------------
1. conflicts = 0
2. indices_a = NonZeroIndices(feature_a)
3. FOR feature_b IN bundle:
4. indices_b = NonZeroIndices(feature_b)
5. conflicts += |indices_a ∩ indices_b|
6. RETURN conflicts
Optimization: Use bitsets for conflict tracking:
conflict_mask[bundle] = bitset of rows where bundle has any non-zero
new_conflicts = popcount(conflict_mask[bundle] & indices_a)
LightGBM’s Threshold#
From LightGBM source (dataset.cpp:118):
const data_size_t single_val_max_conflict_cnt =
static_cast<data_size_t>(total_sample_cnt / 10000);
Default: Allow conflicts in 0.01% of samples.
Multi-Value Bins (Dense Bundling)#
When Bundles Become Dense#
If bundled features collectively cover many rows (e.g., >40% non-zero), the bundle behaves like a dense feature. LightGBM uses two modes:
Single-Value Bin: Each row has one bin value (works when exclusive)
Multi-Value Bin: Each row can have multiple active features
Multi-Value Storage#
For dense bundles (>40% coverage), LightGBM stores multiple values per row:
Traditional: bins[row] = single_bin_value
Multi-value: bins[row] = [bin1, bin2, ...] // CSR-like format
Use case: When exclusivity assumption breaks down for subset of features.
Binary Feature Detection#
Identifying Binary Features#
A feature is binary if it has only 2 unique values (typically 0 and 1):
ALGORITHM: IsBinaryFeature(feature)
-----------------------------------
1. unique_values = CountUnique(feature)
2. RETURN unique_values == 2
Binary Feature Optimization#
Binary features need only 1 bit per sample instead of a full bin index:
Storage |
Memory per sample |
For 1M samples |
|---|---|---|
u8 bin |
8 bits |
1 MB |
1-bit |
1 bit |
125 KB |
Bitpacking Multiple Binary Features#
Pack 8 binary features into a single byte:
Features: [is_male, is_employed, has_car, ..., is_urban]
Bitpacked: byte = (is_male << 0) | (is_employed << 1) | (has_car << 2) | ...
Reading: is_employed = (byte >> 1) & 1
Histogram optimization: For 8 packed features, need 2^8 = 256 combined bins. But can use bit manipulation to extract individual feature histograms efficiently.
Integration with Quantile Binning#
Binary Features Don’t Need Quantile Binning#
Quantile binning is designed for continuous features with skewed distributions. For binary features:
Only 2 possible values
Sorting/percentile computation is wasted effort
Equal-width and quantile binning give identical results
Optimization: Detect binary features and skip expensive quantile computation.
ALGORITHM: BinFeature(feature, config)
--------------------------------------
1. unique_vals = CountUnique(feature)
2. IF unique_vals == 1:
3. RETURN TrivialBinMapper() // Constant feature, skip
4. IF unique_vals == 2:
5. RETURN BinaryBinMapper(min_val, max_val) // Special fast path
6. IF config.strategy == Quantile:
7. RETURN QuantileBinMapper(feature, config.max_bins)
8. ELSE:
9. RETURN EqualWidthBinMapper(feature, config.max_bins)
Performance Analysis#
Memory Savings#
For Adult dataset (105 one-hot features from 14 original):
Approach |
Features |
Memory (256 bins, 48K samples) |
|---|---|---|
No bundling |
105 |
105 × 48K = 5.04 MB |
Perfect bundling (14 bundles) |
14 |
14 × 48K = 0.67 MB |
With bitpacking |
~3 bundles |
~0.2 MB |
Reduction: Up to 25× less memory.
Histogram Building Speedup#
Fewer features = fewer histograms to build:
Approach |
Histograms per leaf |
|---|---|
105 features |
105 × 256 bins |
14 bundles |
14 × 256 bins |
Speedup: ~7× for Adult dataset.
Split Finding#
Must still consider original features for interpretability, but histogram work is reduced:
Build histogram for bundle
Map splits back to original features
Library Comparison#
LightGBM (Full EFB)#
Automatic exclusive feature bundling
Multi-value bins for dense bundles
Conflict-tolerant bundling (0.01% threshold)
Enabled by default (
enable_bundle=true)
XGBoost (Limited)#
No explicit EFB
Some sparse optimizations in
tree_method=histRelies on native categorical support instead
CatBoost#
Native categorical encoding (ordered target statistics)
No one-hot explosion
Different approach: avoid the problem rather than optimize it
References#
Ke, G., et al. (2017). “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”
Section 3.1: Exclusive Feature Bundling
LightGBM source:
src/io/dataset.cpp: FastFeatureBundling, FindGroupsinclude/LightGBM/feature_group.h: FeatureGroup classsrc/treelearner/feature_histogram.hpp: Histogram with bundles
XGBoost source:
include/xgboost/tree_model.h: Sparse handling
Summary#
Technique |
Benefit |
Complexity |
Priority |
|---|---|---|---|
Binary detection |
Skip quantile, potential bitpacking |
Low |
High |
Exclusive bundling |
5-25× memory reduction |
Medium |
Medium |
Multi-value bins |
Handles non-exclusive |
High |
Low |