Feature Bundling and Sparse Feature Optimization#

This document explores techniques for optimizing gradient boosting when datasets contain many sparse or binary features, particularly from one-hot encoding. The key insight is that mutually exclusive features (like one-hot encoded columns) can be bundled together, dramatically reducing memory usage and computation.

Problem Statement#

The One-Hot Encoding Problem#

When categorical features are one-hot encoded, they create a sparse matrix:

Original: color = {red, blue, green}  →  3 binary features

Row 1: color=red   →  [1, 0, 0]
Row 2: color=blue  →  [0, 1, 0]  
Row 3: color=green →  [0, 0, 1]
Row 4: color=red   →  [1, 0, 0]

Observation: In any row, at most one of these features is non-zero. These features are mutually exclusive.

Waste in Current Approach#

With separate features, we:

Store 3 separate binned columns (3× memory vs 1 categorical column)
Build 3 separate histograms (3× histogram work)
Find splits for 3 features (3× split-finding work)

For a dataset with 50 categorical features of average cardinality 20:

1000 one-hot features instead of 50 categorical features
20× memory and compute overhead

Real-World Impact#

Our quality benchmark shows boosters is 4% worse on the Adult dataset (105 one-hot features from 14 original) compared to LightGBM. This is partly due to:

Equal-width binning on binary features (addressed separately via quantile binning)
No feature bundling optimization (this RFC)
No native categorical support (future work)

Exclusive Feature Bundling (EFB)#

Core Idea#

Key insight: Mutually exclusive features can be merged into a single feature without losing information. If features $f_1, f_2, f_3$ never have non-zero values in the same row, we can encode them as one feature with distinct value ranges:

Original features:
  f1 ∈ {0, 1}  (0 = absent, 1 = present)
  f2 ∈ {0, 1}
  f3 ∈ {0, 1}

Bundled feature f_bundle:
  0 = all absent (f1=f2=f3=0)
  1 = f1 is active
  2 = f2 is active
  3 = f3 is active

Bundle value: offset_f1 + bin(f1) + offset_f2 + bin(f2) + ...
            = 0 + 1 + 0 + 0 + 0 + 0 = 1  (when f1=1)

Formal Definition#

Exclusive Feature Bundling: Given features $F = \{f_1, ..., f_k\}$, we say they can be bundled if for every sample $i$: $$|\{f_j : f_j(x_i) \neq 0\}| \leq 1$$

i.e., at most one feature has a non-zero value per sample.

Relaxation: LightGBM allows a small conflict rate (default 0.01%) for near-exclusive features. This trades a tiny accuracy loss for significant efficiency gains.

Algorithm: Bundle Construction#

ALGORITHM: ConstructExclusiveBundles(features, max_conflict_rate)
-----------------------------------------------------------------
INPUT:
  features: List of sparse features
  max_conflict_rate: Maximum allowed conflict (default 0.0001)

OUTPUT:
  bundles: List of feature sets that can be bundled

// Build conflict graph: edge (i,j) if features i,j conflict
conflict_graph = ComputeConflicts(features)

// Greedy graph coloring (NP-hard to optimize, greedy is good enough)
bundles = []
sorted_features = SortByNonZeroCount(features, descending=True)

FOR feature IN sorted_features:
    best_bundle = None
    best_conflict = infinity
    
   FOR bundle IN bundles:
       conflict = CountConflicts(feature, bundle, features)
       IF conflict <= max_conflict_rate * n_samples:
           IF conflict < best_conflict:
               best_bundle = bundle
               best_conflict = conflict
   
   IF best_bundle is not None:
       best_bundle.append(feature)
   ELSE:
       bundles.append([feature])

RETURN bundles

Complexity: O(n_features² × n_samples) worst case, but conflict detection uses sparse indices making it O(n_features² × avg_nnz).

Algorithm: Bundle Value Encoding#

Once features are bundled, encode them as a single feature:

ALGORITHM: EncodeBundleValue(sample, bundle, bin_offsets)
---------------------------------------------------------
INPUT:
  sample: Feature values for one row
  bundle: List of (feature_idx, bin_mapper) in the bundle
  bin_offsets: Cumulative bin offsets for each feature in bundle

OUTPUT:
  bin_value: Single bin index representing all features

1. bundle_value = 0  // Default: all features are zero/missing

2. FOR i, (feat_idx, mapper) IN enumerate(bundle):
3.     val = sample[feat_idx]
4.     IF val != 0:  // Only one should be non-zero (by exclusivity)
5.         bin = mapper.GetBin(val)
6.         bundle_value = bin_offsets[i] + bin
7.         BREAK  // At most one active feature

8. RETURN bundle_value

Decoding at Split Time#

When finding splits, we need to map bundle bins back to original features:

Bundle: [f1 (2 bins), f2 (3 bins), f3 (4 bins)]
Offsets: [0, 2, 5, 9]

Bundle bin 0: Default (all zero)
Bundle bin 1: f1 = bin 0 (offset 0 + 1 = 1)
Bundle bin 2: f1 = bin 1 (offset 0 + 2 = 2... wait, that's f2's range)

Corrected:
  Bins 1-2: f1 (bins 0-1)
  Bins 3-5: f2 (bins 0-2)  
  Bins 6-9: f3 (bins 0-3)

Split on bundle bin 4 → “f2 <= threshold_1”

Conflict Detection#

Conflict Definition#

Two features conflict on sample $i$ if both have non-zero values: $$\text{conflict}(f_a, f_b) = |\{i : f_a(x_i) \neq 0 \land f_b(x_i) \neq 0\}|$$

Efficient Conflict Detection#

For sparse features (stored as non-zero indices), use set intersection:

ALGORITHM: CountConflicts(feature_a, bundle, features)
------------------------------------------------------
conflicts = 0
indices_a = NonZeroIndices(feature_a)

FOR feature_b IN bundle:
    indices_b = NonZeroIndices(feature_b)
    conflicts += |indices_a ∩ indices_b|

RETURN conflicts

Optimization: Use bitsets for conflict tracking:

conflict_mask[bundle] = bitset of rows where bundle has any non-zero
new_conflicts = popcount(conflict_mask[bundle] & indices_a)

LightGBM’s Threshold#

From LightGBM source (dataset.cpp:118):

const data_size_t single_val_max_conflict_cnt =
    static_cast<data_size_t>(total_sample_cnt / 10000);

Default: Allow conflicts in 0.01% of samples.

Multi-Value Bins (Dense Bundling)#

When Bundles Become Dense#

If bundled features collectively cover many rows (e.g., >40% non-zero), the bundle behaves like a dense feature. LightGBM uses two modes:

Single-Value Bin: Each row has one bin value (works when exclusive)
Multi-Value Bin: Each row can have multiple active features

Multi-Value Storage#

For dense bundles (>40% coverage), LightGBM stores multiple values per row:

Traditional: bins[row] = single_bin_value

Multi-value: bins[row] = [bin1, bin2, ...]  // CSR-like format

Use case: When exclusivity assumption breaks down for subset of features.

Binary Feature Detection#

Identifying Binary Features#

A feature is binary if it has only 2 unique values (typically 0 and 1):

ALGORITHM: IsBinaryFeature(feature)
-----------------------------------
1. unique_values = CountUnique(feature)
2. RETURN unique_values == 2

Binary Feature Optimization#

Binary features need only 1 bit per sample instead of a full bin index:

Storage	Memory per sample	For 1M samples
u8 bin	8 bits	1 MB
1-bit	1 bit	125 KB

Bitpacking Multiple Binary Features#

Pack 8 binary features into a single byte:

Features: [is_male, is_employed, has_car, ..., is_urban]
Bitpacked: byte = (is_male << 0) | (is_employed << 1) | (has_car << 2) | ...

Reading: is_employed = (byte >> 1) & 1

Histogram optimization: For 8 packed features, need 2^8 = 256 combined bins. But can use bit manipulation to extract individual feature histograms efficiently.

Integration with Quantile Binning#

Binary Features Don’t Need Quantile Binning#

Quantile binning is designed for continuous features with skewed distributions. For binary features:

Only 2 possible values
Sorting/percentile computation is wasted effort
Equal-width and quantile binning give identical results

Optimization: Detect binary features and skip expensive quantile computation.

ALGORITHM: BinFeature(feature, config)
--------------------------------------
unique_vals = CountUnique(feature)

IF unique_vals == 1:
    RETURN TrivialBinMapper()  // Constant feature, skip

IF unique_vals == 2:
    RETURN BinaryBinMapper(min_val, max_val)  // Special fast path

IF config.strategy == Quantile:
    RETURN QuantileBinMapper(feature, config.max_bins)
ELSE:
    RETURN EqualWidthBinMapper(feature, config.max_bins)

Performance Analysis#

Memory Savings#

For Adult dataset (105 one-hot features from 14 original):

Approach	Features	Memory (256 bins, 48K samples)
No bundling	105	105 × 48K = 5.04 MB
Perfect bundling (14 bundles)	14	14 × 48K = 0.67 MB
With bitpacking	~3 bundles	~0.2 MB

Reduction: Up to 25× less memory.

Histogram Building Speedup#

Fewer features = fewer histograms to build:

Approach	Histograms per leaf
105 features	105 × 256 bins
14 bundles	14 × 256 bins

Speedup: ~7× for Adult dataset.

Split Finding#

Must still consider original features for interpretability, but histogram work is reduced:

Build histogram for bundle
Map splits back to original features

Library Comparison#

LightGBM (Full EFB)#

Automatic exclusive feature bundling
Multi-value bins for dense bundles
Conflict-tolerant bundling (0.01% threshold)
Enabled by default (enable_bundle=true)

XGBoost (Limited)#

No explicit EFB
Some sparse optimizations in tree_method=hist
Relies on native categorical support instead

CatBoost#

Native categorical encoding (ordered target statistics)
No one-hot explosion
Different approach: avoid the problem rather than optimize it

References#

Ke, G., et al. (2017). “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”
- Section 3.1: Exclusive Feature Bundling
LightGBM source:
- src/io/dataset.cpp: FastFeatureBundling, FindGroups
- include/LightGBM/feature_group.h: FeatureGroup class
- src/treelearner/feature_histogram.hpp: Histogram with bundles
XGBoost source:
- include/xgboost/tree_model.h: Sparse handling

Summary#

Technique	Benefit	Complexity	Priority
Binary detection	Skip quantile, potential bitpacking	Low	High
Exclusive bundling	5-25× memory reduction	Medium	Medium
Multi-value bins	Handles non-exclusive	High	Low