RFC-0004: Exclusive Feature Bundling (EFB)#
Status: Implemented
Created: 2025-12-20
Updated: 2026-01-02
Scope: Sparse feature optimization
Summary#
EFB merges mutually exclusive sparse features into bundles, reducing histogram building cost. A 100-feature one-hot encoding becomes 1 bundle with 100 distinct bin values.
Why EFB?#
Consider one-hot encoded categoricals:
Sample |
cat_a |
cat_b |
cat_c |
|---|---|---|---|
0 |
1 |
0 |
0 |
1 |
0 |
1 |
0 |
2 |
0 |
0 |
1 |
Only one feature is non-zero per sample. Building 3 separate histograms wastes work—we can build 1 histogram with 3 distinct values.
Impact: 2-5× speedup on sparse datasets (text, one-hot, multi-hot).
Accuracy trade-off: With max_conflict_ratio = 0 (exact mutual exclusion),
bundling is lossless. With max_conflict_ratio > 0, some samples may have
multiple features in a bundle active—the encoded bin is ambiguous. Typically
<1% accuracy loss with 5% conflict ratio on sparse data.
Layers#
High Level#
Users enable via config:
let config = GBDTConfig::builder()
.enable_bundling(true) // Default: true
.max_conflict_ratio(0.0) // Exact mutual exclusion
.build()?;
// To disable EFB for debugging
let config = GBDTConfig::builder()
.enable_bundling(false)
.build()?;
Medium Level (Bundling Algorithm)#
Conflict detection: Build bitsets marking which samples each feature covers
Conflict graph: Features with overlapping non-zero entries conflict
Greedy bundling: Assign features to bundles, avoiding conflicts
pub struct BundlingConfig {
pub max_conflict_ratio: f32, // 0.0 = exact, 0.05 = allow 5% conflicts
pub max_bundle_size: usize, // Max features per bundle (default 64)
pub min_sparsity: f32, // Only bundle features >90% sparse
}
Low Level (Offset Encoding)#
Bundle members share a single bin column using offset encoding:
Member 0: bins [0, max_bins_0)
Member 1: bins [max_bins_0, max_bins_0 + max_bins_1)
Member 2: bins [max_bins_0 + max_bins_1, ...)
pub struct BundlePlan {
pub members: Box<[usize]>, // Original feature indices
pub offsets: Box<[u16]>, // Cumulative offsets
pub total_bins: u16,
}
At training time, EffectiveViews exposes bundles as single features. The
histogram builder decodes encoded bins back to member histograms.
Conflict Detection#
Two features conflict if both are non-zero for the same sample:
// Bitset per feature: which samples have non-zero values
presence[feat_a].and(&presence[feat_b]).count_ones() > threshold
Complexity: O(n_features² × n_samples/64) for conflict detection. For 1M samples, each bitset = 125KB. Conflict check = AND + popcount (vectorized). Typically <1% of total training time.
Training Integration#
Training sees n_effective features, which may be fewer than n_original:
Original |
Bundling |
Effective |
|---|---|---|
100 features |
10 bundles of 10 |
10 features |
50 dense + 100 sparse |
50 + 5 bundles |
55 features |
Tree grower is unaware of bundling—it just iterates effective features. Splits on bundles record which member was split when building trees.
Files#
Path |
Contents |
|---|---|
|
|
|
|
|
|
Design Decisions#
DD-1: Greedy over optimal. Graph coloring for optimal bundling is NP-hard. Greedy runs in O(n_features²) and is within 2× of optimal.
DD-2: Bitset conflict detection. Bitwise AND + popcount is cache-friendly and vectorizes well.
DD-3: Decode during histogram build. Alternative: single histogram, decode during split search. We chose decode-at-build for simpler split code.
DD-4: Skip dense features. Only features with >90% zeros are candidates. Dense features rarely bundle profitably.
Portability Note#
Bundling is a training-time optimization. Trees store splits in terms of the original feature indices, so prediction does not depend on any bundling metadata.
Testing Strategy#
Category |
Tests |
|---|---|
Bundle correctness |
Verify no conflicts in bundles (with ratio=0) |
Offset encoding |
Round-trip: encode → decode preserves bin values |
Split decoding |
Bundle splits correctly map to original features |
Edge cases |
Single-feature bundles, max bundle size, empty bundles |
Regression |
Same predictions with/without bundling (ratio=0) |