RFC-0004: Exclusive Feature Bundling (EFB)#

Status: Implemented
Created: 2025-12-20
Updated: 2026-01-02
Scope: Sparse feature optimization

Summary#

EFB merges mutually exclusive sparse features into bundles, reducing histogram building cost. A 100-feature one-hot encoding becomes 1 bundle with 100 distinct bin values.

Why EFB?#

Consider one-hot encoded categoricals:

Sample	cat_a	cat_b	cat_c
0	1	0	0
1	0	1	0
2	0	0	1

Only one feature is non-zero per sample. Building 3 separate histograms wastes work—we can build 1 histogram with 3 distinct values.

Impact: 2-5× speedup on sparse datasets (text, one-hot, multi-hot).

Accuracy trade-off: With max_conflict_ratio = 0 (exact mutual exclusion), bundling is lossless. With max_conflict_ratio > 0, some samples may have multiple features in a bundle active—the encoded bin is ambiguous. Typically <1% accuracy loss with 5% conflict ratio on sparse data.

Layers#

High Level#

Users enable via config:

let config = GBDTConfig::builder()
    .enable_bundling(true)  // Default: true
    .max_conflict_ratio(0.0)  // Exact mutual exclusion
    .build()?;

// To disable EFB for debugging
let config = GBDTConfig::builder()
    .enable_bundling(false)
    .build()?;

Medium Level (Bundling Algorithm)#

Conflict detection: Build bitsets marking which samples each feature covers
Conflict graph: Features with overlapping non-zero entries conflict
Greedy bundling: Assign features to bundles, avoiding conflicts

pub struct BundlingConfig {
    pub max_conflict_ratio: f32,  // 0.0 = exact, 0.05 = allow 5% conflicts
    pub max_bundle_size: usize,   // Max features per bundle (default 64)
    pub min_sparsity: f32,        // Only bundle features >90% sparse
}

Low Level (Offset Encoding)#

Bundle members share a single bin column using offset encoding:

Member 0: bins [0, max_bins_0)
Member 1: bins [max_bins_0, max_bins_0 + max_bins_1)
Member 2: bins [max_bins_0 + max_bins_1, ...)

pub struct BundlePlan {
    pub members: Box<[usize]>,    // Original feature indices
    pub offsets: Box<[u16]>,      // Cumulative offsets
    pub total_bins: u16,
}

At training time, EffectiveViews exposes bundles as single features. The histogram builder decodes encoded bins back to member histograms.

Conflict Detection#

Two features conflict if both are non-zero for the same sample:

// Bitset per feature: which samples have non-zero values
presence[feat_a].and(&presence[feat_b]).count_ones() > threshold

Complexity: O(n_features² × n_samples/64) for conflict detection. For 1M samples, each bitset = 125KB. Conflict check = AND + popcount (vectorized). Typically <1% of total training time.

Training Integration#

Training sees n_effective features, which may be fewer than n_original:

Original	Bundling	Effective
100 features	10 bundles of 10	10 features
50 dense + 100 sparse	50 + 5 bundles	55 features

Tree grower is unaware of bundling—it just iterates effective features. Splits on bundles record which member was split when building trees.

Files#

Path	Contents
`data/binned/bundling.rs`	`BundlingConfig`, conflict detection
`data/binned/bundle.rs`	`BundlePlan`, offset encoding
`data/binned/effective.rs`	`EffectiveViews`

Design Decisions#

DD-1: Greedy over optimal. Graph coloring for optimal bundling is NP-hard. Greedy runs in O(n_features²) and is within 2× of optimal.

DD-2: Bitset conflict detection. Bitwise AND + popcount is cache-friendly and vectorizes well.

DD-3: Decode during histogram build. Alternative: single histogram, decode during split search. We chose decode-at-build for simpler split code.

DD-4: Skip dense features. Only features with >90% zeros are candidates. Dense features rarely bundle profitably.

Portability Note#

Bundling is a training-time optimization. Trees store splits in terms of the original feature indices, so prediction does not depend on any bundling metadata.

Testing Strategy#

Category	Tests
Bundle correctness	Verify no conflicts in bundles (with ratio=0)
Offset encoding	Round-trip: encode → decode preserves bin values
Split decoding	Bundle splits correctly map to original features
Edge cases	Single-feature bundles, max bundle size, empty bundles
Regression	Same predictions with/without bundling (ratio=0)