Inside Gradient Boosting, Part 7 of 9
This series explains gradient boosting from first principles to advanced implementation details.
Previous: Gradient-Based Sampling (GOSS) In this post: Feature-level optimizations for sparse and categorical data. Next: Regularization and Hyperparameter Tuning
So far, we’ve focused on sample-level optimizations: histograms, tree growth strategies, and sampling. But what about features?
Many real-world datasets have hundreds of sparse binary features (from one-hot encoding) or high-cardinality categorical columns. Both create computational challenges that specialized algorithms can address.
This post covers two related optimizations:
- Exclusive Feature Bundling (EFB): LightGBM’s technique for combining sparse features
- Native categorical splits: How to avoid one-hot encoding entirely
The One-Hot Problem
When you one-hot encode a categorical feature, you create a sparse matrix:
Original: color = {red, blue, green}
One-hot encoded:
Sample 1 (color=red): [1, 0, 0]
Sample 2 (color=blue): [0, 1, 0]
Sample 3 (color=green): [0, 0, 1]Observation: In any row, at most one of these features is non-zero. These features are mutually exclusive — meaning they cannot both be “on” at the same time.
The waste is significant:
- 3× memory: Three columns instead of one
- 3× histogram work: Build three separate histograms
- 3× split evaluation: Test splits on three features
For a typical dataset with 50 categorical features averaging 20 categories each:
- One-hot encoding creates 1,000 features instead of 50
- 20× overhead in memory and computation
Exclusive Feature Bundling (EFB)
LightGBM’s insight: mutually exclusive features can be merged into one without losing information.
The Core Idea
If features are mutually exclusive (at most one is non-zero per row), encode them as a single feature with distinct value ranges:
Original features (each 0 or 1):
f1: color_red
f2: color_blue
f3: color_green
Bundled feature:
0 = all zero (missing/default)
1 = f1 is active (red)
2 = f2 is active (blue)
3 = f3 is active (green)One histogram instead of three. Same information, 3× less work.
Finding Bundles
Not all features are perfectly exclusive. LightGBM uses a greedy algorithm (one that makes locally optimal choices at each step) to find near-exclusive bundles:
Algorithm: Exclusive Feature Bundling
Input: Features, max_conflict_rate (default: 0.01%) 1. Compute conflict graph: - Edge between features i and j if both are non-zero in same row 2. Sort features by non-zero count (descending) 3. Greedy bundling: For each feature: Find bundle with lowest conflict If conflict <= threshold: add to bundle Else: create new bundle Output: List of feature bundles
The conflict threshold (0.01% by default) allows bundling features that are almost exclusive, trading tiny accuracy loss for significant speedup.
Encoding Bundle Values
Once features are bundled, encode the bundle value by assigning offset ranges:
Bundle: [f1 (2 bins), f2 (3 bins), f3 (4 bins)]
Offsets: [0, 2, 5]
If f1 is active with bin 1: bundle_value = 0 + 1 = 1
If f2 is active with bin 2: bundle_value = 2 + 2 = 4
If f3 is active with bin 0: bundle_value = 5 + 0 = 5The bundle has possible values, compactly represented in one column.
Performance Impact
For a dataset with 105 one-hot features bundled into 14:
| Metric | Without EFB | With EFB |
|---|---|---|
| Features to process | 105 | 14 |
| Memory | 5 MB | 0.7 MB |
| Histogram work | 105 histograms | 14 histograms |
Reduction: ~7× faster, ~7× less memory.
The Combinatorial Challenge of Categorical Splits
Beyond bundling, there’s a deeper question: how do we split on categorical features natively?
For numerical features, we test thresholds: “is ?”
For categorical features, we test set membership: “is ?” where is a subset of categories.
The problem: for categories, there are possible binary partitions.
| Categories | Possible Partitions |
|---|---|
| 3 | 3 |
| 5 | 15 |
| 10 | 511 |
| 20 | 524,287 |
| 100 | ~ |
Exhaustive search is only feasible for small .
Native Categorical Splits
Approach 1: One-Hot Strategy
For few categories (LightGBM default: ), test each category individually:
For each category c:
Compute gain for {c} vs {all other categories}
Return the best single-category splitThis is but only finds “isolate one category” splits, not arbitrary subsets.
Approach 2: Gradient-Sorted Strategy
For more categories, LightGBM uses a clever approximation based on a classic result:
Fisher's Theorem (1958)
For squared error, the optimal binary partition of categories can be found by:
- Sort categories by their mean target value
- Test the split points in sorted order
The optimal split is contiguous in this ordering. For other losses (e.g., logistic), this is a heuristic that works well in practice.
For gradient boosting, we use gradient/Hessian ratio as the “mean target” proxy:
where and are gradient and Hessian sums for category .
Algorithm: Gradient-Sorted Categorical Split
Input: Categories, gradients, hessians 1. For each category c: ratio[c] = sum(gradients for c) / sum(hessians for c) 2. Sort categories by ratio 3. Scan sorted order: Running sum: G_left, H_left = 0 For split point i = 1 to k-1: Add category i-1 to left Compute gain for left vs right Track best split Output: Categories going left, gain
Complexity: instead of .
Why does this work? Categories with similar gradient ratios have similar optimal leaf values and should be grouped together. Sorting by ratio orders categories by their “effect direction.”
Representing Categorical Splits
Once we find a good split, we need to store which categories go left.
Bitset Storage
Store the “goes left” set as a bitset:
Categories: {0: apple, 1: banana, 2: cherry, 3: date}
Split: {apple, cherry} go left
Bitset: 0b0101 = 5
||||
|||+-- apple (0): 1 = left
||+--- banana (1): 0 = right
|+---- cherry (2): 1 = left
+----- date (3): 0 = right
Decision: if (bitset & (1 << category)) then LEFT else RIGHTFor up to 64 categories, this fits in a single u64. For more categories, LightGBM uses multiple 64-bit words packed together. Very high cardinality (1000+ categories) may require 16+ words per split, adding storage overhead.
Preventing Overfitting
Categorical splits are prone to overfitting, especially with high cardinality. Each split can perfectly isolate small groups.
LightGBM’s Regularization Parameters
| Parameter | Default | Effect |
|---|---|---|
cat_l2 | 10.0 | Extra L2 penalty on leaf weights |
cat_smooth | 10.0 | Smoothing in ratio computation |
min_data_per_group | 100 | Minimum samples per category |
max_cat_threshold | 32 | Consider only top-k categories by importance |
Practical Guidance
For high-cardinality features (100+ categories):
- Increase
cat_l2to 20-50- Increase
min_data_per_groupto 200-500- Consider grouping rare categories into “other”
When to Use Native vs One-Hot
Use One-Hot Encoding When
- Cardinality is low (< 10 categories)
- Using a library without native support
- Interpretability matters (one feature per category)
- Categories have independent effects
Use Native Categorical When
- Cardinality is moderate (10-1000 categories)
- Memory efficiency matters
- Categories should be grouped (similar products, locations)
- Using LightGBM or XGBoost with
enable_categorical
Consider Alternatives When
- Very high cardinality (> 1000): Target encoding, embeddings
- Ordinal relationship exists: Treat as numerical
- Too few samples per category: Group into “other”
Handling Unknown Categories
At inference, you may encounter categories not seen during training.
| Strategy | Behavior | Use When |
|---|---|---|
| Default direction | Go right (or learned default) | Production systems |
| Error | Reject the input | Safety-critical systems |
| Map to “other” | Reserve during training | Expected unknown categories |
Encoding Consistency
Training and inference must use identical category-to-integer mappings. Store the encoder with your model artifact.
Library Configuration
LightGBM
params = {
'categorical_feature': [0, 3, 7], # Column indices
'max_cat_to_onehot': 4, # Below this: one-hot search
'cat_l2': 10.0, # Extra regularization
'cat_smooth': 10.0, # Ratio smoothing
'max_cat_threshold': 32, # Top-k categories
'min_data_per_group': 100, # Min samples
}XGBoost
# Enable categorical support
dtrain = xgb.DMatrix(data, enable_categorical=True)
params = {
'tree_method': 'hist', # Required for categorical
'max_cat_to_onehot': 4, # One-hot threshold
}Summary: Feature-Level Optimizations
| Technique | Problem Solved | Speedup |
|---|---|---|
| EFB | Sparse one-hot features | 5-25× memory, 5-10× speed |
| Native categorical | Arbitrary subset splits | Shallower trees, better accuracy |
| Gradient-sorted splits | Exponential split space | vs |
When EFB Kicks In
EFB is automatic in LightGBM. It activates when:
- Features are sparse (< 1% non-zero)
- Features are nearly exclusive (< 0.01% conflict)
- Multiple features can be bundled together
You don’t need to configure anything; it just works.
What’s Next
We’ve covered the core algorithms: split finding, tree growth, sampling, and feature handling. The next post shifts to practical usage.
Regularization and Hyperparameter Tuning explains how to configure all these parameters to get the best performance: learning rate, tree depth, regularization, and tuning strategies.
References
-
Ke, G. et al. (2017). “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”. NeurIPS 2017. Section 3.1: EFB. PDF
-
Fisher, W.D. (1958). “On Grouping for Maximum Homogeneity”. Journal of the American Statistical Association, 53(284), 789-798.
-
Chen, T. & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System”. KDD 2016. arXiv