Categorical Feature Handling#
Overview#
Categorical features (discrete values without natural ordering) require special split mechanisms because threshold-based splits are meaningless:
Numerical feature (ordered): $\(\text{Split: } x_j \le t \rightarrow \text{left, otherwise right}\)$
Categorical feature (unordered): $\(\text{Split: } x_j \in S \rightarrow \text{left, otherwise right}\)$
where \(S \subset \{\text{all categories}\}\) is a subset of categories.
The Combinatorial Challenge#
For a categorical feature with \(k\) categories, there are \(2^{k-1} - 1\) possible binary partitions (non-empty, non-trivial subsets up to symmetry).
Categories |
Possible Partitions |
|---|---|
3 |
3 |
5 |
15 |
10 |
511 |
20 |
524,287 |
100 |
~\(10^{29}\) |
Exhaustive search is only feasible for small \(k\). Larger cardinalities require approximation algorithms.
Approaches to Categorical Splits#
Approach 1: One-Hot Encoding (External Preprocessing)#
Convert each category to a binary feature:
Original: color ∈ {red, blue, green}
One-hot encoded:
color_red: [1, 0, 0] for red
color_blue: [0, 1, 0] for blue
color_green: [0, 0, 1] for green
Advantages:
Works with any tree implementation (no special support needed)
Each split tests one category (“is color = red?”)
Simple and well-understood
Disadvantages:
Feature explosion: k categories → k features
Memory inefficient for high cardinality
Tree depth increases (multiple splits needed to test subsets)
Loses relationships between categories
Approach 2: Native Categorical Splits#
Trees directly support partition-based splits:
Split: color ∈ {red, green} → left, {blue} → right
Advantages:
No feature expansion
More expressive splits (arbitrary subsets in one split)
Compact model representation
Can capture “similar category” relationships
Disadvantages:
Exponential split candidates (\(2^{k-1}\))
Requires specialized algorithms
Additional storage for category sets
Split Finding Algorithms#
Algorithm 1: Exhaustive Search (Small Cardinality)#
For few categories (k ≤ 4-8), enumerate all possible binary partitions:
ALGORITHM: ExhaustiveCategoricalSplit(categories, gradients, hessians)
----------------------------------------------------------------------
1. k <- |categories|
2. best_gain <- -infinity
3. best_partition <- null
4.
5. // Enumerate all non-empty subsets (up to complement symmetry)
6. FOR subset IN all_subsets(categories) WHERE 0 < |subset| < k:
7. // Compute gradient/hessian sums for left (subset) and right (complement)
8. G_left, H_left <- sum_stats(subset, gradients, hessians)
9. G_right, H_right <- sum_stats(complement, gradients, hessians)
10.
11. gain <- compute_split_gain(G_left, H_left, G_right, H_right)
12.
13. IF gain > best_gain:
14. best_gain <- gain
15. best_partition <- subset
16.
17. RETURN best_partition, best_gain
Complexity: \(O(2^k)\) — exponential, only feasible for small k.
Algorithm 2: One-Hot Strategy (LightGBM)#
For k ≤ max_cat_to_onehot (default 4), test each category individually:
ALGORITHM: OneHotCategoricalSplit(categories, gradients, hessians)
------------------------------------------------------------------
1. best_gain <- -infinity
2. best_category <- null
3.
4. FOR c IN categories:
5. // Compute gain for splitting {c} vs {everything else}
6. G_c, H_c <- sum_stats({c}, gradients, hessians)
7. G_rest, H_rest <- sum_stats(categories - {c}, gradients, hessians)
8.
9. gain <- compute_split_gain(G_c, H_c, G_rest, H_rest)
10.
11. IF gain > best_gain:
12. best_gain <- gain
13. best_category <- c
14.
15. RETURN {best_category}, best_gain
This finds the single best category to isolate but won’t find multi-category partitions like {red, green} vs {blue}.
Complexity: O(k) per split—linear in number of categories.
Algorithm 3: Gradient-Sorted Strategy (LightGBM)#
For k > max_cat_to_onehot, LightGBM uses a clever approximation based on a classic result:
Key Insight (Fisher, 1958): For squared error loss, the optimal binary partition can be found by:
Sort categories by their mean target value
Test all O(k) split points in this sorted order
For gradient boosting, we use the gradient/hessian ratio as a proxy for “mean target”:
ALGORITHM: GradientSortedCategoricalSplit(categories, gradients, hessians)
--------------------------------------------------------------------------
1. // Step 1: Compute per-category gradient ratio
2. FOR c IN categories:
3. G_c <- sum(gradients[i] for i where x[i] = c)
4. H_c <- sum(hessians[i] for i where x[i] = c)
5. ratio[c] <- G_c / (H_c + smoothing) // smoothing prevents division by zero
6.
7. // Step 2: Sort categories by ratio
8. sorted_cats <- SORT(categories, key=ratio)
9.
10. // Step 3: Find optimal split point in sorted order
11. best_gain <- -infinity
12. G_left, H_left <- 0, 0
13. G_total, H_total <- sum(gradients), sum(hessians)
14.
15. FOR i FROM 1 TO k-1:
16. c <- sorted_cats[i-1]
17. G_left += G_c[c]
18. H_left += H_c[c]
19. G_right <- G_total - G_left
20. H_right <- H_total - H_left
21.
22. gain <- compute_split_gain(G_left, H_left, G_right, H_right)
23.
24. IF gain > best_gain:
25. best_gain <- gain
26. best_split_point <- i
27.
28. RETURN sorted_cats[0:best_split_point], best_gain
Why This Works: Categories with similar gradient/hessian ratios have similar optimal leaf values and should be grouped together. Sorting by ratio orders categories by their “effect direction,” and the optimal partition is contiguous in this ordering.
Complexity: O(k log k) for sorting + O(k) for scanning = O(k log k)
Reference:
LightGBM/src/treelearner/feature_histogram.cpp, Fisher, W.D. (1958), “On Grouping for Maximum Homogeneity”
Algorithm Comparison#
Algorithm |
Complexity |
Optimality |
Use When |
|---|---|---|---|
Exhaustive |
O(2^k) |
Optimal |
k ≤ 4 |
One-hot |
O(k) |
Suboptimal |
Finding single best category |
Gradient-sorted |
O(k log k) |
Near-optimal for squared loss |
k > 4 |
Storage: Bitsets for Category Sets#
Representing Partitions#
Store the “goes left” category set as a bitset:
Categories: {0: apple, 1: banana, 2: cherry, 3: date}
Split: {apple, cherry} go left
Bitset representation: 0b0101 = 5
||||
|||+-- apple (0): 1 = goes left
||+--- banana (1): 0 = goes right
|+---- cherry (2): 1 = goes left
+----- date (3): 0 = goes right
Decision: IF bitset & (1 << category) THEN left ELSE right
Compact Bitset Storage#
Pack bits into machine words:
For up to 64 categories: 1 uint64 word
For up to 128 categories: 2 uint64 words
For k categories: ceil(k / 64) uint64 words
Storage per split node: ceil(max_category / 64) * 8 bytes
Variable-Size Storage#
Different splits may have different cardinalities. Use CSR-like storage:
Categorical split storage:
data: [word0, word1, word2, word3, ...] // All bitsets packed
offsets: [0, 1, 3, 3, 5, ...] // Start index per node
sizes: [1, 2, 0, 2, ...] // Words per node (0 = numerical split)
Access bitset for node i:
start <- offsets[i]
len <- sizes[i]
bitset <- data[start : start + len]
Memory Requirements#
Max Categories |
Words per Split |
Bytes per Split |
|---|---|---|
64 |
1 |
8 |
256 |
4 |
32 |
1,024 |
16 |
128 |
10,000 |
157 |
1,256 |
For high-cardinality features, consider:
Limiting to top-k most frequent categories
Hashing to fixed range
Rejecting categorical split if cardinality exceeds threshold
Decision Logic#
Traversal with Categorical Splits#
ALGORITHM: TraverseWithCategorical(node, features, tree)
--------------------------------------------------------
1. WHILE node is not leaf:
2. feat_idx <- tree.split_feature[node]
3. fval <- features[feat_idx]
4.
5. IF tree.is_categorical[node]:
6. // Categorical decision: check bitset membership
7. category <- INT(fval)
8. bitset <- tree.get_category_bitset(node)
9.
10. IF bitset.contains(category):
11. node <- tree.left_child[node]
12. ELSE:
13. node <- tree.right_child[node]
14. ELSE:
15. // Numerical decision: threshold comparison
16. IF fval <= tree.threshold[node]:
17. node <- tree.left_child[node]
18. ELSE:
19. node <- tree.right_child[node]
20.
21. RETURN tree.leaf_value[node]
Decision Type Encoding (LightGBM)#
LightGBM packs split metadata into a single byte:
decision_type byte layout:
Bit 0: Categorical flag (0=numerical, 1=categorical)
Bit 1: Default left flag (0=missing→right, 1=missing→left)
Bit 2-3: Missing type (00=None, 01=Zero, 10=NaN)
Example: decision_type = 0b00000101 = 5
- Bit 0: 1 → categorical split
- Bit 1: 0 → missing values go right
- Bit 2-3: 01 → treat 0.0 as missing
Reference:
LightGBM/include/LightGBM/tree.h,kCategoricalMask,kDefaultLeftMask
Regularization for Categorical Splits#
Categorical splits are prone to overfitting, especially with high cardinality (can perfectly separate small groups). Apply extra regularization:
Additional L2 Penalty#
Standard leaf value: w = -G / (H + lambda)
With categorical penalty: w = -G / (H + lambda + cat_l2)
LightGBM default: cat_l2 = 10.0
Smoothing in Ratio Computation#
Prevent extreme ratios for rare categories:
Standard ratio: ratio = G / H
Smoothed ratio: ratio = G / (H + cat_smooth)
LightGBM default: cat_smooth = 10.0
Minimum Samples per Category#
Skip categories with insufficient data:
IF count[category] < min_data_per_group:
exclude category from split finding
LightGBM default: min_data_per_group = 100
Maximum Categories per Split#
Limit complexity of category sets:
Only consider top max_cat_threshold categories by |G/H| magnitude
LightGBM default: max_cat_threshold = 32
When to Use Each Approach#
Use One-Hot Encoding When#
Cardinality is low (< 10 categories)
Using a library without native categorical support
Interpretability is important (one feature per category)
Categories are expected to have independent effects
Use Native Categorical When#
Cardinality is moderate (10-1000 categories)
Memory efficiency matters
Categories should be grouped (e.g., similar products, related locations)
Using LightGBM, XGBoost (with
enable_categorical), or compatible implementation
Consider Alternatives When#
Very high cardinality (> 1000): Target encoding, embeddings
Ordinal relationship exists: Treat as numerical
Too few samples per category: Group rare categories into “other”
Encoding Consistency#
Critical: Training and inference must use identical category → integer mappings.
The Problem#
Training data categories: ["apple", "banana", "cherry"]
Mapping: apple→0, banana→1, cherry→2
New inference data: ["cherry", "date", "apple"]
If different mapping: cherry→0, date→1, apple→2 // WRONG!
The model learned that “0” means apple, but inference is sending cherry as 0.
Solution: Store Encoder with Model#
Model artifact should include:
1. Tree ensemble (splits, leaves)
2. Category encoder (category string → integer mapping per feature)
3. Feature metadata (which features are categorical)
At inference:
1. Load encoder
2. Transform input categories using stored mapping
3. Handle unknown categories (default direction or error)
Handling Unknown Categories#
Options for categories seen at inference but not during training:
Strategy |
Behavior |
Use When |
|---|---|---|
Default direction |
Go right (or configured default) |
Production systems, graceful degradation |
Error |
Reject input |
Safety-critical, want to catch data issues |
Map to “other” |
Reserve special category during training |
Expected unknown categories |
Library Comparison#
Feature |
LightGBM |
XGBoost |
|---|---|---|
Native support |
Yes |
Yes (newer versions, |
Split algorithm |
Gradient-sorted |
Partition-based |
One-hot threshold |
Configurable ( |
Fixed strategy |
Max cardinality |
Configurable |
Limited |
Bitset storage |
Separate array |
In-node |
Linear tree compat |
Yes |
No |
LightGBM Configuration#
params = {
'categorical_feature': [0, 3, 7], # Column indices of categorical features
'max_cat_to_onehot': 4, # Below this, use one-hot-like search
'cat_l2': 10.0, # Extra L2 regularization
'cat_smooth': 10.0, # Ratio smoothing
'max_cat_threshold': 32, # Max categories per split
'min_data_per_group': 100, # Min samples per category
}
XGBoost Configuration#
# Enable categorical support
dtrain = xgb.DMatrix(data, enable_categorical=True)
params = {
'tree_method': 'hist', # Required for categorical
'max_cat_to_onehot': 4, # One-hot threshold
}
Summary#
Aspect |
One-Hot Encoding |
Native Categorical |
|---|---|---|
Preprocessing |
Required (k → k features) |
Not needed |
Memory |
High (sparse binary features) |
Low (single feature + bitsets) |
Split expressiveness |
Single category per split |
Arbitrary subsets |
Algorithm |
Standard threshold |
Gradient-sorted or exhaustive |
Tree depth |
Deeper (multiple splits for subset) |
Shallower |
Overfitting risk |
Lower |
Higher (needs regularization) |
Implementation |
Universal |
Requires library support |
Key References#
Fisher, W.D. (1958), “On Grouping for Maximum Homogeneity” — Mathematical foundation
LightGBM/src/treelearner/feature_histogram.cpp— Gradient-sorted split findingLightGBM/src/io/bin.cpp— Category binningLightGBM/include/LightGBM/tree.h— Categorical split storage and decisionxgboost/include/xgboost/tree_model.h— Categorical split in XGBoost