Categorical Feature Handling#

Overview#

Categorical features (discrete values without natural ordering) require special split mechanisms because threshold-based splits are meaningless:

Numerical feature (ordered): $$\text{Split: } x_j \le t \rightarrow \text{left, otherwise right}$$

Categorical feature (unordered): $$\text{Split: } x_j \in S \rightarrow \text{left, otherwise right}$$

where $S \subset \{\text{all categories}\}$ is a subset of categories.

The Combinatorial Challenge#

For a categorical feature with $k$ categories, there are $2^{k-1} - 1$ possible binary partitions (non-empty, non-trivial subsets up to symmetry).

Categories	Possible Partitions
3	3
5	15
10	511
20	524,287
100	~$10^{29}$

Exhaustive search is only feasible for small $k$. Larger cardinalities require approximation algorithms.

Approaches to Categorical Splits#

Approach 1: One-Hot Encoding (External Preprocessing)#

Convert each category to a binary feature:

Original: color ∈ {red, blue, green}

One-hot encoded:
  color_red:   [1, 0, 0] for red
  color_blue:  [0, 1, 0] for blue  
  color_green: [0, 0, 1] for green

Advantages:

Works with any tree implementation (no special support needed)
Each split tests one category (“is color = red?”)
Simple and well-understood

Disadvantages:

Feature explosion: k categories → k features
Memory inefficient for high cardinality
Tree depth increases (multiple splits needed to test subsets)
Loses relationships between categories

Approach 2: Native Categorical Splits#

Trees directly support partition-based splits:

Split: color ∈ {red, green} → left, {blue} → right

Advantages:

No feature expansion
More expressive splits (arbitrary subsets in one split)
Compact model representation
Can capture “similar category” relationships

Disadvantages:

Exponential split candidates ($2^{k-1}$)
Requires specialized algorithms
Additional storage for category sets

Split Finding Algorithms#

Algorithm 1: Exhaustive Search (Small Cardinality)#

For few categories (k ≤ 4-8), enumerate all possible binary partitions:

ALGORITHM: ExhaustiveCategoricalSplit(categories, gradients, hessians)
----------------------------------------------------------------------
k <- |categories|
best_gain <- -infinity
best_partition <- null

// Enumerate all non-empty subsets (up to complement symmetry)
FOR subset IN all_subsets(categories) WHERE 0 < |subset| < k:
    // Compute gradient/hessian sums for left (subset) and right (complement)
    G_left, H_left <- sum_stats(subset, gradients, hessians)
    G_right, H_right <- sum_stats(complement, gradients, hessians)
   
   gain <- compute_split_gain(G_left, H_left, G_right, H_right)
   
   IF gain > best_gain:
       best_gain <- gain
       best_partition <- subset

RETURN best_partition, best_gain

Complexity: $O(2^k)$ — exponential, only feasible for small k.

Algorithm 2: One-Hot Strategy (LightGBM)#

For k ≤ max_cat_to_onehot (default 4), test each category individually:

ALGORITHM: OneHotCategoricalSplit(categories, gradients, hessians)
------------------------------------------------------------------
best_gain <- -infinity
best_category <- null

FOR c IN categories:
    // Compute gain for splitting {c} vs {everything else}
    G_c, H_c <- sum_stats({c}, gradients, hessians)
    G_rest, H_rest <- sum_stats(categories - {c}, gradients, hessians)
    
    gain <- compute_split_gain(G_c, H_c, G_rest, H_rest)
   
   IF gain > best_gain:
       best_gain <- gain
       best_category <- c

RETURN {best_category}, best_gain

This finds the single best category to isolate but won’t find multi-category partitions like {red, green} vs {blue}.

Complexity: O(k) per split—linear in number of categories.

Algorithm 3: Gradient-Sorted Strategy (LightGBM)#

For k > max_cat_to_onehot, LightGBM uses a clever approximation based on a classic result:

Key Insight (Fisher, 1958): For squared error loss, the optimal binary partition can be found by:

Sort categories by their mean target value
Test all O(k) split points in this sorted order

For gradient boosting, we use the gradient/hessian ratio as a proxy for “mean target”:

ALGORITHM: GradientSortedCategoricalSplit(categories, gradients, hessians)
--------------------------------------------------------------------------
// Step 1: Compute per-category gradient ratio
FOR c IN categories:
    G_c <- sum(gradients[i] for i where x[i] = c)
    H_c <- sum(hessians[i] for i where x[i] = c)
    ratio[c] <- G_c / (H_c + smoothing)  // smoothing prevents division by zero

// Step 2: Sort categories by ratio
sorted_cats <- SORT(categories, key=ratio)

// Step 3: Find optimal split point in sorted order
best_gain <- -infinity
G_left, H_left <- 0, 0
G_total, H_total <- sum(gradients), sum(hessians)

FOR i FROM 1 TO k-1:
   c <- sorted_cats[i-1]
   G_left += G_c[c]
   H_left += H_c[c]
   G_right <- G_total - G_left
   H_right <- H_total - H_left
   
   gain <- compute_split_gain(G_left, H_left, G_right, H_right)
   
   IF gain > best_gain:
       best_gain <- gain
       best_split_point <- i

RETURN sorted_cats[0:best_split_point], best_gain

Why This Works: Categories with similar gradient/hessian ratios have similar optimal leaf values and should be grouped together. Sorting by ratio orders categories by their “effect direction,” and the optimal partition is contiguous in this ordering.

Complexity: O(k log k) for sorting + O(k) for scanning = O(k log k)

Reference: LightGBM/src/treelearner/feature_histogram.cpp, Fisher, W.D. (1958), “On Grouping for Maximum Homogeneity”

Algorithm Comparison#

Algorithm	Complexity	Optimality	Use When
Exhaustive	O(2^k)	Optimal	k ≤ 4
One-hot	O(k)	Suboptimal	Finding single best category
Gradient-sorted	O(k log k)	Near-optimal for squared loss	k > 4

Storage: Bitsets for Category Sets#

Representing Partitions#

Store the “goes left” category set as a bitset:

Categories: {0: apple, 1: banana, 2: cherry, 3: date}
Split: {apple, cherry} go left

Bitset representation: 0b0101 = 5
        ||||
        |||+-- apple (0): 1 = goes left
        ||+--- banana (1): 0 = goes right  
        |+---- cherry (2): 1 = goes left
        +----- date (3): 0 = goes right

Decision: IF bitset & (1 << category) THEN left ELSE right

Compact Bitset Storage#

Pack bits into machine words:

For up to 64 categories: 1 uint64 word
For up to 128 categories: 2 uint64 words
For k categories: ceil(k / 64) uint64 words

Storage per split node: ceil(max_category / 64) * 8 bytes

Variable-Size Storage#

Different splits may have different cardinalities. Use CSR-like storage:

Categorical split storage:
  data:    [word0, word1, word2, word3, ...]  // All bitsets packed
  offsets: [0, 1, 3, 3, 5, ...]               // Start index per node
  sizes:   [1, 2, 0, 2, ...]                  // Words per node (0 = numerical split)

Access bitset for node i:
  start <- offsets[i]
  len <- sizes[i]
  bitset <- data[start : start + len]

Memory Requirements#

Max Categories	Words per Split	Bytes per Split
64	1	8
256	4	32
1,024	16	128
10,000	157	1,256

For high-cardinality features, consider:

Limiting to top-k most frequent categories
Hashing to fixed range
Rejecting categorical split if cardinality exceeds threshold

Decision Logic#

Traversal with Categorical Splits#

ALGORITHM: TraverseWithCategorical(node, features, tree)
--------------------------------------------------------
WHILE node is not leaf:
    feat_idx <- tree.split_feature[node]
    fval <- features[feat_idx]
    
    IF tree.is_categorical[node]:
        // Categorical decision: check bitset membership
        category <- INT(fval)
        bitset <- tree.get_category_bitset(node)
        
       IF bitset.contains(category):
           node <- tree.left_child[node]
       ELSE:
           node <- tree.right_child[node]
   ELSE:
       // Numerical decision: threshold comparison
       IF fval <= tree.threshold[node]:
           node <- tree.left_child[node]
       ELSE:
           node <- tree.right_child[node]

RETURN tree.leaf_value[node]

Decision Type Encoding (LightGBM)#

LightGBM packs split metadata into a single byte:

decision_type byte layout:
  Bit 0: Categorical flag     (0=numerical, 1=categorical)
  Bit 1: Default left flag    (0=missing→right, 1=missing→left)
  Bit 2-3: Missing type       (00=None, 01=Zero, 10=NaN)

Example: decision_type = 0b00000101 = 5
  - Bit 0: 1 → categorical split
  - Bit 1: 0 → missing values go right
  - Bit 2-3: 01 → treat 0.0 as missing

Reference: LightGBM/include/LightGBM/tree.h, kCategoricalMask, kDefaultLeftMask

Regularization for Categorical Splits#

Categorical splits are prone to overfitting, especially with high cardinality (can perfectly separate small groups). Apply extra regularization:

Additional L2 Penalty#

Standard leaf value: w = -G / (H + lambda)

With categorical penalty: w = -G / (H + lambda + cat_l2)

LightGBM default: cat_l2 = 10.0

Smoothing in Ratio Computation#

Prevent extreme ratios for rare categories:

Standard ratio: ratio = G / H
Smoothed ratio: ratio = G / (H + cat_smooth)

LightGBM default: cat_smooth = 10.0

Minimum Samples per Category#

Skip categories with insufficient data:

IF count[category] < min_data_per_group:
    exclude category from split finding

LightGBM default: min_data_per_group = 100

Maximum Categories per Split#

Limit complexity of category sets:

Only consider top max_cat_threshold categories by |G/H| magnitude

LightGBM default: max_cat_threshold = 32

When to Use Each Approach#

Use One-Hot Encoding When#

Cardinality is low (< 10 categories)
Using a library without native categorical support
Interpretability is important (one feature per category)
Categories are expected to have independent effects

Use Native Categorical When#

Cardinality is moderate (10-1000 categories)
Memory efficiency matters
Categories should be grouped (e.g., similar products, related locations)
Using LightGBM, XGBoost (with enable_categorical), or compatible implementation

Consider Alternatives When#

Very high cardinality (> 1000): Target encoding, embeddings
Ordinal relationship exists: Treat as numerical
Too few samples per category: Group rare categories into “other”

Encoding Consistency#

Critical: Training and inference must use identical category → integer mappings.

The Problem#

Training data categories: ["apple", "banana", "cherry"]
Mapping: apple→0, banana→1, cherry→2

New inference data: ["cherry", "date", "apple"]
If different mapping: cherry→0, date→1, apple→2  // WRONG!

The model learned that “0” means apple, but inference is sending cherry as 0.

Solution: Store Encoder with Model#

Model artifact should include:
Tree ensemble (splits, leaves)
Category encoder (category string → integer mapping per feature)
Feature metadata (which features are categorical)

At inference:
Load encoder
Transform input categories using stored mapping
Handle unknown categories (default direction or error)

Handling Unknown Categories#

Options for categories seen at inference but not during training:

Strategy	Behavior	Use When
Default direction	Go right (or configured default)	Production systems, graceful degradation
Error	Reject input	Safety-critical, want to catch data issues
Map to “other”	Reserve special category during training	Expected unknown categories

Library Comparison#

Feature	LightGBM	XGBoost
Native support	Yes	Yes (newer versions, `enable_categorical`)
Split algorithm	Gradient-sorted	Partition-based
One-hot threshold	Configurable (`max_cat_to_onehot=4`)	Fixed strategy
Max cardinality	Configurable	Limited
Bitset storage	Separate array	In-node
Linear tree compat	Yes	No

LightGBM Configuration#

params = {
    'categorical_feature': [0, 3, 7],  # Column indices of categorical features
    'max_cat_to_onehot': 4,            # Below this, use one-hot-like search
    'cat_l2': 10.0,                    # Extra L2 regularization
    'cat_smooth': 10.0,                # Ratio smoothing
    'max_cat_threshold': 32,           # Max categories per split
    'min_data_per_group': 100,         # Min samples per category
}

XGBoost Configuration#

# Enable categorical support
dtrain = xgb.DMatrix(data, enable_categorical=True)

params = {
    'tree_method': 'hist',   # Required for categorical
    'max_cat_to_onehot': 4,  # One-hot threshold
}

Summary#

Aspect	One-Hot Encoding	Native Categorical
Preprocessing	Required (k → k features)	Not needed
Memory	High (sparse binary features)	Low (single feature + bitsets)
Split expressiveness	Single category per split	Arbitrary subsets
Algorithm	Standard threshold	Gradient-sorted or exhaustive
Tree depth	Deeper (multiple splits for subset)	Shallower
Overfitting risk	Lower	Higher (needs regularization)
Implementation	Universal	Requires library support

Key References#

Fisher, W.D. (1958), “On Grouping for Maximum Homogeneity” — Mathematical foundation
LightGBM/src/treelearner/feature_histogram.cpp — Gradient-sorted split finding
LightGBM/src/io/bin.cpp — Category binning
LightGBM/include/LightGBM/tree.h — Categorical split storage and decision
xgboost/include/xgboost/tree_model.h — Categorical split in XGBoost