RFC-0012: Categorical Features#
Status: Implemented
Created: 2025-12-20
Updated: 2026-01-02
Scope: Categorical feature handling in GBDT
Summary#
Categorical features use specialized split strategies: one-hot for low cardinality, sorted partition for high cardinality. Categories are stored as bitsets during training and in tree nodes.
Why Native Categoricals?#
One-hot encoding high-cardinality categoricals creates:
Many sparse features
Fragmented splits
Memory overhead
Native categorical handling:
Single split can partition multiple categories
More expressive partitions
Direct encoding in tree
Split Strategies#
One-Hot (Low Cardinality)#
For ≤ max_onehot_cats categories (default: 4):
Category histogram: [A=100, B=50, C=75, D=25]
Try each as singleton:
{A} vs {B,C,D} → gain_A
{B} vs {A,C,D} → gain_B
...
Pick best singleton.
Simple, O(K) comparisons, works well for few categories.
Sorted Partition (High Cardinality)#
For > max_onehot_cats categories:
Compute gradient ratio per category:
grad_sum / hess_sumSort categories by ratio
Scan for optimal partition point
Sorted by ratio: [D, B, C, A]
Scan partitions:
{D} vs {B,C,A} → gain_1
{D,B} vs {C,A} → gain_2
{D,B,C} vs {A} → gain_3
Pick best partition.
O(K log K) from sorting. Higher-gradient categories go right (convention).
CatBitset#
Compact storage for category sets:
pub struct CatBitset {
bits: u64, // Inline: categories 0..63
overflow: Option<Box<[u64]>>, // Heap: categories 64+
}
impl CatBitset {
pub fn contains(&self, cat: u32) -> bool;
pub fn insert(&mut self, cat: u32);
pub fn count(&self) -> u32;
pub fn iter(&self) -> impl Iterator<Item = u32>;
}
Why inline first 64?
Most categoricals have < 64 categories
Avoids allocation in common case
Fast bitwise operations
Split Types#
pub enum SplitType {
Numerical { bin: u16 },
Categorical { left_cats: CatBitset },
}
pub struct SplitInfo {
pub feature: u32,
pub gain: f32,
pub default_left: bool,
pub split_type: SplitType,
}
Numerical: samples with bin ≤ threshold go left.
Categorical: samples with category in left_cats go left.
Tree Traversal#
At a categorical split node:
fn traverse_categorical(sample: &[f32], split: &CatBitset) -> bool {
let category = sample[feature] as u32;
split.contains(category) // true = go left
}
Missing values follow default_left.
Training Integration#
Histogram building treats categorical bins as category indices:
Feature bins: [0, 2, 1, 0, 2] (categories 0, 1, 2)
Histogram:
bin 0: (grad_sum=1.2, hess_sum=3.0)
bin 1: (grad_sum=0.5, hess_sum=1.5)
bin 2: (grad_sum=0.8, hess_sum=2.0)
Split finder detects categorical feature (via feature metadata) and uses appropriate strategy.
Feature Metadata#
// In BinnedDataset
is_categorical: Vec<bool>, // Per-feature flag
Set during binning based on input feature type.
Files#
Path |
Contents |
|---|---|
|
|
|
|
|
|
|
Tree category storage |
Design Decisions#
DD-1: Sorted partition for high cardinality. One-hot becomes expensive above ~10 categories. Sorted partition is O(K log K) regardless of K and often finds better partitions than brute-force.
DD-2: Higher gradient → right. Convention matches XGBoost. Sorted partition naturally produces right-going sets when sorting ascending by gradient ratio.
DD-3: Categories in bitset go LEFT. left_cats contains categories
that route samples left. Matches how numerical splits use “≤ threshold → left”.
DD-4: Max categories ~1000. Sorted partition scratch buffer limits practical cardinality. Beyond 1000, consider hashing or embedding.
DD-5: Missing separate from categories. Missing values are handled by
default_left, not as a special category bin. Keeps histogram logic simple.
Unseen Categories#
At prediction time, categories not seen during training are treated as missing:
Follow
default_leftdirection at categorical splitsNo error raised—graceful fallback
High Cardinality Limits#
DD-4 limits practical cardinality to ~1000:
CatBitset memory: 64 bytes inline + 8 bytes per 64 categories above 64
Sorted partition scratch: ~16 bytes per category
Beyond 1000 categories: consider target encoding, embedding, or hashing
If exceeded, training continues but may be slow and memory-intensive.
Feature Marking#
Mark features as categorical during Dataset construction:
// Via builder
let dataset = Dataset::builder()
.add_feature("age", age_values.view()) // numeric by default
.add_categorical("color", color_values.view()) // categorical
.build()?;
// Via schema
let mut schema = DatasetSchema::default();
schema.set_feature_type(2, FeatureType::Categorical);
Testing Strategy#
Category |
Tests |
|---|---|
One-hot splits |
Each category as singleton works correctly |
Sorted partition |
Optimal partition found for known distributions |
Bitset operations |
Insert, contains, iterate correctness |
Unseen categories |
Prediction handles gracefully |
High cardinality |
500+ categories don’t crash |