Storage Format Research#
This document evaluates existing tree ensemble serialization formats and their suitability for booste-rs.
Treelite Evaluation#
Treelite overview#
Treelite is a universal tree ensemble format designed for model deployment. It provides a common representation that can import from XGBoost, LightGBM, and scikit-learn.
Capabilities#
Supported Import Sources:
XGBoost (JSON, UBJSON, legacy binary)
LightGBM (text format)
scikit-learn (RandomForest, GradientBoosting, HistGradientBoosting, etc.)
Supported Model Features:
Tree structure (numerical and categorical splits)
Multiple output groups (multiclass classification)
Post-processors (sigmoid, softmax, exponential_standard_ratio)
Base scores
Tree statistics (hessian sums, data counts, gain)
Comparison operators (LT, LE, GT, GE, EQ)
Treelite v4 Format Structure:
Header:
- major_ver, minor_ver, patch_ver
- threshold_type (float32 or float64)
- leaf_output_type (float32 or float64)
- num_tree
Header 2:
- num_feature
- task_type: kBinaryClf, kRegressor, kMultiClf, kLearningToRank, kIsolationForest
- num_target, num_class, leaf_vector_shape
- target_id, class_id per tree
- postprocessor, sigmoid_alpha, ratio_c, base_scores
Per Tree:
- num_nodes, has_categorical_split
- node_type, cleft, cright, split_index
- default_left, leaf_value, threshold, cmp
- category_list (for categorical splits)
- leaf_vector (for multi-output leaves)
- data_count, sum_hess, gain (optional statistics)
Critical Limitations for booste-rs#
After thorough evaluation, Treelite has fundamental limitations that prevent its use as our primary format:
1. No GBLinear Support#
Treelite is fundamentally a tree-only format. XGBoost’s gblinear booster stores a weight matrix and bias vector—not trees. There is no mechanism in Treelite to represent:
GBLinear Model:
weights: [num_features × num_groups]
bias: [num_groups]
The TaskType enum only includes tree-based tasks:
kBinaryClf,kRegressor,kMultiClf,kLearningToRank,kIsolationForest
No kLinear or similar exists.
2. No Linear Leaves Support#
LightGBM supports linear_tree=True which fits a linear model in each leaf:
// From LightGBM tree.h
bool is_linear_;
std::vector<std::vector<double>> leaf_coeff_; // coefficients per leaf
std::vector<double> leaf_const_; // intercept per leaf
std::vector<std::vector<int>> leaf_features_; // features used per leaf
Prediction with linear leaves:
double output = leaf_const_[leaf];
for (size_t i = 0; i < leaf_features_[leaf].size(); ++i) {
output += leaf_coeff_[leaf][i] * feature_values[leaf_features_[leaf][i]];
}
Treelite’s leaf representation is limited to:
leaf_value: scalar floatleaf_vector: array of floats (for multi-output)
There is no provision for linear coefficients, feature indices, or intercepts within leaves.
3. DART Weights Not Preserved#
While Treelite can store DART trees structurally, the per-tree dropout weights used during DART prediction are not part of the format. DART requires:
struct DartModel {
trees: Vec<Tree>,
tree_weights: Vec<f32>, // Not in Treelite
}
Treelite as Export Target#
Despite limitations for our primary format, Treelite could serve as an export-only target for standard GBDT models:
Benefits:
Interoperability with TL2cgen (compiled model deployment)
ONNX conversion via Treelite ecosystem
Common format for model sharing
Constraints:
Only for models without linear leaves
Only for GBDT, not GBLinear
Export-only (not for loading back into booste-rs)
Recommendation#
Do not use Treelite as the primary storage format for booste-rs.
Instead:
Define a native format that supports all booste-rs features
Consider Treelite export as a future interoperability feature for standard GBDT models
XGBoost JSON Format#
XGBoost overview#
XGBoost uses JSON (and UBJSON for binary efficiency) as its primary model format since v1.0.
XGBoost structure#
{
"version": [2, 1, 0],
"learner": {
"learner_model_param": {
"base_score": "0.5",
"num_feature": "10",
"num_class": "0"
},
"gradient_booster": {
"name": "gbtree",
"model": {
"gbtree_model_param": { "num_trees": "100" },
"trees": [
{
"tree_param": { "num_nodes": "15" },
"split_indices": [...],
"split_conditions": [...],
"left_children": [...],
"right_children": [...],
"default_left": [...],
"categories": [...], // optional
"split_type": [...] // 0=numerical, 1=categorical
}
]
}
},
"objective": { "name": "binary:logistic", ... }
}
}
XGBoost considerations#
Pros:
Human readable
Well documented
Supports GBLinear (different structure)
Supports DART (with tree weights)
Cons:
Verbose (large file sizes)
Frequent format changes between XGBoost versions
Quirks (base_score as string, array, or bracketed string)
No linear leaves (XGBoost doesn’t support them)
LightGBM Text Format#
LightGBM overview#
LightGBM uses a line-based text format for model serialization.
LightGBM structure#
tree
version=v4
num_class=1
num_tree_per_iteration=1
label_index=0
max_feature_idx=9
objective=binary sigmoid:1
parameters:
[boosting: gbdt]
...
end of parameters
Tree=0
num_leaves=31
num_cat=0
split_feature=3 7 2 ...
threshold=0.5 1.2 0.8 ...
decision_type=2 2 2 ...
left_child=-1 -2 0 ...
right_child=1 2 -3 ...
leaf_value=0.1 -0.2 0.05 ...
Tree=1
...
Linear Tree Extension#
is_linear=1
num_features_per_leaf=3 2 4 ...
leaf_features=0:2:5 1:3 0:1:2:4 ...
leaf_coeff=0.1:0.2:-0.1 0.3:0.4 ...
leaf_const=0.5 0.3 0.1 ...
LightGBM considerations#
Pros:
Supports linear trees
Relatively stable format
Human readable
Cons:
Text parsing overhead
No GBLinear equivalent
Format quirks (threshold adjustment for
<=vs<)
Conclusion#
Neither XGBoost nor LightGBM formats are ideal for booste-rs:
Feature |
XGBoost JSON |
LightGBM Text |
Treelite v4 |
|---|---|---|---|
GBLinear |
✅ |
❌ |
❌ |
Linear leaves |
❌ |
✅ |
❌ |
DART weights |
✅ |
❌ |
❌ |
Categorical |
✅ |
✅ |
✅ |
Human readable |
✅ |
✅ |
❌ |
Compact |
❌ |
❌ |
✅ |
Recommendation: Define a native booste-rs format that unifies all features.