Tree Storage Formats#
Overview#
Tree storage formats present a classic AoS vs SoA (Array of Structures vs Structure of Arrays) trade-off:
Format |
Structure |
Memory Pattern |
Mutation |
Cache Behavior |
|---|---|---|---|---|
AoS |
|
Per-node access |
Easy |
Good for single-node ops |
SoA |
|
Per-field access |
Harder |
Good for batch ops |
The fundamental insight:
Training needs frequent mutation (adding nodes, updating splits) → AoS is natural
Inference needs fast batch prediction (same operation on many samples) → SoA enables SIMD/coalescing
AoS: Array of Structures#
Concept#
Each node is a self-contained structure with all its fields:
Node 0: { parent=NULL, left=1, right=2, feature=5, threshold=0.5, leaf_value=N/A }
Node 1: { parent=0, left=NULL, right=NULL, feature=N/A, threshold=N/A, leaf_value=1.5 }
Node 2: { parent=0, left=3, right=4, feature=2, threshold=0.3, leaf_value=N/A }
...
Memory Layout#
Memory (contiguous nodes):
+--------+--------+--------+--------+
| Node 0 | Node 1 | Node 2 | Node 3 |
+--------+--------+--------+--------+
24B 24B 24B 24B
Each node:
+--------+-------+--------+---------+-----------+
| parent | left | right | feature | threshold |
| 4B | 4B | 4B | 4B | 4B |
+--------+-------+--------+---------+-----------+
XGBoost’s Training Tree (AoS)#
XGBoost uses AoS for tree construction:
// From xgboost/include/xgboost/tree_model.h
class RegTree::Node {
int32_t parent_; // Parent index (high bit: is_left_child)
int32_t cleft_; // Left child (kInvalidNodeId if leaf)
int32_t cright_; // Right child
uint32_t sindex_; // Split feature (high bit: default_left)
union {
float leaf_value; // If leaf node
float split_cond; // If internal node
} info_;
};
Note the union: internal nodes store split_cond (threshold), leaf nodes store
leaf_value. This saves memory but means you must check IsLeaf() before accessing.
Reference:
xgboost/include/xgboost/tree_model.h, classRegTree::Node
Why AoS for Training#
During training, trees are built incrementally:
1. Start with single leaf (root)
2. Find best split for current leaves
3. Split one leaf -> create two children
4. Repeat until stopping criterion
Each step modifies individual nodes:
Add new nodes to the array
Update parent pointers
Store split information
With AoS, this is natural—all node data is together:
ALGORITHM: SplitNode(tree, node_id, split)
------------------------------------------
1. left_id <- tree.add_node()
2. right_id <- tree.add_node()
3. tree.nodes[node_id].left <- left_id
4. tree.nodes[node_id].right <- right_id
5. tree.nodes[node_id].feature <- split.feature
6. tree.nodes[node_id].threshold <- split.threshold
7. tree.nodes[left_id].parent <- node_id
8. tree.nodes[right_id].parent <- node_id
SoA: Structure of Arrays#
Concept#
Separate arrays for each field, indexed by node ID:
left_child: [1, -1, 3, -1, -1] // -1 = leaf (no child)
right_child: [2, -1, 4, -1, -1]
split_feature: [5, -, 2, -, -] // - = N/A for leaves
threshold: [0.5, -, 0.3, -, -]
leaf_value: [-, 1.5, -, 0.8, 2.1]
Memory Layout#
Memory (separate arrays):
left_child: [1, -1, 3, -1, -1] <- contiguous
right_child: [2, -1, 4, -1, -1] <- contiguous
split_feature: [5, 0, 2, 0, 0] <- contiguous
threshold: [0.5, 0, 0.3, 0, 0] <- contiguous
leaf_value: [0, 1.5, 0, 0.8, 2.1] <- contiguous
LightGBM’s Inference Tree (SoA)#
LightGBM stores trained trees in SoA format:
// From LightGBM/include/LightGBM/tree.h
class Tree {
// Split node data (size = num_leaves - 1)
std::vector<int> left_child_;
std::vector<int> right_child_;
std::vector<int> split_feature_;
std::vector<double> threshold_;
std::vector<int8_t> decision_type_; // Packed flags
// Leaf data (size = num_leaves)
std::vector<double> leaf_value_;
int num_leaves_;
double shrinkage_;
};
Reference:
LightGBM/include/LightGBM/tree.h, classTree
Why SoA for Inference#
During batch prediction, you need to traverse many samples through the tree:
FOR each sample IN batch:
node <- root
WHILE node is not leaf:
IF sample.features[split_feature[node]] < threshold[node]:
node <- left_child[node]
ELSE:
node <- right_child[node]
output[sample] <- leaf_value[node]
With SoA, reading thresholds for many nodes loads contiguous memory:
SoA memory access pattern:
threshold = [t0, t1, t2, t3, t4, ...]
+--- contiguous reads ---+
AoS memory access pattern:
nodes = [{...t0...}, {...t1...}, {...t2...}]
+-- stride = 24 bytes between thresholds --+
SIMD Benefits#
SoA enables vectorization—comparing multiple values against multiple thresholds simultaneously:
// Conceptual SIMD operation
thresholds_vec = LOAD_8_FLOATS(threshold[node:node+8])
features_vec = GATHER_8_FLOATS(sample_features, split_feature[node:node+8])
mask = COMPARE_LESS_THAN(features_vec, thresholds_vec) // 8 comparisons in 1 instruction
GPU Coalescing#
For GPU inference, SoA is essential:
Warp of 32 threads at different tree nodes:
Thread 0 needs threshold[node_0]
Thread 1 needs threshold[node_1]
Thread 2 needs threshold[node_2]
...
SoA: All reads from contiguous threshold[] array -> coalesced memory access
AoS: Reads scattered across node structs -> uncoalesced, slow
Tree Unrolling (Top-Level Optimization)#
For the first few tree levels, we can eliminate pointer chasing entirely by using implicit indexing — storing nodes in level-order so child positions are computed arithmetically.
Implicit Indexing#
In a complete binary tree stored in level order:
Level 0: [0] ← root at index 0
Level 1: [1] [2] ← children at 2*0+1, 2*0+2
Level 2: [3][4][5][6] ← children at 2*1+1, 2*1+2, etc.
For node at index i:
left_child = 2*i + 1
right_child = 2*i + 2
parent = (i-1) / 2
Benefit: No pointer storage, no pointer chasing — just arithmetic.
Why Partial Unrolling?#
Fully unrolling a tree to depth D requires \(2^D - 1\) node slots:
Depth |
Nodes Required |
Memory (32B/node) |
|---|---|---|
6 |
63 |
2 KB |
10 |
1,023 |
32 KB |
15 |
32,767 |
1 MB |
20 |
1,048,575 |
32 MB |
XGBoost unrolls only the top 6 levels because:
Memory explosion: Sparse trees waste slots (padding to \(2^D\))
Gather bottleneck: Beyond ~6 levels, the random memory access to fetch features (gather) dominates, not tree structure cache misses
Sample divergence: With block size 64, samples diverge to ~64 different subtrees after 6 levels anyway
Diminishing returns: Top levels route samples to \(2^6 = 64\) subtrees, covering the highest-entropy decisions
Unrolled Traversal Algorithm#
ALGORITHM: TraverseUnrolled(features, tree)
─────────────────────────────────────────
pos ← 0 // Position in level-order array
// Unrolled levels (no pointer chasing)
FOR level FROM 0 TO UNROLL_DEPTH - 1:
feat ← tree.feature[pos]
thresh ← tree.threshold[pos]
IF features[feat] < thresh:
pos ← 2 * pos + 1 // Left child (arithmetic)
ELSE:
pos ← 2 * pos + 2 // Right child (arithmetic)
// Remaining levels use standard pointer-following
node ← subtree_roots[pos]
WHILE node is not leaf:
// Standard traversal...
RETURN leaf_value[node]
Trade-offs#
Aspect |
Unrolled |
Pointer-Based |
|---|---|---|
Memory |
Higher (padding) |
Lower (sparse OK) |
Cache misses (structure) |
Zero for unrolled levels |
One per level |
Implementation |
More complex |
Simple |
Best for |
Top ~6 levels |
Deep/sparse trees |
Node Indexing Conventions#
Different libraries use different conventions for identifying leaf nodes:
XGBoost: Sentinel Value#
Uses a special value (kInvalidNodeId = -1) to mark leaves:
cleft_ == kInvalidNodeId (-1) -> this is a leaf node
Otherwise -> this is an internal node
Tree structure:
[0] <- root (internal)
/ \
[1] [2] <- node 1 is leaf, node 2 is internal
/ \
[3] [4] <- both leaves
LightGBM: Negative Indices#
Uses negative indices (bitwise NOT) for leaves:
Positive index -> internal node
Negative index -> leaf (use ~index to get leaf ID)
left_child_ = [~0, ~1] // Node 0 -> Leaf 0, Node 1 -> Leaf 1
right_child_ = [1, ~2] // Node 0 -> Node 1, Node 1 -> Leaf 2
This convention separates internal nodes from leaves in storage, allowing different-sized arrays.
Decision Type Encoding (LightGBM)#
LightGBM packs multiple boolean flags into a single byte:
Bit 0: Categorical flag (0=numerical, 1=categorical)
Bit 1: Default left flag (0=missing->right, 1=missing->left)
Bit 2-3: Missing type (00=None, 01=Zero, 10=NaN)
Example: decision_type = 0b00000110 = 6
- Bit 0: 0 -> numerical split
- Bit 1: 1 -> missing values go left
- Bit 2-3: 01 -> treat 0.0 as missing
This saves memory compared to separate boolean fields and improves cache efficiency.
Reference:
LightGBM/include/LightGBM/tree.h,kCategoricalMask,kDefaultLeftMask
Multi-Output (Vector) Leaves#
For multi-target regression or multi-class classification, leaves may store multiple values:
Scalar Leaves (Standard)#
leaf_value: [v0, v1, v2, v3] // 4 leaves, 1 value each
Vector Leaves#
// 4 leaves, 3 outputs each
leaf_value: [v0_0, v0_1, v0_2, // Leaf 0
v1_0, v1_1, v1_2, // Leaf 1
v2_0, v2_1, v2_2, // Leaf 2
v3_0, v3_1, v3_2] // Leaf 3
XGBoost supports this via TreeParam::size_leaf_vector:
// From xgboost/include/xgboost/tree_model.h
struct TreeParam {
bst_target_t size_leaf_vector; // 1 for scalar, >1 for vector
};
LightGBM typically handles multi-output via separate trees per output rather than vector leaves.
Linear Tree Extension#
LightGBM supports linear models in leaves (leaf contains a linear function, not just a constant):
Prediction for leaf i:
output = leaf_const[i] + sum(leaf_coeff[i][j] * feature[leaf_features[i][j]])
This requires additional storage per leaf:
// Conceptual (from LightGBM tree.h)
std::vector<double> leaf_const_; // Intercept per leaf
std::vector<std::vector<double>> leaf_coeff_; // Coefficients per leaf
std::vector<std::vector<int>> leaf_features_; // Feature indices per leaf
Memory Estimates#
For a tree with L leaves (L-1 internal nodes, 2L-1 total nodes):
XGBoost AoS (Training)#
Nodes: (2L-1) * 24 bytes = ~48L bytes
Stats: (2L-1) * 16 bytes = ~32L bytes (RTreeNodeStat)
Total: ~80L bytes per tree
LightGBM SoA (Inference)#
Internal nodes: (L-1) * (4+4+4+8+1+4) = ~25(L-1) bytes
Leaf nodes: L * (8+8+4+4) = ~24L bytes
Total: ~49L bytes per tree (more compact)
Forest Memory#
For 100 trees with 31 leaves each (depth 5):
XGBoost AoS: 100 * 80 * 31 = ~248 KB
LightGBM SoA: 100 * 49 * 31 = ~152 KB
SoA is typically more compact because:
No padding between fields
Leaves and internal nodes can use different field sets
Bit-packing for flags
When to Use Each Format#
AoS (Training)#
Use when:
Building trees incrementally
Frequent node mutations
Need parent pointers for backtracking
Implementing tree updaters/pruners
Avoid when:
Batch inference with many samples
GPU inference
SIMD optimization is critical
SoA (Inference)#
Use when:
Batch prediction (many samples)
GPU inference
SIMD optimization
Memory-mapped model loading
Read-only tree access
Avoid when:
Training (frequent mutations are awkward)
Single-sample latency-critical paths (AoS may be simpler)
Summary#
Aspect |
Training (AoS) |
Inference (SoA) |
|---|---|---|
Layout |
Node structs in array |
Separate array per field |
Mutation |
Easy (modify one node) |
Hard (update multiple arrays) |
Cache |
Good for single node access |
Good for field across many nodes |
SIMD |
Limited |
Excellent (contiguous data) |
GPU |
Poor (uncoalesced access) |
Excellent (coalesced access) |
Use case |
Tree construction |
Batch prediction |
Key References#
xgboost/include/xgboost/tree_model.h—RegTree::Node(AoS)LightGBM/include/LightGBM/tree.h—Treeclass (SoA)“Data-Oriented Design” — General principles of AoS vs SoA trade-offs