Batch Traversal#

Overview#

Batch traversal is an optimization that processes multiple samples together through a tree ensemble, improving cache utilization and enabling vectorization. The key insight is that tree structure is read-only during inference—if we process many samples simultaneously, they all share access to the same tree nodes.

The Cache Efficiency Problem#

Modern CPUs have a memory hierarchy:

Level	Size	Latency	What fits
L1 Cache	32-64 KB	~4 cycles	A few tree levels
L2 Cache	256-512 KB	~12 cycles	Several small trees
L3 Cache	4-32 MB	~40 cycles	Many trees
RAM	GBs	~200+ cycles	Everything

Single-sample processing problem:

FOR sample IN batch:
    FOR tree IN forest:
        output[sample] += traverse(tree, sample)  // Tree loaded from RAM
                                                  // Evicted before next sample uses it

Each tree is loaded from RAM, used once, and evicted before the next sample can benefit.

Batch processing solution:

FOR tree IN forest:
    FOR sample IN batch:
        output[sample] += traverse(tree, sample)  // Tree stays in cache
                                                  // Reused by many samples

Tree stays in cache while all samples in the batch traverse it.

Block-Based Traversal#

The Algorithm#

Rather than processing samples one at a time, group them into blocks:

ALGORITHM: BatchPredict(samples, trees, block_size=64)
------------------------------------------------------
FOR block IN samples.chunks(block_size):
    // Load features for entire block
    block_features <- LoadFeatures(block)
    
    FOR tree IN trees:
        // Traverse all samples in block through this tree
        leaf_indices <- TraverseBlock(tree, block_features)
        
        // Accumulate leaf values
       FOR i FROM 0 TO block.size:
           output[block.start + i] += tree.leaf_value[leaf_indices[i]]

Why Block Size Matters#

Block Size	Pros	Cons
Small (1-16)	Low memory, simple	No vectorization, high loop overhead
Medium (32-64)	Cache-friendly, vectorizable	Good balance for most workloads
Large (128-256)	Better amortization	Memory pressure, samples diverge quickly

Recommended default: 64 samples

This matches:

Typical L1 cache line behavior
Common SIMD register widths (8x8 or 4x16 patterns)
GPU warp sizes (32 threads, 2 warps per block)

Per-Thread Buffers#

To avoid allocation in the hot path, pre-allocate reusable buffers:

ThreadLocalBuffer:
  feature_values: array[BLOCK_SIZE * MAX_FEATURES] of float
  node_positions: array[BLOCK_SIZE] of int
  leaf_outputs:   array[BLOCK_SIZE] of float

Allocate once at initialization, reuse across all predictions.

Level-Synchronized Traversal#

Concept#

Instead of each sample following its own path independently, synchronize all samples at each tree level:

Level 0: All samples at root
         |
         v
Level 1: Samples split to left/right children
         |
         v
Level 2: Samples split again
         ...

Why This Helps#

Independent traversal: Each sample has unpredictable branches

Sample 0: root -> left -> right -> left -> leaf
Sample 1: root -> right -> left -> left -> leaf
Sample 2: root -> left -> left -> right -> leaf
          (completely different memory access patterns)

Level-synchronized traversal: All samples access the same tree level simultaneously

Level 0: All samples compare against root.threshold
Level 1: All samples compare against level-1 thresholds
         (contiguous memory access, predictable branches)

Unrolled Tree Layouts#

The Concept#

Standard trees use pointer-based navigation (store child indices, follow them). This causes random memory access:

Standard: nodes[current].left -> nodes[that].left -> nodes[other].left
          (3 pointer chases, 3 potential cache misses)

Unrolled layout stores the top K levels in position-indexed arrays:

Unrolled (depth 3):
  Level 0: [root]           <- position 0
  Level 1: [left, right]    <- positions 0, 1
  Level 2: [LL, LR, RL, RR] <- positions 0, 1, 2, 3

Position in level d: directly indexes into level-d array
No pointer chasing for first K levels!

Implicit Indexing#

For a complete binary tree, child positions are computed arithmetically:

Parent at position p in level d:
  Left child:  position 2*p     in level d+1
  Right child: position 2*p + 1 in level d+1

This eliminates pointer storage and following for the unrolled levels.

Why Partial Unrolling?#

Fully unrolling a tree to depth D requires \(2^D - 1\) node slots:

Depth	Nodes Required	Memory (32B/node)
6	63	2 KB
10	1,023	32 KB
15	32,767	1 MB
20	1,048,575	32 MB

XGBoost unrolls only the top 6 levels because:

Memory explosion: Deep unrolling wastes memory on padding (sparse trees don’t use all positions)
Gather bottleneck: Beyond ~6 levels, the random memory access (gather) to fetch features dominates, not cache misses for tree structure
Diminishing returns: Top levels route samples to \(2^6 = 64\) subtrees, covering high-entropy decisions
Block amortization: With block size 64, samples diverge to ~64 different subtrees after 6 levels anyway

Reference: xgboost/src/predictor/cpu_predictor.cc, array tree layout

Traversal with Unrolled Layout#

ALGORITHM: TraverseUnrolled(features, unrolled_tree)
----------------------------------------------------
pos <- 0  // Position within current level

// Unrolled levels (no indirection)
FOR level FROM 0 TO UNROLL_DEPTH:
    feat_idx <- unrolled_tree.feature[level][pos]
    threshold <- unrolled_tree.threshold[level][pos]
    fval <- features[feat_idx]
    
    // Next position: left child at 2*pos, right at 2*pos+1
   IF fval <= threshold:
       pos <- 2 * pos
   ELSE:
       pos <- 2 * pos + 1

// Continue with standard traversal for deep levels
node <- subtree_roots[pos]
WHILE node is not leaf:
   // Standard pointer-following for remaining levels
   ...

RETURN leaf_value[node]

Cache Efficiency Techniques#

Feature Staging#

Load all features for a sample into a contiguous buffer before traversing trees:

WITHOUT staging (bad):
  FOR tree IN trees:
      fval <- data[sample][tree.root.feature]  // Random access
      // ... traverse using random feature accesses

WITH staging (good):
  staged_features <- data[sample, :]  // One contiguous read
  FOR tree IN trees:
      fval <- staged_features[tree.root.feature]  // Cached access

Benefits:

Original data might be column-major (bad for row access)
Staged buffer is compact and cache-friendly
Same features reused across many trees

Prefetching#

Explicitly request next-level nodes before they’re needed:

ALGORITHM: TraverseWithPrefetch(node, features)
-----------------------------------------------
PREFETCH(nodes[node.left])   // Request from memory now
PREFETCH(nodes[node.right])

// By the time we need children, they're in cache
IF features[node.feature] <= node.threshold:
    RETURN TraverseWithPrefetch(node.left, features)
ELSE:
    RETURN TraverseWithPrefetch(node.right, features)

Hardware prefetch works well for sequential access but struggles with tree traversal’s irregular patterns. Explicit prefetching can help.

Parallelism Strategies#

Data Parallelism (Across Samples)#

Partition samples across threads:

PARALLEL FOR block IN data.chunks(BLOCK_SIZE):
    FOR tree IN forest:
        process(tree, block)

Pros: Simple, good load balancing, linear scaling Cons: Each thread loads all trees (duplicated cache pressure)

Model Parallelism (Across Trees)#

Partition trees across threads:

FOR block IN data.chunks(BLOCK_SIZE):
    partial_sums <- PARALLEL FOR tree_chunk IN forest.chunks(TREES_PER_THREAD):
        sum(traverse(tree, block) FOR tree IN tree_chunk)
    output[block] <- sum(partial_sums)

Pros: Each thread caches subset of trees Cons: Requires reduction, less balanced if trees vary in size

Hybrid Approach#

For large workloads, combine both:

// Outer: partition samples across thread groups
// Inner: partition trees within each group

PARALLEL FOR sample_chunk IN data.chunks(SAMPLES_PER_GROUP):
    local_output <- zeros(sample_chunk.size)
    
    PARALLEL FOR tree_chunk IN forest.chunks(TREES_PER_THREAD):
        FOR tree IN tree_chunk:
            accumulate(local_output, traverse(tree, sample_chunk))
    
    output[sample_chunk] <- local_output

SIMD Considerations#

What Vectorizes Well#

Threshold comparisons (same threshold for multiple samples at same node):

// 8 samples at same node, comparing against one threshold
thresholds <- BROADCAST(node.threshold)  // [t, t, t, t, t, t, t, t]
features <- LOAD_8_FLOATS(staged_features[0:8])
go_left_mask <- COMPARE_LE(features, thresholds)  // 8 comparisons, 1 instruction

Position updates (predictable arithmetic):

positions <- 2 * positions + decisions  // SIMD multiply-add

Leaf accumulation (independent additions):

outputs <- outputs + leaf_values  // SIMD add

What Doesn’t Vectorize#

Feature gathering (different feature index per sample):

// Each sample needs different feature
FOR i FROM 0 TO 8:
    features[i] <- data[sample_idx[i], feat_idx[i]]  // Random access

SIMD gather instructions exist but are slower than contiguous loads.

Divergent paths (samples in different subtrees):

// After samples diverge to different nodes, synchronization breaks down
// Cannot vectorize across samples in different tree branches

Practical SIMD Strategy#

Focus SIMD on predictable parts:

Level-by-level comparison: All samples at same level, vectorize threshold comparison
Leaf accumulation: Vectorize adding leaf values to outputs
Multi-tree parallelism: Process 4/8 trees simultaneously for one sample

GPU Considerations#

Why GPU Needs Blocks#

GPU efficiency requires thousands of threads executing similar operations. Block-based traversal is essential:

GPU Kernel: One thread per sample
__global__ void predict(samples, trees, outputs):
    sample_idx <- blockIdx.x * blockDim.x + threadIdx.x
    
    // Load tree structure to shared memory (block-wide)
    __shared__ TreeNodes shared_tree
    cooperative_load(trees[tree_idx], shared_tree)
    __syncthreads()
    
    // All threads in block traverse same tree (from shared memory)
    output <- traverse(shared_tree, samples[sample_idx])
    outputs[sample_idx] += output

Warp Efficiency#

Threads in a warp (32 threads) execute in lockstep. Divergent branches hurt performance:

Thread 0: goes left  --+
Thread 1: goes right   |-- Warp divergence (both paths execute, half threads idle)
Thread 2: goes left    |
...                  --+

Level-synchronized traversal and proper tree layout minimize divergence by keeping samples at similar tree positions.

Summary#

Technique	Benefit	Implementation
Block processing	Keep tree in cache	Process 64 samples per tree
Level synchronization	Predictable access	All samples at same depth
Tree unrolling	No pointer chasing	Implicit indexing for top 6 levels
Feature staging	Cache-friendly	Pre-load sample features
SIMD comparison	8x throughput	Vectorize threshold comparisons
Multi-tree	Better utilization	Process multiple trees per sample

Key References#

xgboost/src/predictor/cpu_predictor.cc — Block traversal implementation
XGBoost array tree layout (6-level unrolling)
LightGBM/src/boosting/gbdt.cpp — Batch prediction with iterators
“What Every Programmer Should Know About Memory” — Cache hierarchy deep dive