Inside Gradient Boosting, Part 10 of 10

This series explains gradient boosting from first principles to advanced implementation details.

Previous: XGBoost vs LightGBM: A Practical Comparison In this post: The story of building a gradient boosting library from scratch.

This series started with a simple idea: how does gradient boosting actually work?

Over nine posts, we went from residual intuition to Taylor expansions, from histograms to GOSS, from categorical splits to hyperparameter tuning. But one question lingered: could I build one myself?

This final post is the story of boosters—a gradient boosting library I built from scratch in Rust. Not because the world needed another GBDT library, but because there’s no better way to understand something than to build it.


The Beginning: Dissecting XGBoost

It started with a deceptively simple goal: understand what’s inside an XGBoost model.

I’d been using XGBoost for years. I knew how to tune max_depth and learning_rate. I could explain the basics of gradient boosting in an interview. But I couldn’t answer a simple question: what exactly is stored in the model file?

So I opened an XGBoost JSON model and started reading:

{
  "learner": {
    "gradient_booster": {
      "model": {
        "trees": [{
          "split_indices": [2, 5, -1, ...],
          "split_conditions": [0.5, 1.2, 0.0, ...],
          "default_left": [true, false, ...],
          ...
        }]
      }
    }
  }
}

Trees are stored as arrays. Each node has a split feature, a threshold, a default direction for missing values. Leaves store prediction values. This I could understand.

But then I looked at training. How are those thresholds chosen? How does histogram building work? What’s this “subtraction trick” I kept reading about?

The XGBoost and LightGBM source code are excellent, but they’re written for production—hundreds of files, GPU/CPU paths, distributed training, a decade of engineering. I wanted to understand the core algorithm, not navigate a massive codebase.

So I decided to build my own. But first, I started where any good engineering project should: with documentation.


The Early Decision: RFCs First

Before writing any code, I wrote RFCs (Request for Comments)—design documents that forced me to think through each component:

  • What problem am I solving?
  • What approaches did I consider?
  • What are the tradeoffs?
  • What will the implementation look like?

This felt slow at first. I wanted to code! But the RFC for histogram building alone prevented three architectural mistakes I would have had to revert later. The RFC for tree storage saved me from a layout that would have killed cache performance.

By the time I finished, I had 15 RFCs covering every major component: dataset representation, tree storage, binning, histograms, training, inference, Python bindings. These documents became the source material for this blog series.


Week 1: The Naive Implementation

The first version was embarrassingly slow.

I implemented exactly what Part 1 describes: fit trees to residuals, add to ensemble, repeat. No histograms. No optimizations. Just the algorithm.

fn find_best_split(samples: &[Sample], feature: usize) -> Option<Split> {
    let mut best_gain = 0.0;
    let mut best_threshold = 0.0;
    
    // Sort samples by feature value (O(n log n) per feature per node!)
    let mut sorted: Vec<_> = samples.iter().collect();
    sorted.sort_by(|a, b| a.features[feature].partial_cmp(&b.features[feature]).unwrap());
    
    // Try every unique threshold
    for i in 1..sorted.len() {
        let threshold = sorted[i].features[feature];
        let (left, right) = partition_at(samples, feature, threshold);
        let gain = compute_gain(&left, &right);
        
        if gain > best_gain {
            best_gain = gain;
            best_threshold = threshold;
        }
    }
    
    Some(Split { feature, threshold: best_threshold, gain: best_gain })
}

It worked. On the Iris dataset (150 samples, 4 features), it trained in milliseconds. On California Housing (20,000 samples, 8 features), it took minutes. On anything larger, it didn’t finish.

But that’s fine. I understood what the algorithm was doing. Time to make it fast.


Week 2: Histograms Change Everything

Part 4 explains why histograms matter. Implementing them was when the project got interesting.

Instead of sorting samples by feature value at every node (O(n log n)), I quantized features into 256 bins once, upfront. Then split finding becomes O(bins × features) regardless of sample count.

The insight that finally clicked: histogram bins are accumulators. Each bin stores the sum of gradients for samples falling into that bin. To find the best split, scan bins left-to-right, maintaining running sums.

struct HistogramBin {
    gradient_sum: f64,
    hessian_sum: f64,
}
 
fn find_best_split_histogram(histogram: &[HistogramBin], total_g: f64, total_h: f64) -> Split {
    let mut g_left = 0.0;
    let mut h_left = 0.0;
    let mut best_gain = 0.0;
    let mut best_bin = 0;
    
    for (bin, stats) in histogram.iter().enumerate() {
        g_left += stats.gradient_sum;
        h_left += stats.hessian_sum;
        let g_right = total_g - g_left;
        let h_right = total_h - h_left;
        
        let gain = split_gain(g_left, h_left, g_right, h_right, lambda);
        if gain > best_gain {
            best_gain = gain;
            best_bin = bin;
        }
    }
    
    Split { bin: best_bin, gain: best_gain }
}

Suddenly California Housing trained in seconds. But I was still 10× slower than XGBoost. Where was the time going?


Week 3: The Subtraction Trick

Profiling revealed that I was building two histograms for every split—one for each child. For a tree with 100 leaves, that’s 200 histogram builds.

The subtraction trick (from Part 4) cuts this in half:

// Instead of building both children:
//   build_histogram(left_child);
//   build_histogram(right_child);
 
// Keep the parent histogram, build only the smaller child:
build_histogram(smaller_child);
 
// Derive the larger child by subtraction:
for bin in 0..num_bins {
    larger[bin] = parent[bin] - smaller[bin];
}

Building a histogram is O(samples in node). Subtracting histograms is O(bins). For a node with 100,000 samples and 256 bins, that’s a 390× speedup for the larger child.

But there’s a catch: you need the parent histogram when splitting. If you’ve already evicted it from memory, you’re back to building both children.

This led me to implement an LRU histogram cache. The implementation details were tricky—pinning slots during subtraction, handling depth-first vs breadth-first growth—but the payoff was real. Training time dropped by another 40%.


Weeks 4-6: Chasing Performance

By now I had a working histogram-based GBDT. It matched XGBoost’s accuracy (verified against their predictions). But it was slower. Profiling showed the bottleneck was histogram building itself.

I went down several rabbit holes:

Ordered gradients: Reading gradients in sample order (as they appear in the node) causes cache misses. Pre-gathering gradients into partition order before building histograms improved memory locality. 15% improvement.

f64 bins vs f32: I initially used f32 for histograms to match gradient precision. But summing millions of f32 values accumulates rounding error. f64 bins fixed numerical issues and, surprisingly, weren’t slower—memory bandwidth wasn’t the bottleneck.

Feature-parallel vs row-parallel: I tried splitting histogram work by rows (each thread builds partial histogram, merge at end). More complex, required synchronization, and slower than the simple approach: split by features (each thread builds complete histogram for a subset of features, no merge needed).

SIMD: I spent a week on explicit SIMD vectorization using Rust’s std::simd. The result? Slower than letting LLVM auto-vectorize the scalar loop. The compiler is smarter than I thought.

After a month of optimization, I was within 10% of XGBoost on most benchmarks. Good enough to move on.


Month 2: Feature Parity

Performance is nothing without features. I went through the XGBoost and LightGBM documentation and implemented:

GOSS sampling (Part 6): Keep high-gradient samples, subsample the rest. The weight amplification math took a few tries to get right.

Categorical features (Part 7): Native categorical splits with the gradient-sorted algorithm. This was more complex than expected—bitset storage, handling high cardinality, regularization to prevent overfitting.

Exclusive Feature Bundling: For one-hot encoded data, bundle mutually exclusive features into single columns. My benchmark on the Adult dataset improved by 7× after implementing EFB.

Missing value handling: XGBoost’s “learned default direction” is elegant—during split finding, try missing values going left and right, pick the better direction.

Quantile regression, ranking, multiclass: Each objective required implementing the gradient and Hessian formulas from scratch. The quantile objective was particularly finicky (check-loss has non-smooth derivatives).


Benchmarking and Validation

I built an evaluation framework (boosters-eval) that runs standardized benchmarks against XGBoost and LightGBM:

california (regression, RMSE):
  boosters:  0.4071±0.0050  train_time: 0.090s
  lightgbm:  0.4087±0.0061  train_time: 0.136s
  xgboost:   0.4089±0.0057  train_time: 0.127s
 
synthetic_reg_medium (RMSE):
  boosters:  0.1710±0.0039  train_time: 0.581s
  lightgbm:  0.2046±0.0029  train_time: 0.811s
  xgboost:   0.2043±0.0036  train_time: 1.660s

The numbers showed something surprising: boosters wasn’t just close to XGBoost—on some benchmarks, it was faster. Leaf-wise growth on medium-sized datasets hit a sweet spot in my implementation.

But accuracy parity was the real validation. Every test compares boosters’ predictions against XGBoost/LightGBM. Bit-for-bit matching isn’t always possible (floating-point ordering differs), but predictions within 1e-6 tolerance on the same data? That I could achieve.

Two and a half months after starting, I had a working library.


What I Learned

Building boosters taught me more about gradient boosting than any paper or course. Some lessons:

The algorithm is simple; the performance isn’t. The core boosting loop fits in 50 lines. Making it fast took 10,000 lines of cache-aware, memory-optimized Rust.

Histograms are the key innovation. Pre-XGBoost, gradient boosting was slow. Histogram-based training made it competitive with random forests. Everything else—subtraction trick, GOSS, EFB—is optimization on top.

Documentation prevents mistakes. Writing RFCs before code forced me to think through edge cases. The histogram cache design went through three revisions on paper before I wrote any code.

Rust is a good fit. Zero-cost abstractions, fearless concurrency, no GC pauses. The language pushed me toward correct-by-construction designs (ownership for histogram slots, lifetimes for data views).

The established libraries are excellent. XGBoost and LightGBM have survived a decade of production use. My implementation found no bugs in their algorithms—just alternative ways to express the same ideas.


What’s Still Missing

boosters isn’t complete. Some important features are still on the roadmap:

  • GPU training: CUDA/Metal histogram building would significantly speed up large datasets
  • Monotonic constraints: Enforcing that predictions increase (or decrease) with a feature
  • Interaction constraints: Limiting which features can interact in splits

These are planned but not yet implemented.


Why Rust?

Beyond performance, Rust enables things that are hard in C++:

WebAssembly: boosters compiles to WASM, enabling client-side inference in browsers. No server round-trip needed for predictions.

MUSL builds: Static linking with MUSL produces single-binary executables with no dynamic dependencies. This is invaluable for production deployments—no more “works on my machine” issues with glibc versions.

Memory safety: Rust’s ownership system prevents entire classes of bugs. No use-after-free, no data races (without unsafe).


Try It

boosters is open source:

pip install boosters
from boosters.sklearn import GBDTRegressor
 
model = GBDTRegressor(n_estimators=100, max_depth=6)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

The documentation includes everything from this blog series, plus API reference and tutorials:

When to Use boosters

For research: I already use it myself. The sklearn-compatible API makes it a drop-in replacement for experimentation.

For production: Be cautious. The API may still change, and while there are no known algorithmic bugs, make sure you understand what you’re doing. If you need GPU training or monotonic constraints, stick with XGBoost/LightGBM for now.


Series Conclusion

This series covered gradient boosting from the intuition of iterative residual fitting to the implementation details of histogram caches. The key ideas:

  1. Boosting builds predictions additively, each model correcting the previous ensemble’s errors
  2. Gradients tell us which direction to correct; Hessians tell us how far
  3. Trees are trained using the split gain formula, which falls directly out of the second-order Taylor expansion
  4. Histograms make split finding O(bins) instead of O(samples)
  5. Growth strategies (depth-wise vs leaf-wise) trade off efficiency and overfitting risk
  6. Sampling (GOSS, column subsampling) provides regularization and speed
  7. Categorical handling and EFB optimize for real-world data patterns
  8. Hyperparameters mostly control the bias-variance tradeoff

These ideas are implemented in XGBoost, LightGBM, CatBoost, and now boosters. The libraries differ in defaults and details, but the core algorithm is the same.

If you’ve followed along, you now understand gradient boosting at a level deeper than most practitioners. You can read the source code. You can reason about why one configuration outperforms another. You could, if you wanted, build your own.

Thanks for reading.


References

  1. boosters Documentation

  2. boosters GitHub Repository

  3. Chen, T. & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System”. KDD 2016. arXiv

  4. Ke, G. et al. (2017). “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”. NeurIPS 2017. PDF