What is Gradient Boosting?

If you’ve spent any time with tabular machine learning, you’ve encountered gradient boosting. XGBoost, LightGBM, CatBoost: these libraries dominate Kaggle competitions and power countless production systems. But what’s actually happening under the hood?

This post builds the intuition for gradient boosting from scratch. We’ll start with a simple question (why would you ever want a “weak” model?) and work our way to the core boosting algorithm. No heavy math yet; that comes in Part 2. Here, we focus on understanding why this approach works.

The Prediction Problem

Let’s ground ourselves. We have training data: input features $X$ and target values $y$ . We want a function $F (x)$ that predicts $y$ well on new, unseen data.

Notation

Throughout this series, $F (x)$ denotes our prediction function (the ensemble), $h (x)$ denotes an individual weak learner we’re adding, and subscripts like $F_{m}$ indicate the ensemble after $m$ boosting rounds.

The obvious approach: find one really good model. Train a deep neural network, or a complex decision tree, or some other powerful learner. Let it absorb all the patterns in the data.

This works, sometimes. But powerful models have a problem: they’re prone to overfitting. A deep tree can memorize the training data perfectly, including its noise and quirks, and then fail spectacularly on new data.

So here’s an alternative idea: what if we combined many simple models instead?

Weak Learners: Intentionally Simple

A weak learner is a model with limited capacity; it can only represent simple patterns. In practice, this typically means:

A shallow decision tree (depth 3-6, controlled by max_depth)
A tree with few leaves (controlled by num_leaves)
Any model constrained to capture only broad patterns

What's a Decision Tree?

A decision tree makes predictions by asking a sequence of yes/no questions about the features. “Is age > 30?” then “Is income > 50k?” then “Predict: high risk.” Each question is a split, and the final answers are leaves. A shallow tree has few questions, so it can only make coarse distinctions.

Why would we want such a limited model? Because constrained models have a crucial property: they don’t overfit easily. A tree with 3 levels of splits can’t memorize a million training examples. It can only capture broad, robust patterns.

The catch is obvious: a single weak learner gives poor predictions. It’s too simple to model complex relationships.

But what if we could build a team of weak learners, each contributing their small piece of understanding?

On Terminology

In theoretical computer science, “weak learner” has a precise definition from PAC learning: any classifier with accuracy slightly better than random guessing. In gradient boosting practice, we use it more loosely to mean “a model with limited complexity.” The concepts are related; both capture the idea of models that are individually limited but collectively powerful.

Combining Models: From Averaging to Boosting

The simplest way to combine models is averaging. Train 10 decision trees independently, average their predictions. This is the core idea behind Random Forests, where each tree is trained on a random subset of data and features. The diversity among trees helps reduce variance, and their uncorrelated errors tend to cancel out.

But averaging treats all models equally and independently. What if we could be smarter about how we combine them?

Boosting takes a different approach. Instead of training models independently and averaging, we train them sequentially, where each new model specifically focuses on fixing the mistakes of the previous ones.

Here’s the key insight: after training the first model, look at where it’s wrong. Train the second model to predict those errors. Now the combination of both models is better than either alone.

This is iterative refinement. Each new model is a correction to the ensemble so far.

The Residual Intuition

Let’s make this concrete with an example.

House Price Prediction

Suppose we’re predicting house prices, and our first model $F_{1}$ gives these predictions:

House True Price $F_{1}$ Prediction Residual
A 300k 280k +20k
B 450k 460k -10k
C 200k 180k +20k
D 500k 500k 0

The residual is what’s left over: true value - prediction. It tells us how wrong we are and in which direction.

House	True Price	$F_{1}$ Prediction	Residual
A	300k	280k	+20k
B	450k	460k	-10k
C	200k	180k	+20k
D	500k	500k	0

Now, instead of training the second model on the original prices, we train it on the residuals. The second model $h_{2}$ (another decision tree) learns to predict these corrections. If it does a reasonable job, our combined prediction $F_{1} (x) + h_{2} (x)$ will be closer to the truth than $F_{1}$ alone.

We can repeat this process:

Compute residuals: $r = y - F_{current} (x)$
Train a new weak learner $h$ to predict $r$
Update: $F_{new} = F_{current} + h$
Repeat

Each iteration reduces the residuals. The ensemble gets better and better.

From Residuals to Gradients

The residual approach above is clean and intuitive, but it’s actually a special case of something more general.

What's a Gradient?

In calculus, a gradient tells you the slope of a function: which direction makes the function increase fastest. For a loss function, the gradient points toward higher loss. So to reduce loss, we move in the opposite direction: the negative gradient.

When we minimize squared error (that is, $(y - F (x))^{2}$ ), the residual $y - F (x)$ happens to equal the negative gradient of the loss with respect to the prediction. The residual points toward lower loss, which is exactly where we want to go.

This connection to gradients is why the method is called gradient boosting. And it unlocks a powerful generalization: we can use any differentiable loss function, not just squared error.

For classification, we use logistic loss (the model predicts log-odds, which we convert to probabilities)
For robust regression, we might use absolute error (less sensitive to outliers)
For ranking, we might use pairwise losses

In each case, the negative gradient tells us which direction reduces the loss. We train each weak learner (each tree) to approximate that direction.

We’ll derive this formally in Part 2. For now, the key intuition is: the negative gradient points toward improvement, and boosting follows that direction step by step.

The Shrinkage Trick

There’s one more ingredient that makes boosting work well in practice: shrinkage (called learning_rate in XGBoost and LightGBM).

Instead of adding each new model at full strength ( $F_{new} = F_{current} + h$ ), we scale it down:

$F_{new} = F_{current} + η \cdot h$

where $η$ (the learning rate) is a small number like 0.1 or 0.01.

This seems counterproductive; we’re deliberately making each correction smaller! But shrinkage provides crucial regularization. By taking small steps, we:

Reduce the risk of any single model overfitting to noise
Give later models a chance to refine earlier corrections
Create a smoother optimization path

The tradeoff: smaller $η$ requires more boosting rounds (controlled by n_estimators). A common practice is to set a small learning rate (0.01-0.1) and use early stopping to find the right number of rounds, stopping when validation performance stops improving.

Rule of Thumb

Start with learning_rate=0.1 and n_estimators=1000 with early stopping. Lower the learning rate and increase iterations if you have the compute budget.

The Complete Algorithm

Putting it all together, here’s gradient boosting for squared error:

Algorithm: Gradient Boosting (Squared Error)

Input:
  - Training data (X, y) with n samples
  - Number of rounds M (n_estimators)
  - Learning rate η (learning_rate)
  - Weak learner constraints (e.g., max_depth)

Output: Ensemble model F_M

1. Initialize F₀(x) = mean(y)

2. For m = 1 to M:
    a. Compute residuals for each sample:
       rᵢ = yᵢ - F_{m-1}(xᵢ)
    
    b. Fit a weak learner hₘ to targets (X, r):
       hₘ = fit_tree(X, r, max_depth=...)
    
    c. Update the ensemble:
       Fₘ(x) = F_{m-1}(x) + η · hₘ(x)

3. Return F_M

That’s it. The magic is in the iteration: each tree sees a progressively easier problem (smaller residuals), and their sum captures complexity that no single tree could.

Training vs Inference

Training is inherently sequential: you can’t train round $m$ until you’ve computed residuals from round $m - 1$ . Inference is different: each sample can be predicted independently, and in practice we batch predictions for cache efficiency. However, a model with 1000 trees means 1000 tree traversals per prediction, so both latency and model size scale linearly with the number of trees.

Why Does This Work So Well?

Gradient boosting’s success comes from a few complementary factors:

Bias-variance management: Weak learners start with high bias (too simple) but low variance (stable predictions). Boosting gradually reduces bias by adding corrections. Shrinkage and tree constraints prevent the variance from growing too large.

Automatic feature handling: Tree-based boosting naturally handles mixed feature types (numeric, categorical), doesn’t require feature scaling, and automatically captures interactions through splits.

Flexibility: The framework works with any differentiable loss. Classification, regression, ranking, survival analysis: gradient boosting adapts to all of them.

Multiple regularization knobs: Besides shrinkage, practitioners can tune tree depth, minimum samples per leaf, row and column subsampling, and more. This makes it possible to control overfitting across diverse problems.

When to Use Gradient Boosting

Gradient boosting excels at:

Tabular data: Structured data with rows and columns (as opposed to images, text, or sequences)
Mixed feature types: Handles numeric and categorical features naturally
Moderate-sized datasets: Thousands to millions of samples
When accuracy matters: GBMs consistently win competitions on tabular benchmarks
Production stability: Trained models are fully deterministic; same input always gives same output

Limitations

Cannot extrapolate: Trees partition the feature space and assign constant values to each region. They cannot predict beyond the range of target values seen during training, and they struggle when test features fall outside training distributions.

Very small data: May overfit despite regularization; simpler models might generalize better.

Unstructured data: For images or text, deep learning typically wins.

Interpretability: Ensembles of hundreds of trees are hard to explain directly (though SHAP helps; see Part 8).

Try It Yourself

To experiment with gradient boosting, the official documentation for XGBoost, LightGBM, and scikit-learn all provide excellent tutorials with runnable examples.

A Brief History

Gradient boosting didn’t appear fully formed. It evolved through several key contributions:

Year	Development
1990	Boosting concept (Schapire): Proved weak learners can be combined into strong learners
1997	AdaBoost (Freund & Schapire): Boosting via adaptive sample reweighting
1999	Gradient descent view (Mason et al.): Connected boosting to optimization in function space
2001	Gradient Boosting Machines (Friedman): Shrinkage, arbitrary losses, the framework we use today
2014	XGBoost (Chen & Guestrin): Systems innovations that made GBM practical at scale
2017	LightGBM (Ke et al.): Histogram-based training, leaf-wise growth
2017	CatBoost (Prokhorenkova et al.): Improved categorical handling, ordered boosting

The theoretical foundations were laid in the 1990s, but it took until the mid-2010s for implementation innovations to make gradient boosting the dominant algorithm for tabular data.

What’s Next

We’ve built the intuition: gradient boosting trains weak learners sequentially, each one correcting the errors of the ensemble so far. The gradient tells us which direction to correct, and shrinkage keeps us from overcorrecting.

But we’ve glossed over some crucial details:

What exactly is “gradient descent in function space”?
How do we find the optimal leaf values?
What makes XGBoost and LightGBM so fast?

The next post, Functional Gradient Descent, formalizes the math. We’ll derive the pseudo-residual, show how it generalizes beyond squared error, and build toward the split gain formula that powers modern implementations.

Summary

Gradient boosting builds a prediction by summing weak learners, each trained to correct the errors of the previous ensemble:

F_{M} (x) = F_{0} (x) + η \cdot h_{1} (x) + η \cdot h_{2} (x) + \dots + η \cdot h_{M} (x)

Key ideas:

Weak learners are intentionally simple (e.g., shallow trees via max_depth) to avoid overfitting
Residuals show where the current model is wrong; the next model learns to predict them
Shrinkage (learning_rate) regularizes by taking small steps
Gradients generalize residuals to any differentiable loss function

This foundation enables everything that follows: efficient split finding, tree growth strategies, sampling optimizations, and more.

References

Friedman, J.H. (2001). “Greedy Function Approximation: A Gradient Boosting Machine”. Annals of Statistics, 29(5), 1189-1232. PDF
Schapire, R.E. (1990). “The Strength of Weak Learnability”. Machine Learning, 5(2), 197-227. PDF
Freund, Y. & Schapire, R.E. (1997). “A Decision-Theoretic Generalization of On-Line Learning”. Journal of Computer and System Sciences, 55(1), 119-139. PDF
Chen, T. & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System”. KDD 2016. arXiv

Egor Dmitriev

Recent Posts

Building Boosters: A Gradient Boosting Library from Scratch

Regularization and Hyperparameter Tuning

Trees and the Split Gain Formula

XGBoost vs LightGBM: A Practical Comparison

The Prediction Problem

Weak Learners: Intentionally Simple

Combining Models: From Averaging to Boosting

The Residual Intuition

From Residuals to Gradients

The Shrinkage Trick

The Complete Algorithm

Why Does This Work So Well?

When to Use Gradient Boosting

A Brief History

What’s Next

Summary

References

Table of Contents