Gradient Boosting#
Gradient boosting is a method for building predictive models by iteratively adding weak learners that correct the errors of the ensemble. It is not an algorithm itself, but a framework that can be instantiated with different base learners:
GBDT (Gradient Boosted Decision Trees): Uses decision trees as base learners
GBLinear: Uses linear models as base learners
This document explains the theoretical foundations common to all gradient boosting variants.
The Core Idea#
Gradient boosting builds a prediction model as a sum of weak learners:
Each weak learner \(f_i\) is trained to predict the residual errors of all previous learners combined. The key insight is that we can use gradient descent in function space: instead of adjusting parameters, we add entire functions that point in the direction of steepest descent.
Functional Gradient Descent#
Consider minimizing an expected loss over a training set:
In standard gradient descent, we update parameters: \(\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}\)
In functional gradient descent, we update the function itself:
The functional gradient at each point \(x_i\) is:
This is exactly the negative gradient (or pseudo-residual) at each training point.
The Boosting Algorithm#
At each iteration \(m\):
Compute pseudo-residuals: \(r_i^{(m)} = -\frac{\partial L(y_i, F^{(m-1)}(x_i))}{\partial F^{(m-1)}(x_i)}\)
Fit a weak learner \(h_m\) to the pseudo-residuals: \(h_m \approx \arg\min_h \sum_i (r_i^{(m)} - h(x_i))^2\)
Update the ensemble: \(F^{(m)} = F^{(m-1)} + \eta \cdot h_m\)
The learning rate \(\eta\) controls how much each weak learner contributes.
Loss Functions and Gradients#
Different loss functions lead to different gradient boosting behaviors:
Task |
Loss Function |
Gradient (Pseudo-Residual) |
|---|---|---|
Regression |
\(\frac{1}{2}(y - F)^2\) |
\(y - F\) (residual) |
Classification |
\(\log(1 + e^{-yF})\) |
\(y \cdot \sigma(-yF)\) |
Ranking |
LambdaRank |
Pairwise gradients |
Second-Order Approximation#
XGBoost and LightGBM use a second-order Taylor expansion of the loss:
Where:
\(g = \frac{\partial L}{\partial F}\) is the gradient
\(H = \frac{\partial^2 L}{\partial F^2}\) is the Hessian (second derivative)
Using the Hessian allows better approximation and enables Newton-like updates instead of pure gradient descent. This is why modern implementations track both gradient and Hessian for each sample.
Regularization#
Gradient boosting is prone to overfitting. Common regularization techniques:
Shrinkage (Learning Rate)#
Reduce the contribution of each weak learner:
Smaller \(\eta\) requires more iterations but often achieves better generalization.
Subsampling#
Train each weak learner on a random subset of data:
Row subsampling: Random fraction of training examples
Column subsampling: Random fraction of features
Structural Regularization#
For tree-based learners:
Limit tree depth
Limit number of leaves
Minimum samples per leaf
L1/L2 regularization on leaf weights
GBDT vs GBLinear#
The gradient boosting framework can use any differentiable weak learner:
Aspect |
GBDT |
GBLinear |
|---|---|---|
Base learner |
Decision tree |
Linear model |
Non-linearity |
Captured by trees |
None (linear) |
Feature interactions |
Automatic via splits |
None |
Interpretability |
Lower |
Higher |
Speed |
Slower |
Faster |
Best for |
Complex patterns |
Linear relationships |
When to Use Each#
GBDT excels when:
Data has non-linear relationships
Feature interactions matter
You have sufficient training data
GBLinear excels when:
Relationships are approximately linear
You need fast training/inference
Interpretability is important
Data is very high-dimensional and sparse
Historical Context#
Gradient boosting was introduced by Friedman (2001) as a generalization of AdaBoost. Key developments:
Year |
Development |
|---|---|
1997 |
AdaBoost (Freund & Schapire) |
2001 |
Gradient Boosting Machines (Friedman) |
2014 |
XGBoost (Chen & Guestrin) — second-order, regularization |
2017 |
LightGBM (Ke et al.) — histogram-based, leaf-wise |
2017 |
CatBoost (Prokhorenkova et al.) — categorical handling |
Modern implementations (XGBoost, LightGBM) add:
Second-order approximation (Hessian)
Histogram-based split finding
Native missing value handling
Regularization in the objective
Further Reading#
In This Documentation#
Academic Papers#
Friedman, J.H. (2001). “Greedy Function Approximation: A Gradient Boosting Machine”
Chen, T. & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System”
Ke, G. et al. (2017). “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”
Source Code#
XGBoost:
src/gbm/gbtree.cc,src/gbm/gblinear.ccLightGBM:
src/boosting/gbdt.cpp