In Part 1, we built the intuition: gradient boosting trains weak learners sequentially, each one correcting the errors of the previous ensemble. We saw that for squared error, the “correction” is just the residual.
But where does the “gradient” in gradient boosting come from? Why does training on residuals work? And how do we extend this to classification, ranking, and other tasks?
This post answers those questions by formalizing gradient boosting as optimization in function space. We’ll derive the pseudo-residual, work through concrete loss functions, and see why this framework is so powerful.
From Parameter Space to Function Space
Let’s start with what you probably already know: gradient descent.
Gradient Descent Refresher
Given parameters and a loss function , gradient descent updates:
We compute the gradient (the direction of steepest ascent) and move in the opposite direction (toward lower loss). The learning rate controls the step size.
This works when your model has a fixed structure and you’re just tuning its parameters. A neural network, for instance, has weights that gradient descent adjusts.
But in gradient boosting, we’re not adjusting parameters of a single model. We’re adding entirely new models to an ensemble. The “thing we’re optimizing” is the prediction function itself:
How do you do gradient descent when your variable is a function?
The Functional View
Here’s the key insight: at each training point , the prediction is just a number. We can think of the function as a vector of length (one component per training sample):
Now our loss becomes a function of this vector: , where is the per-sample loss (e.g., squared error, logistic loss).
We can take the gradient of this loss with respect to the predictions:
This is the direction in prediction space that increases the loss fastest. To reduce the loss, we want to move in the opposite direction: the negative gradient.
The Pseudo-Residual
Here’s where the magic happens. We want to update our function: .
But we can’t just add arbitrary vectors to . We need to add a function that generalizes to new data. So we train a weak learner to approximate the negative gradient: .
This is trained to predict the negative gradient at each training point. The target for the weak learner at sample is:
This target is called the pseudo-residual. For regression losses, it’s closely related to the actual residual; for other losses, it’s more general.
Why "Pseudo-Residual"?
For squared error loss, the pseudo-residual equals the residual: . For other losses, the pseudo-residual is the negative gradient, which points toward improvement but isn’t exactly the “error” in the intuitive sense. Friedman (2001) coined the term to emphasize this generality.
Once we’ve trained to fit these pseudo-residuals, we update: .
And repeat. Each iteration adds a function that pushes predictions in the direction of lower loss.
Deriving Pseudo-Residuals for Common Losses
Let’s work through the math for the most common loss functions.
Squared Error (Regression)
The loss for a single sample is . (The is for convenience; it cancels when we differentiate.)
Taking the derivative: .
The negative gradient (our pseudo-residual) is:
This is exactly the residual. Training on residuals is functional gradient descent for squared error.
Squared Error Pseudo-Residual
If and :
- Gradient: (loss decreases if we increase prediction)
- Pseudo-residual: (nudge prediction up by 20)
Logistic Loss (Binary Classification)
For binary classification with labels , the log-loss is: , where is the sigmoid function and is the log-odds.
This simplifies to .
Taking the derivative and simplifying: .
The pseudo-residual is:
where is the predicted probability.
Logistic Loss Pseudo-Residual
If (positive class) and (model predicts 30% probability):
- Pseudo-residual:
- Interpretation: the model should increase its raw score for this sample.
If (negative class) and :
- Pseudo-residual:
- Interpretation: the model should decrease its raw score for this sample.
Absolute Error (Robust Regression)
For absolute error (L1 loss): .
The derivative is: .
The pseudo-residual is: .
This is just the sign of the residual: either or . The magnitude doesn’t matter; absolute error only cares about the direction of the error, making it robust to outliers.
Summary of Common Losses
| Loss | Formula | Pseudo-Residual |
|---|---|---|
| Squared Error | ||
| Absolute Error | ||
| Logistic (binary) |
The pattern: compute , that’s your training target.
The Complete Boosting Algorithm
Now we can write down gradient boosting in its general form:
Algorithm: Gradient Boosting (General)
Input: - Training data (X, y) with n samples - Differentiable loss function L(y, F) - Number of rounds M - Learning rate η Initialize: F₀(x) = argmin_c Σᵢ L(yᵢ, c) (e.g., mean for squared error, log-odds for logistic) For m = 1 to M: 1. Compute pseudo-residuals: rᵢ = -∂L(yᵢ, F_{m-1}(xᵢ)) / ∂F_{m-1}(xᵢ) 2. Fit weak learner hₘ to (X, r): hₘ = argmin_h Σᵢ (rᵢ - h(xᵢ))² 3. Update ensemble: Fₘ = F_{m-1} + η · hₘ Output: F_M
A few notes on this algorithm:
Initialization: We start with the constant that minimizes the loss over all training samples. For squared error, this is the mean of . For logistic loss, it’s the log-odds of the positive class.
Initialization for Classification
If your training data has 70% positive samples (): This gives an initial probability of , matching the class balance.
Fitting the weak learner: We always fit to the pseudo-residuals using squared error, regardless of the original loss function. This is because we’re approximating the gradient direction, not directly optimizing the original loss.
Why squared error for the weak learner? The negative gradient is a continuous value at each point. Fitting a tree to these values with squared error finds a function that points (approximately) in the gradient direction. This works even when the original loss is not squared error.
Why Shrinkage Works
In Part 1, we saw that shrinkage (small ) improves generalization. Now we can understand why.
Gradient descent has a similar principle: if you take steps that are too large, you can overshoot the minimum and oscillate. Smaller steps converge more reliably, though they take longer.
In functional gradient descent, each weak learner approximates the gradient direction but doesn’t capture it perfectly. A full step () would add the weak learner at full strength, potentially overcorrecting in some regions.
By using , we:
- Reduce the impact of imperfect gradient approximation
- Allow subsequent learners to refine the correction
- Create a smoother path through function space
Friedman’s original paper showed empirically that with more iterations typically outperforms with fewer iterations.
Shrinkage as Regularization
Shrinkage is equivalent to early stopping along a regularization path. With smaller , the model takes more steps before reaching any given complexity level, allowing early stopping to find a better stopping point.
From First-Order to Second-Order
The algorithm above uses only the gradient (first derivative). Modern implementations like XGBoost and LightGBM also use the Hessian (second derivative) for better optimization.
The idea is to approximate the loss with a second-order Taylor expansion:
where:
- is the gradient
- is the Hessian
This gives us more information about the curvature of the loss function. In regions where the Hessian is large (high curvature), we should take smaller steps; where it’s small (flat), we can take larger steps.
We’ll derive how the Hessian leads to the optimal leaf values and the split gain formula in Part 3.
For now, here are the Hessians for common losses:
| Loss | Gradient () | Hessian () |
|---|---|---|
| Squared Error | ||
| Logistic | ||
| Absolute Error | (see note) |
The Hessian for squared error is constant (1), so first-order and second-order methods are equivalent. For logistic loss, the Hessian is largest when (uncertain predictions) and smallest near 0 or 1 (confident predictions).
Absolute Error and Zero Hessian
The Hessian of absolute error is 0 everywhere (except at , where it’s undefined). This means second-order methods can’t be applied directly. In practice, implementations use a smoothed variant called Huber loss, or they clip the Hessian to a small positive value.
Putting It Into Practice
Let’s trace through one boosting round for logistic loss.
One Boosting Round (Binary Classification)
Setup: 4 training samples, current predictions , true labels
Sample Gradient () Pseudo-residual () 1 1 0.5 0.62 -0.38 +0.38 2 0 1.0 0.73 +0.73 -0.73 3 1 -1.0 0.27 -0.73 +0.73 4 0 -0.5 0.38 +0.38 -0.38 Steps:
- Compute pseudo-residuals (rightmost column)
- Fit a shallow tree to predict these pseudo-residuals from features
- Add the tree’s predictions (scaled by ) to current
- Repeat
Sample 3 has the largest pseudo-residual (+0.73): a positive example that the model currently thinks is negative (). The tree will try to increase predictions for samples that look like sample 3.
What’s Next
We’ve now formalized gradient boosting as functional gradient descent. The pseudo-residual emerges naturally as the negative gradient of the loss, and the algorithm generalizes to any differentiable loss function.
But we’ve been vague about what “fit a weak learner” actually means. How do we build trees that are good at fitting pseudo-residuals? How do we decide where to split?
The next post, Trees and the Split Gain Formula, dives into the details:
- Why trees are the standard weak learner
- How the Hessian leads to optimal leaf values
- The split gain formula that powers XGBoost and LightGBM
Summary
Functional gradient descent is gradient descent where the “variable” being optimized is a function rather than parameters:
- Compute the negative gradient of the loss at each training point: the pseudo-residual
- Fit a weak learner to approximate this gradient direction
- Add the weak learner (scaled by learning rate) to the ensemble
- Repeat
Key ideas:
- Pseudo-residual:
- For squared error, pseudo-residual = residual
- For logistic loss, pseudo-residual =
- Shrinkage ensures we don’t overshoot; smaller = more regularization
- Second-order methods use the Hessian for better step sizes
References
-
Friedman, J.H. (2001). “Greedy Function Approximation: A Gradient Boosting Machine”. Annals of Statistics, 29(5), 1189-1232. PDF
-
Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). “Boosting Algorithms as Gradient Descent in Function Space”. NIPS 1999. PDF
-
Chen, T. & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System”. KDD 2016. arXiv