GBLinear#
GBLinear is a linear booster — an alternative to tree-based boosting that trains a generalized linear model (GLM) with elastic net regularization (L1 + L2).
This is a simpler model than GBDT, but can be effective for problems with linear relationships or as a quick baseline.
Contents#
Folder |
Description |
|---|---|
Coordinate descent optimization |
|
Linear prediction |
The Core Idea#
A linear model predicts by summing weighted features:
GBLinear minimizes the elastic net objective:
Where:
\(\mathcal{L}(\mathbf{w})\) is the loss (squared error, logistic, etc.)
\(\lambda_1\) controls L1 sparsity (feature selection)
\(\lambda_2\) controls L2 weight decay (stability)
When to Use Linear vs Tree Models#
Use Case |
Linear |
Tree |
|---|---|---|
Data has linear relationships |
✅ Good |
Overkill |
Need interpretable coefficients |
✅ Good |
Limited |
High-dimensional sparse data |
✅ Often better |
Can overfit |
Complex feature interactions |
❌ Can’t capture |
✅ Excels |
Baseline comparison |
✅ Quick sanity check |
— |
Production latency matters |
✅ Simpler |
More complex |
Linear Model Strengths#
Speed: Training and inference are very fast (just multiply-add)
Interpretability: Coefficients directly show feature importance
Sparsity: L1 regularization can zero out irrelevant features
Memory: Minimal — just a weight vector
Linear Model Limitations#
No categorical features — Requires numerical inputs only
No feature interactions — Can’t capture XOR-like patterns
No tree-specific features — No leaf indices, prediction layers, etc.
SHAP interactions — Always zero (linear models have no interactions)
Key Concepts#
Elastic Net Regularization#
Combines L1 (Lasso) and L2 (Ridge) regularization:
Type |
Penalty |
Effect |
|---|---|---|
L1 (Lasso) |
\(\sum |w_i|\) |
Sparse weights — some become exactly 0 |
L2 (Ridge) |
\(\sum w_i^2\) |
Small weights — stable, keeps all features |
Elastic Net |
Both |
Best of both — sparse, stable, handles correlation |
Coordinate Descent#
Instead of updating all weights at once (gradient descent), coordinate descent updates one weight at a time while holding others fixed. For convex problems with separable regularization, each update has a closed-form solution.
Soft Thresholding#
The L1 penalty creates a “dead zone” around zero. The soft thresholding operator (proximal operator for L1) pushes small weights toward zero and can make them exactly zero — this is how L1 achieves sparsity.
Model Structure#
The model is a weight matrix of shape (num_features + 1) × num_output_groups:
┌──────────────────────────────────────────┐
│ Feature 0: [w₀₀, w₀₁, ..., w₀ₙ] │ ← weights for each output group
│ Feature 1: [w₁₀, w₁₁, ..., w₁ₙ] │
│ ... │
│ Feature k: [wₖ₀, wₖ₁, ..., wₖₙ] │
│ BIAS: [b₀, b₁, ..., bₙ] │ ← last row is bias (no regularization)
└──────────────────────────────────────────┘
Configuration |
Output Groups |
|---|---|
Binary classification |
1 |
K-class classification |
K |
Regression |
1 |
XGBoost Source Files#
File |
Purpose |
|---|---|
|
Main booster implementation |
|
Weight matrix structure |
|
Training parameters |
|
Coordinate update algorithms |
|
Parallel (lock-free) updater |
|
Sequential updater |
Comparison: Linear vs Tree Booster#
Aspect |
Linear |
Tree |
|---|---|---|
Model structure |
Weight vector |
Tree ensemble |
Categorical features |
❌ Not supported |
✅ Supported |
Feature interactions |
❌ None (linear) |
✅ Captures |
Interpretability |
✅ Direct coefficients |
Limited |
Training speed |
✅ Very fast |
Slower |
Prediction speed |
✅ O(features) |
O(trees × depth) |
Memory usage |
✅ Minimal |
Larger |
Hyperparameters |
Few (just regularization) |
Many (tree structure) |