GBLinear#

GBLinear is a linear booster — an alternative to tree-based boosting that trains a generalized linear model (GLM) with elastic net regularization (L1 + L2).

This is a simpler model than GBDT, but can be effective for problems with linear relationships or as a quick baseline.


Contents#

Folder

Description

training/

Coordinate descent optimization

inference/

Linear prediction


The Core Idea#

A linear model predicts by summing weighted features:

\[\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_n x_n = \mathbf{w}^T \mathbf{x} + b\]

GBLinear minimizes the elastic net objective:

\[ \min_{\mathbf{w}} \quad \mathcal{L}(\mathbf{w}) + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2 \]

Where:

  • \(\mathcal{L}(\mathbf{w})\) is the loss (squared error, logistic, etc.)

  • \(\lambda_1\) controls L1 sparsity (feature selection)

  • \(\lambda_2\) controls L2 weight decay (stability)


When to Use Linear vs Tree Models#

Use Case

Linear

Tree

Data has linear relationships

✅ Good

Overkill

Need interpretable coefficients

✅ Good

Limited

High-dimensional sparse data

✅ Often better

Can overfit

Complex feature interactions

❌ Can’t capture

✅ Excels

Baseline comparison

✅ Quick sanity check

Production latency matters

✅ Simpler

More complex

Linear Model Strengths#

  • Speed: Training and inference are very fast (just multiply-add)

  • Interpretability: Coefficients directly show feature importance

  • Sparsity: L1 regularization can zero out irrelevant features

  • Memory: Minimal — just a weight vector

Linear Model Limitations#

  • No categorical features — Requires numerical inputs only

  • No feature interactions — Can’t capture XOR-like patterns

  • No tree-specific features — No leaf indices, prediction layers, etc.

  • SHAP interactions — Always zero (linear models have no interactions)


Key Concepts#

Elastic Net Regularization#

Combines L1 (Lasso) and L2 (Ridge) regularization:

Type

Penalty

Effect

L1 (Lasso)

\(\sum |w_i|\)

Sparse weights — some become exactly 0

L2 (Ridge)

\(\sum w_i^2\)

Small weights — stable, keeps all features

Elastic Net

Both

Best of both — sparse, stable, handles correlation

Coordinate Descent#

Instead of updating all weights at once (gradient descent), coordinate descent updates one weight at a time while holding others fixed. For convex problems with separable regularization, each update has a closed-form solution.

Soft Thresholding#

The L1 penalty creates a “dead zone” around zero. The soft thresholding operator (proximal operator for L1) pushes small weights toward zero and can make them exactly zero — this is how L1 achieves sparsity.


Model Structure#

The model is a weight matrix of shape (num_features + 1) × num_output_groups:

┌──────────────────────────────────────────┐
│  Feature 0:  [w₀₀, w₀₁, ..., w₀ₙ]       │  ← weights for each output group
│  Feature 1:  [w₁₀, w₁₁, ..., w₁ₙ]       │
│  ...                                      │
│  Feature k:  [wₖ₀, wₖ₁, ..., wₖₙ]       │
│  BIAS:       [b₀,  b₁,  ..., bₙ]         │  ← last row is bias (no regularization)
└──────────────────────────────────────────┘

Configuration

Output Groups

Binary classification

1

K-class classification

K

Regression

1


XGBoost Source Files#

File

Purpose

src/gbm/gblinear.cc

Main booster implementation

src/gbm/gblinear_model.h

Weight matrix structure

src/linear/param.h

Training parameters

src/linear/coordinate_common.h

Coordinate update algorithms

src/linear/updater_shotgun.cc

Parallel (lock-free) updater

src/linear/updater_coordinate.cc

Sequential updater


Comparison: Linear vs Tree Booster#

Aspect

Linear

Tree

Model structure

Weight vector

Tree ensemble

Categorical features

❌ Not supported

✅ Supported

Feature interactions

❌ None (linear)

✅ Captures

Interpretability

✅ Direct coefficients

Limited

Training speed

✅ Very fast

Slower

Prediction speed

✅ O(features)

O(trees × depth)

Memory usage

✅ Minimal

Larger

Hyperparameters

Few (just regularization)

Many (tree structure)