RFC-0010: GBLinear#
Status: Implemented
Created: 2025-12-01
Updated: 2026-01-02
Scope: Gradient boosted linear model training and inference
Summary#
GBLinear trains a linear model using gradient boosting with coordinate descent. It’s an alternative to GBDT when interpretability or linear relationships are desired. Compatible with XGBoost’s gblinear booster.
Why GBLinear?#
Aspect |
GBDT |
GBLinear |
|---|---|---|
Model |
Trees |
Linear weights |
Complexity |
Non-linear |
Linear |
Interpretability |
Feature importance |
Coefficients |
Training speed |
Slower (histogram building) |
Faster (coordinate descent) |
Accuracy |
Usually better |
Good for linear data |
GBLinear is useful when:
Features have roughly linear relationships with target
Model interpretability is important
Training speed matters and trees aren’t necessary
Layers#
High Level#
Users call GBLinearModel::train:
let model = GBLinearModel::train(&dataset, eval_set, config, seed)?;
let predictions = model.predict(&test_data, n_threads);
Medium Level (Trainer)#
pub struct GBLinearTrainer<O: ObjectiveFn, M: MetricFn> {
objective: O,
metric: M,
params: GBLinearParams,
updater: Updater,
}
impl<O, M> GBLinearTrainer<O, M> {
pub fn train(
&self,
train: &Dataset,
targets: TargetsView<'_>,
weights: WeightsView<'_>,
val_set: Option<&Dataset>,
) -> Option<LinearModel>;
}
Returns None if dataset contains categorical features (unsupported).
Medium Level (Model)#
pub struct LinearModel {
weights: Array2<f32>, // Shape: [n_features + 1, n_groups]
}
impl LinearModel {
pub fn weight(&self, feature: usize, group: usize) -> f32;
pub fn bias(&self, group: usize) -> f32;
pub fn predict(&self, data: &Dataset, parallelism: Parallelism) -> Array2<f32>;
}
Weights layout: rows 0..n_features are coefficients, last row is bias.
Low Level (Updater)#
pub struct Updater {
strategy: UpdateStrategy,
alpha: f32, // L1 regularization
lambda: f32, // L2 regularization
learning_rate: f32,
max_delta_step: f32,
}
impl Updater {
pub fn compute_delta(
&self,
grad_sum: f64,
hess_sum: f64,
) -> f32;
}
Coordinate Descent#
GBLinear optimizes one weight at a time:
For each round:
Compute gradients from objective
For each feature (selected by feature_selector):
sum_g = Σ grad[i] * x[i, feature]
sum_h = Σ hess[i] * x[i, feature]²
delta = compute_delta(sum_g, sum_h)
weight[feature] += delta
Update predictions: pred += delta * x[:, feature]
Update bias similarly
Update Strategies#
pub enum UpdateStrategy {
/// Full Newton step: Δw = -learning_rate * sum_g / (sum_h + λ)
Standard,
/// Shotgun: Same as Standard, updates all features in parallel
Shotgun,
}
Shotgun uses parallel feature updates (approximation). Standard is sequential but more accurate for correlated features.
Performance: Shotgun is ~3-5× faster than Standard on many-feature datasets. May require more rounds to converge with highly correlated features.
Convergence: Training runs for exactly n_rounds. No early convergence
detection within coordinate descent—this matches XGBoost behavior.
Feature Selection#
pub enum FeatureSelectorKind {
/// Cyclic: iterate 0, 1, ..., n_features-1
Cyclic,
/// Shuffle: random permutation each round
Shuffle { seed: u64 },
/// Random: sample with replacement
Random { seed: u64 },
}
Shuffle typically converges faster than cyclic by reducing coordinate dependencies.
Parameters#
pub struct GBLinearParams {
pub n_rounds: u32, // Boosting rounds
pub learning_rate: f32, // Step size
pub alpha: f32, // L1 regularization
pub lambda: f32, // L2 regularization
pub update_strategy: UpdateStrategy,
pub feature_selector: FeatureSelectorKind,
pub max_delta_step: f32, // Clip step size
pub early_stopping_rounds: u32,
pub verbosity: Verbosity,
pub seed: u64,
}
Weight Access#
impl LinearModel {
// Full weight matrix access
pub fn as_slice(&self) -> &[f32]; // Flat view [n_features + 1, n_groups]
pub fn weight_view(&self) -> ArrayView2<f32>; // [n_features, n_groups] excluding bias
// Per-group combined view (preferred for training updates)
pub fn weights_and_bias(&self, group: usize) -> ArrayView1<f32>; // [n_features + 1]
pub fn weights_and_bias_mut(&mut self, group: usize) -> ArrayViewMut1<f32>;
// Individual element access
pub fn weight(&self, feature: usize, group: usize) -> f32;
pub fn bias(&self, group: usize) -> f32;
pub fn biases(&self) -> ArrayView1<f32>; // [n_groups]
}
The weights_and_bias(group) view is the primary API for training updates—it
returns a contiguous slice [w_0, w_1, ..., w_{n-1}, bias] for a single output group.
Individual weight() and bias() methods exist for inspection.
Multi-Output#
For K-class classification:
Weights shape:
[n_features + 1, K]Each output group has its own weight vector
Coordinate descent updates all K outputs per feature
Prediction#
Linear prediction is matrix multiplication:
impl LinearModel {
pub fn predict(&self, data: &Dataset, parallelism: Parallelism) -> Array2<f32> {
// output = data · weights[:-1, :] + bias
let mut output = data.features.dot(&self.weight_matrix());
output += &self.biases();
output
}
}
Parallelized via ndarray’s built-in BLAS bindings or rayon for row blocks.
Files#
Path |
Contents |
|---|---|
|
|
|
|
|
|
|
|
|
|
Design Decisions#
DD-1: No categorical support. GBLinear requires numeric features for meaningful linear coefficients. If categoricals are needed, users should one-hot encode or use GBDT.
DD-2: XGBoost-compatible layout. Weight storage matches XGBoost’s gblinear for model interoperability.
DD-3: Shotgun parallelism. Shotgun assumes feature updates are independent (approximately true with regularization). Faster but may need more rounds for highly correlated features.
DD-4: L1 via soft thresholding. L1 regularization uses soft thresholding
formula: sign(x) * max(|x| - α, 0) for sparse solutions.
XGBoost Differences#
Aspect |
Boosters |
XGBoost |
|---|---|---|
Feature selection |
Cyclic, Shuffle, Random |
Same options |
Update strategy |
Standard, Shotgun |
coord_descent, shotgun |
Regularization |
L1 + L2 |
Same |
Multi-output |
Supported |
Supported |
Behavior is designed to match XGBoost for model interoperability.
Testing Strategy#
Category |
Tests |
|---|---|
Coordinate descent |
Single-feature convergence to optimal |
L1 sparsity |
Zero weights with sufficient α |
XGBoost compat |
Load XGBoost gblinear, compare predictions |
Multi-output |
K-class weights correctly shaped |