Prediction & Optimization#

How inference works for linear models, and why there are fewer optimizations than tree methods.

Inference Algorithm#

Prediction is a weighted sum — the simplest possible model evaluation:

\[ \text{output}[g] = \text{bias}[g] + \sum_{j} x_j \cdot w_{j,g} \]

Where:

\(x_j\) = feature value
\(w_{j,g}\) = weight for feature \(j\), output group \(g\)
\(\text{bias}[g]\) = bias term for output group \(g\)

Dense Features#

def predict_linear(features, weights, bias, num_groups):
    outputs = []
    for group in range(num_groups):
        total = bias[group]
        for feat_idx, value in enumerate(features):
            total += value * weights[feat_idx, group]
        outputs.append(total)
    return outputs

Sparse Features#

For sparse data, only non-zero features contribute:

def predict_sparse(indices, values, weights, bias, num_groups):
    outputs = [bias[g] for g in range(num_groups)]
    for idx, value in zip(indices, values):
        for group in range(num_groups):
            outputs[group] += value * weights[idx, group]
    return outputs

Base Score and Base Margin#

Like tree models, linear models add a base score to predictions:

output = bias + base_score + Σ(feature × weight)

If per-row base_margin values are provided, those override the global base score:

output = bias + base_margin[row] + Σ(feature × weight)

Feature Contributions#

Since the model is linear, feature contributions are trivial to compute:

\[ \text{contribution}[j] = x_j \cdot w_j \]

\[ \text{contribution}[\text{bias}] = \text{bias} + \text{base\_score} \]

Note: Feature interactions are always zero. Linear models have no interaction terms — each feature contributes independently.

Complexity#

Operation	Complexity
Single row (dense)	O(features × groups)
Single row (sparse, K non-zeros)	O(K × groups)
Batch of N rows	O(N × features × groups)

Compare to tree models: O(trees × depth) per row.

For most practical cases, linear prediction is faster because:

features is typically less than trees × depth
Memory access is sequential (no tree pointer chasing)
Sparse data skips zero features entirely

Threading#

Batch Parallelism#

XGBoost parallelizes over rows during batch prediction:

parallel_for(batch.size(), num_threads, [&](row_idx) {
    for (group in 0..num_groups) {
        preds[row_idx * num_groups + group] = predict(row[row_idx], group);
    }
});

Each row’s prediction is independent, so this parallelizes perfectly with no synchronization.

Within-Row Parallelism?#

XGBoost does not parallelize the feature sum within a single row. Why?

Too little work — A single dot product is fast; threading overhead dominates
Memory-bound — The bottleneck is loading weights, not computing sums
Good enough — Row-level parallelism saturates cores for batch prediction

Why Fewer Optimizations Than Trees?#

Linear models have significantly fewer optimizations than tree methods. Here’s why:

1. Inherent Simplicity#

The prediction loop is just:

for (auto& entry : sparse_row) {
    sum += entry.fvalue * weights[entry.index];
}

There’s not much to optimize. The compiler handles it well.

2. Memory-Bound Workload#

Operation	Compute	Memory
Load feature value	0	1 read (4-8 bytes)
Load weight	0	1 read (4-8 bytes)
Multiply-add	2 ops	0

Arithmetic intensity: ~2 ops per 8-16 bytes ≈ 0.125-0.25 ops/byte

Modern CPUs can sustain 10+ ops per byte of memory bandwidth. This workload is heavily memory-bound — faster compute doesn’t help.

3. Sparse Data Issues#

Most GBLinear use cases involve sparse features. This creates problems for:

SIMD: Requires contiguous data, but sparse iteration jumps around randomly. Gather instructions exist but are slow.

Prefetching: Access patterns are data-dependent and unpredictable.

Caching: Random access patterns defeat cache locality.

Optimization Comparison: Linear vs Tree#

Optimization	Tree Models	Linear Models	Why?
Thread parallelism	✅ Yes	✅ Yes	Both benefit from row parallelism
SIMD	❌ Limited	❌ No	Gather operations too slow
GPU inference	✅ Yes	❌ No	Data transfer > compute time
GPU training	✅ Yes	⚠️ Deprecated	Same reason
Block processing	✅ Yes	❌ No	Not beneficial for dot products
Histogram tricks	✅ Yes	N/A	Training only, trees only
SoA layout	✅ Yes	N/A	Already minimal structure

GPU: Why Not?#

Linear models on GPU face fundamental issues:

Data transfer overhead: Copying sparse features to GPU costs more than the computation
Low arithmetic intensity: Just one multiply-add per non-zero. GPUs need high compute/memory ratios.
Small models: Linear models are tiny (just weights). No benefit from massive parallelism.

XGBoost’s GPU training for linear models (gpu_coord_descent) is deprecated as of v2.0.

Parameters#

Core Parameters#

Parameter	Default	Description
`booster`	—	Set to `'gblinear'` for linear model
`eta` (learning_rate)	0.5	Shrinkage factor for weight updates
`lambda` (reg_lambda)	0.0	L2 regularization strength
`alpha` (reg_alpha)	0.0	L1 regularization strength

Note: GBLinear’s default learning rate (0.5) is higher than GBTree’s (0.3). Linear models are simpler and tolerate larger steps.

Training Parameters#

Parameter	Default	Description
`updater`	`'shotgun'`	`'shotgun'` (parallel) or `'coord_descent'` (sequential)
`feature_selector`	`'cyclic'`	How to order feature updates
`top_k`	0	Limit to top-k features (thrifty/greedy only)
`tolerance`	0.0	Early stop if max weight change < tolerance

Parameter Effects#

Learning rate (eta):

Lower → more conservative, may need more rounds
Higher → faster training, risk of instability

L2 regularization (lambda):

Higher → smaller weights, less overfitting
Too high → underfitting

L1 regularization (alpha):

Higher → more weights become exactly zero
Useful for automatic feature selection

Updater:

shotgun → faster, parallel, slight approximation
coord_descent → slower, exact, supports all selectors

Typical Configurations#

Fast Iteration (Default)#

params = {
    'booster': 'gblinear',
    'updater': 'shotgun',
    'feature_selector': 'cyclic',
    'lambda': 0,
    'alpha': 0,
    'eta': 0.5,
}

Good for initial experimentation.

Feature Selection#

params = {
    'booster': 'gblinear',
    'updater': 'coord_descent',
    'feature_selector': 'thrifty',
    'lambda': 0.1,
    'alpha': 1.0,  # Strong L1 for sparsity
    'eta': 0.3,
}

Use when you want the model to automatically select features.

Maximum Stability#

params = {
    'booster': 'gblinear',
    'updater': 'coord_descent',
    'feature_selector': 'shuffle',
    'lambda': 1.0,  # Strong L2
    'alpha': 0.1,
    'eta': 0.1,
    'tolerance': 1e-5,
}

Use when shotgun isn’t converging or you need reproducibility.

High-Dimensional Sparse Data#

params = {
    'booster': 'gblinear',
    'updater': 'coord_descent',
    'feature_selector': 'greedy',  # or 'thrifty' for speed
    'top_k': 1000,
    'lambda': 0.01,
    'alpha': 0.5,
    'eta': 0.3,
}

Focus computation on the most important features.

Summary#

Aspect	Details
Prediction	Simple dot product: \(\sum x_j \cdot w_j + \text{bias}\)
Complexity	O(features) dense, O(nnz) sparse
Threading	Row-level parallelism only
SIMD	Not used (sparse data, memory-bound)
GPU	Not supported (overhead > benefit)
Bottleneck	Memory bandwidth, not compute

Linear models are simple, fast, and interpretable — but this simplicity means there’s less room for clever optimizations. The best approach is straightforward: parallelize across rows and use efficient sparse data structures.