Prediction & Optimization#
How inference works for linear models, and why there are fewer optimizations than tree methods.
Inference Algorithm#
Prediction is a weighted sum — the simplest possible model evaluation:
Where:
\(x_j\) = feature value
\(w_{j,g}\) = weight for feature \(j\), output group \(g\)
\(\text{bias}[g]\) = bias term for output group \(g\)
Dense Features#
def predict_linear(features, weights, bias, num_groups):
outputs = []
for group in range(num_groups):
total = bias[group]
for feat_idx, value in enumerate(features):
total += value * weights[feat_idx, group]
outputs.append(total)
return outputs
Sparse Features#
For sparse data, only non-zero features contribute:
def predict_sparse(indices, values, weights, bias, num_groups):
outputs = [bias[g] for g in range(num_groups)]
for idx, value in zip(indices, values):
for group in range(num_groups):
outputs[group] += value * weights[idx, group]
return outputs
Base Score and Base Margin#
Like tree models, linear models add a base score to predictions:
output = bias + base_score + Σ(feature × weight)
If per-row base_margin values are provided, those override the global base score:
output = bias + base_margin[row] + Σ(feature × weight)
Feature Contributions#
Since the model is linear, feature contributions are trivial to compute:
Note: Feature interactions are always zero. Linear models have no interaction terms — each feature contributes independently.
Complexity#
Operation |
Complexity |
|---|---|
Single row (dense) |
O(features × groups) |
Single row (sparse, K non-zeros) |
O(K × groups) |
Batch of N rows |
O(N × features × groups) |
Compare to tree models: O(trees × depth) per row.
For most practical cases, linear prediction is faster because:
featuresis typically less thantrees × depthMemory access is sequential (no tree pointer chasing)
Sparse data skips zero features entirely
Threading#
Batch Parallelism#
XGBoost parallelizes over rows during batch prediction:
parallel_for(batch.size(), num_threads, [&](row_idx) {
for (group in 0..num_groups) {
preds[row_idx * num_groups + group] = predict(row[row_idx], group);
}
});
Each row’s prediction is independent, so this parallelizes perfectly with no synchronization.
Within-Row Parallelism?#
XGBoost does not parallelize the feature sum within a single row. Why?
Too little work — A single dot product is fast; threading overhead dominates
Memory-bound — The bottleneck is loading weights, not computing sums
Good enough — Row-level parallelism saturates cores for batch prediction
Why Fewer Optimizations Than Trees?#
Linear models have significantly fewer optimizations than tree methods. Here’s why:
1. Inherent Simplicity#
The prediction loop is just:
for (auto& entry : sparse_row) {
sum += entry.fvalue * weights[entry.index];
}
There’s not much to optimize. The compiler handles it well.
2. Memory-Bound Workload#
Operation |
Compute |
Memory |
|---|---|---|
Load feature value |
0 |
1 read (4-8 bytes) |
Load weight |
0 |
1 read (4-8 bytes) |
Multiply-add |
2 ops |
0 |
Arithmetic intensity: ~2 ops per 8-16 bytes ≈ 0.125-0.25 ops/byte
Modern CPUs can sustain 10+ ops per byte of memory bandwidth. This workload is heavily memory-bound — faster compute doesn’t help.
3. Sparse Data Issues#
Most GBLinear use cases involve sparse features. This creates problems for:
SIMD: Requires contiguous data, but sparse iteration jumps around randomly. Gather instructions exist but are slow.
Prefetching: Access patterns are data-dependent and unpredictable.
Caching: Random access patterns defeat cache locality.
Optimization Comparison: Linear vs Tree#
Optimization |
Tree Models |
Linear Models |
Why? |
|---|---|---|---|
Thread parallelism |
✅ Yes |
✅ Yes |
Both benefit from row parallelism |
SIMD |
❌ Limited |
❌ No |
Gather operations too slow |
GPU inference |
✅ Yes |
❌ No |
Data transfer > compute time |
GPU training |
✅ Yes |
⚠️ Deprecated |
Same reason |
Block processing |
✅ Yes |
❌ No |
Not beneficial for dot products |
Histogram tricks |
✅ Yes |
N/A |
Training only, trees only |
SoA layout |
✅ Yes |
N/A |
Already minimal structure |
GPU: Why Not?#
Linear models on GPU face fundamental issues:
Data transfer overhead: Copying sparse features to GPU costs more than the computation
Low arithmetic intensity: Just one multiply-add per non-zero. GPUs need high compute/memory ratios.
Small models: Linear models are tiny (just weights). No benefit from massive parallelism.
XGBoost’s GPU training for linear models (gpu_coord_descent) is deprecated as of v2.0.
Parameters#
Core Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
— |
Set to |
|
0.5 |
Shrinkage factor for weight updates |
|
0.0 |
L2 regularization strength |
|
0.0 |
L1 regularization strength |
Note: GBLinear’s default learning rate (0.5) is higher than GBTree’s (0.3). Linear models are simpler and tolerate larger steps.
Training Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
|
|
|
|
How to order feature updates |
|
0 |
Limit to top-k features (thrifty/greedy only) |
|
0.0 |
Early stop if max weight change < tolerance |
Parameter Effects#
Learning rate (eta):
Lower → more conservative, may need more rounds
Higher → faster training, risk of instability
L2 regularization (lambda):
Higher → smaller weights, less overfitting
Too high → underfitting
L1 regularization (alpha):
Higher → more weights become exactly zero
Useful for automatic feature selection
Updater:
shotgun→ faster, parallel, slight approximationcoord_descent→ slower, exact, supports all selectors
Typical Configurations#
Fast Iteration (Default)#
params = {
'booster': 'gblinear',
'updater': 'shotgun',
'feature_selector': 'cyclic',
'lambda': 0,
'alpha': 0,
'eta': 0.5,
}
Good for initial experimentation.
Feature Selection#
params = {
'booster': 'gblinear',
'updater': 'coord_descent',
'feature_selector': 'thrifty',
'lambda': 0.1,
'alpha': 1.0, # Strong L1 for sparsity
'eta': 0.3,
}
Use when you want the model to automatically select features.
Maximum Stability#
params = {
'booster': 'gblinear',
'updater': 'coord_descent',
'feature_selector': 'shuffle',
'lambda': 1.0, # Strong L2
'alpha': 0.1,
'eta': 0.1,
'tolerance': 1e-5,
}
Use when shotgun isn’t converging or you need reproducibility.
High-Dimensional Sparse Data#
params = {
'booster': 'gblinear',
'updater': 'coord_descent',
'feature_selector': 'greedy', # or 'thrifty' for speed
'top_k': 1000,
'lambda': 0.01,
'alpha': 0.5,
'eta': 0.3,
}
Focus computation on the most important features.
Summary#
Aspect |
Details |
|---|---|
Prediction |
Simple dot product: \(\sum x_j \cdot w_j + \text{bias}\) |
Complexity |
O(features) dense, O(nnz) sparse |
Threading |
Row-level parallelism only |
SIMD |
Not used (sparse data, memory-bound) |
GPU |
Not supported (overhead > benefit) |
Bottleneck |
Memory bandwidth, not compute |
Linear models are simple, fast, and interpretable — but this simplicity means there’s less room for clever optimizations. The best approach is straightforward: parallelize across rows and use efficient sparse data structures.