Coordinate Descent Training#
How linear models learn feature weights using coordinate descent optimization.
Overview#
Linear boosters use coordinate descent — instead of updating all weights at once (like gradient descent), we update one weight at a time while holding others fixed.
For convex problems with separable regularization (like elastic net):
Closed-form updates — Each coordinate update has an analytical solution (no step size tuning)
Convergence guaranteed — Provably reaches the global optimum
Efficient for sparse data — Only touches non-zero feature values
Natural for L1 — Soft thresholding integrates cleanly
Elastic Net Regularization#
The objective function is:
Where:
\(\mathcal{L}(\mathbf{w})\) — Loss function (squared error, logistic, etc.)
\(\lambda_1\) (
reg_alpha) — L1 penalty, encourages sparse weights\(\lambda_2\) (
reg_lambda) — L2 penalty, encourages small weights
Why Elastic Net?#
Regularization |
Pros |
Cons |
|---|---|---|
L1 alone (Lasso) |
Feature selection, sparse |
Unstable with correlated features |
L2 alone (Ridge) |
Stable, handles correlation |
Keeps all features |
Elastic Net |
Sparse AND stable |
Two hyperparameters |
Elastic net gets the best of both: it can zero out irrelevant features (like L1) while remaining stable when features are correlated (like L2).
Handling Collinear Features#
Invariant: Coordinate descent does not require features to be linearly independent.
In classical linear regression, perfect collinearity (one feature is a linear combination of others) makes the solution undefined — the design matrix \(X^TX\) is singular. However, elastic net regularization resolves this:
The L2 term \(\lambda_2 I\) ensures the matrix is always invertible. This means:
With \(\lambda_2 > 0\): Training always converges, even with perfectly collinear features
Practical implication: No need to explicitly remove correlated features before training
Weight distribution: Among collinear features, L2 spreads weights evenly; L1 tends to select one and zero others
The L1 component provides an additional benefit: it can automatically select among correlated features, effectively performing feature selection during training.
The Coordinate Update Formula#
For a single weight \(w_j\), the update proceeds as follows:
Step 1: Compute Gradient Statistics#
Sum over all training examples:
Where:
\(g_i\) = gradient of loss for example \(i\)
\(h_i\) = hessian of loss for example \(i\)
\(x_{ij}\) = value of feature \(j\) for example \(i\)
Step 2: Add L2 Regularization#
Step 3: Apply Soft Thresholding (L1)#
The soft thresholding operator handles L1 regularization:
Step 4: Apply Update#
Where \(\eta\) is the learning rate.
Pseudocode#
def coordinate_update(w_j, feature_column, gradients, hessians, lambda1, lambda2, eta):
# Step 1: Gradient stats
grad_sum = sum(g * x for g, x in zip(gradients, feature_column))
hess_sum = sum(h * x * x for h, x in zip(hessians, feature_column))
# Step 2: L2 adjustment
grad_L2 = grad_sum + lambda2 * w_j
hess_L2 = hess_sum + lambda2
# Step 3: Soft thresholding for L1
tmp = w_j - grad_L2 / hess_L2
if tmp >= 0:
delta = max(-(grad_L2 + lambda1) / hess_L2, -w_j)
else:
delta = min(-(grad_L2 - lambda1) / hess_L2, -w_j)
# Step 4: Apply with learning rate
return w_j + eta * delta
Bias Update#
The bias term has no regularization — just plain gradient descent:
Updaters: Parallel vs Sequential#
XGBoost provides two coordinate descent implementations:
Shotgun Updater (Default)#
Updates features in parallel — different threads update different weights simultaneously.
Thread 0: update w₀ ┐
Thread 1: update w₁ ├── all at the same time
Thread 2: update w₂ │
Thread 3: update w₃ ┘
How it works:
All threads read the same gradient vector
Each thread updates its assigned weights
Race conditions in gradient reads are tolerated
Uses lock-free writes
Trade-offs:
Pros |
Cons |
|---|---|
Significant speedup on multi-core |
Approximate gradients (stale reads) |
Good scaling |
Only supports |
Works well with small learning rates |
Convergence is approximate |
When to use: Default choice for most use cases. Works well in practice.
Coord_descent Updater (Sequential)#
Updates features one at a time, updating residuals after each change.
update w₀ → refresh gradients
↓
update w₁ → refresh gradients
↓
update w₂ → refresh gradients
↓
...
Trade-offs:
Pros |
Cons |
|---|---|
Exact gradients |
Slower (sequential) |
Supports all feature selectors |
Can’t parallelize feature updates |
Better for greedy/thrifty selection |
When to use: When using advanced feature selectors, or when shotgun isn’t converging well.
Feature Selectors#
How to choose which feature to update next?
Selector |
Strategy |
Complexity |
Shotgun? |
|---|---|---|---|
cyclic |
Round-robin: 0, 1, 2, …, n, 0, 1, … |
O(n) |
✅ |
shuffle |
Random permutation each round |
O(n) |
✅ |
random |
Random with replacement |
O(n) |
❌ |
thrifty |
Sort by gradient magnitude, update important first |
O(n log n) |
❌ |
greedy |
Always pick feature with largest gradient |
O(n²) |
❌ |
Why Feature Selection Matters#
For high-dimensional sparse data, most features may have zero gradient at any given time. Smarter selection strategies can converge faster by focusing on features that actually matter.
Selector Details#
Cyclic (default): Simple and fast. Processes features in order. No overhead.
Shuffle: Like cyclic, but randomizes order each round. Can break patterns and improve convergence for some problems.
Random: Picks features randomly with replacement. Some features may be updated multiple times per round, others not at all.
Thrifty: Pre-sorts features by gradient magnitude once per round, then processes in that order. Good balance — focuses on important features without the O(n²) cost of greedy.
Greedy: Recomputes gradients after each update and always picks the feature with the largest gradient. Best convergence but O(n²) per round — only practical for small feature sets.
Top-K Pruning#
For thrifty and greedy, the top_k parameter limits updates to the k most important features:
params = {
'feature_selector': 'thrifty',
'top_k': 100, # Only update top 100 features per round
}
Reduces computation when most features are irrelevant.
Convergence Criteria#
XGBoost tracks the largest weight change each round:
Training stops early if:
Parameters:
tolerance = 0(default): Disabled, run all roundstolerance = 1e-4: Stop when weights stabilize
Practical Notes#
With strong regularization, convergence is faster
Shotgun may oscillate slightly due to stale gradients
If not converging, try:
Reducing learning rate
Switching to
coord_descentupdaterIncreasing regularization
Data Format: CSC#
Training uses CSC (Column-Sparse-Compressed) format for efficient feature-wise access:
Column 0: [(row_2, 0.5), (row_7, 1.2), (row_9, 0.8)]
Column 1: [(row_0, 0.3), (row_5, 2.1)]
Column 2: [(row_1, 0.7), (row_3, 1.5), (row_6, 0.2), (row_8, 0.9)]
...
Why CSC?#
Coordinate descent iterates over features (columns):
Format |
Access pattern |
Cost per feature |
|---|---|---|
Row-major (CSR) |
Scan all rows |
O(total_nnz) |
Column-major (CSC) |
Direct column access |
O(nnz in column) |
CSC is essential for efficient coordinate descent on sparse data.
Threading Model#
Training parallelism happens at multiple levels:
1. Gradient Accumulation (parallel over rows)#
For each feature, the gradient sum is computed in parallel:
// Parallel reduction with thread-local accumulators
parallel_for(rows, [&](row) {
auto tid = thread_id();
sum_grad_local[tid] += gradient[row] * feature_value[row];
sum_hess_local[tid] += hessian[row] * feature_value[row]²;
});
// Single-thread reduce
total_grad = sum(sum_grad_local);
total_hess = sum(sum_hess_local);
Uses thread-local buffers to avoid atomics.
2. Feature Updates (shotgun only)#
The shotgun updater parallelizes across features:
parallel_for(features, [&](feature) {
// Each thread updates different features
// Gradients may be slightly stale
update_weight(feature);
});
3. Residual Updates (sequential updater)#
After each weight change, residuals must be updated sequentially — this is why coord_descent is slower than shotgun.
Summary#
Aspect |
Description |
|---|---|
Algorithm |
Coordinate descent with elastic net |
Update |
Closed-form with soft thresholding |
Parallelism |
Shotgun (features) or sequential + parallel gradient sums |
Data format |
CSC for efficient column access |
Convergence |
Guaranteed for convex objectives |
Key params |
|