Coordinate Descent Training#

How linear models learn feature weights using coordinate descent optimization.

Overview#

Linear boosters use coordinate descent — instead of updating all weights at once (like gradient descent), we update one weight at a time while holding others fixed.

For convex problems with separable regularization (like elastic net):

Closed-form updates — Each coordinate update has an analytical solution (no step size tuning)
Convergence guaranteed — Provably reaches the global optimum
Efficient for sparse data — Only touches non-zero feature values
Natural for L1 — Soft thresholding integrates cleanly

Elastic Net Regularization#

The objective function is:

\[ \text{minimize:} \quad \mathcal{L}(\mathbf{w}) + \lambda_1 \sum_i |w_i| + \lambda_2 \sum_i w_i^2 \]

Where:

\(\mathcal{L}(\mathbf{w})\) — Loss function (squared error, logistic, etc.)
\(\lambda_1\) (reg_alpha) — L1 penalty, encourages sparse weights
\(\lambda_2\) (reg_lambda) — L2 penalty, encourages small weights

Why Elastic Net?#

Regularization	Pros	Cons
L1 alone (Lasso)	Feature selection, sparse	Unstable with correlated features
L2 alone (Ridge)	Stable, handles correlation	Keeps all features
Elastic Net	Sparse AND stable	Two hyperparameters

Elastic net gets the best of both: it can zero out irrelevant features (like L1) while remaining stable when features are correlated (like L2).

Handling Collinear Features#

Invariant: Coordinate descent does not require features to be linearly independent.

In classical linear regression, perfect collinearity (one feature is a linear combination of others) makes the solution undefined — the design matrix \(X^TX\) is singular. However, elastic net regularization resolves this:

\[ (X^TX + \lambda_2 I)^{-1} \]

The L2 term \(\lambda_2 I\) ensures the matrix is always invertible. This means:

With \(\lambda_2 > 0\): Training always converges, even with perfectly collinear features
Practical implication: No need to explicitly remove correlated features before training
Weight distribution: Among collinear features, L2 spreads weights evenly; L1 tends to select one and zero others

The L1 component provides an additional benefit: it can automatically select among correlated features, effectively performing feature selection during training.

The Coordinate Update Formula#

For a single weight \(w_j\), the update proceeds as follows:

Step 1: Compute Gradient Statistics#

Sum over all training examples:

\[ G_j = \sum_i g_i \cdot x_{ij} \qquad H_j = \sum_i h_i \cdot x_{ij}^2 \]

Where:

\(g_i\) = gradient of loss for example \(i\)
\(h_i\) = hessian of loss for example \(i\)
\(x_{ij}\) = value of feature \(j\) for example \(i\)

Step 2: Add L2 Regularization#

\[ G_j^{L2} = G_j + \lambda_2 \cdot w_j \qquad H_j^{L2} = H_j + \lambda_2 \]

Step 3: Apply Soft Thresholding (L1)#

The soft thresholding operator handles L1 regularization:

\[\begin{split} \delta = \begin{cases} \max\left(-\frac{G_j^{L2} + \lambda_1}{H_j^{L2}}, -w_j\right) & \text{if } w_j - \frac{G_j^{L2}}{H_j^{L2}} \geq 0 \\[1em] \min\left(-\frac{G_j^{L2} - \lambda_1}{H_j^{L2}}, -w_j\right) & \text{otherwise} \end{cases} \end{split}\]

Step 4: Apply Update#

\[ w_j \leftarrow w_j + \eta \cdot \delta \]

Where \(\eta\) is the learning rate.

Pseudocode#

def coordinate_update(w_j, feature_column, gradients, hessians, lambda1, lambda2, eta):
    # Step 1: Gradient stats
    grad_sum = sum(g * x for g, x in zip(gradients, feature_column))
    hess_sum = sum(h * x * x for h, x in zip(hessians, feature_column))
    
    # Step 2: L2 adjustment
    grad_L2 = grad_sum + lambda2 * w_j
    hess_L2 = hess_sum + lambda2
    
    # Step 3: Soft thresholding for L1
    tmp = w_j - grad_L2 / hess_L2
    if tmp >= 0:
        delta = max(-(grad_L2 + lambda1) / hess_L2, -w_j)
    else:
        delta = min(-(grad_L2 - lambda1) / hess_L2, -w_j)
    
    # Step 4: Apply with learning rate
    return w_j + eta * delta

Bias Update#

The bias term has no regularization — just plain gradient descent:

\[ b \leftarrow b + \eta \cdot \left(-\frac{\sum_i g_i}{\sum_i h_i}\right) \]

Updaters: Parallel vs Sequential#

XGBoost provides two coordinate descent implementations:

Shotgun Updater (Default)#

Updates features in parallel — different threads update different weights simultaneously.

Thread 0: update w₀    ┐
Thread 1: update w₁    ├── all at the same time
Thread 2: update w₂    │
Thread 3: update w₃    ┘

How it works:

All threads read the same gradient vector
Each thread updates its assigned weights
Race conditions in gradient reads are tolerated
Uses lock-free writes

Trade-offs:

Pros	Cons
Significant speedup on multi-core	Approximate gradients (stale reads)
Good scaling	Only supports `cyclic` and `shuffle` selectors
Works well with small learning rates	Convergence is approximate

When to use: Default choice for most use cases. Works well in practice.

Coord_descent Updater (Sequential)#

Updates features one at a time, updating residuals after each change.

update w₀ → refresh gradients
        ↓
update w₁ → refresh gradients
        ↓
update w₂ → refresh gradients
        ↓
...

Trade-offs:

Pros	Cons
Exact gradients	Slower (sequential)
Supports all feature selectors	Can’t parallelize feature updates
Better for greedy/thrifty selection

When to use: When using advanced feature selectors, or when shotgun isn’t converging well.

Feature Selectors#

How to choose which feature to update next?

Selector	Strategy	Complexity	Shotgun?
cyclic	Round-robin: 0, 1, 2, …, n, 0, 1, …	O(n)	✅
shuffle	Random permutation each round	O(n)	✅
random	Random with replacement	O(n)	❌
thrifty	Sort by gradient magnitude, update important first	O(n log n)	❌
greedy	Always pick feature with largest gradient	O(n²)	❌

Why Feature Selection Matters#

For high-dimensional sparse data, most features may have zero gradient at any given time. Smarter selection strategies can converge faster by focusing on features that actually matter.

Selector Details#

Cyclic (default): Simple and fast. Processes features in order. No overhead.

Shuffle: Like cyclic, but randomizes order each round. Can break patterns and improve convergence for some problems.

Random: Picks features randomly with replacement. Some features may be updated multiple times per round, others not at all.

Thrifty: Pre-sorts features by gradient magnitude once per round, then processes in that order. Good balance — focuses on important features without the O(n²) cost of greedy.

Greedy: Recomputes gradients after each update and always picks the feature with the largest gradient. Best convergence but O(n²) per round — only practical for small feature sets.

Top-K Pruning#

For thrifty and greedy, the top_k parameter limits updates to the k most important features:

params = {
    'feature_selector': 'thrifty',
    'top_k': 100,  # Only update top 100 features per round
}

Reduces computation when most features are irrelevant.

Convergence Criteria#

XGBoost tracks the largest weight change each round:

\[ \Delta_{\max} = \max_j |w_j^{\text{new}} - w_j^{\text{old}}| \]

Training stops early if:

\[ \Delta_{\max} \leq \texttt{tolerance} \]

Parameters:

tolerance = 0 (default): Disabled, run all rounds
tolerance = 1e-4: Stop when weights stabilize

Practical Notes#

With strong regularization, convergence is faster
Shotgun may oscillate slightly due to stale gradients
If not converging, try:
- Reducing learning rate
- Switching to coord_descent updater
- Increasing regularization

Data Format: CSC#

Training uses CSC (Column-Sparse-Compressed) format for efficient feature-wise access:

Column 0: [(row_2, 0.5), (row_7, 1.2), (row_9, 0.8)]
Column 1: [(row_0, 0.3), (row_5, 2.1)]
Column 2: [(row_1, 0.7), (row_3, 1.5), (row_6, 0.2), (row_8, 0.9)]
...

Why CSC?#

Coordinate descent iterates over features (columns):

Format	Access pattern	Cost per feature
Row-major (CSR)	Scan all rows	O(total_nnz)
Column-major (CSC)	Direct column access	O(nnz in column)

CSC is essential for efficient coordinate descent on sparse data.

Threading Model#

Training parallelism happens at multiple levels:

1. Gradient Accumulation (parallel over rows)#

For each feature, the gradient sum is computed in parallel:

// Parallel reduction with thread-local accumulators
parallel_for(rows, [&](row) {
    auto tid = thread_id();
    sum_grad_local[tid] += gradient[row] * feature_value[row];
    sum_hess_local[tid] += hessian[row] * feature_value[row]²;
});
// Single-thread reduce
total_grad = sum(sum_grad_local);
total_hess = sum(sum_hess_local);

Uses thread-local buffers to avoid atomics.

2. Feature Updates (shotgun only)#

The shotgun updater parallelizes across features:

parallel_for(features, [&](feature) {
    // Each thread updates different features
    // Gradients may be slightly stale
    update_weight(feature);
});

3. Residual Updates (sequential updater)#

After each weight change, residuals must be updated sequentially — this is why coord_descent is slower than shotgun.

Summary#

Aspect	Description
Algorithm	Coordinate descent with elastic net
Update	Closed-form with soft thresholding
Parallelism	Shotgun (features) or sequential + parallel gradient sums
Data format	CSC for efficient column access
Convergence	Guaranteed for convex objectives
Key params	`lambda` (L2), `alpha` (L1), `eta` (learning rate)