Sampling Strategies in Gradient Boosting#
Overview#
Row sampling (also called bagging or subsampling) trains each tree on a subset of the data. This provides:
Regularization: Reduces overfitting by introducing randomness
Speed: Less data to process per tree
Variance reduction: Ensemble of diverse trees
There are two main approaches:
Random Subsampling: Uniform random selection
GOSS: Gradient-based One-Side Sampling (prioritizes informative samples)
Random Subsampling#
Algorithm#
function random_subsample(data, subsample_rate, seed):
n = len(data)
k = floor(n * subsample_rate)
# Random selection without replacement
indices = random_choice(range(n), size=k, replace=False, seed=seed)
return data[indices]
Properties#
Property |
Value |
|---|---|
Selection |
Uniform random |
Sample size |
|
Weight adjustment |
None needed |
Bias |
Unbiased |
Parameters#
Parameter |
Typical Range |
Description |
|---|---|---|
|
0.5 - 1.0 |
Fraction of rows to sample per tree |
When to Use#
Large datasets: Speed up training proportionally
Overfitting: Add regularization through randomness
Ensemble diversity: Each tree sees different data
Example#
Dataset: 1,000,000 rows
subsample = 0.5
Each tree trains on:
- 500,000 randomly selected rows
- Different random seed per tree
GOSS (Gradient-based One-Side Sampling)#
GOSS is a smarter sampling strategy that keeps samples with large gradients (informative) and randomly samples from the rest.
Intuition#
Not all samples are equally important for learning:
Large gradient: Model is making big mistakes → very informative
Small gradient: Model is already accurate → less informative
GOSS keeps all the “hard” examples and subsamples the “easy” ones.
Algorithm#
function goss_sample(gradients, top_rate, other_rate):
n = len(gradients)
# 1. Compute importance scores
scores = |gradients| # Absolute gradient magnitude
# 2. Find top-k samples
top_k = floor(n * top_rate)
top_indices = argtopk(scores, top_k)
# 3. Random sample from rest
other_indices = set(range(n)) - set(top_indices)
other_k = floor(len(other_indices) * other_rate)
sampled_other = random_choice(other_indices, size=other_k)
# 4. Weight amplification for sampled-other
weight = (1 - top_rate) / other_rate
for i in sampled_other:
gradients[i] *= weight
hessians[i] *= weight
return top_indices + sampled_other
Weight Amplification#
The sampled “other” rows need weight adjustment to maintain unbiased gradient estimates:
Population of "other" rows: (1 - top_rate) × n
Sampled "other" rows: other_rate × (1 - top_rate) × n
Sampling rate for "other": other_rate
Weight multiplier: 1 / other_rate × (1 - top_rate) = (1 - top_rate) / other_rate
Parameters#
Parameter |
Typical Value |
Description |
|---|---|---|
|
0.2 |
Fraction of high-gradient samples to keep |
|
0.1 |
Fraction of remaining samples to randomly select |
Effective Sample Rate#
effective_rate = top_rate + other_rate × (1 - top_rate)
Example: top_rate=0.2, other_rate=0.1
= 0.2 + 0.1 × 0.8
= 0.2 + 0.08
= 0.28 (28% of data)
Multi-Output Case#
For multi-output objectives (multiclass, multi-quantile), each sample has K gradients. We need a single importance score per sample:
Option 1: Sum of products (LightGBM) $\( \text{score}_i = \sum_{k=1}^{K} |g_{i,k} \times h_{i,k}| \)$
Option 2: L2 norm of gradients $\( \text{score}_i = \sqrt{\sum_{k=1}^{K} g_{i,k}^2} \)$
Both capture the overall “importance” of a sample across all outputs.
Warm-up Period#
LightGBM disables GOSS for the first few iterations:
warm_up_rounds = 1 / learning_rate
if iteration < warm_up_rounds:
use all data (no GOSS)
else:
apply GOSS
Rationale: Early gradients are noisy and don’t reflect true sample importance.
Comparison: Random vs GOSS#
Random Subsampling (50%) GOSS (top=20%, other=10%)
Sample Selection: Sample Selection:
○○●○●○○●●○○●○○●○ ●●●●○○●○○○○●○○○○ (top 20%)
(random 50%) ○○●○○○○○●○○○○○○○ (random 10% of rest)
● = selected Weight:
○ = not selected top samples: weight = 1
other samples: weight = 4× (amplified)
Effective data: 50% Effective data: 28%
Gradient bias: none Gradient bias: none (due to weighting)
Variance: higher Variance: lower (keeps important samples)
When to Use Each#
Scenario |
Recommendation |
|---|---|
Baseline |
Random subsampling |
Large dataset, accuracy critical |
GOSS |
Highly imbalanced data |
GOSS (keeps minority samples with large gradients) |
Very large dataset (>10M rows) |
GOSS (bigger speedup) |
Small dataset |
Random (or no sampling) |
Debugging/reproducibility |
Random (simpler to reason about) |
Column Sampling#
In addition to row sampling, gradient boosting supports column (feature) sampling:
Levels of Column Sampling#
colsample_bytree: Sample features once per tree
colsample_bylevel: Sample features once per depth level
colsample_bynode: Sample features once per node
Combined Effect#
The sampling rates multiply:
effective_feature_rate = colsample_bytree × colsample_bylevel × colsample_bynode
Example: 0.8 × 0.8 × 0.8 = 0.512 (51.2% of features per node)
Implementation#
function get_features_for_node(tree, node, seed):
available = range(num_features)
# Tree-level sampling
tree_seed = seed + tree.id
available = sample(available, colsample_bytree, tree_seed)
# Level-level sampling
level_seed = seed + tree.id + node.depth
available = sample(available, colsample_bylevel, level_seed)
# Node-level sampling
node_seed = seed + tree.id + node.id
available = sample(available, colsample_bynode, node_seed)
return available
Stochastic Gradient Boosting#
Combining row and column sampling creates Stochastic Gradient Boosting:
for each tree:
rows = sample_rows(data, subsample) # or GOSS
cols = sample_columns(features, colsample_bytree)
for each node:
node_cols = sample_columns(cols, colsample_bynode)
histogram = build_histogram(rows, node_cols)
split = find_best_split(histogram)
This provides multiple sources of randomness:
Which rows each tree sees
Which features each tree/level/node considers
Bootstrap-like variance reduction
Performance Considerations#
Memory#
Strategy |
Memory Impact |
|---|---|
Random subsample |
Indices only (n × 4 bytes) |
GOSS |
Indices + weights (n × 8 bytes) |
Column sample |
Minimal (feature mask) |
Computation#
Strategy |
Speedup |
|---|---|
Random 50% |
~2× faster histogram building |
GOSS 28% |
~3.5× faster histogram building |
Column 50% |
~2× fewer features to evaluate |
GOSS Overhead#
GOSS has additional overhead compared to random sampling:
Compute gradient magnitudes: O(n)
Partial sort for top-k: O(n)
Weight multiplication: O(n × other_rate)
For large datasets, this overhead is small compared to histogram building savings.
Practical Recommendations#
Starting Point#
# Moderate regularization
params = {
'subsample': 0.8, # 80% of rows
'colsample_bytree': 0.8, # 80% of features per tree
'colsample_bylevel': 1.0, # All features per level
'colsample_bynode': 1.0, # All features per node
}
For Large Datasets (>1M rows)#
# Aggressive subsampling for speed
params = {
'boosting_type': 'goss', # LightGBM
'top_rate': 0.2,
'other_rate': 0.1,
# or for XGBoost:
'subsample': 0.5,
}
For Overfitting Problems#
# More randomness = more regularization
params = {
'subsample': 0.6,
'colsample_bytree': 0.6,
'colsample_bynode': 0.8,
}
References#
LightGBM Paper - Section 4 (GOSS)
XGBoost: Stochastic gradient boosting via
subsampleparameterLightGBM:
src/boosting/goss.hppStochastic Gradient Boosting - Friedman, 2002