RFC-0006: Sampling Strategies#
Status: Implemented
Created: 2025-12-15
Updated: 2026-01-02
Scope: Row and column sampling for regularization
Summary#
Sampling strategies reduce computation and improve generalization. Row sampling (bagging, GOSS) selects training samples per tree. Column sampling selects features at tree/level/node granularity.
Why Sampling?#
Goal |
Mechanism |
|---|---|
Reduce overfitting |
Random subsets prevent memorization |
Speed up training |
Fewer samples/features to process |
Diversity |
Each tree sees different data |
Row Sampling#
Configuration#
pub enum RowSamplingParams {
None,
Uniform { subsample: f32 },
Goss { top_rate: f32, other_rate: f32 },
}
Uniform Sampling#
Standard bagging: randomly select subsample fraction of samples.
// 80% of samples per tree
let sampling = RowSamplingParams::uniform(0.8);
Implementation: zero out gradients for unselected samples. No data copying.
GOSS (Gradient-based One-Side Sampling)#
From LightGBM’s “LightGBM: A Highly Efficient Gradient Boosting Decision Tree” (Ke et al., 2017). GOSS exploits the insight that samples with small gradients are already well-modeled and contribute little to the information gain.
Key insight: Data instances with different gradients play different roles in computing information gain. Instances with larger gradients contribute more to the information gain. To maintain the accuracy of information gain estimation, we keep all instances with large gradients and only randomly sample from instances with small gradients.
Algorithm:
Sort samples by absolute gradient magnitude: \(|g_i \cdot h_i|\)
Keep top \(a\) fraction (large gradients, informative samples)
Randomly sample \(b\) fraction from remaining \((1-a)\) samples
Amplify sampled small-gradient instances by factor \(\frac{1-a}{b}\)
Mathematical formulation:
For a dataset with \(n\) instances, let \(A\) be the top-\(a \cdot n\) instances sorted by gradient magnitude, and \(B\) be a random sample of size \(b \cdot n\) from the remaining instances. The estimated variance gain is:
The amplification factor \(\frac{1-a}{b}\) compensates for the underrepresentation of small-gradient samples, ensuring unbiased gradient sums.
Typical values:
top_rate = 0.2(keep top 20% high-gradient samples)other_rate = 0.1(sample 10% from remaining 80%)Effective data usage: 0.2 + 0.8 × 0.1 = 28% of samples
// Keep top 20%, sample 10% of rest
let sampling = RowSamplingParams::goss(0.2, 0.1);
Warmup period: GOSS skips the first ⌊1/learning_rate⌋ rounds. Early
iterations have unreliable gradients since predictions are far from targets.
During warmup, all samples are used.
Implementation details:
Gradient importance computed as
|grad × hess|for proper weightingPartial sort (quickselect) used to find top-\(a\) threshold efficiently: O(n)
Gradients modified in-place—no data copying
Same RNG seed ensures reproducibility
Sampler API#
pub struct RowSampler {
config: RowSamplingParams,
rng: SmallRng,
}
impl RowSampler {
pub fn new(config: RowSamplingParams, seed: u64) -> Self;
/// Apply sampling by modifying gradients in-place
pub fn apply(&mut self, gradients: &mut Gradients, iteration: usize);
/// Check if warmup period is active
pub fn is_warmup(&self, iteration: usize) -> bool;
}
Column Sampling#
ColSamplingParams#
pub enum ColSamplingParams {
None,
ByTree { colsample: f32 },
ByLevel { colsample: f32 },
ByNode { colsample: f32 },
}
Granularity#
Level |
When Applied |
Scope |
|---|---|---|
ByTree |
Once per tree |
All nodes share same features |
ByLevel |
When depth changes |
Nodes at same depth share features |
ByNode |
Every split finding |
Each node has independent features |
Cascading: In XGBoost, levels cascade: ByTree × ByLevel × ByNode.
We implement the primary level only for simplicity.
ColSampler API#
pub struct ColSampler {
config: ColSamplingParams,
n_features: u32,
rng: SmallRng,
active_features: Vec<u32>,
}
impl ColSampler {
pub fn new(config: ColSamplingParams, n_features: u32, seed: u64) -> Self;
/// Resample for new tree
pub fn resample_tree(&mut self);
/// Resample for new level (ByLevel only)
pub fn resample_level(&mut self, depth: u32);
/// Resample for new node (ByNode only)
pub fn resample_node(&mut self);
/// Get currently active features
pub fn active_features(&self) -> &[u32];
}
Integration with Training#
Row sampling in trainer:
// Before each tree
row_sampler.apply(&mut gradients, tree_idx);
// gradients now has zeros for unsampled rows
Column sampling in grower:
// Before split finding at node
let active = col_sampler.active_features();
splitter.find_split(histogram, parent_stats, active);
Files#
Path |
Contents |
|---|---|
|
Module exports |
|
|
|
|
Design Decisions#
DD-1: Zero gradients, don’t skip. Zeroing unsampled gradients is simpler than maintaining sample indices. Split finding naturally ignores zero-gradient samples.
DD-2: GOSS warmup. Early gradients are noisy (all samples equally wrong). Skip GOSS until model has learned basic patterns.
DD-3: Single column sampling level. XGBoost’s cascading is complex. Most users pick one level. Keep it simple, add cascading if needed.
DD-4: SmallRng for speed. Sampling happens every tree. Use fast RNG (SmallRng) rather than cryptographic quality.
DD-5: Deterministic with seed. Same seed = same sampling sequence. Important for reproducibility.
Accuracy Impact#
Strategy |
Typical Accuracy |
Training Speed |
|---|---|---|
None (100%) |
Baseline |
1.0× |
Uniform 80% |
-0.1% to -0.5% |
1.2× |
Uniform 50% |
-0.5% to -2% |
1.8× |
GOSS (20%, 10%) |
-0.1% to -0.5% |
1.5× |
GOSS typically matches uniform sampling quality with less data.
Recommended Settings#
Use Case |
Row Sampling |
Column Sampling |
|---|---|---|
Default |
None |
ByTree(0.8) |
Large dataset |
Uniform(0.8) |
ByTree(0.8) |
Overfitting |
GOSS(0.2, 0.1) |
ByNode(0.5) |
Speed priority |
Uniform(0.5) |
ByTree(0.5) |
Performance#
Sampling overhead is negligible (<1% of training time):
Uniform: O(n) gradient zeroing
GOSS: O(n log n) for gradient sorting + O(n) amplification
Column: O(m) feature selection
Early Stopping Interaction#
Sampling applies per-tree, not per-round. Early stopping evaluates on validation set (not affected by training sampling). GOSS warmup may delay early stopping convergence slightly.
Testing Strategy#
Category |
Tests |
|---|---|
Uniform correctness |
Correct fraction selected |
GOSS correctness |
Top samples kept, amplification applied |
Determinism |
Same seed → same samples |
Column sampling |
Correct features selected per granularity |
Warmup |
GOSS skips correct number of rounds |