Building histograms is the dominant cost in gradient boosting. With a million samples, every tree must iterate over all of them to build gradient statistics. Can we do better?

The answer is sampling. Instead of using all samples to build each tree, we can train on a subset. This post explores two approaches: random subsampling and LightGBM’s smarter alternative, GOSS (Gradient-based One-Side Sampling).


Why Sample?

Sampling provides three benefits:

Speed: Fewer samples means faster histogram building. With 50% sampling, each tree trains roughly 2× faster.

Regularization: Each tree sees different data, reducing the risk of overfitting to specific examples.

Variance reduction: The ensemble averages over trees trained on different subsets, similar to bagging.

The tradeoff is potential accuracy loss. But smart sampling can minimize this.


Random Subsampling

The simplest approach: randomly select a fraction of samples for each tree.

Algorithm: Random Subsampling

Input: Dataset of n samples, subsample_rate

For each tree:
    1. Select k = floor(n × subsample_rate) samples uniformly at random
    2. Train tree on selected samples only
    3. Use original gradients (no weight adjustment needed)

With subsample = 0.5, each tree trains on a random 50% of the data.

Advantages:

  • Simple to implement
  • Unbiased gradient estimates
  • Good regularization

Disadvantages:

  • May discard important samples
  • Treats all samples equally, ignoring information about which are hard to predict

When to Use

Random subsampling works well when:

  • You want a simple regularization technique
  • Dataset is large enough that missing some samples doesn’t hurt
  • Samples are roughly equally important

The Insight: Not All Samples Are Equal

Here’s the key observation: samples with large gradients are more informative than samples with small gradients.

Why? The gradient tells us how much the loss would decrease if we adjusted the prediction for sample . A large gradient means the model is making a big mistake on that sample. A small gradient means the model is already accurate there.

If we must subsample, we should keep the high-gradient samples (they drive learning) and subsample the low-gradient ones (they contribute less).

Gradient Magnitudes

In a classification problem with logistic loss, :

  • Hard positive: , (large negative gradient)
  • Hard negative: , (large positive gradient)
  • Easy sample: , (small gradient)

Hard samples have 18× larger gradient magnitude than easy ones!


GOSS: Gradient-based One-Side Sampling

GOSS (introduced in the LightGBM paper) exploits this insight. The name “One-Side” refers to keeping one side of the gradient distribution (the high-magnitude tail) intact while sampling the other side:

  1. Keep all samples with large gradients (the “top” samples)
  2. Randomly subsample the rest (the “other” samples)
  3. Upweight the subsampled samples to maintain approximately unbiased estimates

Algorithm: GOSS Sampling

Input: Gradients g[], top_rate, other_rate
       e.g., top_rate = 0.2, other_rate = 0.1

1. Compute importance scores: score[i] = |g[i]|

2. Select top samples:
   k_top = floor(n × top_rate)
   top_indices = indices of k_top largest scores

3. Sample from the rest:
   other_indices = all indices not in top_indices
   k_other = floor(len(other_indices) × other_rate)
   sampled_other = random_sample(other_indices, k_other)

4. Upweight the sampled-other samples:
   weight = (1 - top_rate) / other_rate
   For each i in sampled_other:
       g[i] *= weight
       h[i] *= weight

5. Return top_indices + sampled_other

Why Upweighting?

The “other” samples represent a larger population. If we sampled 10% of them, each sampled sample must “count for” 10 samples in the gradient sum.

The weight formula:

With top_rate = 0.2 and other_rate = 0.1:

  • The “other” population is 80% of data
  • We sample 10% of that 80%
  • Weight =

This ensures the total gradient contribution from “other” samples remains correct in expectation. The estimate is approximately unbiased (the approximation improves with larger datasets).

Multi-Class GOSS

For multi-class classification, each sample has multiple gradients (one per class). The importance score becomes the sum of absolute gradients across classes: .

Effective Sample Rate

How much data does GOSS actually use?

Effective Sample Rate

With top_rate = 0.2 and other_rate = 0.1:

GOSS uses 28% of the data per tree, yet keeps the most informative 20% intact.

Compare this to random 28% subsampling: GOSS gets the same speedup but with much less information loss.


GOSS vs Random: Visual Comparison

Dataset: 100 samples, sorted by gradient magnitude
 
Random Subsampling (28%):
| High gradient |  Medium gradient  | Low gradient |
| ●○●○●○●○●○    | ●○●○●○●○●○●○●○●○  | ○●○●○●○●○●   |
         (random selection across all samples)
 
GOSS (top=20%, other=10%):
| High gradient |  Medium gradient  | Low gradient |
| ●●●●●●●●●●    | ●○○○○○○○●○○○○○○○  | ○○○●○○○○○●   |
    (all top 20%)  (10% random from remaining 80%)
 
● = selected, ○ = not selected

GOSS guarantees all high-gradient samples are included, while random sampling may miss some.


Column Sampling

In addition to row sampling, gradient boosting supports column (feature) sampling:

ParameterWhen AppliedDescription
colsample_bytreeOnce per treeSample features at the tree level
colsample_bylevelEach depth levelSample features per tree depth
colsample_bynodeEach splitSample features per node

The rates multiply together: .

Combined Column Sampling

With colsample_bytree = 0.8, colsample_bylevel = 0.8, colsample_bynode = 0.8:

At each node, you consider: of features.

Column sampling provides:

  • Regularization: Prevents reliance on a few dominant features
  • Speed: Fewer features to evaluate per split
  • Diversity: Trees use different feature subsets

Stochastic Gradient Boosting

When you combine row and column sampling, you get Stochastic Gradient Boosting:

For each tree:
    rows = sample_rows(data, subsample_rate)    # or GOSS
    cols = sample_columns(features, colsample_bytree)
    
    For each node:
        node_cols = sample_columns(cols, colsample_bynode)
        histogram = build_histogram(rows, node_cols)
        split = find_best_split(histogram)

This introduces randomness at multiple levels:

  • Which rows each tree sees
  • Which features each tree/node considers
  • Combined with learning rate shrinkage

The result is a diverse ensemble that is less prone to overfitting.


GOSS Warm-Up Period

LightGBM disables GOSS for the first few iterations:

warm_up_rounds = 1 / learning_rate
 
If iteration < warm_up_rounds:
    Use all data (no GOSS)
Else:
    Apply GOSS

Why? Early in training, the model is bad everywhere. Gradients don’t yet reflect true sample difficulty. Once the model has learned basic patterns, gradients become meaningful indicators of which samples are genuinely hard.

With learning_rate = 0.1, this means about 10 warm-up rounds.


When to Use What

ScenarioRecommendation
BaselineRandom subsampling (subsample = 0.8)
Large dataset (> 1M rows)GOSS for faster training
Imbalanced classificationGOSS (keeps minority class samples with high gradients)
Overfitting concernsLower subsample, add colsample_bytree
Small dataset (< 10k)No sampling, or very mild (subsample = 0.9)
DebuggingRandom (simpler to reason about)

Hyperparameters

Row Sampling

ParameterLibraryDefaultTypical Range
subsampleBoth1.00.5 - 1.0
boosting_type = 'goss'LightGBM-Enable GOSS
top_rateLightGBM0.20.1 - 0.3
other_rateLightGBM0.10.05 - 0.2

Column Sampling

ParameterLibraryDefaultTypical Range
colsample_bytreeBoth1.00.5 - 1.0
colsample_bylevelXGBoost1.00.5 - 1.0
colsample_bynodeBoth1.00.5 - 1.0

Starting Configuration

For moderate regularization and speed:

subsample = 0.8
colsample_bytree = 0.8
colsample_bynode = 1.0

For large datasets (> 1M rows) with LightGBM:

boosting_type = 'goss'
top_rate = 0.2
other_rate = 0.1

Performance Impact

StrategyEffective DataSpeedupAccuracy Trade-off
Random 50%50%~2×Moderate
Random 80%80%~1.25×Small
GOSS (default)28%~3.5×Small (keeps important samples)
+ Column 80%28% × 80% features~4×Moderate

GOSS achieves better speedup-accuracy trade-offs than random sampling because it preserves the most informative samples.

GOSS Memory Overhead

GOSS requires storing sample weights, adding about 8 bytes per sample. For very large datasets, this is negligible compared to the histogram memory savings from using fewer samples.


What’s Next

We’ve covered how histogram-based training works and how sampling speeds it up. But what about categorical features and reducing the number of features?

The next post, Exclusive Feature Bundling and Categorical Features, explains LightGBM’s EFB optimization (bundling sparse features) and how both libraries handle categorical data natively.


Summary

Sampling speeds up training by using subsets of data:

  • Random subsampling: Simple, uniform selection. Good baseline.
  • GOSS: Keep high-gradient samples, subsample the rest. More efficient.

GOSS insight: samples with large gradients are more informative. Keeping them while subsampling easy samples preserves learning signal with less data.

Column sampling adds feature-level randomness, providing regularization and diversity.

MethodKey ParametersWhen to Use
Random rowsubsampleBaseline, small speedup needed
GOSStop_rate, other_rateLarge datasets, need speed
Columncolsample_bytree/bynodeRegularization, feature diversity

References

  1. Ke, G. et al. (2017). “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”. NeurIPS 2017. Section 4: GOSS. PDF

  2. Friedman, J.H. (2002). “Stochastic Gradient Boosting”. Computational Statistics & Data Analysis, 38(4), 367-378. PDF

  3. Chen, T. & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System”. KDD 2016. arXiv