Quantized Feature Matrix#

A quantized feature matrix stores pre-computed bin indices for all (sample, feature) pairs, enabling O(1) histogram updates instead of O(log b) per-sample bin lookups during training.


Core Concept#

Given histogram cuts computed during preprocessing, the quantization function:

\[Q(x_{i,j}) = \text{SearchBin}(x_{i,j}, \text{cuts}_j)\]

maps each raw feature value to its bin index. This transformation is performed once during data loading, amortizing the O(log b) binary search cost.

The Space-Time Trade-off#

Traditional approach:
  - Store raw floats (4 bytes each)
  - Every histogram build: O(n × log(bins)) for bin lookups

Quantized approach:
  - Pre-compute bins once: O(n × log(bins))
  - Store bin indices (1 byte each for ≤256 bins)
  - Every histogram build: O(n) direct indexing

Since histogram building happens many times (once per node, per tree, per iteration), the one-time quantization cost is quickly amortized.


Benefits#

Benefit

Explanation

Memory reduction

1 byte (uint8) vs 4 bytes (float32) per value — 4× reduction

Cache efficiency

4× more data fits in cache, reducing memory bandwidth pressure

Integer operations

Bin comparisons are integer ops, faster than float comparisons

Determinism

Integer comparisons are exact across platforms (no float issues)

Type Selection#

The storage type adapts to the maximum number of bins:

max_bins

Storage Type

Bytes per Value

≤ 256

uint8

1

≤ 65536

uint16

2

> 65536

uint32

4

256 bins is the sweet spot: sufficient precision for most datasets while fitting in a single byte.


Memory Layouts#

Dense (Row-Major)#

For data without missing values, store bin indices in a contiguous 2D array:

Layout: bins[row × n_features + feature]

Row 0: [b₀₀, b₀₁, b₀₂, ..., b₀ₙ]
Row 1: [b₁₀, b₁₁, b₁₂, ..., b₁ₙ]
Row 2: [b₂₀, b₂₁, b₂₂, ..., b₂ₙ]
...

Access pattern: O(1) per cell, excellent cache locality for row-wise iteration.

Best for: Dense data, histogram building (iterates by row).

Sparse (CSR)#

For sparse data or data with many missing values, Compressed Sparse Row format:

row_ptr: [0, 2, 3, 6]           ← Row boundaries  
col_idx: [0, 2, 1, 0, 1, 2]     ← Feature indices  
values:  [b₀, b₁, b₂, b₃, b₄, b₅]  ← Bin values

Row 0: 2 values at features 0, 2
Row 1: 1 value at feature 1
Row 2: 3 values at features 0, 1, 2

Invariant: Missing values are implicit (not stored), saving memory.

Best for: >50% sparsity, memory-constrained environments.

GPU (ELLPACK)#

GPUs require uniform memory access patterns for coalesced reads:

ELLPACK (padded dense format):
  row_stride = max_features_per_row

  Row 0: [b₀, b₁, PAD]
  Row 1: [b₀, PAD, PAD]  
  Row 2: [b₀, b₁, b₂]

All rows padded to the same width, enabling predictable strided access.

Why GPUs need this: Threads in a warp (32 threads) must access adjacent memory addresses for efficient coalesced reads. Variable-length rows break this pattern.

Trade-off: Padding wastes memory for sparse data, but enables massive parallelism.


Global Bin Indexing#

During histogram building, we need a global bin index across all features to use a single flat histogram array:

Feature 0: 3 bins → local indices {0, 1, 2}
Feature 1: 4 bins → local indices {0, 1, 2, 3}  
Feature 2: 3 bins → local indices {0, 1, 2}

feature_offsets = [0, 3, 7, 10]   // Cumulative sum

Global bin = feature_offsets[feature] + local_bin

Example: feature=1, local_bin=2 → global_bin = 3 + 2 = 5

This enables O(1) mapping and a single contiguous histogram array.


Quantization Algorithm#

ALGORITHM: BuildQuantizedMatrix(data, max_bins)
───────────────────────────────────────────────
Input:  Raw feature matrix data[n_rows × n_features]
        Maximum bins per feature
Output: Quantized matrix with bin indices

1. cuts ← BuildHistogramCuts(data, max_bins)  // See histogram-cuts.md

2. SELECT storage type:
   IF max_bins ≤ 256:   use uint8
   ELIF max_bins ≤ 65536: use uint16
   ELSE: use uint32

3. PARALLEL FOR row ∈ [0, n_rows):
       FOR feature ∈ [0, n_features):
           value ← data[row, feature]
           IF is_missing(value):
               bin ← MISSING_BIN  // Special sentinel
           ELSE:
               bin ← BinarySearch(value, cuts[feature])
           index[row, feature] ← bin

4. RETURN QuantizedMatrix(index, cuts, feature_offsets)

Complexity: O(n × d × log(bins)) for n samples, d features.

Parallelization: Embarrassingly parallel — each cell independent.


Access During Histogram Building#

The histogram building loop uses the quantized matrix for O(1) bin lookup:

ALGORITHM: BuildHistogram(node_samples, gradients, qmatrix)
───────────────────────────────────────────────────────────
histogram ← zeros(total_bins)

FOR row ∈ node_samples:
    grad ← gradients[row]
    FOR feature ∈ [0, n_features):
        local_bin ← qmatrix[row, feature]
        global_bin ← feature_offset[feature] + local_bin
        histogram[global_bin] += grad

RETURN histogram

Key insight: The inner loop is just an array index + accumulate — no binary search.


Missing Value Handling#

Different strategies for representing missing values:

Strategy

Description

Memory

Lookup

Sentinel bin

Assign bin_id = max_bins to missing

Same

O(1)

Separate mask

Boolean array tracking missing

+12.5%

O(1)

Implicit (CSR)

Missing values not stored

Sparse wins

O(log nnz)

XGBoost uses sentinel bins. LightGBM uses a mix depending on sparsity.


Memory Estimates#

For 1 million samples, 100 features:

Format

Size

Notes

Raw float32

400 MB

4 bytes × 100M cells

Quantized uint8

100 MB

1 byte × 100M cells

CSR (50% sparse)

~75 MB

Saves on missing, adds indices

ELLPACK (GPU)

Varies

Depends on row_stride, bit-packing


Summary#

Aspect

Description

Purpose

Pre-computed bin indices for O(1) histogram updates

Key trade-off

One-time O(log b) quantization vs per-histogram lookup

Memory

4× reduction with uint8 bins

CPU format

Dense or CSR depending on sparsity

GPU format

ELLPACK (padded, bit-packed)

Missing values

Sentinel bin or implicit (CSR)