RFC-0001: Dataset#
Status: Implemented
Created: 2025-12-15
Updated: 2026-01-02
Scope: User-facing raw data container
Summary#
Dataset is the primary user-facing data type. It stores:
Raw feature values (dense or sparse, one column per feature)
Optional targets (shape
[n_outputs, n_samples])Optional sample weights (length
n_samples)Immutable schema / feature metadata
The core ideology (from RFC-0021): keep raw and binned data separate.
Datasetis for raw-value access (prediction, SHAP, GBLinear, linear leaves, metrics).BinnedDatasetis for training-only histogram building (bin indices + bin mappers).
Goals#
Make the common access patterns fast and hard to misuse.
Support sparse features without densifying the full matrix.
Keep the API small and stable: add new access primitives only when they are clearly needed.
Non-goals#
Store both feature-major and sample-major layouts permanently (avoids 2× memory).
Provide a “universal iterator” abstraction over every access mode.
Data Model#
Dataset storage (conceptual)#
pub struct Dataset {
features: Vec<Feature>,
n_samples: usize,
schema: DatasetSchema,
targets: Option<Array2<f32>>, // [n_outputs, n_samples]
weights: Option<Array1<f32>>, // [n_samples]
}
pub enum Feature {
Dense(Array1<f32>),
Sparse {
indices: Vec<u32>,
values: Vec<f32>,
n_samples: usize,
default: f32,
},
}
Sparse feature semantics and invariants#
A sparse feature is “column-sparse”: it stores a sorted list of (row index, value) pairs.
All unspecified rows implicitly have value default.
Required invariants:
indices.len() == values.len()indicesare strictly increasing (sorted, no duplicates)each index is in bounds:
index < n_samples
The builder validates these invariants and returns DatasetError on violations.
Targets and weights#
Targets are optional; when present they are stored as
[n_outputs, n_samples].Weights are optional; when absent they are treated as uniform weight
1.0.
Why Feature-Major?#
The library’s training loops are feature-dominant:
GBDT histogram building iterates one feature across many samples.
GBLinear / coordinate descent iterates one feature across many samples.
So we store features as columns ([n_features, n_samples] conceptually) to keep each feature contiguous.
Prediction and SHAP need sample-major access. Instead of storing both layouts, we use block transposition:
Dataset::buffer_samples()fills a caller-owned buffer of shape[block_size, n_features].The returned
SamplesViewis sample-major and cache-friendly for traversal.
Access Patterns (the important part)#
The key design choice is that we expose a few primitives that map directly to algorithmic needs.
1) Sparse-efficient iteration: for_each_feature_value#
dataset.for_each_feature_value(feature_idx, |sample_idx, value| {
// Dense: called n_samples times
// Sparse: called nnz times (only stored non-default values)
});
Use when:
You want to ignore implicit defaults (common for sparse-aware algorithms).
You want “match once, loop tight” performance (no per-element branching).
Important semantics:
For sparse features, this iterates only stored values.
If you need a dense stream including defaults, use
for_each_feature_value_dense.
2) Dense stream without materialization: for_each_feature_value_dense#
dataset.for_each_feature_value_dense(feature_idx, |sample_idx, value| {
// Always called exactly n_samples times.
});
Use when:
Downstream code expects one value per sample (e.g., quantization/binning).
You want to keep sparse storage but still “act dense” on demand.
3) Subset gather for leaves: gather_feature_values#
dataset.gather_feature_values(feature_idx, sample_indices, buffer);
This is optimized for “give me feature values for a subset of rows”, and is used by linear leaves.
Critical requirement:
sample_indicesmust be sorted ascending.Dense features still work either way.
Sparse features use a merge-join; unsorted indices can produce incorrect results.
If you want to avoid writing into a buffer, use for_each_gathered_value().
4) Sample-major blocks for traversal: buffer_samples#
let mut block = ndarray::Array2::<f32>::zeros((block_size, dataset.n_features()));
let samples = dataset.buffer_samples(&mut block, start_sample);
// samples: SamplesView<'_> with shape [samples_filled, n_features]
Use when:
You traverse trees row-by-row (GBDT prediction, TreeSHAP).
You want to parallelize by sample blocks, with per-thread buffer reuse.
Semantics:
The buffer must have shape
[block_size, n_features].The returned view covers only the filled prefix (end-of-dataset can be partial).
Summary table#
Pattern |
Method |
Intended use |
|---|---|---|
Sparse-aware scan |
|
GBLinear, LinearSHAP, sparse feature stats |
Dense stream without densifying |
|
Binning/quantization |
Subset gather |
|
Linear leaves (per-leaf subsets) |
Sample-major traversal |
|
Prediction, TreeSHAP |
Missing Values#
Missing values are represented as f32::NAN.
Tree models handle missing via the split’s
default_leftdirection.Sparse features can use
default = f32::NANto treat unspecified entries as missing.
Categorical Features#
Categorical features are stored as f32, but they are conceptually integer category IDs:
Valid values are non-negative integers encoded as floats (e.g.,
0.0, 1.0, 2.0).The schema marks them as
FeatureType::Categorical.Binning casts to integer internally for categorical split logic.
Views#
The data module provides lightweight read-only views with explicit semantics:
TargetsView: target values with shape[n_outputs, n_samples].WeightsView: either uniform weights or a slice of explicit weights.SamplesView: sample-major[n_samples, n_features]view, produced bybuffer_samples.
Multi-output targets:
n_outputs = 1for regression and binary classification.n_outputs = Kfor K-class classification (one row per class/logit).
Construction#
Dataset::from_array#
Convenience constructor for dense feature-major input:
Features shape:
[n_features, n_samples]Targets shape (optional):
[n_outputs, n_samples]Weights shape (optional):
[n_samples]
Dataset::builder#
Use the builder for mixed dense/sparse and explicit metadata:
use boosters::data::Dataset;
use ndarray::array;
let ds = Dataset::builder()
.add_feature("age", array![25.0, 30.0, 35.0])
.add_categorical("color", array![0.0, 1.0, 2.0])
.add_sparse("rare", vec![1, 3], vec![10.0, 30.0], 5, 0.0)
.targets_1d(array![0.0, 1.0, 0.0, 1.0, 0.0].view())
.build()?;
Validation happens in build() (empty features, shape mismatches, sparse index invariants).
Integration and Ownership#
The split between raw and binned data is intentional:
Datasetowns raw values and schema.BinnedDataset::from_dataset(&Dataset, &BinningConfig)derives training bins without retaining a reference.
What needs what:
Component |
Needs bins |
Needs raw |
Needs targets |
Needs weights |
|---|---|---|---|---|
GBDT histogram building |
✅ |
❌ |
❌ (uses grads) |
❌ |
GBDT split finding |
✅ |
❌ |
❌ |
❌ |
GBDT prediction |
❌ |
✅ |
❌ |
❌ |
GBLinear training |
❌ |
✅ |
✅ |
✅ |
Linear leaves (per-leaf fit) |
❌ |
✅ |
✅ |
✅ |
Tree SHAP / Linear SHAP |
❌ |
✅ |
❌ |
❌ |
Files#
Path |
Contents |
|---|---|
|
|
|
|
|
|
|
|
Design Decisions#
DD-1: Separate raw from binned. Dataset holds raw values; BinnedDataset holds bins only.
This keeps ownership clear and avoids raw-value duplication.
DD-2: Sparse via enum (not traits). Storage is represented as Feature::{Dense,Sparse}.
Matching once avoids per-element branching.
DD-3: Caller-owned buffers. Sample-major traversal is done via buffer_samples() into a caller buffer.
This enables per-thread buffer reuse without allocations.
DD-4: Small set of access primitives. The API exposes a few operations that map to actual algorithms (scan, dense stream, gather, sample blocks), rather than a large general-purpose accessor surface.
Testing Strategy#
Category |
Tests |
|---|---|
Construction |
Empty dataset, shape mismatch, mixed dense/sparse |
Sparse invariants |
Unsorted indices, duplicates, out-of-bounds |
Access primitives |
Scan semantics, dense stream semantics, sorted gather requirement |
Block buffering |
Dense/sparse buffering and last-partial-block correctness |