RFC-0015: Evaluation Framework#
Status: Implemented
Created: 2025-12-27
Updated: 2026-01-02
Scope: Quality benchmarking against XGBoost/LightGBM
Summary#
The boosters-eval package provides systematic quality comparison between
boosters, XGBoost, and LightGBM across datasets and configurations.
Why a Separate Evaluation Package?#
Approach |
Limitation |
|---|---|
Ad-hoc scripts |
Not reproducible, incomplete coverage |
In-crate tests |
Slow CI, wrong dependencies |
Notebooks |
Hard to automate, version control issues |
Dedicated package:
Clear dependency separation (XGBoost, LightGBM as deps)
CLI for easy invocation
Structured reports for tracking quality
Architecture#
boosters-eval/
├── src/boosters_eval/
│ ├── cli.py # Typer CLI entry point
│ ├── suite.py # Benchmark suite definitions
│ ├── runners.py # Library-specific training wrappers
│ ├── datasets.py # Dataset loading (sklearn, synthetic)
│ ├── metrics.py # Metric computation
│ ├── results.py # Result aggregation
│ └── reports/ # Markdown report generation
CLI Usage#
# Full quality benchmark
uv run boosters-eval full -o docs/benchmarks/quality-report.md
# Quick iteration
uv run boosters-eval quick -o docs/benchmarks/quick-report.md
# Specific booster type
uv run boosters-eval full --booster gbdt
uv run boosters-eval full --booster gblinear
uv run boosters-eval full --booster linear_trees
# Minimal for CI
uv run boosters-eval minimal
Suite Configuration#
@dataclass
class SuiteConfig:
name: str
description: str
datasets: list[str] # Dataset keys
n_estimators: int = 100
max_depth: int = 6
seeds: list[int] = field(default_factory=lambda: [42])
libraries: list[str] = field(default_factory=lambda: ["boosters", "xgboost", "lightgbm"])
booster_types: list[BoosterType] = field(default_factory=lambda: [BoosterType.GBDT])
Pre-defined suites:
QUICK_SUITE: 3 seeds, 2 datasets, 50 treesFULL_SUITE: 5 seeds, all datasets, 100 trees, all boostersMINIMAL_SUITE: CI smoke test
Datasets#
DATASETS = {
# Real-world
"california": DatasetInfo(loader=load_california, task="regression"),
"adult": DatasetInfo(loader=load_adult, task="binary"),
"covertype": DatasetInfo(loader=load_covertype, task="multiclass"),
# Synthetic
"synthetic_reg_small": DatasetInfo(loader=synthetic_regression, n_samples=1000),
"synthetic_bin_small": DatasetInfo(loader=synthetic_binary, n_samples=1000),
}
Large datasets use parquet files (generated by boosters-datagen).
Runners#
Each library has a runner wrapper:
class BoostersRunner:
def train(self, X, y, config: BenchmarkConfig) -> Model:
train = bst.Dataset(X, y)
model = bst.GBDTModel(config=self._make_config(config))
return model.fit(train)
def predict(self, model, X) -> np.ndarray:
return model.predict(bst.Dataset(X))
class XGBoostRunner:
def train(self, X, y, config: BenchmarkConfig) -> xgb.Booster:
dtrain = xgb.DMatrix(X, y)
params = self._translate_params(config)
return xgb.train(params, dtrain, num_boost_round=config.n_estimators)
class LightGBMRunner:
# Similar pattern
Config translation ensures fair comparison (same hyperparameters).
Metrics#
def compute_metrics(y_true, y_pred, task: str) -> dict[str, float]:
if task == "regression":
return {
"rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
"mae": mean_absolute_error(y_true, y_pred),
"r2": r2_score(y_true, y_pred),
}
elif task == "binary":
return {
"logloss": log_loss(y_true, y_pred),
"auc": roc_auc_score(y_true, y_pred),
"accuracy": accuracy_score(y_true, y_pred > 0.5),
}
Report Generation#
# Quality Benchmark Report
**Date**: 2026-01-02
**Commit**: a1b2c3d
## Summary
| Dataset | Task | Boosters | XGBoost | LightGBM |
| ------- | ---- | -------- | ------- | -------- |
| california | regression | 0.452 RMSE | **0.448** | 0.455 |
| adult | binary | **0.372** logloss | 0.375 | 0.373 |
## Detailed Results
...
Best values are bold. Includes uncertainty (std across seeds).
Ablation Studies#
Ablation suites compare settings within a single library:
from boosters_eval import create_ablation_suite, TrainingConfig
# Growth strategy comparison
ABLATION_GROWTH = create_ablation_suite(
name="ablation-growth",
dataset="california",
library="boosters",
variants={
"depthwise": TrainingConfig(growth_strategy="depthwise"),
"leafwise": TrainingConfig(growth_strategy="leafwise"),
},
)
# Threading mode comparison
ABLATION_THREADING = create_ablation_suite(
name="ablation-threading",
dataset="california",
library="boosters",
variants={
"single": TrainingConfig(n_threads=1),
"multi-4": TrainingConfig(n_threads=4),
"multi-auto": TrainingConfig(n_threads=-1),
},
)
Use cases:
Growth strategy comparison (leafwise vs depthwise)
Threading modes (single vs multi-threaded)
Algorithm variants (GBDT vs GBLinear vs linear_trees)
Hyperparameter sensitivity (depth, learning rate)
Regression Testing#
Baseline Recording#
Record baseline results for quality regression detection:
# Record baseline with 5 seeds
uv run boosters-eval baseline record \
--suite full \
--seeds 5 \
--output baselines/v0.2.0.json
Baseline Checking#
Compare current results against a recorded baseline:
# Check against baseline (2% tolerance)
uv run boosters-eval baseline check \
--baseline baselines/v0.1.0.json \
--tolerance 0.02 \
--fail-on-regression
Example failure output:
❌ Regression detected in 2 configs:
california/gbdt: RMSE 0.4821 > baseline 0.4521 (+6.6%, tolerance 2%)
breast_cancer/gbdt: LogLoss 0.1234 > baseline 0.1150 (+7.3%, tolerance 2%)
Exit Codes#
Code |
Meaning |
|---|---|
0 |
All checks passed |
1 |
Regression detected (quality degraded beyond tolerance) |
2 |
Execution error (library crash, missing dependency) |
3 |
Configuration error (invalid baseline file, unknown dataset) |
CI Regression Testing#
# GitHub Actions workflow
- name: Record baseline (on main merge)
if: github.ref == 'refs/heads/main'
run: |
uv run boosters-eval baseline record \
--suite full \
--output baselines/main.json
- name: Check for regressions (on PR)
run: |
uv run boosters-eval baseline check \
--baseline baselines/main.json \
--tolerance 0.02 \
--fail-on-regression
Files#
Path |
Contents |
|---|---|
|
CLI commands (full, quick, minimal) |
|
SuiteConfig, pre-defined suites |
|
Library-specific wrappers |
|
Dataset registry and loaders |
|
Metric computation |
|
ResultCollection, aggregation |
|
Markdown report generators |
Design Decisions#
DD-1: Separate package. Avoids heavy dependencies (XGBoost, LightGBM) in main library. Clean separation of concerns.
DD-2: Multiple seeds. Each config runs with multiple random seeds. Reports show mean ± std for statistical validity.
DD-3: Config translation. Runners translate to library-specific params. Ensures apples-to-apples comparison.
DD-4: Markdown output. Human-readable, version-controlled, easy to include in documentation.
DD-5: Ablation suites. Dedicated configurations for hyperparameter studies. Helps identify regime differences between libraries.
Statistical Significance#
Multiple seeds (default: 5) enable statistical comparison:
# Compute p-value for boosters vs xgboost
from scipy.stats import ttest_rel
p_value = ttest_rel(boosters_scores, xgboost_scores).pvalue
significant = p_value < 0.05
Reports include mean ± std and note when differences are significant.
Custom Datasets#
Add datasets to the registry:
# In datasets.py
DATASETS["my_dataset"] = DatasetInfo(
loader=lambda: load_my_data(), # Returns (X, y) tuple
task="regression", # or "binary", "multiclass"
description="My custom dataset",
)
For large datasets, use parquet files in packages/boosters-datagen/data/.
CI Integration#
# In GitHub Actions
- name: Quality benchmark
run: uv run boosters-eval minimal
minimal suite is fast (~2 min) for CI. Full suite runs nightly or on-demand.
Debugging Failures#
# Verbose output
uv run boosters-eval quick --verbose
# Single dataset/library for isolation
uv run boosters-eval quick --dataset california --library boosters
# Drop to Python debugger
python -m pdb packages/boosters-eval/src/boosters_eval/cli.py quick