RFC-0015: Evaluation Framework#

Status: Implemented
Created: 2025-12-27
Updated: 2026-01-02
Scope: Quality benchmarking against XGBoost/LightGBM

Summary#

The boosters-eval package provides systematic quality comparison between boosters, XGBoost, and LightGBM across datasets and configurations.

Why a Separate Evaluation Package?#

Approach

Limitation

Ad-hoc scripts

Not reproducible, incomplete coverage

In-crate tests

Slow CI, wrong dependencies

Notebooks

Hard to automate, version control issues

Dedicated package:

  • Clear dependency separation (XGBoost, LightGBM as deps)

  • CLI for easy invocation

  • Structured reports for tracking quality

Architecture#

boosters-eval/
├── src/boosters_eval/
│   ├── cli.py          # Typer CLI entry point
│   ├── suite.py        # Benchmark suite definitions
│   ├── runners.py      # Library-specific training wrappers
│   ├── datasets.py     # Dataset loading (sklearn, synthetic)
│   ├── metrics.py      # Metric computation
│   ├── results.py      # Result aggregation
│   └── reports/        # Markdown report generation

CLI Usage#

# Full quality benchmark
uv run boosters-eval full -o docs/benchmarks/quality-report.md

# Quick iteration
uv run boosters-eval quick -o docs/benchmarks/quick-report.md

# Specific booster type
uv run boosters-eval full --booster gbdt
uv run boosters-eval full --booster gblinear
uv run boosters-eval full --booster linear_trees

# Minimal for CI
uv run boosters-eval minimal

Suite Configuration#

@dataclass
class SuiteConfig:
    name: str
    description: str
    datasets: list[str]          # Dataset keys
    n_estimators: int = 100
    max_depth: int = 6
    seeds: list[int] = field(default_factory=lambda: [42])
    libraries: list[str] = field(default_factory=lambda: ["boosters", "xgboost", "lightgbm"])
    booster_types: list[BoosterType] = field(default_factory=lambda: [BoosterType.GBDT])

Pre-defined suites:

  • QUICK_SUITE: 3 seeds, 2 datasets, 50 trees

  • FULL_SUITE: 5 seeds, all datasets, 100 trees, all boosters

  • MINIMAL_SUITE: CI smoke test

Datasets#

DATASETS = {
    # Real-world
    "california": DatasetInfo(loader=load_california, task="regression"),
    "adult": DatasetInfo(loader=load_adult, task="binary"),
    "covertype": DatasetInfo(loader=load_covertype, task="multiclass"),
    
    # Synthetic
    "synthetic_reg_small": DatasetInfo(loader=synthetic_regression, n_samples=1000),
    "synthetic_bin_small": DatasetInfo(loader=synthetic_binary, n_samples=1000),
}

Large datasets use parquet files (generated by boosters-datagen).

Runners#

Each library has a runner wrapper:

class BoostersRunner:
    def train(self, X, y, config: BenchmarkConfig) -> Model:
        train = bst.Dataset(X, y)
        model = bst.GBDTModel(config=self._make_config(config))
        return model.fit(train)
    
    def predict(self, model, X) -> np.ndarray:
        return model.predict(bst.Dataset(X))

class XGBoostRunner:
    def train(self, X, y, config: BenchmarkConfig) -> xgb.Booster:
        dtrain = xgb.DMatrix(X, y)
        params = self._translate_params(config)
        return xgb.train(params, dtrain, num_boost_round=config.n_estimators)

class LightGBMRunner:
    # Similar pattern

Config translation ensures fair comparison (same hyperparameters).

Metrics#

def compute_metrics(y_true, y_pred, task: str) -> dict[str, float]:
    if task == "regression":
        return {
            "rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
            "mae": mean_absolute_error(y_true, y_pred),
            "r2": r2_score(y_true, y_pred),
        }
    elif task == "binary":
        return {
            "logloss": log_loss(y_true, y_pred),
            "auc": roc_auc_score(y_true, y_pred),
            "accuracy": accuracy_score(y_true, y_pred > 0.5),
        }

Report Generation#

# Quality Benchmark Report

**Date**: 2026-01-02
**Commit**: a1b2c3d

## Summary

| Dataset | Task | Boosters | XGBoost | LightGBM |
| ------- | ---- | -------- | ------- | -------- |
| california | regression | 0.452 RMSE | **0.448** | 0.455 |
| adult | binary | **0.372** logloss | 0.375 | 0.373 |

## Detailed Results
...

Best values are bold. Includes uncertainty (std across seeds).

Ablation Studies#

Ablation suites compare settings within a single library:

from boosters_eval import create_ablation_suite, TrainingConfig

# Growth strategy comparison
ABLATION_GROWTH = create_ablation_suite(
    name="ablation-growth",
    dataset="california",
    library="boosters",
    variants={
        "depthwise": TrainingConfig(growth_strategy="depthwise"),
        "leafwise": TrainingConfig(growth_strategy="leafwise"),
    },
)

# Threading mode comparison
ABLATION_THREADING = create_ablation_suite(
    name="ablation-threading",
    dataset="california",
    library="boosters",
    variants={
        "single": TrainingConfig(n_threads=1),
        "multi-4": TrainingConfig(n_threads=4),
        "multi-auto": TrainingConfig(n_threads=-1),
    },
)

Use cases:

  • Growth strategy comparison (leafwise vs depthwise)

  • Threading modes (single vs multi-threaded)

  • Algorithm variants (GBDT vs GBLinear vs linear_trees)

  • Hyperparameter sensitivity (depth, learning rate)

Regression Testing#

Baseline Recording#

Record baseline results for quality regression detection:

# Record baseline with 5 seeds
uv run boosters-eval baseline record \
  --suite full \
  --seeds 5 \
  --output baselines/v0.2.0.json

Baseline Checking#

Compare current results against a recorded baseline:

# Check against baseline (2% tolerance)
uv run boosters-eval baseline check \
  --baseline baselines/v0.1.0.json \
  --tolerance 0.02 \
  --fail-on-regression

Example failure output:

❌ Regression detected in 2 configs:
  california/gbdt: RMSE 0.4821 > baseline 0.4521 (+6.6%, tolerance 2%)
  breast_cancer/gbdt: LogLoss 0.1234 > baseline 0.1150 (+7.3%, tolerance 2%)

Exit Codes#

Code

Meaning

0

All checks passed

1

Regression detected (quality degraded beyond tolerance)

2

Execution error (library crash, missing dependency)

3

Configuration error (invalid baseline file, unknown dataset)

CI Regression Testing#

# GitHub Actions workflow
- name: Record baseline (on main merge)
  if: github.ref == 'refs/heads/main'
  run: |
    uv run boosters-eval baseline record \
      --suite full \
      --output baselines/main.json

- name: Check for regressions (on PR)
  run: |
    uv run boosters-eval baseline check \
      --baseline baselines/main.json \
      --tolerance 0.02 \
      --fail-on-regression

Files#

Path

Contents

cli.py

CLI commands (full, quick, minimal)

suite.py

SuiteConfig, pre-defined suites

runners.py

Library-specific wrappers

datasets.py

Dataset registry and loaders

metrics.py

Metric computation

results.py

ResultCollection, aggregation

reports/

Markdown report generators

Design Decisions#

DD-1: Separate package. Avoids heavy dependencies (XGBoost, LightGBM) in main library. Clean separation of concerns.

DD-2: Multiple seeds. Each config runs with multiple random seeds. Reports show mean ± std for statistical validity.

DD-3: Config translation. Runners translate to library-specific params. Ensures apples-to-apples comparison.

DD-4: Markdown output. Human-readable, version-controlled, easy to include in documentation.

DD-5: Ablation suites. Dedicated configurations for hyperparameter studies. Helps identify regime differences between libraries.

Statistical Significance#

Multiple seeds (default: 5) enable statistical comparison:

# Compute p-value for boosters vs xgboost
from scipy.stats import ttest_rel
p_value = ttest_rel(boosters_scores, xgboost_scores).pvalue
significant = p_value < 0.05

Reports include mean ± std and note when differences are significant.

Custom Datasets#

Add datasets to the registry:

# In datasets.py
DATASETS["my_dataset"] = DatasetInfo(
    loader=lambda: load_my_data(),  # Returns (X, y) tuple
    task="regression",  # or "binary", "multiclass"
    description="My custom dataset",
)

For large datasets, use parquet files in packages/boosters-datagen/data/.

CI Integration#

# In GitHub Actions
- name: Quality benchmark
  run: uv run boosters-eval minimal

minimal suite is fast (~2 min) for CI. Full suite runs nightly or on-demand.

Debugging Failures#

# Verbose output
uv run boosters-eval quick --verbose

# Single dataset/library for isolation
uv run boosters-eval quick --dataset california --library boosters

# Drop to Python debugger
python -m pdb packages/boosters-eval/src/boosters_eval/cli.py quick