RFC-0014: Python Bindings#
Status: Implemented
Created: 2025-12-01
Updated: 2026-01-02
Scope: PyO3 bindings for boosters
Summary#
Python bindings via PyO3 expose GBDTModel, GBLinearModel, Dataset, and
configuration types. The API is Pythonic with scikit-learn compatibility.
Why Python Bindings?#
Aspect |
Benefit |
|---|---|
Adoption |
ML ecosystem is Python-dominated |
Integration |
Works with pandas, numpy, scikit-learn |
Iteration |
Fast prototyping, then production |
Comparison |
Benchmark against XGBoost/LightGBM |
Package Structure#
packages/boosters-python/
├── src/ # Rust PyO3 bindings
│ ├── lib.rs # Module definition
│ ├── config.rs # PyGBDTConfig, PyGBLinearConfig
│ ├── data/ # Dataset bindings
│ ├── model/ # Model bindings
│ └── ...
└── python/boosters/ # Python wrapper layer
├── __init__.py # Public exports
├── dataset/ # Dataset + DatasetBuilder
├── model.py # Model re-exports
├── sklearn/ # scikit-learn estimators
└── ...
Two-layer design:
Rust layer (
_boosters_rs): Raw bindings, efficient memory handlingPython layer (
boosters): Pythonic API, input validation, convenience
Core Types#
Dataset#
The Dataset constructor accepts numpy arrays, pandas DataFrames, or scipy
sparse matrices. Internally, it delegates to DatasetBuilder:
import boosters as bst
import numpy as np
# From numpy array (most common)
dataset = bst.Dataset(X, y, weights=None)
# From pandas DataFrame
dataset = bst.Dataset(df.drop("target", axis=1), df["target"])
# From scipy sparse matrix
dataset = bst.Dataset(sparse_X, y)
# With optional metadata
dataset = bst.Dataset(
X, y,
feature_names=["age", "income", "category"],
categorical_features=[2], # index of categorical column
)
DatasetBuilder#
For advanced use cases with mixed feature types:
from boosters.dataset import DatasetBuilder, Feature
# Builder pattern for complex datasets
dataset = (
DatasetBuilder()
.add_feature(Feature.from_dense(age_arr, name="age"))
.add_feature(Feature.from_sparse(indices, values, n_samples, name="category", categorical=True))
.labels(y)
.weights(sample_weights)
.build()
)
# Or add features in bulk
builder = DatasetBuilder()
builder.add_features_from_array(X_dense, feature_names=["f0", "f1", "f2"])
builder.add_features_from_dataframe(df)
builder.add_features_from_sparse(csc_matrix)
builder.labels(y)
dataset = builder.build()
Feature#
Individual feature columns with type metadata:
from boosters.dataset import Feature
# Dense feature from array-like
f1 = Feature.from_dense(np.array([1.0, 2.0, 3.0]), name="age")
# Sparse feature from indices + values
f2 = Feature.from_sparse(
indices=np.array([0, 2]), # Non-zero sample indices
values=np.array([1.0, 3.0]), # Non-zero values
n_samples=3,
name="sparse_col",
categorical=True,
)
Configuration#
config = bst.GBDTConfig(
n_trees=100,
learning_rate=0.1,
max_depth=6,
reg_lambda=1.0,
objective=bst.Objective.squared_error(),
metric=bst.Metric.rmse(),
)
linear_config = bst.GBLinearConfig(
n_rounds=100,
learning_rate=0.5,
alpha=0.0,
lambda_=1.0,
)
Models#
# Create model with config
model = bst.GBDTModel(config)
# Train
model.fit(train_data, val_set=valid_data, n_threads=4)
# Predict (returns transformed predictions, e.g., probabilities)
predictions = model.predict(test_data, n_threads=4)
# Raw predictions (untransformed margin scores)
raw_scores = model.predict_raw(test_data)
# Feature importance
importances = model.feature_importance(bst.ImportanceType.Gain)
# SHAP values (shape: [n_samples, n_features+1, n_outputs])
shap_values = model.shap_values(test_data)
Note: There is no model.predict_proba() on the core model. The predict()
method already applies the appropriate transformation (sigmoid for binary,
softmax for multiclass). Use sklearn wrappers for predict_proba().
scikit-learn API#
from boosters.sklearn import GBDTRegressor, GBDTClassifier
# Standard sklearn interface
model = GBDTRegressor(n_trees=100, max_depth=6)
model.fit(X, y)
preds = model.predict(X_test)
# Classifiers have predict_proba
clf = GBDTClassifier(n_trees=100)
clf.fit(X, y)
proba = clf.predict_proba(X_test) # [n_samples, n_classes]
# Compatible with pipelines, cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
Memory Model#
Zero-Copy Where Possible#
# NumPy array: borrowed if C-contiguous float32, else copied
dataset = bst.Dataset(X, y) # Check dataset.was_converted
# Predictions: returns new numpy array (owned by Python)
preds = model.predict(dataset) # np.ndarray
Explicit Copies#
# Sparse: data is copied into internal CSC format
dataset = bst.Dataset(sparse_matrix, y)
# DataFrame: column data extracted and stored
dataset = bst.Dataset(df, y)
Threading#
# Training parallelism
model.fit(train, n_threads=8)
# Prediction parallelism
preds = model.predict(test, n_threads=8)
# Default: use all available cores
preds = model.predict(test) # n_threads=0 = auto
Rayon thread pool is managed internally. n_threads=1 forces sequential.
GIL handling: The GIL is released during training and prediction. Python threads can run concurrently with Rust computation.
Error Handling#
try:
model.fit(invalid_data)
except bst.BoostersError as e:
print(f"Training failed: {e}")
Rust errors are converted to Python exceptions with context.
Stub Generation#
Type stubs (.pyi) are auto-generated via pyo3-stub-gen:
# IDE gets full type info
def fit(
self,
train: Dataset,
val_set: Dataset | None = None,
n_threads: int = 0,
) -> GBDTModel: ...
Files#
Path |
Contents |
|---|---|
|
PyO3 module registration |
|
Config bindings |
|
Dataset, DatasetBuilder, Feature bindings |
|
GBDTModel, GBLinearModel bindings |
|
Public API |
|
Dataset, DatasetBuilder, Feature |
|
scikit-learn estimators |
Design Decisions#
DD-1: Two-layer design. Rust layer is minimal, Python layer adds ergonomics. Easier to iterate on Python API without recompiling.
DD-2: Dataset constructor delegates to builder. The Dataset(X, y)
constructor internally uses DatasetBuilder for flexibility, but provides
a simple interface for common cases.
DD-3: Config objects, not kwargs. Explicit config types provide IDE support and validation. Matches XGBoost’s approach.
DD-4: Thread pool managed internally. Users specify n_threads, Rayon
handles pool. No global state leakage between calls.
DD-5: sklearn wrappers for predict_proba. The core GBDTModel.predict()
already applies transformations. The sklearn GBDTClassifier.predict_proba()
is provided separately for sklearn compatibility.
DD-6: Feature-major layout. Datasets store features in column-major order for cache-friendly binning and histogram construction.
Installation#
# From source
cd packages/boosters-python
maturin develop --release
Testing Strategy#
Category |
Tests |
|---|---|
Type conversion |
numpy ↔ Rust array roundtrip |
Memory safety |
No leaks under repeated calls |
sklearn compliance |
sklearn |
Error handling |
Rust errors → Python exceptions |
Threading |
No deadlocks, correct results |
Tests in packages/boosters-python/tests/.