Tutorial 02: sklearn Integration#

🟢 Beginner — Familiarity with scikit-learn helpful

boosters provides scikit-learn compatible estimators that work with pipelines, cross-validation, and other sklearn utilities.

What you’ll learn#

Use GBDTRegressor and GBDTClassifier estimators
Perform cross-validation
Build pipelines with preprocessing
Use GridSearchCV for hyperparameter tuning

[1]:

import numpy as np
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from boosters.sklearn import GBDTRegressor, GBDTClassifier

Basic Usage#

The sklearn estimators follow the standard fit/predict API:

[2]:

# Generate data
X, y = make_regression(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = GBDTRegressor(n_estimators=100, max_depth=6, learning_rate=0.1)
model.fit(X_train, y_train)

# Evaluate
score = model.score(X_test, y_test)
print(f"R² Score: {score:.4f}")

R² Score: 0.8912

Cross-Validation#

Use sklearn’s cross-validation utilities:

[3]:

# Cross-validation
model = GBDTRegressor(n_estimators=100, max_depth=6)
scores = cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error', n_jobs=1)

print(f"CV RMSE: {-scores.mean():.4f} ± {scores.std():.4f}")

CV RMSE: 38.0863 ± 4.1803

Pipelines#

Combine preprocessing and models:

[4]:

# Create pipeline with preprocessing
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GBDTRegressor(n_estimators=100))
])

# Fit and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Pipeline R² Score: {score:.4f}")

Pipeline R² Score: 0.8912

Classification#

Use GBDTClassifier for classification tasks:

[5]:

# Generate classification data
X_clf, y_clf = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X_clf, y_clf, test_size=0.2, random_state=42
)

# Train classifier
clf = GBDTClassifier(n_estimators=100, max_depth=6)
clf.fit(X_train_clf, y_train_clf)

# Evaluate
accuracy = clf.score(X_test_clf, y_test_clf)
print(f"Accuracy: {accuracy:.4f}")

# Probability predictions
probas = clf.predict_proba(X_test_clf)
print(f"Probability shape: {probas.shape}")

Accuracy: 0.9000
Probability shape: (200, 2)

Hyperparameter Search#

Use GridSearchCV to find optimal parameters:

[6]:

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [4, 6],
    'learning_rate': [0.05, 0.1],
}

# Grid search (n_jobs=1 to avoid pickling issues with Rust models)
search = GridSearchCV(
    GBDTRegressor(),
    param_grid,
    cv=3,
    scoring='neg_root_mean_squared_error',
    n_jobs=1,
)
search.fit(X_train, y_train)

print(f"Best params: {search.best_params_}")
print(f"Best CV score: {-search.best_score_:.4f}")

Best params: {'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100}
Best CV score: 37.6068

Summary#

In this tutorial, you learned how to:

✅ Use sklearn-compatible estimators
✅ Perform cross-validation
✅ Build preprocessing pipelines
✅ Search hyperparameters with GridSearchCV

Next Steps#

Tutorial 03: Classification — Binary classification with ROC/AUC
Tutorial 05: Early Stopping — Prevent overfitting