Note

This tutorial is available as a Jupyter notebook. Download notebook

Tutorial 02: sklearn Integration#

🟢 Beginner — Familiarity with scikit-learn helpful

boosters provides scikit-learn compatible estimators that work with pipelines, cross-validation, and other sklearn utilities.

What you’ll learn#

  1. Use GBDTRegressor and GBDTClassifier estimators

  2. Perform cross-validation

  3. Build pipelines with preprocessing

  4. Use GridSearchCV for hyperparameter tuning

[1]:
import numpy as np
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from boosters.sklearn import GBDTRegressor, GBDTClassifier

Basic Usage#

The sklearn estimators follow the standard fit/predict API:

[2]:
# Generate data
X, y = make_regression(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = GBDTRegressor(n_estimators=100, max_depth=6, learning_rate=0.1)
model.fit(X_train, y_train)

# Evaluate
score = model.score(X_test, y_test)
print(f"R² Score: {score:.4f}")
R² Score: 0.8912

Cross-Validation#

Use sklearn’s cross-validation utilities:

[3]:
# Cross-validation
model = GBDTRegressor(n_estimators=100, max_depth=6)
scores = cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error', n_jobs=1)

print(f"CV RMSE: {-scores.mean():.4f} ± {scores.std():.4f}")
CV RMSE: 38.0863 ± 4.1803

Pipelines#

Combine preprocessing and models:

[4]:
# Create pipeline with preprocessing
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GBDTRegressor(n_estimators=100))
])

# Fit and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Pipeline R² Score: {score:.4f}")
Pipeline R² Score: 0.8912

Classification#

Use GBDTClassifier for classification tasks:

[5]:
# Generate classification data
X_clf, y_clf = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X_clf, y_clf, test_size=0.2, random_state=42
)

# Train classifier
clf = GBDTClassifier(n_estimators=100, max_depth=6)
clf.fit(X_train_clf, y_train_clf)

# Evaluate
accuracy = clf.score(X_test_clf, y_test_clf)
print(f"Accuracy: {accuracy:.4f}")

# Probability predictions
probas = clf.predict_proba(X_test_clf)
print(f"Probability shape: {probas.shape}")
Accuracy: 0.9000
Probability shape: (200, 2)

Summary#

In this tutorial, you learned how to:

  1. ✅ Use sklearn-compatible estimators

  2. ✅ Perform cross-validation

  3. ✅ Build preprocessing pipelines

  4. ✅ Search hyperparameters with GridSearchCV

Next Steps#