Note
This tutorial is available as a Jupyter notebook. Download notebook
Tutorial 02: sklearn Integration#
🟢 Beginner — Familiarity with scikit-learn helpful
boosters provides scikit-learn compatible estimators that work with pipelines, cross-validation, and other sklearn utilities.
What you’ll learn#
Use
GBDTRegressorandGBDTClassifierestimatorsPerform cross-validation
Build pipelines with preprocessing
Use GridSearchCV for hyperparameter tuning
[1]:
import numpy as np
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from boosters.sklearn import GBDTRegressor, GBDTClassifier
Basic Usage#
The sklearn estimators follow the standard fit/predict API:
[2]:
# Generate data
X, y = make_regression(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train model
model = GBDTRegressor(n_estimators=100, max_depth=6, learning_rate=0.1)
model.fit(X_train, y_train)
# Evaluate
score = model.score(X_test, y_test)
print(f"R² Score: {score:.4f}")
R² Score: 0.8912
Cross-Validation#
Use sklearn’s cross-validation utilities:
[3]:
# Cross-validation
model = GBDTRegressor(n_estimators=100, max_depth=6)
scores = cross_val_score(model, X, y, cv=5, scoring='neg_root_mean_squared_error', n_jobs=1)
print(f"CV RMSE: {-scores.mean():.4f} ± {scores.std():.4f}")
CV RMSE: 38.0863 ± 4.1803
Pipelines#
Combine preprocessing and models:
[4]:
# Create pipeline with preprocessing
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', GBDTRegressor(n_estimators=100))
])
# Fit and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Pipeline R² Score: {score:.4f}")
Pipeline R² Score: 0.8912
Classification#
Use GBDTClassifier for classification tasks:
[5]:
# Generate classification data
X_clf, y_clf = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
X_clf, y_clf, test_size=0.2, random_state=42
)
# Train classifier
clf = GBDTClassifier(n_estimators=100, max_depth=6)
clf.fit(X_train_clf, y_train_clf)
# Evaluate
accuracy = clf.score(X_test_clf, y_test_clf)
print(f"Accuracy: {accuracy:.4f}")
# Probability predictions
probas = clf.predict_proba(X_test_clf)
print(f"Probability shape: {probas.shape}")
Accuracy: 0.9000
Probability shape: (200, 2)
Hyperparameter Search#
Use GridSearchCV to find optimal parameters:
[6]:
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [50, 100],
'max_depth': [4, 6],
'learning_rate': [0.05, 0.1],
}
# Grid search (n_jobs=1 to avoid pickling issues with Rust models)
search = GridSearchCV(
GBDTRegressor(),
param_grid,
cv=3,
scoring='neg_root_mean_squared_error',
n_jobs=1,
)
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV score: {-search.best_score_:.4f}")
Best params: {'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100}
Best CV score: 37.6068
Summary#
In this tutorial, you learned how to:
✅ Use sklearn-compatible estimators
✅ Perform cross-validation
✅ Build preprocessing pipelines
✅ Search hyperparameters with GridSearchCV
Next Steps#
Tutorial 03: Classification — Binary classification with ROC/AUC
Tutorial 05: Early Stopping — Prevent overfitting