Note

This tutorial is available as a Jupyter notebook. Download notebook

Tutorial 01: Basic GBDT Training#

🟢 Beginner — No prior boosting experience needed

In this tutorial, you’ll learn how to train your first Gradient Boosted Decision Tree (GBDT) model with boosters.

What you’ll learn#

  1. Create a dataset from NumPy arrays

  2. Configure and train a GBDT model

  3. Make predictions

  4. Evaluate model performance

Setup#

First, let’s install and import the required packages:

[1]:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

import boosters

Generate Sample Data#

We’ll use scikit-learn to generate a synthetic regression dataset:

[2]:
# Generate synthetic regression data
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X_train.shape[1]}")
Training samples: 800
Test samples: 200
Features: 10

Create a Dataset#

boosters uses a Dataset object to wrap your data for efficient training:

[3]:
# Create boosters Dataset objects
train_data = boosters.Dataset(X_train, y_train)
test_data = boosters.Dataset(X_test, y_test)

print(f"Train dataset: {train_data}")
print(f"Test dataset: {test_data}")
Train dataset: Dataset(n_samples=800, n_features=10, has_labels=true, categorical_features=0)
Test dataset: Dataset(n_samples=200, n_features=10, has_labels=true, categorical_features=0)

Configure the Model#

Create a configuration for your GBDT model:

[4]:
# Configure the GBDT model
config = boosters.GBDTConfig(
    n_estimators=100,      # Number of trees
    max_depth=6,           # Maximum tree depth
    learning_rate=0.1,     # Learning rate (shrinkage)
    objective=boosters.Objective.squared(),  # Regression objective (L2 loss)
)

print("Configuration created!")
print(config)
Configuration created!
GBDTConfig(n_estimators=100, learning_rate=0.1, objective=Squared)

Train the Model#

Train the model using the GBDTModel.train() method:

[5]:
# Train the model
model = boosters.GBDTModel.train(train_data, config=config)

print(f"Model trained!")
print(f"Number of trees: {model.n_trees}")
print(f"Number of features: {model.n_features}")
Model trained!
Number of trees: 100
Number of features: 10

Make Predictions#

Use the trained model to predict on the test set:

[6]:
# Make predictions - need to wrap in Dataset for core API
y_pred = model.predict(boosters.Dataset(X_test))

print(f"Predictions shape: {y_pred.shape}")
print(f"First 5 predictions: {y_pred[:5].flatten()}")
Predictions shape: (200, 1)
First 5 predictions: [ 2.1480764e+01  6.7458168e+01 -2.0529093e-01 -2.7172467e+02
  3.0354942e+01]

Evaluate Performance#

Calculate standard regression metrics:

[7]:
# Calculate metrics - flatten predictions for sklearn metrics
y_pred_flat = y_pred.flatten()
mse = mean_squared_error(y_test, y_pred_flat)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred_flat)

print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
Mean Squared Error: 1243.9708
Root Mean Squared Error: 35.2700
R² Score: 0.9263

Summary#

In this tutorial, you learned how to:

  1. ✅ Create datasets from NumPy arrays

  2. ✅ Configure a GBDT model with basic hyperparameters

  3. ✅ Train the model

  4. ✅ Make predictions and evaluate performance

Next Steps#