7 XGBoost Tricks for More Accurate Predictive Models

Image by editor

# introduction

Ensemble methods like XG boost (Extreme Gradient Boosting) is a powerful implementation of gradient-boosted decision trees that aggregates several weak estimators into a powerful predictive model. These ensembles are very popular due to their accuracy, efficiency, and strong performance on structured (tabular) data. It is a widely used machine learning library. scikit-learn does not provide a native implementation of XGBoost, but there is another library, appropriately called XGBoost, that provides an API compatible with scikit-learn.

Just import it like this:

from xgboost import XGBClassifier

Below, we outline seven Python tricks that will help you get the most out of this standalone implementation of XGBoost, especially when aiming to build more accurate predictive models.

To demonstrate these tricks, we’ll use the breast cancer dataset freely available on scikit-learn and define a baseline model with almost default settings. Make sure to run this code first before trying the 7 tricks below.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Baseline model
model = XGBClassifier(eval_metric="logloss", random_state=42)
model.fit(X_train, y_train)
print("Baseline accuracy:", accuracy_score(y_test, model.predict(X_test)))

# 1. Adjust learning rate and number of estimators

Although not a universal rule, explicitly lowering the learning rate while increasing the number of estimators (trees) in the XGBoost ensemble often improves accuracy. The smaller learning rate allows the model to learn more gradually, and the additional trees compensate for the reduced step size.

Here is an example. Try it yourself and compare the accuracy of your results to your initial baseline.

model = XGBClassifier(
    learning_rate=0.01,
    n_estimators=5000,
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)
print("Model accuracy:", accuracy_score(y_test, model.predict(X_test)))

For clarity, the final print() The statement is omitted in the remaining examples. If you want to test it yourself, just add it to one of the snippets below.

# 2. Adjust the maximum tree depth

of max_depth argument is an important hyperparameter inherited from classical decision trees. It limits how deep each tree in the ensemble can grow. Limiting the depth of a tree may seem simple, but surprisingly, shallow trees generalize more often than deep trees.

This example limits the tree to a maximum depth of 2.

model = XGBClassifier(
    max_depth=2,
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)

# 3. Reducing overfitting by subsampling

of subsample The argument randomly samples a portion (e.g., 80%) of the training data before growing each tree in the ensemble. This simple technique serves as an effective regularization strategy and helps prevent overfitting.

If not specified, this hyperparameter defaults to 1.0, meaning 100% of the training samples are used.

model = XGBClassifier(
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)

Note that this approach is most effective for moderately sized datasets. If the dataset is already small, aggressive subsampling can lead to underfitting.

# 4. Adding regularization term

To further control overfitting, traditional regularization strategies such as L1 (lasso) and L2 (ridge) can be used to penalize complex trees. In XGBoost, these are reg_alpha and reg_lambda Specify each parameter.

model = XGBClassifier(
    reg_alpha=0.2,   # L1
    reg_lambda=0.5,  # L2
    eval_metric="logloss",
    random_state=42
)
model.fit(X_train, y_train)

# 5. Using early stopping

Early stopping is an efficiency-oriented mechanism that stops training when performance on the validation set stops improving for a specified number of rounds.

Depending on your coding environment and the version of the XGBoost library you are using, you may need to upgrade to a newer version to use the implementation shown below. Also check the following: early_stopping_rounds specified during model initialization, rather than being passed to . fit() method.

model = XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    eval_metric="logloss",
    early_stopping_rounds=20,
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=False
)

To upgrade the library, run:

!pip uninstall -y xgboost
!pip install xgboost --upgrade

# 6. Performing a hyperparameter search

For a more systematic approach, hyperparameter search can help you identify combinations of settings that maximize model performance. Below is an example of using grid search to explore combinations of the three main hyperparameters introduced earlier.

param_grid = {
    "max_depth": [3, 4, 5],
    "learning_rate": [0.01, 0.05, 0.1],
    "n_estimators": [200, 500]
}

grid = GridSearchCV(
    XGBClassifier(eval_metric="logloss", random_state=42),
    param_grid,
    cv=3,
    scoring="accuracy"
)

grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)

best_model = XGBClassifier(
    **grid.best_params_,
    eval_metric="logloss",
    random_state=42
)

best_model.fit(X_train, y_train)
print("Tuned accuracy:", accuracy_score(y_test, best_model.predict(X_test)))

# 7. Adjusting class imbalances

This last trick is especially useful when dealing with datasets with strong class imbalance (the breast cancer dataset is relatively balanced, so you don’t need to worry if you observe minimal changes). of scale_pos_weight This parameter is especially useful when the class ratios are highly skewed, such as 90/10, 95/5, or 99/1.

Here’s how to calculate and apply it based on your training data.

ratio = np.sum(y_train == 0) / np.sum(y_train == 1)

model = XGBClassifier(
    scale_pos_weight=ratio,
    eval_metric="logloss",
    random_state=42
)

model.fit(X_train, y_train)

# summary

In this article, we reviewed seven practical tricks to enhance your XGBoost ensemble models using specialized Python libraries. Careful tuning of learning rate, tree depth, sampling strategy, regularization, and class weighting, combined with systematic hyperparameter search, can often make the difference between a decent model and a high-accuracy model.

Ivan Palomares Carrascosa I am a leader, writer, speaker, and advisor in AI, machine learning, deep learning, and LLM. He trains and coaches others to leverage AI in the real world.

Source link