introduction
In my previous posts I saw swamp standard decision trees and random forest wonders. Now, let's explore visually to complete the triplet!
There are plenty of gradient boost tree libraries, including Xgboost, CatBoost, LightGBM and more. However, I use one of Sklearn for this. why? Simply compared to others, visualization has become easier. In fact, we tend to use other libraries rather than Sklearn's libraries. However, this project is about visual learning, not pure performance.
Basically, GBT is a combination of trees Just work together. A single decision tree (including one extracted from a random forest) can make decent predictions on its own, but taking individual trees from GBT is unlikely to provide what is available.
Beyond this, as always, there is no theory or mathematics. Just plot and hyperparameters. As before, we use the California housing dataset via Scikit-Learn (CC-By), the same general process as described in the previous post. The code is https://github.com/jamesdeluk/data-projects/tree/main/visualising-tree, and all images below (created by GIF).
Basic gradient boost tree
Start with a basic GBT: gb = GradientBoostingRegressor(random_state=42). As with other tree types, default settings min_samples_split, min_samples_leaf, max_leaf_nodes 2, 1, None each. Interestingly, the default max_depth 3, not None Similar to the Decision-Making Tree/Random Forest. We'll look into more detail on the hyperparameters that you should pay attention to later learning_rate (How steep the slope is, default 0.1), and n_estimators (Similar to Random Forest – Number of Trees).
The fitting takes 2.2 seconds, the prediction takes 0.005 seconds, and the results are as follows:
| metric | max_depth = none |
|---|---|
| May | 0.369 |
| Map | 0.216 |
| mse | 0.289 |
| rmse | 0.538 |
| r² | 0.779 |
So it's faster than the default Random Forest, but slightly worse performance. For the block I selected, I predicted 0.803 (actual 0.894).
Visualization
This is why you're here, right?
tree
As before, you can plot a single tree. This is the first one gb.estimators_[0, 0]:

I've explained these in a previous post so I won't do it again here. One thing that catches your attention is to note how terrible the value is! Three of the leaves even have negative values, but we know that they are not. This is why GBT only serves as a combined ensemble rather than as a random forest-like independent standalone tree.
Prediction and Errors
My favorite way to visualize GBT is to use predictions and iterative plots. gb.staged_predict. For the block I chose:

Remember that the default model has 100 estimators? Well, I'm here. The initial prediction was quite apart – 2! But every time you learn it (remember learning_rate? ), and has come closer to actual value. Of course, the final value was off (0.803, therefore about 10% off), as it was trained with training data rather than this particular data, but you can see the process clearly.
In this case, a considerable steady state was reached after about 50 iterations. Later, we'll see how to stop iterations at this stage so that you don't waste your time and money.
Similarly, you can plot errors (i.e., subtract the predictions from the true value). Of course, this simply gives the same plot with different y-axis values.

Let's take this one step further! Test data has over 5,000 blocks to predict. For each iteration, you can loop through each and predict all of them!

I love this plot.

They all start twice, but explode across the iteration. All true values are from 0.15 to 5, and the average is known to differ at 2.1 (check the first post), so the prediction (spreads from ~0.3 to ~5.5 predictions) are as expected.
You can also plot errors.

At first glance, it seems a little strange. For example, we expect to start with ±2 and converge at 0. But this happens most cases – on the left side of the plot, you can see it in the first 10 iterations. The problem is that there are over 5000 lines in this plot, so there are plenty of overlapping and the outliers stand out more. Perhaps there is a better way to visualize these? How about it…

The median error is 0.05. This is very good! The IQR is below 0.5, which is also decent. So there are some terrible predictions, but most are decent.
Hyperparameter tuning
Decision Tree Hyperparameters
Just like before, let's compare the way the hyperparameters discussed in the original decision tree post are applied to GBTS with the default hyperparameters. learning_rate = 0.1, n_estimators = 100. min_samples_leaf, min_samples_splitand max_leaf_nodes There is one max_depth = 10to make a fair comparison with previous posts and one another.
| Model | max_depth = none | max_depth = 10 | min_samples_leaf = 10 | min_samples_split = 10 | max_leaf_nodes = 100 |
|---|---|---|---|---|---|
| Fit time | 10.889 | 7.009 | 7.101 | 7.015 | 6.167 |
| Predict time | 0.089 | 0.019 | 0.015 | 0.018 | 0.013 |
| May | 0.454 | 0.304 | 0.301 | 0.302 | 0.301 |
| Map | 0.253 | 0.177 | 0.174 | 0.174 | 0.175 |
| mse | 0.496 | 0.222 | 0.212 | 0.217 | 0.210 |
| rmse | 0.704 | 0.471 | 0.46 | 0.466 | 0.458 |
| r² | 0.621 | 0.830 | 0.838 | 0.834 | 0.840 |
| Selected predictions | 0.885 | 0.906 | 0.962 | 0.918 | 0.923 |
| Selected error | 0.009 | 0.012 | 0.068 | 0.024 | 0.029 |
Unlike decision-making trees and random forests, deeper trees have become much worse! It took me a while to fit. However, increasing the depth from 3 (the default) to 10 improved the score. Other constraints have provided further improvements. This again shows how all hyperparameters can play a role.
Learning_rate
GBT works by adjusting predictions after each iteration based on errors. The higher the adjustment (aka slope, aka learning rate), the more changes the prediction between iterations.
There is a clear trade-off in learning rates. Comparison of learning rates for 0.01 (slow), 0.1 (default), and 0.5 (fast), over 100 iterations:

Fastest learning rates can acquire the right value faster, but they can become excessively excessive and more likely to jump beyond the true value (think of raising a car fish), leading to vibrations. Slow learning rates never reach the correct value (think… don't turn the steering wheel well and drive directly to the tree). For statistics:
| Model | Defaults | fast | slow |
|---|---|---|---|
| Fit time | 2.159 | 2.288 | 2.166 |
| Predict time | 0.005 | 0.004 | 0.015 |
| May | 0.370 | 0.338 | 0.629 |
| Map | 0.216 | 0.197 | 0.427 |
| mse | 0.289 | 0.247 | 0.661 |
| rmse | 0.538 | 0.497 | 0.813 |
| r² | 0.779 | 0.811 | 0.495 |
| Selected predictions | 0.803 | 0.949 | 1.44 |
| Selected error | 0.091 | 0.055 | 0.546 |
Naturally, the slow learning model was awful. In this block, FAST was slightly better than the entire default. However, if you stopped at least 40 iterations at least for selected blocks, you can see how the last 90 iterations were done, at least for selected blocks. The joy of visualization!
n_estimators
As mentioned above, the number of estimators is closely related to the learning rate. Generally,The more estimators, the more iterations you get to measure and adjust the error, but this costs additional time.
As mentioned above, a large number of estimators are particularly important for a low learning rate to reach the correct value. Increase the number of estimators to 500:

With sufficient iteration, the slow learning GBT reached a true value. In fact, they all got much closer. The statistics check this:
| Model | Default More | Fastmore | Slow More |
|---|---|---|---|
| Fit time | 12.254 | 12.489 | 11.918 |
| Predict time | 0.018 | 0.014 | 0.022 |
| May | 0.323 | 0.319 | 0.410 |
| Map | 0.187 | 0.185 | 0.248 |
| mse | 0.232 | 0.228 | 0.338 |
| rmse | 0.482 | 0.477 | 0.581 |
| r² | 0.823 | 0.826 | 0.742 |
| Selected predictions | 0.841 | 0.921 | 0.858 |
| Selected error | 0.053 | 0.027 | 0.036 |
Naturally, increasing the number of estimators by 5 times significantly increased the time to fit (in this case it would be six times, but that could be just one time). However, it has not yet exceeded the above constrained tree score. I think you need to see if you can beat them if you want to do a hyperparameter search. Also, for the selected blocks, none of the models actually improved after about 300 iterations, as seen in the plot. If this was consistent across all data, no extra 700 iterations were needed. I mentioned earlier how it is possible to avoid wasting repeated time without improving. Now is the time to look into it.
n_iter_no_change, validation_fraction, and tol
Additional iterations may not improve the final result, but it takes time to run them. This is where early halt begins.
There are three related hyperparameters. first, n_iter_no_changethe number of iterations because there is “no change” before no further iterations are made. tol[erance] This is the size in which changes in the validation score should be classified as “unchanged.” and validation_fraction How much of the training data is used as the validation set to generate the validation score (this is separate from the test data).
Compare 1000 Estimator GBT with a rather aggressive early suspension – n_iter_no_change=5, validation_fraction=0.1, tol=0.005 – One of the latter stopped after only 61 estimators (so it only took 5-6% of the time to fit):

As expected, the results were worse:
| Model | Defaults | Early suspension |
|---|---|---|
| Fit time | 24.843 | 1.304 |
| Predict time | 0.042 | 0.003 |
| May | 0.313 | 0.396 |
| Map | 0.181 | 0.236 |
| mse | 0.222 | 0.321 |
| rmse | 0.471 | 0.566 |
| r² | 0.830 | 0.755 |
| Selected predictions | 0.837 | 0.805 |
| Selected error | 0.057 | 0.089 |
But as always, the question to ask: is it worth investing 20 times the time to improve R² by 10%, or is it worth reducing errors by 20%?
Bayes search
You probably were expecting this. Search Space:
search_spaces = {
'learning_rate': (0.01, 0.5),
'max_depth': (1, 100),
'max_features': (0.1, 1.0, 'uniform'),
'max_leaf_nodes': (2, 20000),
'min_samples_leaf': (1, 100),
'min_samples_split': (2, 100),
'n_estimators': (50, 1000),
}
Mostly similar to my previous posts. The only additional hyperparameter is learning_rate.
So far, it took the longest time at 96 minutes (about 50% more than Random Forest!). The best hyperparameters are:
best_parameters = OrderedDict({
'learning_rate': 0.04345459461297153,
'max_depth': 13,
'max_features': 0.4993693929975871,
'max_leaf_nodes': 20000,
'min_samples_leaf': 1,
'min_samples_split': 83,
'n_estimators': 325,
})
max_features, max_leaf_nodesand min_samples_leafvery similar to a tuned random forest. n_estimators Also, the selected block plot above is consistent with what suggested. The additional 700 iterations were almost unnecessary. However, compared to the adjusted random forest, the trees are only a third deeper. min_samples_split It's way higher than we've seen before. Value of learning_rate Based on what we saw above, it wasn't too surprising.
And cross-validated scores:
| metric | average | std |
|---|---|---|
| May | -0.289 | 0.005 |
| Map | -0.161 | 0.004 |
| mse | -0.200 | 0.008 |
| rmse | -0.448 | 0.009 |
| r² | 0.849 | 0.006 |
Of all the models so far, this is the best, with low errors, high R² and low variance!
Finally, our old friends, box plot:

Conclusion
And we approach the end of my miniseries with three most common types of tree-based models.
My hope is that by looking at the different ways of visualizing trees, you can now (a) get a better understanding of how different models work without looking at the equations, and (b) use your own plots to adjust your own models. It could also help manage stakeholders. Executives prefer cleaner photos to tables of numbers, so viewing a tree plot will help you understand why it's impossible to ask for from you.
Based on this dataset, and these models, the gradient boosted model was slightly better than the random forest, both far better than the only decision tree. However, this could be because GBT increased the time to search for better hyperparameters by 50% (usually computationally expensive; it was the same number of iterations after all). It is also worth noting that GBT tends to be higher than random forests than excessive. And although the decision tree performed poorly, it was far Faster – and in some use cases this is more important. Plus, as mentioned earlier, there are other libraries with advantages and disadvantages. For example, CatBoost processes category data from the box, while other GBT libraries usually need to preprocess the category data (for example, 1 hot or label encoding). Or, if you really feel brave, try stacking different tree types in an ensemble for even better performances…
Anyway, until next time!
