A visual guide to adjusting random forest hyper parameters

Machine Learning


In a previous post, we looked into the effects of various hyperparameters on the decision tree, their performance and how they appear visually.

So the natural next step is the random forest sklearn.ensemble.RandomForestRegressor.

Again, I won't explain how random forests work, bootstrap, feature selection, majority votes, etc. areas. Essentially, random forests are the vast numbers of trees that work together (and therefore forests), and that's all we care about.

It uses the same general process as the same data (Scikit-Learn, California housing dataset via CC-By). So, if you haven't seen any previous posts, I'd be interested to read it first as I look into some of the functions and metrics I use here.

The code for this is in the same repository as before: https://github.com/jamesdeluk/data-projects/tree/main/visualising-trees

Just like before, all the images below have been created by me.

Basic forest

First, let's take a look at how a basic random forest works. rf = RandomForestRegressor(random_state=42). The default model has an unlimited maximum depth and 100 trees. Using the average method it took about 6 seconds to fit and about 0.1 seconds to predict. It's no surprise that it was a forest, and took 50-150 times longer than a deep decision tree, rather than a single tree. And a score?

metric max_depth = none
May 0.33
Map 0.19
mse 0.26
rmse 0.51
0.80

I predicted 0.954 on my selected row, compared to the actual value of 0.894.

Yes, ready-to-use random forest performed better than Bayes-search-tuned decision trees from previous posts!

Visualization

There are several ways to visualize random forests, such as trees, predictions, and errors. The importance of features can also be used to compare individual trees within a forest.

Individual Tree Plots

Quite clearly, individual decision trees can be plotted. You can access them using them rf.estimators_. For example, this is the first one.

It has 34 deep leaves, 9,432 leaves, and 18,863 nodes. And there are 100 similar trees in this random forest!

Individual predictions

One way I like to visualize random forests is to plot individual predictions for each tree. For example, I can do so for the selected line [tree.predict(chosen[features].values) for tree in rf.estimators_]and plot the results on the scattering.

As Reminder, the true value is 0.894. While some trees are off, it's easy to see how the average of all predictions is pretty close. This is my favorite way to see random forest magic.

Individual errors

You can take this a step further, iterating through all the trees, predicting the entire dataset, and then calculating the error statistics. In this case, for MSE:

The average MSE was ~0.30, which is slightly higher than the overall random forest Again, it shows the benefits of forest on one tree. The best tree was number 32 and the MSE was 0.27. The worst 74 was 0.34 It's still pretty decent though. Both have a depth of 34±1, with ~9400 leaves and ~18000 nodes Therefore, structurally, they are very similar.

The importance of functionality

Obviously, it's difficult to see a plot with all the trees, so this is the importance of the whole forest with the best and worst trees.

The best and worst trees have similar importance for different functions Orders are not necessarily the same. Median income is the most important factor based on this analysis.

Hyperparameter tuning

Of course, the same hyperparameters applied to individual decision trees apply to random forests made up of decision trees. For comparison, I created some RFs with the values ​​I used in the previous post.

metric max_depth = 3 CCP_ALPHA = 0.005 min_samples_split = 10 min_samples_leaf = 10 max_leaf_nodes = 100
Time to fit 1.43 25.04 3.84 3.77 3.32
Time to predict 0.006 0.013 0.028 0.029 0.020
May 0.58 0.49 0.37 0.37 0.41
Map 0.37 0.30 0.22 0.22 0.25
mse 0.60 0.45 0.29 0.30 0.34
rmse 0.78 0.67 0.54 0.55 0.58
0.54 0.66 0.78 0.77 0.74
Selected predictions 1.208 1.024 0.935 0.920 0.969

What we see first No better performance than the default tree (max_depth=None) On top of that. This is different from individual decision trees with constrained but improved performance Again, it shows that the imperfect forest forces that driven the CLT are on one “perfect” tree. However, as before, ccp_alpha It takes a long time and the shallow trees are quite garbage.

Beyond these, there are some hyperparameters that RFS does not have. The most important thing is n_estimators In other words, the number of trees!

n_jobs

But first, n_jobs. This is the number of jobs to run in parallel. Usually doing things in parallel is usually faster than serial/sequential. The resulting RF is the same, with the same error scores (assuming random_state (set), but it needs to be done faster! I added this to test n_jobs=-1 To default RF In this regard, -1 It means “all.”

Remember how it took the default to fit almost 6 seconds and how it took 0.1 to predict? Parallelized, prediction took only 1.1 seconds and 0.03 seconds 3-6 times improvement. I'm definitely doing this!

n_estimators

OK, I'll go back to the number of trees. The default RF has a 100 estimator. Let's try 1000. As predicted, it took 1000 times the length (9.7 seconds to fit, 0.3 if parallelized, 0.3 seconds to predict). Score?

metric n_estimators = 1000
May 0.328
Map 0.191
mse 0.252
rmse 0.502
0.807

There is almost no difference. MSE and RMSE are 0.01 lower and R² is 0.01 higher. It's very good, but is it worth investing ten times more time?
Just check and cross it.

Instead of using a custom loop, use it sklearn.model_selection.cross_validateas mentioned in the previous post:

cross_validate(
    rf, X, y,
    cv=RepeatedKFold(n_splits=5, n_repeats=20, random_state=42),
    n_jobs=-1,
    scoring={
        "neg_mean_absolute_error": "neg_mean_absolute_error",
        "neg_mean_absolute_percentage_error": "neg_mean_absolute_percentage_error",
        "neg_mean_squared_error": "neg_mean_squared_error",
        "root_mean_squared_error": make_scorer(
            lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)),
            greater_is_better=False,
        ),
        "r2": "r2",
    },
)

I'm using it RepeatedKFold As a more stable but slower split strategy KFold;The dataset is not that big so I'm not too worried about the additional time it takes.
There is no standard RMSE scorer, so I had to create it sklearn.metrics.make_scorer and the lambda function.

For the decision tree, I did 1000 loops. However, considering that the default random forest contains 100 trees, the 1000 loop is Many of the trees, therefore take a Many Of time. I'll try 100 (20 repeats of 5 splits) There are still a lot, but it wasn't because of parallelization. Too much bad The 100 tree version took 2 minutes (an unparalleled time of 1304 seconds), and 1000 took almost 100% CPU on all cores (10254S!) and toasted quite a bit My MacBook fans aren't often turned on, but this maximized them!

How do they compare? 100 Tree 1:

metric average std
May -0.328 0.006
Map -0.184 0.005
mse -0.253 0.010
rmse -0.503 0.009
0.810 0.007

1000 Trees:

metric average std
May -0.325 0.006
Map -0.183 0.005
mse -0.250 0.010
rmse -0.500 0.010
0.812 0.006

There is little difference Probably not worth the extra time/power.

Bayes search

Finally, let's do a Bayesian search. I used a wide hyperparameter range.

search_spaces = {
    'n_estimators': (50, 500),
    'max_depth': (1, 100),
    'min_samples_split': (2, 100),
    'min_samples_leaf': (1, 100),
    'max_leaf_nodes': (2, 20000),
    'max_features': (0.1, 1.0, 'uniform'),
    'bootstrap': [True, False],
    'ccp_alpha': (0.0, 1.0, 'uniform'),
}

It's the only hyperparameter I've ever seen before bootstrapThis determines whether to use the entire dataset when building the tree or to use a bootstrap-based (with replacement) approach. Most commonly, this is set Truebut let's try it False Anyway.

I did 200 iterations and it took me 66 minutes. It gave:

Best Parameters: OrderedDict({
    'bootstrap': False,
    'ccp_alpha': 0.0,
    'criterion': 'squared_error',
    'max_depth': 39,
    'max_features': 0.4863711682589259,
    'max_leaf_nodes': 20000,
    'min_samples_leaf': 1,
    'min_samples_split': 2,
    'n_estimators': 380
})

See how max_depth It was similar to the simple one above, n_estimators and max_leaf_nodes It was very expensive (please be careful max_leaf_nodes It's not the actual number of leaf nodes, but only the maximum allowable value. The average number of leaves was 14,954). min_samples_ Both were the smallest Just like when comparing constrained forests with unconstrained forests. It's also interesting how it didn't bootstrap.

What does it give us (not a clean test, not a quick test)?

metric value
May 0.313
Map 0.181
mse 0.229
rmse 0.478
0.825

It's the best so far, but for free. For consistency, I was also tested:

metric average std
May -0.309 0.005
Map -0.174 0.005
mse -0.227 0.009
rmse -0.476 0.010
0.830 0.006

It works very well. Comparing the absolute errors for the best decision tree (Bayes search 1), the default RF, and Bayes search RF, we can see:

Conclusion

In the last post, Bayesian decision tree looked good, especially compared to the basic decision tree. Now it looks terrible with higher errors, lower R², wider variations! So why not use random forests?

Well, random forests take time (and predicted) to fit. This becomes even more extreme with larger datasets. Thousands of tuning iterations in the forest with hundreds of trees, hundreds of columns and hundreds of features datasets. It becomes quite clear why GPUs, which specialize in parallel processing, have become indispensable for machine learning. Still, you have to ask yourself What is enough? Is the ~0.05 improvement in MAE actually important for your use case?

When it comes to visualization, plotting individual trees, like decision trees, is a good way to get an idea of ​​the overall structure. Furthermore, plotting individual predictions and errors is a great way to see the variance of random forests and to better understand how they work.

But there are more tree variations! Next is the one with a gradient boost.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *