What if we say that many ML teams are evaluating their models incorrectly?
Many teams invest a lot of money in training their models and have a smaller budget for evaluation. This leads to inaccurate assessments and flimsy benchmarks. In the worst case scenario, this leads the team to choose the wrong model for the task.
The publication Accounting for Variance in Machine Learning Benchmarks provides excellent recommendations for addressing these issues. Here are some ways to improve your ML model evaluation and improve your benchmarks.
Good model comparisons include many randomized choices. Recall the many arbitrary choices we make during the machine learning process. Random seeds for initialization, data order, how to initialize the learner, etc. Randomizing these will improve the performance of your model. Citing the paper,
“Benchmarks that vary these arbitrary choices not only assess the associated variance (Section 2), but also allow for the measurement of performance on uncorrelated test sets, thereby reducing the expected performance error. This counterintuitive phenomenon is associated with variance reduction in bagging (Breiman, 1996a; Buhlmann et al., 2002), as opposed to specific adaptation of machine learning pipelines. It helps better characterize expected behavior.
I thought the comparison with bagging was particularly interesting. This is why I recommend taking the time to explore different ML concepts and more. This will help you understand things more deeply and come across ideas and associations to be innovative.
Most people use a single train, test, validation split. Batch the data once and you're done. The more diligent may also perform cross-validation. You may also want to play around with the proportions used to construct your set. In the team's words: “For pipelines that more statistically compare the variances of machine learning benchmarks, it is useful to do multiple tests, such as generating random splits with schemes outside of bootstrap (see more See Appendix B).
It is important to always remember that there is some randomness in the results. Running multiple tests is one way to alleviate this problem. But it will never go away unless you go through every possible permutation (which may not be possible and will definitely be unnecessarily expensive). Small improvements may simply be the result of chance. When working with models, always have several models with comparable performance on hand.
For more information on how to properly compare machine learning models with this publication, please see:
