If 50/50 is not optimal: even reveals rebalance

Machine Learning


For the old challenge

Training a model for spam detection. There are far more positives in the dataset than negative, so we invest countless working hours to rebalance to a 50/50 ratio. You are happy because you managed to deal with class imbalances. What if you say that 60/40 is not only good enough, but even better?

In most machine learning classification applications, the number of instances of one class exceeds that of another class. This slows learning [1] It could potentially induce bias in trained models [2]. The most widely used method to address this relies on simple prescriptions. Finding a way to give all classes the same weight. In most cases, this is done in simple ways, such as making minority class examples more important (rewaiting), removing majority class examples from the dataset (undersampling), or removing minority class instances multiple times (oversampling).

The validity of these methods is often discussed, with both theoretical and empirical studies indicating that it depends on the particular application to show which solution works best. [3]. However, there are hidden hypotheses that are rarely discussed and are taken too often for granted. Is rebalancing a good idea? To some extent, these methods work, so the answer is yes. But we should Completely Dataset rebalance? To keep it simple, take the issue of binary classification. Do I need to readjust my training data to have 50% of each class? Intuition says yes, and intuition has guided practice up until now. In this case, the intuition is wrong. For intuitive reasons.

What does “train an imbalance” mean?

Before we dig into the methods and reasons why 50% is not the best training imbalance in binary classification, let's define the relevant quantities. I'll call nclass number of instances of one class (usually a minority class), and nOther class of other. This will result in the total number of data instances in the training set. n=n₀+n₁. The amount analyzed today is training imbalance.

ρ⁽ᵗʳᵃⁱⁿ⁾= n₀/n .

Proof that 50% is not optimal

The first evidence comes from empirical studies of random forests. Kamarov and his collaborators measured optimal training imbalances; ρ⁽ᵒᵖᵗ⁾, with 20 data sets [4]. They discover that their value varies depending on the problem, but conclude that it is more or less ρ⁽ᵒᵖᵗ⁾= 43%. This means that, according to their experiments, they want a slightly larger number than examples of minority classes. But this is not a complete story. If you want to aim for the best model, don't stop here and don't set it right away ρanot ~ 43%.

In fact, this year, theoretical research by Pezzicoli et al. [5]demonstrated that optimal training imbalance is not a universal value effective for all applications. Not 50%, not 43%. Ultimately, optimal imbalances change. Below 50% (as measured by Kamarov and his collaborators) can be over 50%. Specific values of ρclase depends on the details of each particular classification problem. One way to find it ρ⁽ᵒᵖᵗ⁾Training the model against several values ρ⁽ᵗʳᵃⁱⁿ⁾, measures related performance. This would be, for example:

Images by the author

However, the exact pattern is determined ρAs in Kamalov's experiment, when the data is rich compared to the model size, the optimal imbalance appears to be less than 50%. However, many other factors, from inherently rare minority instances to how noisy the training dynamics are, set the optimal value of training imbalances and determine how much performance is lost when training is separated. ρ⁽ᵒᵖᵗ⁾.

Why is the perfect balance not always optimal?

As we said, the answer is actually intuitive. There is no reason why both classes convey the same information because they are different classes. In fact, the Petzicolli team has proved that they are not normally the case. Therefore, inferring the best decision boundary may require more instances than other decisions. Pezzicoli's work, in the context of anomaly detection, provides us with a simple and insightful example.

Assume that the data comes from a multivariate Gaussian distribution and that all points to the right of the decision boundary are labeled as anomalies. In 2D, it looks like this:

Images and inspiration from the author [5]

The dashed line is our decision boundary, and the points to the right of the decision boundary are as follows: n₀Anomalous. Next, let's recalibrate the dataset. ρ⁽ᵗʳᵃⁱⁿ⁾=0.5. To do this, you need to find more anomalies. Anomalies are rare, so what we most likely found is ones that are closer to decision boundaries. Already by the eyes, the scenario is surprisingly clear:

Images and inspiration from the author [5]

Yellow anomalies are stacked along the decision boundary, making them more beneficial about their position than blue dots. This may lead people to think that it is better to privilege minority class points. On the other side, the anomaly covers only one side of the decision boundary, so having enough minority class points makes it convenient to invest in more majority class points to better cover the other side of the decision boundary. As a result of these two competitive effects, ρ⁽ᵒᵖᵗ⁾ is not generally 50%, and its exact value depends on the problem.

The root cause is class asymmetry

Pezzicoli's theory shows that the optimal imbalance is generally different from 50%, as different classes have different properties. However, we analyze only one of the diversity between classes: the behavior of outliers. However, for example, as shown by the co-authors of Sarao Munneri [6]and can produce similar effects, such as the presence of subgroups within a class. It is the agreement of so many effects that determine diversity between classes, indicating what the optimal imbalance for a particular problem is. Until we have a theory that deals with all the sources of asymmetry in our data together (including those induced by how the model architecture handles), we cannot know in advance the optimal training imbalance of the dataset.

Important points and what you can do differently

Up until now, if you re-adjusted your binary dataset to 50%, you were doing well, but probably not trying your best possible. There is no theory yet that can communicate what the optimal training imbalance should be, but I know that it's not 50% now. The good news is that it's on the way. Machine learning theorists are actively working on this topic. In the meantime, you can think ρAs a hyper parameter, like any other hyperparameter, it retunes data in the most efficient way, as a pre-tuneable hyperparameter. Before the next model training is performed, ask yourself: Is 50/50 really the best? Try adjusting the class imbalance. The performance of the model may be surprising.

reference

[1] E. Francazi, M. Baity-Jesi, and A. Lucchi, Theoretical Analysis of Learning Dynamics under Class Imbalance (2023), ICML 2023

[2] K. Ghosh, C. Bellinger, R. Corizzo, P. Branco, b. Krawczyk, and N. Japkowicz, Class Imbalance Issues (2024), Deep Learning, Machine Learning, 113(7), 4845–4901

[3] E. Loffredo, M. Pastore, S. Cocco, R. Monasson, Restarting Balance: Data Principles/Oversampling for Optimal Classification (2024), ICML 2024

[4] F. Kamalov, Af Atiya and D. Elreedy, Partial resampling of imbalanced data (2022), arxiv preprint arxiv: 2207.04631

[5] FS Pezzicoli, V. Ros, FP Landes, and M. Baity-Jesi, Class imbalances in anomaly detection: Learning from an accurate solvent-enabled model (2025). Aistats 2025

[6] S. SARAO-MANNELLI, F. GERACE, N. ROSTAMZADEH, L. SAGLIETTI, Bias-inducing Geometry: An Accurate Resolvable Data Model with the Impact of Fairness (2022), arxiv preprint arxiv: 2205.15935



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *