CGAN: Handles class imbalances using Generated AI | By Deepanshu

If you've been operating machine learning for long enough, you probably came across a scenario with a class imbalance, that is, a scenario in which one class's example is considerably higher than the other classes/classes. Such scenarios can always occur in applications such as anomaly detection and outlier detection. In this blog, I will begin by talking about class imbalances, why we need to address GAN fundamentals, and how to use generation AI to deal with class imbalances.

Class imbalance issues

Imagine you're working on the issue of binary classification. 99.5% of the examples of training data belong to class A, while the rest belongs to class B. You can spend days or weeks working on it and reward the model by achieving 99% accuracy even in the test set. But is that actually good? If you are not working on dealing with this imbalance, the model will learn that everything it sees belongs to class A and cannot accurately detect class B.

Traditional indicators like accuracy become misleading. Instead, you should focus on metrics like:

accuracy: Of all the instances predicted as class B, how many class B is actually?
Reminiscence: Of all actual Class B instances, how many have you correctly identified?
F1 score: It provides a balanced measure of harmonic mean accuracy and recall average.
AUC-ROC: Area below the receiver operating characteristic curve. Measures the ability of a classifier to distinguish classes.

If your model doesn't work well with these metrics, there's more to do.

So how do you deal with such a scenario? The answer to this is very simple. If we don't address class imbalances, we are training our models with highly skewed data that leads to such outcomes. So one way to fix this is to correct the skew of the data. How do you fix the skew? Now, let's add more examples that belong to class B. How can I get more data, especially those belonging to class B? Well, it just generates some synthetic data.

Understanding Generated Enemy Networks (GANS)

Before we talk about CGAN, let's cover the basics of the Generating Enemy Network (GANS)[1]. Introduced by Ian Goodfellow and others. In 2014, GANS is a framework for training generative models. These models consist of two neural networks.

generator
Detector

A generator is a neural network responsible for generating data. Random noise is required as input and is converted to a synthetic sample. This synthetic data is passed along with the actual data to the discriminator. The discriminator then predicts which of these are real and fake. The generator then updates its weight with the aim of reducing the performance of the discriminator. This helps the entire network generate more realistic data.

Conditional Generated Enemy Network (CGAN)

The standard GAN itself is a great solution for dealing with class imbalances, but there are some improvements that CGAN offers. Standard GANs generate data from random noise, but CGANS has much more control over this process.

CGAN is an extension of the GAN architecture. They add conditional inputs to both the generator and the discriminator. This allows you to guide the generation process that can be used to specifically target minority class generation.

How CGAN works

As mentioned before, both the generator and the discriminator receive additional input in the CGAN. This is a class label. This class label is a single hot encoding. The generator takes this class label and random noise and generates output belonging to the specified class, similar to training data.

Like standard GANs, the discriminator receives fakes and actual data generated by the generator. Additionally, we also get the corresponding class label. Learn to distinguish between actual samples and fake samples and Check if the sample matches the provided label.

CGAN Training

The difference between standard and CGAN cores is in the addition of conditional inputs.

First, randomly initialize the weights and biases of the generator (g) and classifier (d).

Next, sample the actual minority class samples X from the training dataset and their corresponding batches of labels. Next, sample a batch of random noise vector Z and minority class label y. These random noise vectors are used to retrieve false data by the generator.

Now that we have real and fake data from the sample for the discriminator, we combine the data samples with one corresponding hot-encoded conditional input through concatenation. This combined input is used to obtain predictions from the discriminator to calculate the discriminator loss l_d

l_d = -e[log(D(x, y))] – e[log(1 — D(G(z, y), y))]

where:

E indicates the expected value (average over the batch).
d(x,y) is the output of the actual sample x and its label y. Ideally, this should be close to 1.
d(g(z, y), y) is the output of the discriminator for the generated sample g(z, y) and its label y. Ideally, this should be close to 0.

After calculating the loss, use gradient descent to update the identifier weights to minimize L_D.

Using the fakes generated by the generator in this step, we also calculate the generator losses l_g

l_g = -e[log(D(G(z, y), y))]

This loss is intended to minimize the probability of correctly identifying the sample from which the identifier was generated as fake. Therefore, the generator wants D(g(z, y), y) to be close to 1 (which means that we determine that the generated sample is real).

Based on this loss, we update the generator weights using gradient descent to minimize L_G.

Once CGAN training is complete, this model can now be used to generate composite data for the training data of the original classifier. Similar to the way you do it in training, sample a batch of random noise vector Z and minority class label Y and ask the generator to generate a fake. These generated fakes can become the composite data of the original classifier.

Conclusion

CGAN provides a powerful and flexible solution to class imbalances, leveraging the GAN's generation capabilities and controls provided by conditional input. By specifically targeting minority class samples generation, CGAN can significantly improve classifier performance. During training they need to be carefully adjusted. The potential benefits make CGAN a valuable tool for machine learning practitioners facing imbalanced data sets.