Next AI revolution: Tutorials to generate high-quality synthetic data using VAE

What is synthetic data?

Computer-generated data intended to replicate or extend existing data.

Why is it convenient?

We all have experienced the success of ChatGpt, Llama and, more recently, Deepseek. These linguistic models are ubiquitously used throughout society, and have sparked many claims that they are rapidly approaching artificial general information.

Depending on your perspective, you are also rapidly approaching the hurdles to advancement in these language models before you get too excited or scared. According to a paper published by the Institute's group, Epoch [1], There is a lack of data. They estimate that by 2028 we will reach the upper limit of possible data to train language models.

Images by the author. Graphs based on estimated dataset projections. This is a reconstructed visualization inspired by the Epoch Research Group [1].

What happens if the data is gone?

Now, if you run out of data, there's no new one to train your language model. These models stop improving. If you want to pursue artificial general information, you need to come up with new ways to improve your AI without increasing the amount of actual training data.

One potential savior is synthetic data that can be generated to mimic existing data, and is already used to improve the performance of models such as Gemini and DBRX.

Synthetic data beyond LLM

In addition to overcoming the data shortage of large-scale language models, synthetic data can be used in the following situations:

Confidential data – If you do not want to share or use sensitive attributes, you can generate synthetic data that mimics the properties of these features while still maintaining anonymity.
Expensive data– If data collection is expensive, you can generate a large amount of synthetic data from a small amount of actual data.
Lack of data– If the number of individual data points from a particular group is disproportionately small, the data set is biased. Composite data can be used to balance the dataset.

Unbalanced data sets

An imbalanced dataset can be a problem as it may not contain enough information to effectively train a predictive model (not *). For example, if the dataset contains more men than women, our model may be biased towards perceived men and misclassifying future female samples as men.

This article shows the imbalances of popular UCI adult datasets [2]and how can you use it? Variational Auto Encoder Generates synthetic data to improve classification in this example.

Download the adult dataset first. This dataset includes features such as age, education, and occupation that can be used to predict target outcome “income”.

# Download dataset into a dataframe
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
   "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
   "occupation", "relationship", "race", "sex", "capital-gain",
   "capital-loss", "hours-per-week", "native-country", "income"
]
data = pd.read_csv(url, header=None, names=columns, na_values=" ?", skipinitialspace=True)

# Drop rows with missing values
data = data.dropna()

# Split into features and target
X = data.drop(columns=["income"])
y = data['income'].map({'>50K': 1, '<=50K': 0}).values

# Plot distribution of income
plt.figure(figsize=(8, 6))
plt.hist(data['income'], bins=2, edgecolor='black')
plt.title('Distribution of Income')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()

In the adult dataset, income is a binary variable, representing $50,000, representing the individual earning above. Plot the revenue distribution across the data set below. The dataset shows that much more individuals earning incomes under $50,000 are very disproportionate.

Images by the author. Original dataset: Number of data instances with labels ≤50k and >50k. There is a disproportionately large representation of individuals making less than 50,000 in their datasets.

Despite this imbalance, machine learning classifiers can be trained on adult datasets that can be used to determine whether they are invisible or tested.

# Preprocessing: One-hot encode categorical features, scale numerical features
numerical_features = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
categorical_features = [
   "workclass", "education", "marital-status", "occupation", "relationship",
   "race", "sex", "native-country"
]

preprocessor = ColumnTransformer(
   transformers=[
       ("num", StandardScaler(), numerical_features),
       ("cat", OneHotEncoder(), categorical_features)
   ]
)

X_processed = preprocessor.fit_transform(X)

# Convert to numpy array for PyTorch compatibility
X_processed = X_processed.toarray().astype(np.float32)
y_processed = y.astype(np.float32)
# Split dataset in train and test sets
X_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_processed, y_processed, test_size=0.2, random_state=42)


rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_model_train, y_model_train)

# Make predictions
y_pred = rf_classifier.predict(X_model_test)

# Display confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Printing the classifier confusion matrix shows that the model works pretty well despite the imbalance. The overall error rate for our model is 16%, while the error rate for the positive class (revenue > 50k) is 36%. The error rate for negative classes (income <50k) is 8%.

This inconsistency indicates that the model is in fact biased towards negative classes. This model often misclassifies individuals who earn more than 50,000 people as earning less than 55,000 people.

Below is how to use the variational autoencoder to generate positive classes of synthetic data to balance this dataset. Next, train the same model using a synthetically balanced dataset to reduce model errors in the test set.

Images by the author. A confusion matrix of predictive models on the original dataset.

How can you generate synthetic data?

There are various ways to generate synthetic data. These include traditional methods such as Smote and Gaussian noise, which generate new data by modifying existing data. Alternatively, generative models such as variational autoencoders and general adversarial networks are predisposed to generating new data as architectures learn the distribution of actual data and use these to generate synthetic samples.

In this tutorial, you will use variational data to generate synthetic data.

Variational Auto Encoder

Variational AutoEncoder (VAE) is ideal for generating synthetic data as it uses actual data to learn continuous latent spaces. This potential space can be considered a magical bucket that can sample synthetic data that is very similar to existing data. This continuity of space is one of their big selling points, as it means that the model is well generalized and doesn't just memorize the latent space of a particular input.

A vae consists of an encoder data input data is calculated by probability distribution (mean and variance) decoder reconstructs data from latent space.

For that continuous latent space, vaes use resend tricks, Using the learned mean and variance, the random noise vector is scaled and shifted to ensure a smooth, continuous representation in latent space.

I build it below BasicVae A class that implements this process in a simple architecture.

encoderIt compresses the input into a small hidden representation and generates both the mean and logarithmic variance that defines the Gaussian distribution that creates the magic sampling bucket. Instead of directly sampling, the model applies the repeer remeterization trick to generate latent variables and is passed to the decoder.
DecoderReconstruct the original data from these latent variables so that the generated data preserves the characteristics of the original dataset.

class BasicVAE(nn.Module):
   def __init__(self, input_dim, latent_dim):
       super(BasicVAE, self).__init__()
       # Encoder: Single small layer
       self.encoder = nn.Sequential(
           nn.Linear(input_dim, 8),
           nn.ReLU()
       )
       self.fc_mu = nn.Linear(8, latent_dim)
       self.fc_logvar = nn.Linear(8, latent_dim)
      
       # Decoder: Single small layer
       self.decoder = nn.Sequential(
           nn.Linear(latent_dim, 8),
           nn.ReLU(),
           nn.Linear(8, input_dim),
           nn.Sigmoid()  # Outputs values in range [0, 1]
       )

   def encode(self, x):
       h = self.encoder(x)
       mu = self.fc_mu(h)
       logvar = self.fc_logvar(h)
       return mu, logvar

   def reparameterize(self, mu, logvar):
       std = torch.exp(0.5 * logvar)
       eps = torch.randn_like(std)
       return mu + eps * std

   def decode(self, z):
       return self.decoder(z)

   def forward(self, x):
       mu, logvar = self.encode(x)
       z = self.reparameterize(mu, logvar)
       return self.decode(z), mu, logvar

Considering the BasicVae architecture, we build the loss function and model training below.

def vae_loss(recon_x, x, mu, logvar, tau=0.5, c=1.0):
   recon_loss = nn.MSELoss()(recon_x, x)
 
   # KL Divergence Loss
   kld_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
   return recon_loss + kld_loss / x.size(0)

def train_vae(model, data_loader, epochs, learning_rate):
   optimizer = optim.Adam(model.parameters(), lr=learning_rate)
   model.train()
   losses = []
   reconstruction_mse = []

   for epoch in range(epochs):
       total_loss = 0
       total_mse = 0
       for batch in data_loader:
           batch_data = batch[0]
           optimizer.zero_grad()
           reconstructed, mu, logvar = model(batch_data)
           loss = vae_loss(reconstructed, batch_data, mu, logvar)
           loss.backward()
           optimizer.step()
           total_loss += loss.item()

           # Compute batch-wise MSE for comparison
           mse = nn.MSELoss()(reconstructed, batch_data).item()
           total_mse += mse

       losses.append(total_loss / len(data_loader))
       reconstruction_mse.append(total_mse / len(data_loader))
       print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}, MSE: {total_mse:.4f}")
   return losses, reconstruction_mse

combined_data = np.concatenate([X_model_train.copy(), y_model_train.cop
y().reshape(26048,1)], axis=1)

# Train-test split
X_train, X_test = train_test_split(combined_data, test_size=0.2, random_state=42)

batch_size = 128

# Create DataLoaders
train_loader = DataLoader(TensorDataset(torch.tensor(X_train)), batch_size=batch_size, shuffle=True)
test_loader = DataLoader(TensorDataset(torch.tensor(X_test)), batch_size=batch_size, shuffle=False)

basic_vae = BasicVAE(input_dim=X_train.shape[1], latent_dim=8)

basic_losses, basic_mse = train_vae(
   basic_vae, train_loader, epochs=50, learning_rate=0.001,
)

# Visualize results
plt.figure(figsize=(12, 6))
plt.plot(basic_mse, label="Basic VAE")
plt.ylabel("Reconstruction MSE")
plt.title("Training Reconstruction MSE")
plt.legend()
plt.show()

vae_loss It consists of two components. Reconstruction loss,measuring how well the generated data matches the original input using mean square error (MSE), KL divergence lossensuring that the learned latent space follows a normal distribution.

Train_vaeOptimize your VAE using multiple epochs using Adam Optimizer. During training, the model gets mini-batches of data, rebuilds them, and uses them to calculate the losses vae_loss. These errors are corrected by backpropagation, where the weights of the model are updated. Train a model of 50 epochs and plot how the square error of the reconstruction decreases during training.

We see that our models quickly learn how to reconstruct data and prove efficient learning.

Images by the author. Reconstruction of BasicVae on adult datasets.

Now you can train BasicVae to use it to accurately reconstruct your adult dataset and generate synthetic data. I want to generate samples of positive classes (individuals earning 50k or more) to balance the classes and remove bias from the model.

To do this, select all samples from the VAE dataset where income is positive class (earn 50k or more). Next, we encode these samples into latent space. Because we select and encode samples of positive classes, this latent space reflects the characteristics of positive classes that can be sampled to create synthetic data.

We sample 15,000 new samples from this latent space and decode these latent vectors as composite data points into the input data space.

# Create column names
col_number = sample_df.shape[1]
col_names = [str(i) for i in range(col_number)]
sample_df.columns = col_names

# Define the feature value to filter
feature_value = 1.0  # Specify the feature value - here we set the income to 1

# Set all income values to 1 : Over 50k
selected_samples = sample_df[sample_df[col_names[-1]] == feature_value]
selected_samples = selected_samples.values
selected_samples_tensor = torch.tensor(selected_samples, dtype=torch.float32)

basic_vae.eval()  # Set model to evaluation mode
with torch.no_grad():
   mu, logvar = basic_vae.encode(selected_samples_tensor)
   latent_vectors = basic_vae.reparameterize(mu, logvar)

# Compute the mean latent vector for this feature
mean_latent_vector = latent_vectors.mean(dim=0)


num_samples = 15000  # Number of new samples
latent_dim = 8
latent_samples = mean_latent_vector + 0.1 * torch.randn(num_samples, latent_dim)

with torch.no_grad():
   generated_samples = basic_vae.decode(latent_samples)

You have now generated positive class synthetic data. This can be combined with the original training data to generate a balanced, synthetic dataset.

new_data = pd.DataFrame(generated_samples)

# Create column names
col_number = new_data.shape[1]
col_names = [str(i) for i in range(col_number)]
new_data.columns = col_names

X_synthetic = new_data.drop(col_names[-1],axis=1)
y_synthetic = np.asarray([1 for _ in range(0,X_synthetic.shape[0])])

X_synthetic_train = np.concatenate([X_model_train, X_synthetic.values], axis=0)
y_synthetic_train = np.concatenate([y_model_train, y_synthetic], axis=0)

mapping = {1: '>50K', 0: '<=50K'}
map_function = np.vectorize(lambda x: mapping[x])
# Apply mapping
y_mapped = map_function(y_synthetic_train)

plt.figure(figsize=(8, 6))
plt.hist(y_mapped, bins=2, edgecolor='black')
plt.title('Distribution of Income')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()

Images by the author. Composite dataset: Number of data instances with labels ≤50k and >50k. There are currently well-balanced individuals earning less than 550k.

Random Forest Classifiers can now be retrained using a balanced training synthetic dataset. You can then evaluate this new model with the original test data to see how effective the synthetic data is in reducing model bias.

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_synthetic_train, y_synthetic_train)

# Step 5: Make predictions
y_pred = rf_classifier.predict(X_model_test)

cm = confusion_matrix(y_model_test, y_pred)

# Create heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

The new classifier trained on a balanced synthetic dataset had fewer errors in the original test set than the original classifier trained on an imbalanced dataset, reducing the error rate to 14%.

Images by the author. Confusion matrix of predictive models for synthetic datasets.

However, we were unable to significantly reduce the inconsistency of errors. The error rate for the positive class is 36%. This could be due to the following reasons:

We discussed that one of the advantages of VAES is the continuous latent space learning. However, if the majority class controls, latent space can be skewed towards the majority class.
The model may not properly learn a clear representation of a minority class due to lack of data, making it difficult to accurately sample sampling from that region.

In this tutorial, we introduced and built a BasicVae architecture that can be used to generate synthetic data that improves the classification accuracy of unbalanced datasets.

It shows how a more sophisticated VAE architecture can be constructed that addresses the above issues, such as by imbalanced sampling.

[1] Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L. , & Hobbhahn, M. (2024). Will the data be gone? Limitations of LLM scaling based on human-generated data. arxiv preprint arxiv: 2211.04325, 3.

[2] Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/c5xw20.

Source link