A step-by-step guide to creating simulation data in Python | By Marcus Sena | July 2024

Machine Learning


table of contents
1. Using NumPy
2. Using Scikit-learn
3. Using SciPy
4. Use of Faker
5. Using Synthetic Data Vault (SDV)
Conclusion and next steps

The most well-known Python libraries dealing with linear algebra and numerical computation are also useful for data generation.

This example shows how to create a dataset with noise that is linearly related to the target values, which is useful for testing a linear regression model.

# importing modules
from matplotlib import pyplot as plt
import numpy as np

def create_data(N, w):
"""
Creates a dataset with noise having a linear relationship with the target values.
N: number of samples
w: target values
"""
# Feature matrix with random data
X = np.random.rand(N, 1) * 10
# target values with noise normally distributed
y = w[0] * X + w[1] + np.random.randn(N, 1)
return X, y

# Visualize the data
X, y = create_data(200, [2, 1])

plt.figure(figsize=(10, 6))
plt.title('Simulated Linear Data')
plt.xlabel('X')
plt.ylabel('y')
plt.scatter(X, y)
plt.show()

Simulated linear data (image by author).

This example uses NumPy to generate synthetic time series data with a linear trend and seasonal components, which is useful for financial modeling and stock market forecasting.

def create_time_series(N, w):
"""
Creates a time series data with a linear trend and a seasonal component.
N: number of samples
w: target values
"""
# time values
time = np.arange(0,N)
# linear trend
trend = time * w[0]
# seasonal component
seasonal = np.sin(time * w[1])
# noise
noise = np.random.randn(N)
# target values
y = trend + seasonal + noise
return time, y

# Visualize the data
time, y = create_time_series(100, [0.25, 0.2])

plt.figure(figsize=(10, 6))
plt.title('Simulated Time Series Data')
plt.xlabel('Time')
plt.ylabel('y')

plt.plot(time, y)
plt.show()

Sometimes you need data with specific characteristics. For example, for a dimensionality reduction task, you may need a high-dimensional dataset with only a few information-useful dimensions. If so, the example below shows a suitable way to generate such a dataset.

# create simulated data for analysis
np.random.seed(42)
# Generate a low-dimensional signal
low_dim_data = np.random.randn(100, 3)

# Create a random projection matrix to project into higher dimensions
projection_matrix = np.random.randn(3, 6)

# Project the low-dimensional data to higher dimensions
high_dim_data = np.dot(low_dim_data, projection_matrix)

# Add some noise to the high-dimensional data
noise = np.random.normal(loc=0, scale=0.5, size=(100, 6))
data_with_noise = high_dim_data + noise

X = data_with_noise

The above code snippet creates a dataset containing 100 observations and 6 features based on a low-dimensional array with only 3 dimensions.

In addition to machine learning models, Scikit-learn has data generators that help build artificial datasets of controlled size and complexity.

of Classify You can create a random n-class dataset using the method, which allows you to create a dataset with a selected number of observations, features, and classes.

It helps in testing and debugging classification models such as Support Vector Machines, Decision Trees, and Naive Bayes.

X, y = make_classification(n_samples=1000, n_features=5, n_classes=2)

#Visualize the first rows of the synthetic dataset
import pandas as pd
df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])
df['target'] = y
df.head()

First row of the dataset (image by author).

Similarly, Doing a regression This method is useful for creating a dataset for regression analysis, which allows you to set the number of observations, number of features, bias, and noise of the resulting dataset.

from sklearn.datasets import make_regression

X,y, coef = make_regression(n_samples=100, # number of observations
n_features=1, # number of features
bias=10, # bias term
noise=50, # noise level
n_targets=1, # number of target values
random_state=0, # random seed
coef=True # return coefficients
)

Data simulated with make_regression (image by author).

The make_blobs method allows you to create artificial “blobs” containing data that can be used for clustering tasks. You can set the total number of points in the dataset, the number of clusters, and the standard deviation within the clusters.

from sklearn.datasets import make_blobs

X,y = make_blobs(n_samples=300, # number of observations
n_features=2, # number of features
centers=3, # number of clusters
cluster_std=0.5, # standard deviation of the clusters
random_state=0)

Simulated data in a cluster (image by author).

The SciPy (short for Scientific Python) library, along with NumPy, is one of the best libraries for handling numerical computations, optimization, statistical analysis, and many other mathematical tasks. SciPy's statistical models can create simulated data from many statistical distributions, including normal, binomial, and exponential distributions.

from scipy.stats import norm, binom, expon
# Normal Distribution
norm_data = norm.rvs(size=1000)
Images by the author.
# Binomial distribution
binom_data = binom.rvs(n=50, p=0.8, size=1000)
Images by the author.
# Exponential distribution
exp_data = expon.rvs(scale=.2, size=10000)
Images by the author.

Often you need to train your model on non-numeric or user data, such as name, address, email, etc. A solution to create realistic data that resembles user information is to use the Faker Python library.

The Faker library allows you to generate convincing data that can be used to test your applications and machine learning classifiers. In the example below, we show how to create a fake dataset that contains names, addresses, phone numbers, and email information.

from faker import Faker

def create_fake_data(N):
"""
Creates a dataset with fake data.
N: number of samples
"""
fake = Faker()
names = [fake.name() for _ in range(N)]
addresses = [fake.address() for _ in range(N)]
emails = [fake.email() for _ in range(N)]
phone_numbers = [fake.phone_number() for _ in range(N)]
fake_df = pd.DataFrame({'Name': names, 'Address': addresses, 'Email': emails, 'Phone Number': phone_numbers})
return fake_df

fake_users = create_fake_data(100)
fake_users.head()

Fake user data by Faker (Image by author).

What if you have a dataset that doesn't have enough observations, or you need more data similar to your existing dataset to complement the training step of your machine learning model? Synthetic Data Vault (SDV) is a Python library that allows you to create synthetic datasets using statistical models.

In the example below, we use SDV to extend the demo dataset.

from sdv.datasets.demo import download_demo

# Load the 'adult' dataset
adult_data, metadata = download_demo(dataset_name='adult', modality='single_table')
adult_data.head()

Adult demo dataset.
from sdv.single_table import GaussianCopulaSynthesizer
# Use GaussianCopulaSynthesizer to train on the data
model = GaussianCopulaSynthesizer(metadata)
model.fit(adult_data)

# Generate Synthetic data
simulated_data = model.sample(100)
simulated_data.head()

Simulation sample (image by author).

We can see that the data is very similar to the original dataset, but it is synthetic.

In this article, we've presented five ways to create simulated synthetic datasets that can be used for machine learning projects, statistical modeling, and other data-related tasks. The examples shown are easy to understand, and we encourage you to explore the code, read the available documentation, and develop other data generation methods to suit any need.

As mentioned earlier, synthetic datasets allow data scientists, machine learning experts, and developers to improve model performance and reduce costs in production and application testing.

Check out the notebook that lists all the methods mentioned in this article.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *