AWS Clean Rooms begins generating privacy-enhancing synthetic datasets for ML model training

Today, we are announcing the generation of privacy-enhancing synthetic datasets for AWS Clean Rooms. This is a new feature that organizations and their partners can use to generate privacy-enhancing synthetic datasets from collected data and train regression and classification machine learning (ML) models. This feature allows models to generate synthetic training datasets that preserve the statistical patterns of the original data without having to access the original records, opening up new opportunities for model training that were previously not possible due to privacy concerns.

When building ML models, data scientists and analysts typically face a fundamental tension between data utility and privacy protection. Access to high-quality, detailed data is essential to training accurate models that can recognize trends, personalize experiences, and drive business outcomes. However, the use of granular data such as user-level event data from multiple parties raises significant privacy concerns and compliance challenges. Organizations want to answer questions like “What characteristics indicate a customer is likely to convert?”, but training on individual-level signals often conflicts with privacy policies and regulatory requirements.

Generating synthetic datasets for privacy-enhancing custom ML

To address this challenge, we introduce privacy-enhancing synthetic dataset generation with AWS Clean Rooms ML. Organizations can use it to create synthetic versions of sensitive datasets that can be used more securely to train ML models. This feature uses advanced ML techniques to generate new datasets that maintain the statistical properties of the original data while anonymizing the subjects from the original source data.

Traditional anonymization techniques such as masking still carry the risk of re-identifying individuals within the dataset. Knowing attributes about a person, such as their postal code or date of birth, is sufficient to identify them in census data. The generation of privacy-enhancing synthetic datasets addresses this risk through a fundamentally different approach. The system trains a model that learns important statistical patterns in the original dataset, generates synthetic records by sampling values from the original dataset, and uses the model to predict a column of predicted values. Rather than simply copying or disrupting the original data, the system uses model capacity reduction techniques to reduce the risk that the model remembers information about individuals in the training data. The resulting synthetic dataset has the same schema and statistical properties as the original data, making it suitable for training classification and regression models. This approach quantitatively reduces the risk of re-identification.

Using this feature, organizations can control privacy parameters such as the amount of noise applied and the level of protection against membership inference attacks. An attacker attempts to determine whether a particular individual’s data is included in the training set. After generating a synthetic dataset, AWS Clean Rooms provides detailed metrics to help customers and their compliance teams understand the quality of the synthetic dataset across two important aspects: fidelity to the original data and privacy protection. The fidelity score uses KL divergence to measure how similar the synthetic data is to the original dataset, and the privacy score quantifies the likelihood that the dataset is protected from membership inference attacks.

Working with synthetic data in AWS Clean Rooms

To start generating privacy-enhancing synthetic datasets, follow the established AWS Clean Rooms ML custom model workflow with new steps to specify privacy requirements and review quality metrics. An organization first creates a configured table with analysis rules using its preferred data source, then joins or creates a collaboration with a partner and associates the table with that collaboration.

The new feature introduces enhanced analysis templates that allow data owners to not only define the SQL query that creates the dataset, but also specify that the resulting dataset must be synthetic. Within this template, organizations categorize columns to indicate which columns the ML model predicts and which columns contain categorical and numeric values. Importantly, the template also includes privacy thresholds that must be met for the generated synthetic data to be available for training. These include an epsilon value that specifies how much noise must be present in the synthetic data to protect against re-identification, and a minimum protection score against membership inference attacks. Setting these thresholds appropriately requires understanding your organization’s specific privacy and compliance requirements, so we recommend working with your legal and compliance teams during this process.

After all data owners review and approve the analysis template, create a machine learning input channel for collaboration members to reference the template. AWS Clean Rooms then begins the synthetic dataset generation process. This process typically completes within a few hours, depending on the size and complexity of your dataset. If the generated synthetic dataset meets the required privacy thresholds defined in the analysis template, a synthetic machine learning input channel will be available with detailed quality metrics. Data scientists can see the actual protection score achieved against simulated membership inference attacks.

Once your organization is satisfied with the quality metrics, you can proceed to train your ML model using the synthetic datasets within the AWS Clean Rooms Collaboration. Depending on your use case, you can export the trained model weights or continue running the inference job within the collaboration itself.

Let’s try it

When you create a new AWS Clean Rooms collaboration, you can now configure who pays for synthetic dataset generation.

After configuring collaboration, you can choose to Analysis template output must be synthetic When creating a new analysis template.

Once your synthetic analysis template is ready, you can use it when running protected queries to view details of all relevant ML input channels.

currently available

You can start using privacy-enhancing synthetic dataset generation today through AWS Clean Rooms. This feature is available in all commercial AWS Regions where AWS Clean Room is available. For more information, please see the AWS Clean Rooms documentation.

The generation of privacy-enhancing synthetic datasets is billed separately based on usage. You pay only for the compute used to generate your synthetic datasets, billed as synthetic data generation units (SDGUs). The number of SDGUs depends on the size and complexity of the original dataset. This fee can be configured as a payer setting. That is, collaboration members can agree to pay costs. For pricing details, please see the AWS Clean Rooms pricing page.

The initial release supports training classification and regression models on tabular data. Synthetic datasets work with standard ML frameworks and can be integrated into existing model development pipelines without changing your workflow.

This feature represents a significant advance in privacy-enhanced machine learning. Organizations can unlock the value of sensitive user-level data for model training while mitigating the risk of exposing sensitive information about individual users. Whether it’s optimizing advertising campaigns, personalizing insurance quotes, or powering fraud detection systems, privacy-enhancing synthetic dataset generation makes it possible to train more accurate models through data federation while respecting individual privacy.

Source link