Google AI describes new machine learning method for generating differentially private synthetic data

Machine Learning


https://arxiv.org/abs/2306.01684

Google AI researchers describe a new approach to address the challenge of producing high-quality synthetic datasets that protect user privacy. This is essential for training predictive models without compromising sensitive information. As machine learning models become dependent on large datasets, it becomes important to ensure the privacy of the individuals whose data contributes to these models. Differentially private synthetic data is synthesized by creating new datasets that are completely artificial while reflecting key characteristics of the original data, thus protecting user privacy while enabling robust model training. .

Current methods for privacy-preserving data generation include directly training models using differentially private machine learning (DP-ML) algorithms that provide strong privacy guarantees. However, when working with high-dimensional datasets utilized for a variety of tasks, this method is computationally intensive and rarely yields high-quality results. Previous models, such as leveraging large-scale language models, combined large-scale language models (LLMs) with differentially private stochastic gradient descent (DP-SGD) to generate private synthetic data. This method uses his DP-SGD on a sensitive dataset to fine-tune an LLM trained on public data to ensure that the generated synthetic data does not reveal specific information about individuals in the sensitive dataset. I'll make it.

Google researchers proposed an enhanced approach to generate differentially private synthetic data by leveraging parameter-efficient fine-tuning techniques such as LoRa (low-rank adaptation) and prompted fine-tuning. These techniques aim to change a small number of parameters during the private training process, which can reduce computational overhead and improve the quality of synthetic data.

The first step in this approach is to train the LLM on a large corpus of public data. The LLM is then fine-tuned using DP-SGD on sensitive datasets. The fine-tuning process is restricted to a subset of the model's parameters. LoRa fine-tuning involves replacing each W in the model with W + LR. Here, L and R are low-rank matrices, and we only train L and R. Prompt tweaking, on the other hand, involves inserting a “prompt tensor”. It runs at the beginning of the network and trains only its weights, effectively changing only the input prompts used by the LLM.

Experimental results show that LoRa fine-tuning, which changes approximately 20 million parameters, performs better than both full-parameter fine-tuning and prompt-based tuning, which changes only approximately 41,000 parameters. This suggests that there is an optimal number of parameters that balances the trade-off between computational efficiency and data quality. Classifiers trained on synthetic data produced by LoRa's fine-tuned LLM perform better than classifiers trained on synthetic data from other fine-tuning methods, and in some cases, DP-SGD A classifier trained directly on the original sensitive data using was also superior. In experiments to evaluate the proposed approach, a decoder-only LLM (Lamda-8B) was trained on public data and then privately fine-tuned on three public datasets (IMDB, Yelp, AG News). Treated as confidential data. The generated synthetic data was used to train classifiers for tasks such as sentiment analysis and topic classification. The performance of the classifier on the retained subset of the original data demonstrated the effectiveness of the proposed method.

In conclusion, our approach of generating differentially private synthetic data using parameter-efficient fine-tuning techniques outperforms existing methods. This method reduces computational requirements and improves the quality of synthetic data by fine-tuning a smaller subset of parameters. This approach not only protects privacy, but also maintains high utility for training predictive models, making it a valuable tool for organizations looking to leverage sensitive data without compromising user privacy. . Experimental results demonstrate the effectiveness of the proposed method and suggest its potential for broader applications in privacy-preserving machine learning.


Please check Articles and blogs. All credit for this study goes to the researchers of this project.Don't forget to follow us twitter.Please join us telegram channel, Discord channeland linkedin groupsHmm.

If you like what we do, you'll love Newsletter..

Don't forget to join us 42,000+ ML subreddits

Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her bachelor's degree from Indian Institute of Technology (IIT), Kharagpur. She is a technology enthusiast and has a keen interest in software and data. She has a keen interest in a range of science applications. She is constantly reading about developments in various areas of AI and ML.

🐝 Join the fastest growing AI research newsletter from researchers at Google + NVIDIA + Meta + Stanford + MIT + Microsoft and more…





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *