How machine learning violates privacy

Machine Learning

Machine learning has pushed the boundaries in several areas, including personalized medicine, self-driving cars, and customized advertising. But studies have found that these systems remember aspects of the data they were trained on to learn patterns, raising privacy concerns.

The goal of statistics and machine learning is to learn from past data to make new predictions or inferences about future data. To achieve this goal, a statistician or machine learning expert selects a model that captures suspicious patterns in the data. The model applies a simplifying structure to the data, allowing it to learn the patterns and make predictions.

Complex machine learning models have their own strengths and weaknesses. On the positive side, they can learn more complex patterns and handle richer data sets for tasks like image recognition or predicting how a particular person will respond to a treatment.

However, there is also the risk of overfitting to the data – that is, it makes accurate predictions on the data it was trained on, but starts to learn additional aspects of the data that are not directly relevant to the task at hand. This causes the model to not generalize well and to perform poorly on new data that is of the same type as the training data, but not exactly the same.

While there are techniques to address the prediction errors associated with overfitting, there are also privacy concerns as we can learn so much from our data.

How machine learning algorithms make inferences

Every model has a certain number of parameters. A parameter is an aspect of the model that can be changed. Each parameter has a value or setting that the model derives from the training data. You can think of parameters as different knobs that you can turn to affect the performance of the algorithm. A linear pattern only has two knobs, slope and intercept, but machine learning models have a huge number of parameters. For example, the language model GPT-3 has 175 billion parameters.

To select parameters, machine learning methods use training data with the goal of minimizing the prediction error on the training data. For example, if the goal is to predict whether a person will respond well to a certain treatment based on their medical history, the machine learning model makes predictions on data where the model developer knows whether the patient responded well or poorly. The model is rewarded for correct predictions and penalized for inaccurate predictions. This causes the algorithm to tune the parameters, i.e. turn some “knobs” and try again.

Explain the basics of machine learning.

To avoid overfitting on the training data, the machine learning model is also checked against a validation dataset. A validation dataset is a separate dataset that is not used in the training process. By checking the performance of the machine learning model on this validation dataset, developers can ensure that the model can generalize beyond the training data and avoid overfitting.

Although this process is successful in ensuring good performance of machine learning models, it does not directly prevent machine learning models from memorizing information in the training data.

Privacy issues

Because machine learning models have a large number of parameters, it is possible that a machine learning method will memorize some of the data it was trained on. In fact, this is a widespread phenomenon, and users can extract memorized data from machine learning models using queries tailored to retrieve that data.

If the training data contains sensitive information, such as medical or genomic data, this could violate the privacy of the owners of the data used to train the model. Recent research has shown that machine learning models are indeed required to memorize aspects of the training data for optimal performance in solving a particular problem. This suggests that there may be a fundamental trade-off between performance and privacy for machine learning methods.

Machine learning models can also make it possible to predict sensitive information using seemingly non-sensitive data. For example, Target was able to predict which customers were likely to be pregnant by analyzing the purchasing habits of customers who signed up for the Target Baby Registry. After training a model on this dataset, they were able to send pregnancy-related ads to customers who were suspected to be pregnant because they had purchased products such as supplements and fragrance-free lotions.

Is privacy possible?

Many methods have been proposed to reduce memory in machine learning methods, but most have been largely ineffective. Currently, the most promising solution to this problem involves placing mathematical bounds on privacy risks.

The state-of-the-art method for formal privacy protection is differential privacy. Differential privacy requires that a machine learning model does not change significantly when a single individual's data changes in the training dataset. Differential privacy methods achieve this guarantee by adding randomness to the algorithm learning to “hide” the contributions of specific individuals. Once a method is protected with differential privacy, no attack can violate its privacy guarantees.

However, even if a machine learning model was trained with differential privacy, it does not prevent it from making sensitive inferences like the Target example. To prevent such privacy violations, all data sent to your organization must be protected. This approach is called local differential privacy and has been implemented by Apple and Google.

Differential privacy is a method of protecting individual privacy when their data is included in large datasets.

Differential privacy limits how much a machine learning model can rely on an individual's data, thus preventing memorization. Unfortunately, this also limits the performance of machine learning methods. This tradeoff has led to criticism of the usefulness of differential privacy, which often results in a significant degradation of performance.

from now on

The tension between inference learning and privacy concerns ultimately raises a societal question of which is more important in which situations: If the data does not contain sensitive information, it is easy to recommend the use of the most powerful machine learning techniques available.

However, when dealing with sensitive data, it is important to consider the impact of privacy leakage, and some machine learning performance may need to be sacrificed in order to protect the privacy of the owners of the data that trained the model.

Jordan Awan is an assistant professor of statistics at Purdue University.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *