How machine learning violates privacy

Machine learning has pushed the boundaries in several areas, including personalized medicine, self-driving cars, and customized advertising. But studies have found that these systems remember aspects of the data they were trained on to learn patterns, raising privacy concerns.

The goal of statistics and machine learning is to learn from past data to make new predictions or inferences about future data. To achieve this goal, a statistician or machine learning expert selects a model that captures suspicious patterns in the data. The model applies simplifying structures to the data so that it can learn the patterns and make predictions.

Complex machine learning models have some inherent strengths and weaknesses. On the positive side, they can learn more complex patterns and use richer data sets for tasks like image recognition or predicting how a particular person will respond to a treatment.

However, there is also a risk of overfitting the data. This means that it makes accurate predictions about the data it was trained on, but begins to learn additional aspects of the data that are not directly relevant to the task at hand. This prevents the model from generalizing. This means that new data that is not exactly the same as the training data, but of the same type, will perform worse.

There are techniques to address the prediction errors associated with overfitting, but as we can learn so much from the data, it also raises privacy concerns.

How machine learning algorithms make inferences

Each model has a certain number of parameters. A parameter is an element of a model that can be changed. Each parameter has a value or setting that the model derives from training data. Parameters can be thought of as different knobs that you can turn to affect the performance of your algorithm. While a straight line pattern has only two knobs: slope and intercept, machine learning models have a huge number of parameters. For example, the language model GPT-3 has 175 billion parameters.

To select parameters, machine learning methods use training data with the goal of minimizing prediction error on the training data. For example, if the goal is to predict whether a person will respond well to a certain medical procedure based on their medical history, the machine learning model makes predictions on data where the model developer knows whether the patient responded well or poorly. The model is rewarded for correct predictions and penalized for incorrect predictions. This causes the algorithm to tune its parameters, i.e. turn some “knobs” and try again.

Explained the basics of machine learning.

To avoid overfitting on the training data, the machine learning model is also checked against a validation dataset. A validation dataset is a separate dataset that is not used in the training process. By checking the performance of the machine learning model on this validation dataset, developers can ensure that the model can generalize beyond the training data and avoid overfitting.

While this process succeeds in ensuring good performance of machine learning models, it does not directly prevent the machine learning model from memorizing the information in the training data.

Privacy issues

Machine learning models have a large number of parameters, so it's possible that a machine learning method remembers some of the data it was trained on. In fact, this is a widespread phenomenon, allowing users to extract memorized data from machine learning models using tailored queries to retrieve the data.

If the training data contains sensitive information, such as medical or genomic data, this may violate the privacy of the owners of the data used to train the model. Recent research has shown that machine learning models are indeed required to memorize aspects of the training data for optimal performance in solving a particular problem. This suggests that there may be a fundamental trade-off between performance and privacy for machine learning methods.

Machine learning models can also be used to predict sensitive information using seemingly non-sensitive data. For example, Target was able to predict which customers might be pregnant by analyzing the shopping habits of customers enrolled in the Target Baby Registry. Once the model was trained on this dataset, it could now send pregnancy-related ads to customers who suspected they were pregnant because they had purchased products such as supplements or unscented lotions.

Is privacy possible?

Many methods have been proposed to reduce memorization in machine learning, but most of them are ineffective. Currently, the most promising solution to this problem is to ensure mathematical bounds on privacy risks.

The state-of-the-art method for formal privacy protection is differential privacy. Differential privacy requires that a machine learning model does not change significantly when a single individual's data changes in the training dataset. Differential privacy techniques achieve this guarantee by introducing additional randomness into the algorithmic learning that “hides” the contributions of specific individuals. Once a method is protected with differential privacy, no attack can violate its privacy guarantees.

However, even if a machine learning model is trained using differential privacy, it does not prevent it from making sensitive inferences, as in the Target example. To prevent such privacy breaches, you must protect all data sent to your organization. This approach is called local differential privacy and has been implemented by Apple and Google.

Differential privacy is a way to protect people's privacy when their data is contained in large datasets.

Differential privacy limits the degree to which machine learning models can rely on a single individual's data, thus hindering memorization. Unfortunately, this also limits the performance of machine learning techniques. The usefulness of differential privacy has been criticized because this trade-off often results in significant performance degradation.

from now on

The tension between inference learning and privacy concerns ultimately raises a societal question of which is more important in which situations: If the data does not contain sensitive information, it is easy to recommend the use of the most powerful machine learning techniques available.

However, when working with sensitive data, it is important to consider the impact of privacy leakage, and some machine learning performance may need to be sacrificed to protect the privacy of the owners of the data on which the model was trained. There may be cases.

Source link