Inside Machine Learning: Duke Professor Touts Potential ‘Federated’ Approach

Think federated learning and consider an octopus, says Xu. he has 9 brains. A donut-shaped central brain occupies the head, but there is a mini-brain at the base of each tentacle. Together, these make up his two-thirds of Octopus’ thinking capacity. This eight-armed cephalopod’s intelligence is distributed. Each arm makes independent decisions and exchanges information with the main brain. The big brain doesn’t always know what the tentacles are doing, but the back-and-forth movement trains each limb to work more efficiently.

“Federated learning is a very new concept and the theory behind it will definitely be very important in the future,” says Xu.

Xu is the recipient of the National Science Foundation Career Award for Federated Learning: Statistical Optimality and Provable Security.

According to Xu, traditional computer learning (also known as machine learning) requires all data to be centralized in one data center. Instead, federated learning, also known as collaborative learning, uses inputs received from distributed sources to train a central model algorithm. Edge devices play a key role in federated learning. These could be smartphones, climate sensors, semi-autonomous vehicles, satellites, bank fraud detection systems, medical wearables, or anything else that collects data, all of which share data remotely to drive central learning models in repeated cycles. You can train. as needed.

Xu says what makes federated learning so attractive to scientists, doctors and businesses is that the data itself never actually leaves the edge device. This is very attractive to various industries, especially healthcare, because of privacy laws such as HIPAA, the Health Insurance Portability and Accountability Act, and hacking threats.

Can Machine Learning Deliver Fairness? UNC Researchers Take a New Approach

Researchers like Xu point out that federated learning has other benefits as well. Federated learning requires less communication and power, runs when the device is idle, and is ready to use. He said that mobile phones and similar devices have much more computing power than in the past, so they are now in the early stages of finding real-world applications.

Still, Xu says federated learning isn’t the perfect solution. The potential for hacking still exists when servers and edge devices communicate. Xu said it is still possible for an eavesdropper to infer personal data based on the transmitted parameters.

To help find privacy solutions, Xu has developed query strategies and analysis techniques that are part of Federated Learning’s anti-theft framework. He is the author of two papers published in IEEE Transactions on Information Theory, “Learner-Private Convex Optimization” and “Optimal Queries for Private Sequential Learning Against Eavesdropping”, presented at the 24th International Conference on Artificial Intelligence and Statistics. The two papers on Complexity” share their findings. It is co-authored by Kuang Xu, his associate professor of operations, information and technology at Stanford Graduate School of Business, and Dana Yang, his assistant professor of science in statistics and data at Cornell University.

“A lot of research is being done on federated learning right now,” says Xu. “Companies are studying it, but there are many obstacles to overcome before these systems can really work.”

stop nefarious eavesdroppers

Xu said that when Google coined the term federated learning, it wasn’t a completely new concept. To speed up AI training, companies had already started distributing computational load across computer servers.

Federated learning takes it to another level, according to Xu. Here’s how it works: Initially, a local copy of the application on the central server exists on all edge devices. Over time, each device gains experience, trains itself, and gets smarter. At the specified moment, when queried by the central server, the device will transfer the training results to the server instead of the raw data itself. It averages and aggregates the results and updates itself. Users can then download newer, smarter versions made with their own data and repeat the cycle as needed. In other words, with federated training, learning content is delivered to a remote device and sensitive information such as emails, photos, financial and health data is securely stored where it was collected.

Xu and co-authors discuss the possibility of malicious eavesdropping in their paper on optimal query complexity.

“Because I am a learner [the central computer] Requires frequent contact with data owners [the edge devices] Queries can be eavesdropped by third-party attackers to perform analysis,” the authors wrote. “That adversary could use the observed queries to reconstruct the trained model, allowing it to free-ride at the learner’s expense, or worse, pass such information It may be used for future sabotage.”

The challenge for Xu and his co-authors was how to prevent third parties from seeing edge device responses.

“We have developed a strategy that allows us to query the number as quickly as possible and at the same time, without exposing the information to the adversary,” said Xu. “This makes it impossible to determine the true value of the information in the response with any degree of accuracy.”

Xu and his collaborators envisioned a private sequential learning problem (plainly speaking a guessing game) using a binary search model. Party A asks Party B to guess her number between 0.0 and 1.0. B replies, “Is that number greater than 0.3?” A replies, “Yes, that number is between 0.3 and 1.0.” Therefore, B asks, “Is that number between 0.3 and 0.4?” But at the same time, with Xu’s proposed solution, the B party also asks a barrage of other questions, such as “Is that number for him between 0.4 and 0.5?” “Is that number between 0.6 and 0.7?” As a result, an eavesdropper cannot tell which query is leading the questioner to the correct answer.

To understand how it works, Xu wants to drill many wells and extract oil from just one well and prevent other companies from knowing about it. He proposes the analogy of an oil company that

“To confuse your competitors, they will see you dig many wells and not be able to tell which ones are successful,” Xu said.

This learning problem game has more problems. Besides creating a smokescreen, Xu and his co-authors had another equally important goal. We wanted the training to require as few queries as possible.

“In federated learning, communication bandwidth is a scarce resource, so efficient use of queries is of fundamental importance,” write Xu and his co-authors. “Studying the trade-offs between accuracy, privacy, and query complexity under the binary search model provides valuable insight into algorithm design for federated his-learning.”

One driving insight is that the most privacy-sensitive part of query processing happens after the learner has received a reasonably accurate guess. Optimal query processing is thus divided into two phases. First, in the pure learning phase, the main goal is to narrow the search to a smaller range containing true numbers. Privacy is not a top priority at this time. Then, in the private refinement phase, the learner refines the guesses within the interval and allocates significantly more queries towards obfuscation.

How to “optimally obfuscate” a learner’s query is also the subject of learner-private convex optimization. In this paper, we explore a solution using a real-world problem-solving technique called convex optimization. This mathematical methodology determines how to make the best choice in the face of conflicting requirements and is a commonly used framework in federated learning.

This is similar to the guessing game example, but the key here is to build a number of intervals that are sufficiently far apart but equally likely to contain the best choice from the eavesdropper’s point of view. Only one of these intervals contains the true best choice, while in all other intervals the learner randomly generates fake proxies. In this way, an eavesdropper cannot distinguish the true best choice from many bogus proxies.

Xu and his co-authors use the example of autonomous driving to illustrate how private convex optimization can benefit companies. “The goal is to protect the privacy of leading manufacturers (learners) from model plagiarism attacks by competitors (eavesdroppers). You have to guarantee that it will work reliably under… without a worst-case guarantee, an attacker can’t act on a stolen model…[and the] Strategy renders the enemy powerless,” they wrote. Similar to the paper on optimal query optimization, the results are the same (private data is protected). However, the mathematical strategy to reach the goal is different.

Cross-device dilemma

There are two types of federated learning: cross-device and cross-silo. Cross-device learning will most likely occur on consumer devices, potentially involving millions of users. Cross-silos typically have far fewer participants, each with vast amounts of data, such as financial institutions or pharmaceutical companies.

A common model might work well in cross-silo settings, but becomes difficult to implement when it involves millions of smartphones, each owned by users with different habits. An example of this is Gboard. This Android keyboard uses federated learning to predict the next word you’ll type when searching or composing a message. The phone learns new phrases and words, stores the information and its context, and makes it available for federated training.

However, according to Xu, there are issues related to personalization.

“Your habits of typing certain words may differ from mine. Just training a common model for everyone probably won’t work for everyone,” says Xu. “We want to create predictable models for every individual user.”

How to split users into appropriate training groups is the subject of a paper submitted to the 36th Annual Conference on Neural Information Processing Systems, Global Convergence of Federated Learning for Mixture Regression. Xu co-authored the paper with Lili Su, assistant professor of electrical and computer engineering at Northeastern University, and Pengkun Yang, assistant professor at the Center for Statistical Sciences, Tsinghua University.

To solve this problem, they turn to a concept called clustering. It assumes that all clients are not the same (e.g. some cars always drive in the snow, others in the rain all the time) and a defined number based on such characteristics group (cars driving in snow or rain). The server doesn’t know which group a particular car should be in, but it has to train different models to account for that uncertainty.

“It’s an egg-or-chick problem,” Xu said. “Once we know the true group split (which clients belong to which group), we can train separate models for each group. I just don’t understand.”

To get out of this predicament, Xu and his co-authors designed a new algorithmic approach that allows a server to estimate which group an individual should be in and train a federated learning model for that group accordingly. .

Xu believes companies now need to pay attention to the implications of federated learning.

“If your company is not investing in privacy-preserving technology, customers may go to competitors who are investing in privacy-preserving technology,” Xu said. “Over the next 10 years, it will become increasingly difficult to use internal privacy technologies as we may not be able to collect the data we use for internal machine learning. may be taken.”

Source link