Differentially private knowledge transfer for federated learning

Machine Learning


Next, we will present the differentially private knowledge transfer method for federated learning (named PrivateKT). We will first give former definitions of local differential privacy and the research problem studied in this paper, and then introduce the details of our PrivateKT method.

Preliminary

The local differential privacy method (LDP)56 aims to protect user privacy under theoretical guarantees. The core idea of LDP is to perturb the shared data via a randomized mechanism to guarantee privacy security. Formally, the definition of LDP can be summarized as follow: a randomized mechanism \({{{{{{{\mathcal{M}}}}}}}}\)(  ) can protect the input data  under ϵ-LDP, if and only if for two arbitrary input data X and \({X}^{{\prime} }\), and any output \(Y\in range({{{{{{{\mathcal{M}}}}}}}})\), the following inequation holds:

$$Pr[{{{{{{{\mathcal{M}}}}}}}}(X)=Y]\le {e}^{\epsilon }\cdot Pr[{{{{{{{\mathcal{M}}}}}}}}({X}^{{\prime} })=Y],$$

(1)

where Pr[] is the probability of  , and ϵ is the privacy budget. The privacy budget ϵ quantifies the privacy guarantee, where a smaller privacy budget means stronger privacy protection.

Problem definition

Following popular federated learning settings, PrivateKT includes N clients and a central server. Each client privately keeps its local dataset and never shares it with the outside, where the local dataset in the i-th client is denoted as \({{{{{{{{\mathcal{D}}}}}}}}}_{l}^{i}\). The global model is maintained by the central server and has a local copy on each client. The central server is also responsible for coordinating the clients to participate in the knowledge transfer. In addition, we assume that there is an unlabeled public dataset \({{{{{{{{\mathcal{D}}}}}}}}}_{p}\) that is non-privacy sensitive and can be shared across different parities for knowledge transfer, where the i-th sample in \({{{{{{{{\mathcal{D}}}}}}}}}_{p}\) is denoted as \({x}_{i}^{p}\). In order to guarantee privacy security during knowledge transfer, any communicated variables correlated to the local private data need to be protected by the LDP method. The research problem studied in this paper is to design a both private and effective knowledge transfer method for federated learning.

Differential private knowledge transfer

The core of private knowledge transfer is communicating perturbed local model predictions on a small amount of actively selected public data. By drastically reducing the size of communicated variables, PrivateKT can effectively mitigate the damage of LDP noise on model performance. Nevertheless, randomly sampled small data may be insufficient to transfer high-quality knowledge from local data to a global model. Thus, we further propose several mechanisms to improve the effectiveness of knowledge transfer based on small data. Next, we will introduce the details of the differential private knowledge transfer in PrivateKT (Fig. 5).

Fig. 5
figure 5

The framework of our PrivateKT method.

Take the t-th knowledge transfer round as an example, PrivateKT includes three core steps, i.e., knowledge extraction, knowledge exchange, and knowledge aggregation. The knowledge extraction step aims to extract knowledge from local data and encode it into local predictions on small actively sampled data. Specifically, the server first distributes the global model in the t-round (denoted as Θt) and K pieces of knowledge transfer (KT) data to each client, and selects a part of clients for model training, where the selected client set is denoted as \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\). (The sampling mechanism of KT data will be introduced in the next paragraph.) For an arbitrary client \(c\in {{{{{{{{\mathcal{G}}}}}}}}}_{t}\), it first trains the latest model Θt on its local dataset \({{{{{{{{\mathcal{D}}}}}}}}}_{l}^{c}\). Then the client c computes predictions of the locally-trained model on the KT data for knowledge extraction, where \({x}_{i}^{t}\) denotes the i-th KT data and \({y}_{c,i}^{t}\) denotes the local model prediction of the client c on \({x}_{i}^{t}\). In this way, knowledge can be extracted from local data into local model predictions, and exchanging local predictions can transfer local knowledge to the central server.

However, the local model predictions are correlated to the private data, the disclosure of which remains the risk of leaking raw data. Thus, to guarantee user privacy security under LDP, each client locally perturbs local predictions via the randomized response mechanism43. Specifically, for each local model prediction y, each client c randomly chooses whether replace it with a randomly-generated category label f before uploading it to the server:

$$\hat{{{{{{{{\bf{y}}}}}}}}}=\left\{\begin{array}{r}{{{{{{{\bf{y}}}}}}}},\quad R=1\\ {{{{{{{\bf{f}}}}}}}},\quad R=0\end{array}\right.,\quad R \sim {{{{{{{\mathcal{B}}}}}}}}(\beta ),\quad {{{{{{{\bf{f}}}}}}}} \sim {{{{{{{\mathcal{P}}}}}}}}(C),$$

(2)

where y {0, 1}C is the one-hot category vector predicted by the local model, f {0, 1}C is a random one-hot vector drawn from a uniform multinomial distribution \({{{{{{{\mathcal{P}}}}}}}}(C)\), \(\hat{{{{{{{{\bf{y}}}}}}}}}\) is the perturbed local prediction, R is a random variable drawn from a Bernoulli distribution \({{{{{{{\mathcal{B}}}}}}}}(\beta )\), C is the number of classification categories and β is the probability of assigning the Bernoulli random variable R to 1. Based on the randomized response mechanism, the client c can build the perturbed local predictions \(\{{\hat{{{{{{{{\bf{y}}}}}}}}}}_{c,i}^{t}|i=1,2,…,K\}\) for the knowledge transfer data. By uploading the perturbed predictions to the server, we can privately exchange local knowledge under differential privacy guarantees. (Discussions on privacy guarantees are in the next section.)

After the server collects perturbed predictions from selected clients \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\), the knowledge aggregation step can be executed to update the global model. The sever first aggregates the local predictions on the same KT data to estimate the averaged predictions of different local models on it. Take the i-th knowledge transfer data \({x}_{i}^{t}\) as an example, the averaged prediction \({{{{{{{{\bf{y}}}}}}}}}_{i}^{t}=\frac{1}{|{{{{{{{{\mathcal{G}}}}}}}}}_{t}|}{\sum }_{c\in {{{{{{{{\mathcal{G}}}}}}}}}_{t}}{{{{{{{{\bf{y}}}}}}}}}_{c,i}^{t}\) on \({x}_{i}^{t}\) is estimated based on the following equation:

$${\hat{{{{{{{{\bf{y}}}}}}}}}}_{i}^{t}=\frac{1}{\beta }\left(\frac{1}{|{{{{{{{{\mathcal{G}}}}}}}}}_{t}|}\mathop{\sum}\limits_{c\in {{{{{{{{\mathcal{G}}}}}}}}}_{t}}{\hat{{{{{{{{\bf{y}}}}}}}}}}_{c,i}^{t}-\frac{1-\beta }{C}{{{{{{{\bf{1}}}}}}}}\right),$$

(3)

where \({\hat{{{{{{{{\bf{y}}}}}}}}}}_{i}^{t}\) is an unbiased estimation of \({{{{{{{{\bf{y}}}}}}}}}_{i}^{t}\) and the mean square error of the estimation can asymptotically converge to 0. (The proof is in Supplementary Information.) In this way, the LDP noise can be reduced in the aggregated knowledge, and fine-tuning the global model on the aggregated knowledge can effectively mitigate the damage of LDP noise on knowledge transfer.

Recall that, due to the proportional relation between the LDP noise intensity and communicated data volume, in PrivateKT only a small amount of public data is used for knowledge transfer to mitigate the damage of LDP noise. However, small data may be insufficient to serve as a high-quality carrier to transfer knowledge, which may lead to a suboptimal model performance. To tackle this challenge, we propose two mechanisms to enhance knowledge transfer from different aspects. First, we propose an importance sampling mechanism to maximize the knowledge capacity of KT data for training the global model Θt. In this mechanism, we measure the uncertainty of the global model Θt on each unlabeled data in \({{{{{{{{\mathcal{D}}}}}}}}}_{p}\) based on the information entropy, and assign a higher sampling opportunity to unlabeled data with higher model uncertainty. The model uncertainty \({u}_{i}^{d}\) and the sampling weight \({w}_{i}^{d}\) of the i-th unlabeled data \({x}_{i}^{p}\) in \({{{{{{{{\mathcal{D}}}}}}}}}_{p}\) are computed as follow:

$${w}_{i}^{d}=\frac{\exp ({u}_{i}^{d})}{\mathop{\sum }\nolimits_{j=1}^{|{{{{{{{{\mathcal{D}}}}}}}}}_{p}|}\exp ({u}_{j}^{d})},\quad {u}_{i}^{d}=-\mathop{\sum }\limits_{j=1}^{C}p({x}_{i}^{p},j;{{{\Theta }}}_{t})\log p({x}_{i}^{p},j;{{{\Theta }}}_{t}),$$

(4)

where \(p({x}_{i}^{p},j;{{{\Theta }}}_{t})\) is the probability of classifying \({x}_{i}^{p}\) to the j-th category based on model Θt. Second, we propose a knowledge buffer to store historical aggregated knowledge, aiming to encode more useful knowledge to the global model. The server first stores the aggregated knowledge of the current round in the knowledge buffer and then utilizes the knowledge in the buffer to fine-tune the global model Θt. (The updated global model is denoted as \({{{\Theta }}}_{t}^{{\prime} }\).) The knowledge buffer is of size B and maintains the stored knowledge in a first-in-first-out manner.

Moreover, to accelerate the model convergence, we employ the self-training technique44 to further fine-tune the global model \({{{\Theta }}}_{t}^{{\prime} }\). We randomly select M samples with low model uncertainties from \({{{{{{{{\mathcal{D}}}}}}}}}_{p}\) and utilize them to self-train the model \({{{\Theta }}}_{t}^{{\prime} }\):

$${w}_{i}^{s}=\frac{\exp (-{u}_{i}^{s})}{\mathop{\sum }\nolimits_{j=1}^{|{{{{{{{{\mathcal{D}}}}}}}}}_{p}|}\exp (-{u}_{j}^{s})},\quad {u}_{i}^{s}=-\mathop{\sum }\limits_{j=1}^{C}p({x}_{i}^{p},j;{{{\Theta }}}_{t}^{{\prime} })\log p({x}_{i}^{p},j;{{{\Theta }}}_{t}^{{\prime} }),$$

(5)

where \({u}_{i}^{s}\) is the uncertainty of model \({{{\Theta }}}_{t}^{{\prime} }\) on \({x}_{i}^{p}\) and \({w}_{i}^{s}\) is the sampling opportunity of \({x}_{i}^{p}\) for the self-training. Until now, we have finished a knowledge transfer round in PrivateKT and privately transferred knowledge from decentralized data to the global model, where the updated model is denoted as Θt+1. Furthermore, we can continue the next knowledge transfer round, after the server distributes the latest global model Θt+1 and corresponding KT data to local clients. By repeating this process, we can transfer knowledge from decentralized data to collaboratively learn an intelligent model in an effective and privacy-preserving way. The workflow of PrivateKT is also summarized in Algorithm 1.

Algorithm pseudo code

Algorithm 1

Workflow of PrivateKT

1: Setting the hyperparameters ϵ, K, β, B, M and T

2: Sever randomly initializes the model parameter Θ1

3: Server randomly selects K knowledge transfer data \({{{{{{{{\mathcal{D}}}}}}}}}_{d}^{1}=\{{x}_{i}^{1}|i=1,…,K\}\) from \({{{{{{{{\mathcal{D}}}}}}}}}_{p}\).

4: for t in 1, 2, . . . , T do

5:    Sever distributes Θt and \({{{{{{{{\mathcal{D}}}}}}}}}_{d}^{t}\) to each client

6:    Server randomly selects a group of clients \({{{{{{{{\mathcal{G}}}}}}}}}_{t}\)

7:    for each client \(c\in {{{{{{{{\mathcal{G}}}}}}}}}_{t}\) (in parallel) do

8:       Locally train model Θt on the local dataset \({{{{{{{{\mathcal{D}}}}}}}}}_{l}^{c}\)

9:       for i in 1, 2, . . . , K do

10:      Compute local model prediction \({{{{{{{{\bf{y}}}}}}}}}_{c,i}^{t}\) on the KT data \({x}_{i}^{t}\)

11:        Randomly draw \(R \sim {{{{{{{\mathcal{B}}}}}}}}(\beta )\) and \({{{{{{{\bf{f}}}}}}}} \sim {{{{{{{\mathcal{P}}}}}}}}(C)\)

12:      Compute perturbed local model prediction \({\hat{{{{{{{{\bf{y}}}}}}}}}}_{c,i}^{t}\) via Eq. (2)

13:       end for

14:       Upload perturbed local model predictions to the server

15:    end for

16:    Server aggregates local knowledge and stores them in the knowledge buffer of size B

17:    Server fine-tunes the global model Θt on the knowledge buffer

18:    Server self-trains the global model and builds the updated model Θt+1

19:   Server samples knowledge transfer data \({{{{{{{{\mathcal{D}}}}}}}}}_{d}^{t+1}\) via the importance sampling mechanism

20: end for

Discussion on privacy protection

Next, we will discuss the privacy guarantees of the knowledge transfer in PrivateKT. In PrivateKT, the local private data is kept by each client and never shared with the outside. In order to transfer knowledge from decentralized data to an intelligent model, PrivateKT extracts knowledge from local data into predictions on small KT data, and shares them with a central server for knowledge aggregation. Thus, in PrivateKT, among all local variables correlated to the private data, only local predictions are shared with the server. Since the communication of local predictions may leak raw data, we propose to perturb each local prediction before sending it to the central server to protect user privacy. The privacy security of a single knowledge transfer round in PrivateKT is guaranteed by the ϵ-LDP based on Lemma 1. (The proof is in the Supplementary Information.)

Lemma 1

Given the size of knowledge transfer samples (i.e., K), the privacy protection of knowledge transfer in PrivateKT is gauranteed by ϵ-LDP if the following equation holds:

$$\beta=\frac{\exp (\frac{\epsilon }{K})-1}{\exp (\frac{\epsilon }{K})-1+C}.$$

(6)

Moreover, in PrivateKT we can further avoid the accumulation of privacy budgets during different knowledge transfer rounds based on the model shuffling method41,57. Thus, the privacy security of the whole knowledge transfer process in PrivateKT is also guaranteed by ϵ-LDP, if the condition in Lemma 1 can be satisfied.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *