FAIRFML: Fair federated machine learning with case studies on reducing gender disparities in predicting cardiac arrest outcomes

Setup of notation and problems

This study employs the notation introduced by Berk et al.⁵³. Let me $y \in {\mathscr {y}} = \left[-\mathrm{1,1}\right]$ Represents the binary result $x \in \chi = {r}^{d} $ Shows feature vectors. Each instance is divided into one of two groups based on sensitive variables; ${\chi} _{{\mathcal {d}}+1} $. joint distribution of ${\mathscr {x}} $ and ${\mathscr {y}} $ is represented by ${\mathscr {p}} $. Consider the training set $s = \{{\left({x}_{i}, {y}_{i}\right)\}}_{i=1}^{n}$It consists of $n $ Drawn independent, identically distributed (IID) samples ${\mathscr {p}} $. This training set is divided into two groups. ${s} _ {1} $ and ${s} _ {2} $,based on sensitive variables, ${n} _ {1} $ and ${n} _{2} $ Represents the size of each of these groups, and thus ${n}_{1}+{n}_{2} = n $.

${\rm {\lambda}} $– The weighted fairness loss for a particular model is ${\mathscr {l}}\left(w,s \right)+{\rm {\lambda}} f \left(w,s \right)$where ${\mathscr {l}} $ Represents the loss function of the standard model, $w $ Represents model parameters $\lambda$ Regularization parameters for fair penalty. Matches with Burke et al.⁵³We focus on fair penalties for groups defined as

$$f\left({\boldsymbol{w}}, s\right)=\frac{1}{{n}_{1}{n}_{2}}\mathop{\sum}\limits_{\begin}{array}{c}{c}{\left({{\boldsymbol{x}}}}_{{\boldsymbol{i}}},{{\boldsymbol{i}}}}_{{\boldsymbol{i}}}\rig)\\in{s}_{1}}\\{\left({{\boldsymbol{x}}}}_{{\boldsymbol{i}}}_{{\boldsymbol{i}}}\rig)\\in{s}_{1}}\\{\left({{\boldsymbol{x}}}}_{{\boldsymbol {j}}}, {{\boldsymbol{y}}}_{{\boldsymbol{J}}}_{{\boldsymbol{x}}}_{{\boldsymbol{x}}}_{{\boldsymbol{x}}}_{{\boldsymbol{x}}}_{{\boldsymbol{w}}_{{\boldsymbol{x}}}_{{\boldsymbol{w}}_{{\boldsymbol{x}}}_{{\boldsymbol{x}}}_{{\boldsymbol{x}}}_{{\boldsymbol{j}}}_

(1)

here$, {d}({y}_{i}, {y}_{j}){\mathbb {=}} {\mathbb {1}}} {\mathbb {=}}}[{y}_{i}={y}_{j}]$ It serves as a weight of cross-group fairness.

4.2 Group Fairness Metric

Also known as demographic parity (DP), statistical parity, equalized odds (EO), is the fairness definition of two widely used algorithms for binary classification.

The model satisfies DP rather than distribution ${\mathscr {p}} $ In the case of that prediction $\hat {y} $ It does not rely on statistically sensitive features.

$$ p \Left[\hat{Y}=1|{\chi }_{{\mathcal{d}}+1}=a\right]= p \Left[\hat{Y}=1\right],\,\ forall a $$

(2)
The model satisfies EO more than distribution ${\mathscr {p}} $ In the case of that prediction $\hat {y} $ Considering true outcome labels, they are conditionally independent of sensitive features.

$$ p \Left[\hat{Y}=1|{\chi }_{{\mathcal{d}}+1}=a,Y=y\right]= p \Left[\hat{Y}=1,=|Y=y\right],\ forall a, y $$

(3)

This study focused on a total of four equity metrics: demographic parity difference (DPD), demographic parity ratio (DPR), equalized odds difference (EOD), and equalized odds ratio (EOR) are calculated using the definitions of DP and EO as follows:
${\text {dpd}} = \mathop {\max} \nolimits_ {a} e \left[{\hat{Y}}|{\chi }_{{\mathcal{d}}+1}=a\right]-\mathop {\min}\nolimits_{a}e\left[{\hat{Y}}|{\chi }_{{\mathcal{d}}+1}=a\right]$ Measure the maximum difference in predicted results between groups. A DPD near 0 indicates a more equal prediction between groups.
${\text {dpr}} = \frac {\mathop {\min} \nolimits_ {a} e \left[{\hat{Y}}|{\chi }_{{\mathcal{d}}+1}=\,a\right]} {\mathop {\max} \nolimits_ {a} e \left[{\hat{Y}}|{\chi }_{{\mathcal{d}}+1}=\,a\right]} $ Measure the ratio between the minimum and maximum prediction results between groups. A DPR near 1 shows a more balanced forecast rate.
${\text{eod}}=\mathop{\max}\nolimits_{y\in\{-{1,1}\}}\left(\mathop{\max}\nolimits_{a}e\left[{\hat{Y}}|{\chi }_{{\mathcal{d}}+1}=a,Y=y\right]-\mathop {\min}\nolimits_{a}e\left[{\hat{Y}}|{\chi }_{{\mathcal{d}}+1}=a,Y=y\right]\right)$ Measure the difference in predictive errors (false positive/negative) between groups. An EOD near 0 indicates a more equal prediction between groups.

${\text {eor}} = \mathop {\min} \nolimits_ {y \in \{-{1,1} \}} \frac {\mathop {\min} \nolimits_ {a} e \left[{\hat{Y}}|{\chi }_{{\mathcal{d}}+1}=\,a,Y\,=\,y\right]} {\mathop {\max} \nolimits_ {a} e \left[{\hat{Y}}|{\chi }_{{\mathcal{d}}+1}=\,a,Y\,=\,y\right]} $ Measure the ratio of error rates between groups. An EOR near 1 indicates a more balanced forecast rate.

Fairfml

integrated ${\rm {\lambda}} $The weighted fair loss described in “Natation Problem Setup” is “Natation Problem Setup” to FL Model Training, and the proposed FAIRFML workflow is shown in Figure 2. By incorporating FAIRFML into any FL framework, it increases the fairness of existing FL solutions by replacing the loss function functions of the standard model. ${\mathscr {l}} $ in $\lambda$ -Weighted fair loss feature during FL model training. Fairness Regular $f $ It's convex⁵³i.e. there is a single global minimum and no local minimer. This property is essential for optimization as it guarantees the total objective function ${\mathscr {l}}\left(w,s \right)+{\rm {\lambda}} f \left(w,s \right)$ It can be efficiently minimized without the risk of converging on a suboptimal solution. Convexity is guaranteed when adjusting ${\rm {\lambda}} $,The trade-off between fairness and model accuracy is stable and predictable. This is essential for effective optimization of typical FL frameworks such as FEDAVG⁵⁵. To prevent overfitting, we incorporate ${{\rm {l}}} _ {2} $ Normalization that results in the final loss function: ${\mathscr{l}}\left(w,s\right)+{\rm{\lambda}}f\left(w,s\right)+{\rm{\gamma}}{{||w ||}}}_{2}^{2}$.

Tradeoffs between regulated model accuracy and fairness ${\rm {\lambda}} $which varies widely across datasets^53,56 High place ${\rm {\lambda}} $ Value imposes greater fairness penalties. As ${\rm {\lambda}} $ Increases from 0 $\ infty $the accuracy of the model tends to decrease. Therefore, users must choose the right one ${\rm {\lambda}} $ The values for each dataset are improved balance of fairness and an acceptable reduction in model accuracy. To address this challenge, we propose a data-driven approach to efficient selection. ${\rm {\lambda}} $ While minimizing calculation costs. As outlined in the pseudocode (Supplementary Figure 1, Supplementary Material), ${{\rm {\lambda}}} _{k} $ Initially, each client is selected independently $k $ By plotting prediction metrics (eg, accuracy or mean square root (MSE)) ${{\rm {\lambda}}} _{k} $. A practical method involves increments ${\lambda} _{k} $ Compared to a model with no normalized prediction metrics, in a fixed procedure (for example, if accuracy is below 0.995*${\text {acc}} _ {0} $where ${\text {acc}} _ {0} $ (The accuracy of a model without fair penalty). maximum ${\lambda} _{k} $ It is then used to define the range of FL training across all clients. ${\rm {\lambda}} $ The value is selected.

For each $\lambda$ Value, we use a two-stage strategy to determine the optimal ${\rm {\gamma}} $. First, explore with wide and equal spaces ${\rm {\gamma}} $ Values starting from zero. Users choose the best one $\gamma$ Based on changes in predictive performance and fairness metrics. Next, narrow down the search area around that value and repeat the process to finalize $\gamma$ I was given ${\rm {\lambda}} $. Detailed pseudo code for selecting $\gamma$ Supplementary Figure 1 is provided.

Datasets and Experiments

Our study population consisted of OHCA patients treated by emergency medical services (EMS) providers as recorded in the Resuscitation Outcome Consortium (ROC) Cardiac Epidemiology Registration (ePistry) (version 3 covering the period from April 1, 2011 to June 30, 2015). ROC, a North American database founded in 2004, aims to advance clinical research into cardiopulmonary arrests.⁵⁷. Ethical approval was obtained from the National University of Singapore Institutional Review Board (IRB) and was granted exemption from this study (IRB reference number: NUS-IRB-2023-451).

Cohort formation and predictor selection were followed by established methodologies of out-of-hospital cardiac arrest (OHCA) studies⁵⁷^,⁵⁸. We achieved spontaneous circulation (ROSC) returns at all times, including patients over 18 years of age transported by EMS, and had complete data on gender, race, etiology, early rhythm, witness status, response time, adrenaline use, and neurological status. The main outcome was neurological condition at drainage, measured on the modified Rankin scale (MRS), with scores of 0, 1, or 2 classified as good results. Variables used in outcome prediction include age (year), etiology of arrest (mind/non-defamatory), presence of witness (yes/no), early rhythm (shockable/non-shockable), bystander cardiopulmonary resuscitation (CPR) (yes/no), response time (several minutes), and adrenaline use (yes/no).

Four sets of experiments were conducted to simulate real-world cross-site data for (i) race/ethnicity to four sites by race/ethnicity, (ii) race/ethnicity to four sites (iii) race/ethnicity to six sites (iv) (iv) (ii) real-world cross-site data for six sites (iv) to six sites (ii) to four sites, and (iv) four sites. Specifically, the likelihood that observations will be assigned to each site will depend on the variables used for the division (age or race/ethnicity). As a result, the marginal distributions of predictors and outcomes are heterogeneous across sites. Continuous variables were standardized using mean and standard deviations from the complete cohort, and logistic regression was employed for outcome prediction. We focused on two representative FL frameworks for each FedAvg and Fedavg³³. FEDAVG is a basic FL framework and is first proposed in the FL domain^32,54Meanwhile, Fedaub is a solution widely adopted in personalized fl. The latter is particularly relevant for medical data analysis, as researchers can provide whether researchers can provide localized benefits to improve the performance of existing models in individual institutions.^{twenty five}. Its effectiveness for personalized improvements to local datasets is also demonstrated in healthcare data⁵⁹.

Three types of analyses were conducted for each scenario. (1) a central model trained in a complete cohort and local model independently trained at each site, (2) federated logistic regression using FEDAVG and per-fedab, and (3) FAIRNEST extended federated logistic regression using FAIRFML (FedFML) using the FAIRFML method using the proposed FAIRFML method (per-fedab). The “Fairlearn” package was used to evaluate model performance using the receiver operating characteristic curve (AUROC) and the area under the four fairness metrics, as described in “Group Fairness Metrics”.⁶⁰.

Source link