This systematic review provides the most comprehensive analysis to date comparing decentralized learning approaches with traditional methods in healthcare, examining 160 studies comprising 710 decentralized models and 8149 performance comparisons.
The rapid growth in research output, particularly since 2020 and multidisciplinary scope reflects increasing recognition of decentralized learning’s potential in healthcare applications.
Considering the paired comparisons between decentralized and centralized methods, performance differences present low magnitude median values and reduced interquartile ranges. This demonstrates that decentralized approaches can broadly achieve comparable performance, although moderately inferior. In particular, strong relative performance in AUROC (51% centralized favourability, small effect size) suggests that the observation ranking ability is preserved through decentralized learning processes. In turn, threshold-dependent metrics—such as accuracy and Dice score– show increased centralized relative advantages, with mostly moderate or large effect sizes. These findings reveal calibration challenges and spatial feature averaging difficulties, respectively. However, DL seems to overperform in terms of specificity (54% centralized favourability, small effect size), suggesting that aggregation processes differentially affect error types. In particular, multi-site learning may filter site-specific false positive patterns while simultaneously diluting rare positive case signals, given case presentation variation and uneven distribution of rare cases across sites.
Focusing on the application viability of these models, centralized models can offer clinically useful alternatives to underperforming decentralized counterparts, in up to 18% of the cases. Sensitivity and accuracy are particularly benefited by the centralized approach, aligning with DL limitations to identify true positive cases.
Regarding the differences with local approaches across all metrics, decentralized performance is dominant, despite some heterogeneity. While, decentralized models benefit from more and often more diverse data, they are less tuned to the specific distribution of a local dataset. In particular, DL demonstrates the strongest advantage in precision (86% favourability), substantially exceeding gains in other metrics.
This likely reflects multi-site models’ ability to filter out site-specific artifacts (e.g., differences in imaging protocols, scanner calibration). Local models overfit to these artifacts, leading to overconfident predictions that inflate false positive rates when encountering variation. In turn, sensitivity shows the smallest DL advantage (70%), with only a moderate effect size, likely reflecting challenges in aggregating rare or subtle pathological patterns across heterogeneous sites. Specificity shows greater improvement (76% favourability), as normal imaging features are more consistent across sites than disease presentations, and DL models learn to avoid falsely flagging benign site-specific variations. This asymmetry reflects a fundamental trade-off: local models can optimize to site-specific patterns—potentially overfitting—at the expense of external validity, whereas DL prioritizes features robust across heterogeneous sites. A similar pattern emerges when comparing DL to centralized models, where challenges in aggregating rare signals similarly constrain sensitivity improvements.
With our sensitivity analysis, excluding observations from articles with the most comparisons, variations in favourability ratios were generally within single-digit percentage points. This strengthens the validity of the data presented and our conclusions.
Focusing on clinical applicability, the threshold-stratified analysis (≥0.80) reveals important patterns for implementation decisions. Centralized models can rescue clinical viability from underperforming DL in up to 18% of cases, primarily for sensitivity and accuracy. This aligns with DL’s documented limitations in identifying true positive cases, particularly rare or subtle pathological patterns across heterogeneous sites.
Importantly, the clinical threshold analysis demonstrates that when both centralized and DL approaches achieve clinical viability, centralized superiority typically represents “excellent versus acceptable” performance rather than “acceptable versus inadequate.” While centralized improvements occur frequently, their magnitude is limited (median difference ranging from 0.7pp to 1.5pp). This suggests that when DL models achieve clinically acceptable performance, centralized alternatives provide only modest incremental gains. This positions DL as a viable alternative for contexts where centralized approaches are prohibited by privacy regulations or data sharing constraints.
Regarding differences with local approaches, DL demonstrates dominant performance across all metrics. The clinical rescue effect is substantial, with median improvements of 7.6–27pp depending on metric. The disproportionate improvement in threshold-dependent metrics (27pp for sensitivity) compared to ranking metrics (7.6pp for AUROC) reveals that local models suffer from overfitting to site-specific patterns and class distributions. Decentralized learning mitigates this by learning features robust across heterogeneous clinical settings, resulting in more generalizable decision boundaries. Notably, even when local models achieve clinical viability, DL frequently offers performance increases that should be considered alongside potentially superior external validity.
Regarding additional privacy-preserving techniques and the secondary aims of this study, these data points are reported infrequently and not in a standardized fashion. Even when reported, key variables (e.g., noise levels) are often fixed, making it impossible to assess their impact in each study. Due to differences in datasets, clinical domains, clinical applications or different computational set-ups, cross-study comparisons would not provide reliable insights. Overall, decentralized models are more resource demanding than their counterparts, especially when privacy-preserving methodologies are added.
A qualitative synthesis of the evidence presents some notable patterns. Noise levels of 0.001 provides a superficial level of protection with negligible impacts on performance. Memory and data transmission requirements, outside of resource scarce environments, should not cause significant hardship for model development. While some techniques can increase development time, these rarely duplicate the duration for their standard counterparts. In real-world settings inference time may be a more relevant constraint. Depending on the techniques used, this can lead to compounded increases and may function as an effective bottleneck to the deployment of larger and more complex models.
The findings from this systematic review enable evidence-based decision-making for healthcare AI implementations balancing privacy preservation with clinical performance requirements. To allow actionable application of these insights we propose a simple decision framework.
We start by highlighting when decentralized learning can be recommended. DL represents the optimal approach in three primary scenarios. First, when data sharing is legally prohibited or institutionally restricted (e.g., under GDPR constraints, cross-border regulations or institutional data governance policies), DL enables model development that would otherwise be impossible. Our analysis demonstrates DL achieves clinically acceptable performance (≥0.80) in the majority of applications, with 83% favourability over local approaches for accuracy and 82% for AUROC. Second, when local data alone yields insufficient performance, DL rescues clinical viability in 12% to 15% of cases with substantial improvements (median difference of 7.6–27pp depending on metric). Third, when external validity is prioritized over maximal performance, DL’s multi-site learning reduces site-specific overfitting, particularly valuable for precision metrics where DL shows 86% favourability over local models.
In turn, centralized approaches should be selected when privacy constraints are manageable and maximal performance is required. Centralized models demonstrate advantages in threshold-dependent metrics, particularly accuracy (78% favourability) and Dice score (78% favourability), with large effect sizes. Clinical threshold analysis reveals these advantages typically represent mostly “excellent versus acceptable” rather than “acceptable versus inadequate” performance. When both approaches achieve clinical viability, centralized improvements average only 0.7–1.5pp, in 16% to 44% of comparisons. However, centralized approaches still provide clinically meaningful rescue in 6% to 17% of comparisons. Therefore, centralized learning is justified primarily when: (1) marginal performance improvements are clinically critical, (2) working with rare pathological patterns requiring maximum sensitivity or (3) privacy-preserving infrastructure is unavailable.
Alternatively, local-only approaches should be avoided for deployment across multiple sites or generalizable applications. Local models systematically underperform DL across all metrics, with particularly poor precision (14% favourability) due to overfitting to site-specific artifacts. The 27pp sensitivity improvement from DL versus local models in rescue scenarios indicates local approaches risk missing true positive cases when applied beyond their training environment. Local models may only be appropriate for strictly site-specific applications where external validity is not required and privacy or technical constraints prevent any data collaboration.
Decision-makers should consider that DL’s primary trade-off is not clinical inadequacy but rather marginal performance concessions (typically 1–2pp) for privacy preservation. The resource overhead—while measurable—rarely doubles development time, though inference latency may constrain deployment of complex models. Organizations should prioritize DL when regulatory compliance, institutional policies, or ethical considerations prohibit centralized data aggregation, accepting that performance will be clinically acceptable rather than optimal in most scenarios.
Despite the robustness of this work, some limitations may have affected these results. Publication bias, reporting bias, and selection bias could influence which results are available for inclusion, potentially skewing the aggregated findings. No specific efforts were made to assess or address these. In addition, gray literature or publications outside primary scientific articles were not examined. Our focus on peer-reviewed publications prioritized methodological rigor and clinical applicability, although this approach may have introduced a temporal lag in capturing the most recent developments and reduced the breadth of included results. We mitigated this by searching for published versions of identified preprints and conducting updated searches through March 2024, to balance evidence quality with timeliness. However, a single moment for evidence retrieval and classification would have been preferable. While we aimed for a clear selection and definition of decentralized learning approaches considered, we recognize other interpretations may be valid. However, the majority of data concerns well established methods (e.g., Federated Learning, Swarm Learning). In addition, we recognize some mistakes (i.e., random errors) may have occurred during our extensive process. During our peer-review process, a small number of otherwise eligible papers33,34 were by mistake not considered.
Regarding data quality of the included studies, many included articles relied on secondary data or inadequately detailed primary data collection. Both private and public datasets featured instances of insufficient number of participants, observations or predictors, as well as the poor quality of reporting of eligibility criteria, outcome definitions and methods used. In practice, these challenges, alongside inconsistent reporting formats, made identifying different health data models, their characteristics, and performance comparisons more difficult.
Therefore, our evidence appraisal document issues related to the primary studies used. Due to the broad scope of our research question and the comparability of the decentralized and non-decentralized model development and evaluation processes, we believe evidence used to be of low concern for this purpose. Additionally, while clinical applicability performance thresholds vary by application and context, 0.80 provides a standardized benchmark across heterogeneous domains.
Considering the main implications of the study, this systematic review makes three novel contributions to the field: (1) quantification of favourability ratios between traditional and decentralized learning approaches across performance metrics, (2) identification of performance ranges where variations are most pronounced, and (3) clinical significance assessment through threshold-stratified analysis.
This is the first study that presents a quantitative evaluation of the difference between decentralized and non-decentralized approaches at a paired comparison level and grouped by clinical application characteristics. This work demonstrates the ability for DL to present robust ranking assessments, while still struggling to retain positive and rare signals, especially when compared to their centralized counterparts. When considering clinically relevant performance ranges, centralized learning superiority is deepened. Compared to local learning, DL advantages are significant, especially in AUROC, accuracy and precision, and present sizable performance increases, when considering clinical applicability. Therefore, decentralized learning represents a clear superior alternative to local-only approaches, centralized learning continues to be the gold standard. However, DL offers a viable alternative for contexts in which centralized learning is not possible.
As the AI Act advocates for performance parity between traditional and privacy-preserving techniques, the quantitative synthesis of the evidence provides an objective insight for monitoring the state of art and evolution of these approaches. In parallel, our limited findings on privacy-performance trade-off support the need for increase adoption of standardized privacy evaluation metrics. In particular, we recommend more rigorous comparative studies, better documentation of implementation details and focus on practical deployment in healthcare settings. Heterogeneous and infrequent reporting does not allow for an adequate study of dynamics between privacy-preserving guarantees and performance cost.
Considering the issues raised during the evidence appraisal of the most cited, and the variety of specific clinical use cases, these results cannot validate particular implementations for widespread deployment. Problems related to reporting of sampling processes, target population definition and data collection methods compromise external validity of the studies considered. In addition, small variations in performance metrics even for a specific disease can have different clinical and operational impacts (e.g., screening versus diagnosis application). Nonetheless, we encourage the exploration of different sub-analyses in our online dashboard to identify promising research fields.
Comparing this study with similar recent reviews, this work provides a detailed and quantitative assessment of the results from the primary articles. Contemporary research mostly focuses on reporting the article and model characteristics, commonly using narrative syntheses of the primary articles35,36,37,38. In addition, these works do not provide actionable information on the added benefit of using decentralized approaches in contrast with traditional methods already being used, nor valuable syntheses of the evidence. Moreover, to the best of our knowledge, no published review on the topic was preceded by the respective protocol publication or registry.
In this domain, future research should focus on the impact of local adaptation processes on decentralized learning performance. A two-step paradigm including local calibration learning followed by local calibration may balance privacy preservation, feature generalization and clinically relevant performance. New studies on the topic should present higher methodological quality, with clearer reporting of eligibility criteria, data collection strategies, outcome definitions and model performance comparisons. For privacy-preserving reporting, guiding references—including quantitative and qualitative dimensions – are needed for comparability. While GDPR and AI Act intentionally do not offer specific metrics, there are alternatives39,40 from experts on the field.
Other topics regarding the adoption of decentralized learning methods require further discussion. From data distribution challenges to considerable technical overheads and machine “unlearning”41 requirements, data collaboration still faces foundational constraints that may limit its widespread adoption. Meanwhile, novel methods such as local fine-tuning pre-trained models, the advent of AI-capable personal devices and normative AI approaches42 can help leverage the development of decentralized learning models.
