Cohort characteristics
In this work, we use data from the publicly available Alzheimer’s Disease Neuroimaging Initiative (ADNI)3132 for model development and Australian Imaging, Biomarker & Lifestyle Flagship Study of Ageing (AIBL)33 for external validation. All participants with a baseline of Cognitive Normal, at least one follow-up visit, and documented medical history were selected. In ADNI, there are 389 censored and 105 transitioned individuals, while the AIBL dataset consists of 290 censored and 30 transitioned individuals. All their baseline variables are summarized in Table 1. In ADNI, we notice a statistically significant difference in age between censored (mean = 73.84, SD = 5.84) and transitioned (mean = 75.80, SD = 5.65) groups (p = 0.002). Additionally, although not significant, a higher percentage of males transitioned compared to censored individuals (56.2% vs. 47.0%, p = 0.120). In terms of cognitive scores, significant differences were observed in several measures between censored and transitioned groups. For instance, the Clinical Dementia Rating Scale Sum of Boxes (CDRSB), Alzheimer’s Disease Assessment Scale (ADAS), Rey Auditory Verbal Learning Test (RAVLT), and Functional Activities Questionnaire (FAQ) showed statistically significant differences between the two groups. Regarding comorbidity features, the presence of certain conditions differed significantly between censored and transitioned individuals. Notably, there was a higher prevalence of Endocrine Metabolic (46.3% vs. 57.1%, p = 0.062) and Renal Genitourinary (42.2% vs. 57.1%, p = 0.009) conditions in transitioned individuals compared to censored individuals. In comparison to ADNI, AIBL participants tend to be slightly younger (mean age: 72.32 vs. 73.84 years) but exhibit comparable levels of educational attainment and racial distribution. Notably, AIBL has a higher proportion of males in the censored group compared to ADNI (55.9% vs. 47.0%). AIBL also demonstrates similar trends in cognitive scores, albeit with variations in specific measures. Regarding comorbidities, AIBL exhibits differences in the prevalence of certain conditions compared to ADNI, suggesting potential variations in health profiles between the two cohorts. For instance, in endocrine and metabolic conditions, in ADNI, both censored and transitioned participants had over 45% prevalence, whereas in AIBL, both transitioned and censored individuals were below 17%.
Performance of the machine and deep learning models
In this study, we conducted an extensive evaluation of machine learning and deep learning survival analysis models to predict early stage Alzheimer’s disease progression. The models—Cox proportional hazards (CoxPH), recursive partitioning for survival trees (Rpart), random survival forest (RSF), fast random survival forest (FastRSF), cross-validated generalized linear model via penalized maximum likelihood (CVGlmnet), DeepSurv, DeepHit, and CoxTime—were compared across four distinct feature sets (FS) (FS1, FS2, FS3, and FS4), each combining demographics, cognitive scores, and comorbidities in varying ways. The workflow of our models is detailed in the Methods section.
The performance of our machine learning models on the ADNI dataset across the four different feature sets (FS1, FS2, FS3, and FS4) is presented in Fig. 1 in the form of a heatmap that shows the mean value of the C-index. All reported results are based on the unseen testing data. As a benchmark, the Cox proportional hazards model (first column) is included. Among the evaluated models, both the fast random survival forest machine learning model and the DeepSurv deep learning model excelled, achieving a C-index of 0.84 when applied to FS1. The CoxPH model, Rpart, and RSF achieved C-indices of 0.82, 0.75, and 0.76, respectively. CVGlmnet and CoxTime yielded moderate results, with C-indices of 0.67 and 0.75. The Deephit model demonstrated a slightly lower C-index of 0.66. This feature set incorporates all data modalities, resulting in superior performance compared to the other feature sets. To complement these findings, we conducted bootstrapping (1,000 resamples) on FS1 to obtain 95% confidence intervals for the C-index (Table 3). Fast RSF achieved the highest mean C-index of 0.8607 [0.8107–0.9107], confirming its strong and stable predictive performance. DeepSurv and RSF followed closely with 0.8226 [0.7726–0.8726] and 0.7844 [0.7344–0.8344], respectively. We conducted Kruskal–Wallis test which revealed significant global differences (\(\chi ^2\) = 60.56, df = 7, \(p < 0.0001\)), and Dunn’s post-hoc comparisons with Holm correction confirmed that Fast RSF significantly outperformed DeepSurv (p = 0.041), as well as other baseline models such as CoxPH (p = 0.002) and DeepHit (\(\text {p} < 0.001\)). These findings provide statistical support for selecting Fast RSF as the top-performing model.

Heatmap displaying the mean concordance index-measured performance of each machine learning algorithm with each feature set (FS) on the ADNI dataset. Mean C-index values are computed over outer CV folds, without bootstrapping. Confidence intervals are reported separately for final model comparisons. Abbreviations: Dem = demographics, CS = cognitive scores, Com = comorbidities.
Excluding comorbidities (FS3) led to a significant performance drop, with the C-index falling to 0.76 for the fast random survival forest and 0.33 for DeepSurv. This decrease highlights the pivotal role of comorbidity information in enhancing the model’s predictive power, especially evident in complex models like Deepsurv. The CoxPH and RSF models attained C-indices of 0.75 and 0.76, respectively. Rpart and CVGlmnet achieved scores of 0.5 and 0.68, while Deephit showed a moderate value of 0.61. CoxTime delivered 0.66. Similarly, excluding cognitive scores (FS2) resulted in a C-index drop to 0.59 for the fast random survival forest and 0.74 for DeepSurv, which is an expected decrease knowing the importance of cognitive scores in predicting outcomes. CoxPH and CoxTime dropped to 0.62 and 0.59, respectively. Rpart and RSF showed scores of 0.5 and 0.59, while CVGlmnet and Deephit reached 0.58 and 0.53, respectively. When using FS4, which only includes demographics, the performance of most models declined. The fast random survival forest model’s c-index dropped sharply to 0.48, indicating a significant loss in predictive accuracy. Other models, such as CVGlment and Rpart, also showed lower c-index values of 0.50 and 0.56, respectively. Interestingly, the Cox model maintained a more stable performance compared to other feature sets, also, Deephit maintained a relatively stable performance with a slight decrease to 0.65. Overall, the inclusion of comorbidities and cognitive scores (FS1) significantly enhances the predictive accuracy of most models compared to using only demographic features (FS4).
The average C-index across all eight models was 0.76 for FS1, decreasing to 0.67 for FS3 and even lower for FS2. Notably, the fast random survival forest consistently outperformed other models when all data modalities were included.
Our top models, achieving a c-index of 0.84, have surpassed previous survival analysis studies conducted on the same dataset (ADNI) and cohort (CN to MCI), which attained c-index scores of 0.6614 and 0.6816, respectively. Further details of the comparison, including the features and models used in each study, are provided in Table 2. Our approach yields a significant enhancement in the early prediction of Alzheimer’s disease progression. By leveraging three cost-effective and non-invasive modalities, we contrast favorably with previous approaches that relied on expensive and invasive techniques such as MRI, PET scans, and blood biomarkers. This deliberate selection of features and models underscores our ability to achieve highly promising outcomes, demonstrating that readily available clinical data can be sufficient for accurate prediction.
Predictive features
Having identified the two top-performing models with a c-index of 0.84 using FS1, which included demographic information, comorbidities, and cognitive scores, we now proceed to perform a statistical test of significance to determine if these top two performing models are significantly different in performance compared to other models and possibly between themselves. As described in the methodology, we employed the Kruskal–Wallis test to assess whether there were statistically significant differences in the predicted risk scores across the eight models trained on the FS1 feature set. This non-parametric test evaluates the global differences in model predictions without assuming normality. The resulting chi-squared statistic was 5.623 (df = 7, p = 0.5844), indicating no significant difference in median risk rankings among the models.
Despite the lack of statistical significance, we proceeded with model selection based on both predictive performance and practical relevance. Fast Random Survival Forest and DeepSurv emerged as the top-performing models by C-index, with RSF achieving the highest overall score (0.84). To further support this choice, we conducted a second Kruskal–Wallis test on the bootstrapped C-index distributions across models, followed by Dunn’s post-hoc test with Holm correction. These additional analyses revealed that Fast RSF performed significantly better than most baseline models, reinforcing its suitability for downstream interpretation. Ultimately, we selected Fast RSF as the model of focus due to its strong performance, model stability across imputations, and interpretability in clinical contexts. This selection reflects a balance between statistical evidence and the practical needs of transparent decision-making in medical applications.
The features influencing the outcomes of the top-performing model (fast random survival forest on FS1) will undergo further analysis. Initially, the model ranks these features using a “permutation” method. Subsequently, we identify and select the top 10 features based on this ranking which were : ADAS13, AGE, RAVLT learning, FAQ, ADAS11, RAVLT immediate, Comorbidity Renal& Genitourinary, CDRSB, ADASQ4, Comorbidity Endocrine & Metabolic. The features selected consist of one demographic feature: Age, 7 cognitive scores: ADAS13, RAVLT learning, FAQ, ADAS11, RAVLT immediate, CDRSB, ADASQ4, and 2 comorbidities: Endocrine & Metabolic and Renal & Genitourinary. The selection of features aligns with existing literature, where age emerges as the most significant risk factor in Alzheimer’s disease, reflecting its well-established association with disease progression34. Additionally, cognitive scores serve as crucial indicators of AD, further supporting their inclusion in the predictive feature.
For our exploratory analysis we employ Partial Dependence Plots (PDPs) to offer a visual representation of the relationship between these individual features and the target variable while holding other features constant. The PDPs for the top 10 selected features by our fast random survival forest models are shown in Fig. 2. Specifically, Fig. 2a corresponds to ADAS13, Fig. 2b to AGE, Fig. 2c to RAVLT learning, Fig. 2d to FAQ, Fig. 2e to ADAS11, Fig. 2f to RAVLT immediate, Fig. 2g to Comorbidity (Renal & Genitourinary), Fig. 2h to CDRSB, Fig. 2i to ADASQ4, and Fig. 2j to Comorbidity (Endocrine & Metabolic). The blue shading in the plot, which is darker for higher values and lighter for smaller values, indicates the varying impact of the feature on the predicted survival function. For example, in the case of “AGE” feature, as age increases, the shading becomes darker and the darker survival function is lower, suggesting a stronger negative relationship between age and survival probability. Conversely, for younger ages, the shading is lighter and the lighter survival function is elevated, indicating a weaker impact of age on survival probability. The trend observed in the survival function value aligns with the shading pattern: as age increases, the survival function value decreases, indicating a higher predicted probability of experiencing the event of interest (conversion to MCI). Conversely, for younger ages, the survival function value increases, suggesting a lower risk of experiencing the event.

Partial Dependence Plots of top 10 selected features by the best performing machine learning model. Time x-axis is in months.

Kaplan–Meier survival curves stratified by (a) presence of endocrine/metabolic comorbidities (e.g., diabetes), (b) renal/genitourinary comorbidities, and (c) age group (\(\le 70\) vs. \(>70\) years).
When examining the partial dependence survival profiles of the comorbidity features “Renal&Genitourinary” and “Endocrine&Metabolic”, we observe two distinct survival function lines: a light blue line representing when the feature is absent (0) or does not exist, and a dark blue line representing when the feature is present (1) or the condition exists. Across both features, the darker line consistently depicts a lower survival function compared to the lighter line, suggesting that the presence of these comorbidities is associated with decreased survival probabilities.
To further evaluate the clinical relevance of these observations, Kaplan–Meier curves were generated for subgroups defined by the key features. Individuals with renal comorbidities exhibited significantly poorer survival compared to those without (Fig. 3a, log-rank p=0.04), supporting the trend observed in the partial dependence plot. In contrast, while individuals with endocrine/metabolic comorbidities, including diabetes, also showed a lower survival trend, this difference was not statistically significant (Fig. 3b, log-rank p=0.06). Additionally, age-stratified survival curves revealed that individuals aged above 70 years had significantly worse outcomes compared to those aged 70 or younger (Fig. 3c, log-rank p=0.003). These results emphasize the predictive value of age and renal health in survival outcomes and reinforce the interpretability of the model’s stratification capabilities.
External validation
Our best performing model, fast random survival forest, underwent external validation against the AIBL dataset to assess its generalizability. To strengthen this validation strategy, we adopted a two-stage approach. In the first stage, we applied the original top-performing model—trained on the full ADNI feature set (FS1)—to the AIBL dataset after aligning feature dimensions by adding missing variables with neutral values. This resulted in a C-index of 0.73, demonstrating robust predictive performance and highlighting the model’s consistent capabilities across datasets. In the second stage, we repeated the process using only the features common to both ADNI and AIBL. The fast random survival forest model was retrained using these shared features and achieved a C-index of 0.79 on the ADNI test set and 0.75 on the AIBL dataset. Next, because zero-imputation may introduce bias, we performed a sensitivity analysis using multiple imputation on the AIBL dataset. Across five imputed versions, the model achieved a pooled C-index of 0.77, indicating improved generalizability under less biased conditions and highlights variability due to missing data. Internal performance remained stable, reinforcing the model’s reliability. The outcomes of all three validation steps are summarized in Table 4.
