Identification and validation of an explainable machine learning model for vascular depression diagnosis in the older adults: a multicenter cohort study

Participants

Study population

A total of 602 individuals participated in this study, comprising 236 patients diagnosed with VaDep and 366 controls without depression who met the inclusion criteria. This cohort served as the internal dataset for building the ML model. Participants were recruited from the Department of Neurology and the Physical Examination Center at Zhongnan Hospital of Wuhan University, along with the Shuiguohu Street Community Health Service Center, between July 2020 and October 2023. Participants formed a consecutive series. All eligible participants identified during the study period were enrolled without selection.

Diagnostic criteria and clinical assessments

Clinical assessments and the diagnosis of VaDep were conducted by at least two experienced neurologists according to clinical features, medical history, MRI scanning, and neuropsychological tests. The diagnostic criteria for VaDep referred to the Vascular Depression Consensus Report [1], which includes the following: (1) evidence of vascular pathology confirmed by MRI imaging in older adult subjects with or without cognitive impairment; (2) absence of previous depressive episodes preceding obvious cerebrovascular disease, based on clinical interviews; (3) presence of at least one cerebrovascular risk factor (smoking, hypertension, diabetes mellitus, cardiovascular disease, or hyperlipidemia); (4) co-incidence of depression with cerebrovascular risk factors; (5) clinical symptoms characteristic of VaDep such as depression, executive dysfunction, and decrease in processing speed, evidenced by a Geriatric Depression Scale (GDS) score ≥ 10, and potentially prolonged performance on the Trail Making Test-B (TMT-B) or Victoria Stroop Test (VST); (6) neuroimaging data confirming cerebral vessel disease (CVD), such as WMH with a Fazekas score ≥ 2 and/or the presence of at least 2 typical lacunes.

Inclusion and exclusion criteria

Inclusion criteria for patients was as follows: (1) age 55 to 80 years old; (2) met the diagnostic criteria of VaDep as above; (3) could provide voluntary written informed consent. The controls were within the same age range without depression. Participants were excluded who (1) had a history of depressive episodes prior to CVD symptoms or before the age of 55; (2) had a family history of depression or history of psychoactive substance abuse; (3) experienced major life events in recent years, such as divorce and death; (4) showed obvious auditory or visual handicaps for neuropsychological tests; (5) could not complete MRI scanning or other assessment procedures; and (6) suffered from other severe illnesses that could significantly affect emotional state, such as thyroid disease, infections, tumors, or systemic diseases.

Ethical approval and informed consent

This study was performed with ethical approval of the Ethics Committee (ClinicalTrials.gov; ID: NCT04999813) and the Medical Ethics Committee, Zhongnan Hospital of Wuhan University (ID: 2,020,124). All participants provided written informed consent in line with the Declaration of Helsinki.

Flow diagram of study participants

The flow diagram is shown in Fig. 1.

Data collection and processing

This was a cross-sectional diagnostic accuracy study. Data collection and testing procedures were planned and conducted before performing the index test and reference standard. Every participant underwent standardized procedures for collecting clinical data, blood samples, and MRI scans. All steps were conducted in a blinded manner to minimize bias and ensure reproducibility.

Clinical information

First, face-to-face interviews were conducted by two neurologists to obtain detailed clinical information, including demographic and health-related factors such as sex, age, body mass index (BMI), education, smoking status, physical activity levels, and medical history of hypertension, diabetes, hyperlipidemia, and heart disease. Additionally, the Framingham Stroke Risk Profile (FSRP) was assessed. Laboratory tests conducted at the hospital’s laboratory department measured fasting blood glucose, triglycerides (TG), total cholesterol (TC), low-density lipoprotein cholesterol (LDL-C), and high-density lipoprotein cholesterol (HDL-C).

Clinical variables were defined as follows. BMI was calculated as weight in kilograms divided by the square of height in meters (kg/m²). Smoking status was categorized based on participant self-report: individuals who reported smoking at least one cigarette per day within the past 30 days were classified as “Yes,” and those who reported never smoking or having quit smoking more than 30 days prior to the assessment were classified as “No.” Physical activity levels were determined through self-report and classified according to current WHO guidelines: individuals engaging in moderate-to-vigorous physical activity for at least 150 min per week were categorized as “active,” whereas those with less than 150 min per week were categorized as “inactive.” Moderate-to-vigorous activities were defined as those causing noticeable increases in heart rate and breathing, such as brisk walking, cycling, or swimming. Medical history was assessed using a combination of participant self-report and objective clinical evaluations. Hypertension was defined as either two consecutive blood pressure measurements with systolic pressure ≥ 140 mmHg or diastolic pressure ≥ 90 mmHg, or a prior diagnosis of hypertension, or current use of anti-hypertensive medication. Diabetes mellitus was defined as fasting plasma glucose ≥ 7.0 mmol/L, a prior diagnosis of diabetes, or current use of anti-diabetic medication. Hyperlipidemia was defined as total cholesterol ≥ 5.2 mmol/L, triglycerides ≥ 1.7 mmol/L, a prior diagnosis of hyperlipidemia, or current use of lipid-lowering drugs. Heart disease was broadly defined as a history of clinically diagnosed coronary artery disease, myocardial infarction, heart failure, atrial fibrillation, or other major cardiac conditions as self-reported by the participant and/or documented in medical records.

Subsequently, cognitive and depressive assessments were conducted by a certified neuropsychologist. The results were independently quality-controlled by a second assessor. The depressive symptoms were evaluated using the Geriatric Depression Scale-30 (GDS-30). Scores below 10 indicate no depressive symptoms, scores from 10 to 20 suggest possible depression, and scores more than 21 indicate definite depression. Executive function was measured through the Trail Making Test-B (TMT-B) and Victoria Stroop Test (VST).

MRI data acquisition and processing

Neuroimaging was performed using a 3.0 T MRI scanner equipped with a 32-channel array coil (Siemens Healthcare, Erlangen, Germany). To minimize head movement during scans, appropriate padding was used. The MRI protocol included whole-brain T1-weighted, T2-weighted, and fluid-attenuated inversion recovery (FLAIR) sequences. T1-weighted images were captured using a sagittal three-dimensional magnetization-prepared rapid gradient echo sequence; the parameters were: reception time (TR) = 2250 ms, echo time (TE) = 2.26 ms, inversion time (TI) = 900 ms, flip angle = 9°, field of view (FOV) = 224 × 256 mm, voxel size = 1 × 1 × 1 mm, sagittal slices number = 176. T2-FLAIR images were acquired using the inversion recovery MATRIX sequence with the following parameters: TR = 6,000 ms, TE = 388 ms, TI = 2200 ms, flip angle = 120°, echo sequence length = 848, bandwidth = 781 Hz/pixel, FOV = 512 × 512 mm, voxel size = 0.5 × 0.5 × 1 mm, sagittal slices number = 160.

To extract MRI features into textual data, WMH was visually assessed on T2-FLAIR images using Fazekas scores ranging from 0 to 6 [14]. A lacune was identified as a subcortical, round or oval fluid-filled cavity resembling cerebrospinal fluid on imaging, with a hyperdense rim on T2-FLAIR images and a diameter ranging from 3 to 15 mm [15]. The number of lacunes was recorded. All was measured by two radiologists who underwent training prior to reading scans, blinded to clinical data and study objectives.

In the training process of machine learning, we input WMH scores and lacune counts into the model as continuous variables. At the same time, we also transferred them into categorical variables to compare the distribution difference between groups. Specifically, WMH burden was categorized into none-to-mild (total Fazekas score 0–2) versus moderate-to-severe (total Fazekas score 3–6). Lacune counts were categorized into none-to-isolated lacune (0–1 lacunes) versus multiple lacunes (≥ 2 lacunes).

Collection and detection of blood sample

Blood samples were collected from each participant after an overnight fast, with serum stored at − 80 °C after centrifugation and sub-packaging. Biomarkers were quantified using enzyme-linked immunosorbent assay (ELISA) or biochemical methods. Levels of interleukin-6 (IL-6), tumor necrosis factor-alpha (TNF-α), C-reactive protein (CRP), glutamate, vascular endothelial growth factor (VEGF), and platelet-derived growth factor receptor-beta (PDGFR-β) were measured using ELISA or biochemical kits from Nanjing Jiancheng (Nanjing, China). Levels of gamma-aminobutyric acid (GABA), serotonin (5-HT), and dopamine were assessed using ELISA kits from LDN (Nordhorn, Germany). The level of brain-derived neurotrophic factor (BDNF) was determined using ELISA kits from Boster (Pleasanton, USA). All measurements were conducted by independent laboratory technicians to ensure objectivity and reliability of the results.

Construction of machine learning model

For model construction, we employed a comprehensive internal dataset comprising demographics, medical histories, lifestyle factors, imaging markers, results of blood tests, and executive function tests. We selected the variables for their accessibility, objectivity in clinical assessment, and strong associations with VaDep as established by prior research. For categorical variables (e.g., sex), we applied one-hot encoding to convert them into binary numerical representations suitable for machine learning algorithms. For continuous variables (e.g., biomarker levels), we performed Z-score standardization to transform them to have a mean of 0 and a standard deviation of 1. Features with more than 20% missing values were excluded from subsequent analyses (see Additional file 1: Table S1–S2). Missing values for the remaining data were imputed using the median for continuous variables and the mode for categorical variables. Next, we tested these variables for multicollinearity, and the results were provided in the Additional file 1: Fig. S1. Initially, the diagnostic model included 30 clinical textual variables.

The dataset used for model construction was randomly divided, allocating 80% for training and the remaining 20% for the test set, with consistent random seeds across all models. To mitigate overfitting and enhance the generalizability of the model, the training set was further randomly split into five equal-sized subsets, and a fivefold cross-validation was conducted ten times. A grid search technique was utilized to identify the optimal hyperparameters. The test set was then used to evaluate the model’s performance with these hyperparameters, outputting the probability of VaDep diagnosis for further analysis.

Initial screening of ML models was conducted using the lazypredict package. Based on multiple performance metrics (accuracy, balanced Accuracy, AUROC and F1 score), the top 6 algorithms were selected from all 27 algorithms for further training and evaluation: Extreme Gradient Boosting (XGB), Bagging, Light Gradient Boosting Machine (LGBM), Extra Trees (ET), Adaptive Boosting (Adaboost), and Random Forest (RF). These models were developed using Python algorithms afterwards. Model performance was assessed using several metrics, including the area under the receiver operating curve (AUROC), the area under the precision-recall curve (AUPR), accuracy, specificity, recall, precision, and F1 score.

Feature selection and model explanation

To enhance the interpretability of our models, we employed the SHapley Additive exPlanations (SHAP) algorithm [16] to gain a deeper understanding of the factors influencing model performance. Additionally, we aimed to identify the minimum set of necessary variables to increase clinical applicability and simplify the model’s interpretation. Therefore, for each of the 6 ML algorithms, we performed feature reduction separately. SHAP values were calculated for each feature to assess its contribution to model decision-making, and features were ranked accordingly. A sequential forward selection strategy [17] was then employed, starting from the feature with the highest contribution, incrementally adding features into the model and evaluating performance at each step. More specifically, for each algorithm, 30 consecutive models with varying feature subsets were constructed, and the performance metric (AUROC) for each model was recorded. Then, the algorithm demonstrating the best overall performance across these 30 models was selected.

After selecting the algorithm, we observed that as the number of features increased, model performance gradually stabilized. We determined the final stopping point when adding more features no longer significantly improved performance, thereby achieving the optimal balance between model complexity and diagnostic accuracy. This process of feature reduction resulted in a final model.

At last, The SHAP analysis was applied again to re-interpret the contributions of each feature in the final model, allowing a more comprehensive understanding of the underlying classification features of the ML model and significantly enhancing its interpretability.

External validation

For external validation, an independent dataset comprising 75 VaDep patients and 96 controls, who met identical inclusion and exclusion criteria, was compiled. These participants were recruited from the General Hospital of the Yangtze River Shipping and Third People’s Hospital of Hubei Province between August 2022 and April 2024. All pre-processed variables were input into the trained final model, and model performance was comprehensively evaluated by calculating AUROC, AUPR, and other relevant metrics.

Regarding the detailed process of the ML method used in this study, from model construction to external validation, we have drawn another flowchart, see the Additional file 1: Fig. S2.

Expert clinical validation

To compare the advantages of the ML model with conventional diagnostic methods for VaDep, we invited six neurologists to independently assess 100 participants. In clinical practice, the diagnosis of VaDep primarily relies on clinicians’ comprehensive evaluation of a patient’s clinical data. For this purpose, we employed a dedicated dataset consisting of 100 participants selected from the internal test set using a stratified random sampling approach. Specifically, the internal test set was first stratified into two layers: case group (VaDep patients) and control group (50 cases each, maintaining 1:1 balance). Within each stratum, simple random sampling was performed using Python’s random number generator (Random module) to select 50 unique participants.

To closely simulate real-world clinical scenarios, each neurologist was provided with a complete dataset for every participant, encompassing demographic information, medical history, neuroimaging data and scans, laboratory test results, and cognitive assessment outcomes. The neurologists were instructed to assign a confidence score ranging from 0 to 100 for each case [18], with higher scores indicating a greater likelihood of a VaDep diagnosis. For statistical analysis, these confidence scores were normalized by dividing by 100, thereby converting them into probability values ranging from 0 to 1, consistent with the output scale of the ML model. The average confidence score across the six neurologists was then calculated and used to evaluate the overall diagnostic performance of the clinicians.

In the statistical analysis, we first evaluated the inter-rater agreement among the six neurologists. Pairwise Pearson correlation analyses were conducted between the confidence scores assigned by each physician to calculate correlation coefficients for each pair. Subsequently, the intraclass correlation coefficient (ICC) and its 95% confidence interval were derived to quantify overall consistency. Finally, the diagnostic performance of the neurologists was compared with that of the model based on the AUROC and AUPR.

Statistical analysis

Continuous variables were described as mean ± standard deviation for normally distributed data, and median [interquartile range] for skewed data, with categorical variables reported as counts (percentage). The comparison between VaDep patients and controls involved using the independent samples t-test for normally distributed continuous variables, the Mann–Whitney U test for non-normally distributed variables, and the chi-square test for categorical variables. Normality was assessed using the Shapiro–Wilk test and visual inspection of histograms. Missing values were imputed using the median for continuous variables and mode for categorical variables.

To assess model performance, we calculated accuracy, specificity, and F1-score (the harmonic means of recall and precision). These were computed using the formulas: \(\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}\), \(\frac{\text{TN}}{\text{FP}+\text{TN}}\) and \(\frac{\text{2*TP}}{\text{2*TP+FP+FN}}\), respectively. Receiver operating characteristic (ROC) and precision-recall (PR) curves were generated, and the areas under these curves (AUROC and AUPR, respectively) were calculated to serve as the primary performance metrics. The 95% confidence intervals for these metrics were established using 1000 bootstrap samples. ROC comparisons were performed using the non-parametric DeLong method in MedCalc software, version 20.

Spearman correlation analysis was conducted to evaluate the relationships between features of the final model and depression scores (GDS) within the internal dataset, adjusting for sex, age, and education as covariates. The Benjamini–Hochberg false discovery rate (FDR) was applied for multiple comparisons, with statistical significance set at p < 0.05. The model construction and visualization processes were implemented using Python (v3.10.9) with libraries such as scikit-learn (v1.0.2), xgboost (v2.0.3), numpy (v1.24.3), pandas (v2.0.1), and scipy (1.14.0). Additional statistical analyses were carried out using R (v4.4.1) and IBM SPSS Statistics 27.