Data collection, label generation, and robustness
Panel a of Fig. 1 illustrates our data generation workflow. We collected stress test ECG data from 3522 consecutive adult patients who underwent a standard25 rest/stress myocardial perfusion single-photon emission computed tomography (SPECT) protocol at a tertiary hospital as part of the BASEL VIII study (NCT01838148). Patients were referred with symptoms possibly related to inducible myocardial ischaemia and clinical suspicion of stable coronary heart disease. If a patient was not able to reach their target heart rate, a pharmacological protocol with either adenosine or dobutamine was initiated by the treating clinician. Individuals for whom stress test by bicycle ergometry was not possible were put on a pharmacological protocol from the start. To compare the algorithmic approaches with expert judgement, the treating cardiologist performed a clinical assessment before and after stress testing: considering all available medical information such as (cardiac) history, relevant symptoms, risk factors, (stress) ECG, prior imaging and more, they indicated the probability of the presence of fCAD on a visual analogue scale (VAS) from 0% to 100%26,27,28,29. Representing clinical practice, adjudication of functionally relevant CAD was not formally blinded for stress ECG results or demographics and was performed centrally by an expert team composed of a nuclear medicine physician and a cardiologist assessing myocardial perfusion scans. Furthermore, whenever available, adjudication was refined with coronary angiography and fractional flow reserve assessment. Of the 3522 eligible patients who provided written informed consent, 701 (20%) patients underwent coronary angiography within 3 months, with 30 (0.9%) patients being reclassified to the fCAD group and 74 (2.1%) being reclassified to the non-ischaemic group. The VAS score the treating cardiologist provides after the stress test but before they get access to the imaging results represents the cardiologist baseline in our study. In practice, this can be interpreted as an indicator as to whether the cardiologist would recommend a follow-up examination with advanced imaging.

a Data acquisition: We highlight the three primary subgroups of exercise stress testing: ① patients who complete the bicycle exercise stress test, ② patients not able to exercise on the bicycle, and for whom a pharmaceutical protocol is used at the beginning of the stress test, and ③ patients starting on the bicycle but need pharmacological support to reach their target heart rate. Doctors perform myocardial perfusion scans at rest (rest MPS), and at the target heart rate (stress MPS). Myocardial perfusion is quantified by the myocardial perfusion scan summed rest score (MPSSR score), and the MPS summed stress score (MPSSS score). The cardiologist estimates the probability of a functionally relevant CAD (fCAD) before and after the stress test (Pre/Post-Test CAD Probability in the figure). The binary label indicating the presence of fCAD (yellow box) is adjudicated by considering the stress test results and additional relevant clinical parameters. b Data Preprocessing: Following smoothing and outlier removal, time series that serve as input to the neural network are constructed by joining short subsequences from different phases of the stress test. For this, 2 s from the pre-stress phase, 6 s from the stress phase, and 2 s from the recovery phase are sampled and concatenated multiple times for a single patient (green, orange, and purple sequences). x-axes represent time in seconds. c Machine Learning: For our neural network approach (CARPEECG), these 2-6-2 sequences are fed into a residual neural network (ResNet). In parallel, the patient’s static clinical data are processed by a 2-layer feedforward network. Four subnetworks are trained on three auxiliary tasks (i.e., MPSSR & MPSSS score as well as stress type prediction) and one main task (fCAD prediction). We average predictions of the main task over all 2-6-2 sequences per patient. Purple arrows in front of each task indicate the direction of the learning signal. The same clinical variables as for CARPEECG are used to train a random forest classifier (CARPEClin.); nodes are coloured to enhance legibility. We combine both predictions with the cardiologist’s judgement in a logistic regression model (CARPEColl.).
The data set was split into a development (75%) and a held-out test set (25%). All patients in the development set enrolled in the study from Jan. 2010 through Dec. 2014; the held-out test set contains patients who enrolled from Dec. 2014 through May 2016. It was only released and used once the models’ parameters were fixed. Thus, high predictive performance on the held-out test set indicates the robustness of our system’s generalisation capability with respect to a temporal shift of the data30, paving the path towards subsequent real-world applications. Lastly, we use external data from two Israeli medical centres to validate our system on 916 consecutive patients referred for SPECT MPI testing, whose ECG signals were obtained by treadmill stress test. This evaluation scenario is designed to exemplify the ability of both computational approaches to generalise to patients from unseen institutions, new modalities, and highlight their behaviour under distributional shifts. Given an fCAD prevalence of 7.5% in the external data set, our approach based on clinical data alone (AUROC: 0.75 ± 0.004, AUPRC: 0.19 ± 0.01) is outperformed by our deep neural net using ECG time series and clinical variables (AUROC: 0.80 ± 0.01, AUPRC: 0.28 ± 0.01). Please refer to the Method section for more details on data splitting and distributional shifts in the external validation data.
Development of an ensemble predictor and a multi-task neural network for functionally relevant CAD prediction
The ability to learn from raw sequential data (i.e., time series) makes neural networks a popular approach for healthcare applications. However, conventional machine learning (ML) has shown to be at least as powerful as deep learning in the clinical context22,23, thus creating opportunities for low-cost deployments that do not require specialised hardware. Therefore, we will also compare their performance to deep learning models. To this end, we select a small set of eight non-sequential, easy-to-access variables on which we train four conventional ML methods (i.e., decision trees, random forests31, logistic regression, support vector machines32). These variables include age, weight, biological sex, height, heart rate at rest, systolic and diastolic blood pressure, and presence of a previous CAD. The best-performing approach (a random forest) was selected via 5-fold cross-validation. We refer to all developed predictors as Coronary ARtery disease PrEdictor (CARPE). Based exclusively on clinical data, we refer to the random forest model as CARPEClin.. Additionally, we develop a neural network approach, CARPEECG, that uses the aforementioned non-sequential variables and the ECG signal, as illustrated in panel c of Fig. 1. We trained CARPEECG via a multi-task learning33 (MTL) architecture with residual layers15,34 at its core using the torchmtl35 package. MTL uses so-called auxiliary tasks (i.e., prediction targets) related to a main task (e.g., fCAD prediction). These domain-specific inductive biases ensure improved and robust predictive performance on the main task36. As shown in Fig. 1, we train CARPEECG on three auxiliary tasks (blue boxes), two of which (MPSSS and MPSRS) quantify the heart’s perfusion capabilities without and under stress, respectively. The third auxiliary task is to predict whether a patient received any pharmacological support to perform the stress test. Each auxiliary task impacts performance on the main task differently. Their respective importance weights were selected in a grid search on the three best-performing leads (see Supplementary Fig. 5 and Supplementary Table 5). To gain insights into the importance of static features and ECG segments, we used SHAP (SHapley37 Additive exPlanation) values38.
Finally, we combine predictions from the ensemble model and deep learning approach with the cardiologist’s post-test judgement by training a new logistic regression model on all three scores from the training set. This way, we leverage the experience and domain knowledge of the cardiologist while adding the potential to benefit from supervised learning techniques. We believe that, in practice, such a collaborative approach has the highest chances of being accepted in a clinical setting not only because it reaches the highest diagnostic performance (see Supplementary Table 6) but also because the cardiologist is an integral part of the score generation (in fact they are required to provide a VAS score after stress testing). Nevertheless, the use of logistic regression to enhance diagnostic accuracy does not ensure or directly translate to clinical utility. The precise impact on patient risk stratification needs to be assessed separately.
Machine learning can be used to reduce unnecessary perfusion imaging
The prevalence of centrally adjudicated fCAD was 32.9% in the full study cohort and 28% in the held-out test split. Figure 2 depicts the diagnostic performance of our machine learning approaches, the cardiologist’s assessment after stress testing, and a computational approach that uses the ECG’s ST-segment depression39,40 as an indication of the presence of fCAD (see Methods for a detailed description) on the held-out data set. We show receiver operating characteristic (ROC) and precision-recall curves in the first row. Standard deviations shown as envelopes were obtained using bootstrapping, as detailed in the Methods section. Regarding the mean area under the ROC curve, we observe that CARPEECG (0.71) and CARPEClin. (0.70) outperform both the ST-depression algorithm (0.58) and the cardiologist (0.64). In regions of high specificity, CARPEClin. drops below the sensitivity of the cardiologist, while CARPEECG reaches comparable predictive performance (see inline plot). At the other extreme of the ROC curve, i.e., at high sensitivity values, both machine learning approaches consistently lead to a higher specificity than the cardiologist’s judgement (see inline plot).

ROC and PR-curve. Predictive performance of our deep learning-based approach (CARPEECG), a random forest based on clinical data (CARPEClin.), the cardiologist, and ST depression in terms of mean performance ± standard deviation (envelopes) over n = 25 bootstrap draws. The upper plots show that both machine learning approaches outperform the cardiologist in terms of area under the receiver operating characteristic and precision-recall curve. In regions of high specificity (inline plot), the neural network is on par with the cardiologist while CARPEClin. exhibits worse performance. Both machine learning methods outperform the cardiologist’s judgement in regions of high sensitivity (inline plot). Decision Curve: First row: Net benefit43 plot for CARPEECG (green), CARPEClin. (orange), the cardiologist (purple), a myocardial perfusion scan (MPS) for no patient (black), and MPS for all patients (dashed grey). CARPEColl. is not shown as it is visually indistinguishable from CARPEECG. Net benefit puts both benefit and harm on the same scale. In our case, we consider harm to be inflicted by performing an unnecessary MPS. At a decision threshold of 5%, all approaches lead to a similar net benefit. At the second threshold of 15%, CARPEClin. and the cardiologist demonstrate a net benefit similar to performing MPS on all patients, with CARPEECG leading to a higher net benefit. Second row: Potential MPSs avoided compared to the cardiologist’s strategy: While the conventional ML model and deep learning avoid the approximately same number of MPSs at the decision threshold of 5% (11.5% and 12.8%, respectively), the gap increases at the pre-MPS threshold of 15% (15.3% and 5.3%, respectively). Envelopes in both rows show 95% confidence intervals around the mean over n = 25 bootstrap draws. Source data are provided as a Source Data file.
Decision curves41 (rows two and three in Fig. 2) overcome the drawbacks of conventional performance evaluations and calibration analyses42 by focusing on a predictor’s clinical value. The concept of net benefit quantifies the trade-off between diagnosing sick patients and preventing healthy patients from being exposed to harmful testing procedures43. For a specific decision threshold probability of a diagnostic tool, a larger net benefit indicates a greater number of true positive predictions without an increase in the rate of false positives and, conversely, a greater number of true negative predictions without an increase in false negatives. Figure 2 shows a decision curve analysis in the second and third row with pre-test rule-out cutoffs (dotted red) as advocated for in European and US-American guidelines8,21, which consider probability thresholds between 5-15% for further non-invasive imaging. The European guideline, for instance, considers non-invasive testing in patients with a probability >15% as most beneficial and testing in patients with 5–15% as potentially beneficial. Our machine learning models lead to a higher net benefit than the cardiologist’s assessment at all thresholds. Notably, at the threshold of 15%, relying on the cardiologist’s judgement is worse (in terms of net benefit) than performing myocardial perfusion imaging on all patients demonstrating the value of an ML-based method.
Table 1 offers a detailed decision curve analysis, showing sensitivity, negative predictive value (NPV), and percentage of avoided myocardial perfusion imaging compared to the cardiologist’s judgement at three probability cut-off values. We also show the percentage of patients who received a score below the cutoff threshold to enable a meaningful interpretation of sensitivity values. The highest fraction of MPIs, i.e., almost 25 per 100 patients, could be avoided at a decision threshold of 10% by using CARPEColl. as a risk stratification method due to risk-overestimation by the cardiologist. That being said, cutoff thresholds should not be chosen to optimise diagnostic performance, but they represent the cardiologist’s minimum probability of disease at which an intervention would be warranted42. In other words, if a cardiologist holds the belief that missing a patient who suffers from fCAD is nine times worse than performing an unnecessary MPI, a model’s performance should be assessed at the 10% cutoff.
Evaluating CARPEECG as a predictive model on all patients of the held-out test set (at the 15% decision threshold) shows the potential to reduce perfusion imaging by 15.3% (see Table 1) without increasing the rate of false negatives. This number increases to 17.3% when using CARPEColl.. We observe similar behaviour in patients without a CAD history. At the 5% threshold (i.e., if a physician considers it 19 times worse to miss an fCAD diagnosis than to perform an unwarranted MPI), ML can be used to avoid 10.8% of the imaging ordered by a cardiologist. For patients with CAD history, the decision thresholds of 5% and 10% lead to a particularly small number (<1% or none) of patients for which fCAD can be ruled out, which inflates sensitivity and NPV of CARPEECG and CARPEColl.. This inflation is particularly pronounced in CARPEClin. (see Supplementary Fig. 6) which is therefore not shown here. Overall, these results demonstrate the potential clinical utility of the proposed methods to reduce potentially unwarranted MPIs.
Subcohort analysis: machine learning performs particularly well on younger patients
Trustworthiness and interpretability are of fundamental importance in the development of risk stratification models in cardiology44. Identifying (sub)cohorts of the population for which the model performs particularly well or poorly is crucial. To address the issue of trust, we evaluate our models’ performances on a variety of subcohorts that are important in the context of exercise stress testing. Regarding interpretability, we perform an analysis of SHAP values38 on the population level, and a case study to better understand the impact feature values and ECG segments have on the predicted scores.
Clinically significant subgroups include patients who underwent exercise stress testing versus patients who required pharmacological testing as well as patients without a prior history of CAD versus patients with a known history of CAD; the odds of suffering from fCAD are significantly increased (p = 2.26E-40, two-sided Fisher’s exact test, test statistic = 2.64) for patients with previous CAD (OR: 2.64, 95% CI: 2.28–3.05) over the whole cohort. To obtain a more detailed performance breakdown, we also stratify the data by sex and age. Diagnostic performances of all approaches and subcohorts of the held-out test set are shown in Fig. 3 and Supplementary Table 6. For comparison, the performance of the CAD consortium model45 and the currently used ESC pre-test probabilities for obstructive coronary artery disease8,9, both based on age, sex, and the nature of symptoms, is shown in patients without known coronary artery disease. First, we assess the performance of individual machine learning methods before discussing their combination with the cardiologist’s judgement. Deep learning outperforms the cardiologist in terms of both AUROC (significant performance increase in 6/10 subcohorts) and AUPRC (significant performance increase in 4/10 subcohorts), while CARPEClin. exceeds the human baseline in 5/10 strata in terms of AUROC and 2/10 subcohorts in terms of AUPRC. The central plot in panel a of Fig. 4 helps explain this performance discrepancy: The conventional ML model relies more than the neural network on the CAD history and sex variable as visually observable by the large gap between the highest negative and the lowest positive SHAP value for each variable. Strong reliance on a given variable pushes the predictor too strongly in one direction such that other features cannot compensate for this influence on the final score. This SHAP analysis confirms the importance of the “CAD history”, “sex”, and “age” variables as observed in other studies19,20.

Performance breakdown over different subcohorts and n = 25 bootstrap draws. The dashed black line indicates the AUROC of a random classifier. Over the full cohort (All Patients), both CARPEClin. and CARPEECG reach a statistically significantly higher AUROC than the cardiologist. Additionally, the collaborative approach (CARPEColl.) significantly increases predictive performance over CARPEECG. Please refer to Supplementary Table 6 and Supplementary Fig. 4 for more details. Box plots indicate median (middle line), 25th, and 75th percentile (box). Whiskers extend to points that lie within 1.5 IQRs of the lower and upper quartile. Diamonds are outliers. Error bars in the bar plots indicate 95% confidence intervals. Source data are provided as a Source Data file.

a Bar plots show the mean absolute SHAP value for all clinical variables used by our predictors. Purple scatter plots show individual data points. CAD history and sex are the most important clinical features for both classifiers. The central scatter plots show the impact individual feature values have on the prediction score. High feature values are depicted in a dark blue, low values in a light green. SHAP values for an existing CAD history are always positive. Similarly, SHAP values of the “sex” feature are always positive for male patients. We depict SHAP value distributions over all ages in the scatter plots on the right-hand side. b SHAP values for clinical variables and one 2-6-2 sequence of a patient. The first row shows the feature distribution of the development data set (n = 2648) in green. The blue cross marks where in the distribution the patient lies. Second row: SHAP values for the specific patient for each feature over n = 5 splits. The absence of a CAD history and the resting heart rate of 67 BPM result in negative SHAP values. The patient’s sex (male), his age, and systolic blood pressure at rest are associated with higher SHAP values. Last row: One of the patient’s 2-6-2 sequence (black) with the SHAP values of each individual measurement in the background. We show negative SHAP values in dark purple and positive ones in yellow. Dashed black lines mark the borders of pre-stress, stress, and recovery samples. The largest areas of high SHAP values concentrate in the stress phase around the ST-segment. Error bars in all plots indicate 95% confidence intervals over all models from all five splits. Box plots indicate median (middle line), 25th, and 75th percentile (box). Whiskers extend to points that lie within 1.5 IQRs of the lower and upper quartile. Diamonds are outliers. Bar plots show the mean over n = 5 test splits with error bars indicating 95% confidence intervals. Source data are provided as a Source Data file.
Overall, the discriminative performance was highest (excluding CARPEColl.) in younger patients (CARPEECG AUROC: 0.78 ± 0.04) in general and in younger patients who did not require pharmacological support specifically (CARPEECG AUROC: 0.79 ± 0.04). The former cohort also represents the stratum in which the increase over the cardiologist is the highest, namely 0.19 in AUROC and 0.15 in AUPRC. We hypothesise that similar to the conventional ML model (i.e., a random forest), a cardiologist might be more biassed towards a negative diagnosis in younger patients. In contrast, the DL model is more robust to such a behaviour (as shown by the SHAP distribution of the age variable in Fig. 4). We show a more detailed assessment of diagnostic performance in different age groups in Fig. 5.

On the x-axes, we show different age groups in the held-out test and external validation set. Left y-axes: area under the receiver operator characteristic (AUROC). Error bars indicate 95% confidence intervals around the mean. Right y-axes: percentage of patients who comprise the respective subgroup of the x-axis. No cardiologist’s judgement is available in the external validation set, hence CARPEColl. cannot be evaluated. The performance difference between random forest and CARPEECG is strongest in the external validation set due to the conventional ML model relying (too) strongly on the “age” variable. Error bars indicate 95% confidence intervals over all models of all five splits. The number of individuals in each bin are 53, 143, 219, 248, 140 for the held-out test set and 281, 341, 208, 86, respectively. Source data are provided as a Source Data file.
On the male subpopulation, CARPEClin. is outperformed by the cardiologist, indicating that the conventional ML model relies too strongly on the sex feature as an indicator for the presence of fCAD, whereas the DL model and the cardiologist use this feature more effectively. This is underlined by the observation that the performance gap between CARPEECG and CARPEClin. is highest in the male subgroup. In female patients, both CARPEClin. and CARPEECG perform comparably and better than the cardiologist in terms of AUROC.
In patients of at least 65 years of age, it becomes apparent that human judgement and ML might benefit from each other: while individually, both CARPEClin. and CARPEECG perform equally or worse than the cardiologist, combining all predictions in CARPEColl. results in a statistically significant performance increase over the DL model. It appears that the ML models’ biases are mitigated by the cardiologist’s expertise and vice versa. Augmenting the machine and deep learning output by the cardiologist’s judgement also increases diagnostic performance significantly in the full population. While CARPEColl. obtains its maximal mean AUROC in the same cohorts as CARPEECG, the highest mean increase over CARPEECG can be observed in patients with a CAD history, making it the group in which ML and cardiologists could complement each other most effectively.
Conventional machine learning relies on age, and ST-segment depressions contribute to high risk scores
For the cardiologist who interacts with a risk-stratification tool, it is critical to understand the model’s operations46 and whether it is consistent with the clinical knowledge about the phenotype. To develop such an understanding, post-hoc explanations47 can be used to make predictions more interpretable. We use SHAP38 values, a game-theoretic approach, to explain the outputs of machine learning models. SHAP values provide a score that quantifies the impact an individual feature value has on the model’s prediction. A positive SHAP value is associated with the prediction of the positive class/the presence of fCAD. Conversely, a feature with a negative SHAP value influences the model towards predicting the negative class/the absence of fCAD.
Panel a of Fig. 4 shows mean absolute SHAP values and SHAP value distributions for all clinical variables for CARPEECG and CARPEClin. on the left-hand side. On the right-hand side, we show the SHAP values for the “age” feature. For both classifiers, “CAD history” and “sex” are the most influential predictive features (i.e., highest mean absolute value). However, CAD history is only significantly more relevant than the patient’s “sex” in the random forest (p = 7.9E-05, test statistic = 7.36 (CARPEClin.)) and not in CARPEECG (p = 0.055, Welch’s t-test for independent samples, test statistic = 2.24). Furthermore, the SHAP distribution of these variables around the value of zero is strikingly different. While CARPEECG exhibits many values comparatively close to zero (i.e., there are patients for which the respective features have no significant impact on the model’s final prediction), both CAD history and “sex” have a large impact on the model’s prediction in all patients for the conventional ML model (i.e., the distance to zero for both positive and negative SHAP values is substantial). Additionally, both features show a distinctive separation: each variable instance always leads to either a positive (male and presence of CAD history) or negative (female and absence of CAD history) SHAP value. We observe another distinctively different behaviour in the distribution of SHAP values for the “age” feature. The conventional ML model has learnt an age threshold of 70 years, which, when exceeded, leads to mostly positive SHAP values (i.e., it contributes to predicting the presence of fCAD) and vice versa. CARPEECG, on the other hand, exhibits a distinctive bell shape around zero, indicating the reduced impact of this variable. While this bias is likely due to the reduced fCAD prevalence of younger patients, the DL model exhibits a more stable behaviour with respect to this variable. The conventional ML model’s reliance on young age as a strong indicator of the absence of fCAD turns out to be detrimental when evaluated on external data, which consists of significantly more young patients (see Fig. 5). This underscores the need for explainability and trustworthiness in assessing ML models; if unaddressed, these aspects may preclude clinical applicability.
In addition to performing a population-wide feature relevance analysis, SHAP values allow for sample-specific analyses. In panel b of Fig. 4, we show a case study of an 83 year-old male patient with no previous CAD. We envision that in a future clinical implementation of our risk assessment tool, such a dashboard will support the cardiologist to understand better on which basis the model arrived at its prediction (e.g., whether the ECG signal is disturbed or noisy) and the influence of each feature (e.g., SHAP values).
The first row of panel b depicts the distributions of the values of all clinical features from the training population. Blue crosses indicate where the patient lies in that distribution. The centre row shows the distribution of SHAP values over five iterations. Moreover, we show the SHAP values of individual measurements in the background of the input ECG in the last row. The mean risk-score CARPEECG provides for this patient, who was later diagnosed with the presence of fCAD, is 0.77. We show positive SHAP values in yellow, negative ones in dark purple.
Notwithstanding their opposing signs, among the clinical variables, both the absence of a previous CAD and the patient’s age contribute most to the model’s prediction (-0.1 and 0.1, respectively). The normal resting heart rate of 67 is associated with a lower risk score (mean SHAP value: 0.07). While weight, height, and diastolic blood pressure influence the model only marginally, the fact that the patient is male contributes most towards a higher risk score. Similarly, the patient’s age lies above the upper quartile of the training distribution, pushing the model toward predicting a higher score. Lastly, the systolic blood pressure (129 mmHg) also contributes to the prediction of the positive class. The largest contribution that increases the model’s output comes from the ECG. The SHAP values attributed to certain measurements and segments in the ECG might change throughout the different phases of stress testing. In sum, the mean SHAP value for the whole signal is 2.31. The highest SHAP values can be observed in the part of the input signal that comes from the stress phase of the examination. Measurements around the R-peak during rest and, more strikingly, around the ST-segment in the stress and partially in the recovery phase are associated with higher SHAP values than other segments of the ECG. The latter observation is a data-driven and a priori domain-agnostic confirmation of the relevance of ST-segment depression in the diagnosis of fCAD. This is underlined by the fact that in the pre-stress phase, where almost no ST-segment depression is visible, SHAP values around the ST-segment are close to zero. Conversely, negative SHAP values, in line with conventional medical understanding, are observed in the T-wave region during rest, the PR interval during stress, and prominently at the ventricular activation or R-wave peak time. This case study and the relevance of ST-segment depression for the prediction of higher risk scores is supported by a population-wide SHAP analysis in Supplementary Figs. 7 and 8.
CARPEECG generalises to unseen data across countries and modalities
To validate our neural network’s generalisation capabilities, we compute its predictive performance on an external validation data set containing 916 consecutive patients referred for exercise myocardial perfusion single photon computed tomography. Referral reasons included non-anginal chest pain, atypical angina, presence of risk factors, or shortness of breath. This data set was retrieved through the THEW data repository48 (SUI: E-OTH-12-0927-015); it differs from the development data in several key characteristics: First, instead of recording the stress test ECG using bicycle ergometry, it was captured by a treadmill exercise test. Therefore, the resulting signal is subject to noise from walking movements rather than the cycling activity. Second, with a mean age of 55 years, the population in the external data set is significantly younger (p = 1.5E-121, one-sided Welch’s t-test, test statistic = 25.39) than the internal study cohort (held-out test set) whose patients are on average 68 years old (see Supplementary Fig. 9 for a complete comparison of all clinical variables). Lastly, the prevalence of ischaemia in the internal cohort is significantly higher compared to the external validation set (7.5%).
As shown in Supplementary Table 7, both approaches reach a good overall diagnostic performance and perform better on the external data set than on the internal held-out test set. CARPEECG outperforms the conventional ML model in both AUROC (0.80 \(\pm\) 0.01 vs. 0.75 \(\pm\) 0.004) and AUPRC (0.28 \(\pm\) 0.02 vs. 0.19 \(\pm\) 0.01). We attribute the higher predictive performance of the DL model to the fact that despite coming from a different modality, ECG signals are not fundamentally different among different populations, making it a robust and reliable input signal.
In Fig. 5, we contrast predictive performance on different age groups in both internal and external validation data. In patients who are younger than 70, both computational approaches consistently outperform the cardiologist in terms of diagnostic accuracy. However, for the stratum that makes up the majority of the data set (ages 70–79), pure computational prediction and human judgement individually perform comparably. However, their combination (CARPEColl.) significantly (p = 8.1e-4, one-sided Welch’s t-test, test statistic = 7.58) increases diagnostic performance over the cardiologist’s judgement and over CARPEECG (p = 0.001, test statistic = 4.73). The two extremes of the age distribution exemplify how the random forest’s cutoff of 70 years (see SHAP analysis) leads to detrimental performance: The further away a patient group lies from the cutoff, the bigger the performance difference between CARPEECG and CARPEClin. becomes. This is even more pronounced in the external validation cohort, where the differences in mean AUROCs (i.e., 10.3 percentage points) are the largest in patients between 26 and 49 years of age.
