Enhancing the diagnosis of functionally relevant coronary artery disease with machine learning

Data collection, label generation, and robustness

Panel a of Fig. 1 illustrates our data generation workflow. We collected stress test ECG data from 3522 consecutive adult patients who underwent a standard²⁵ rest/stress myocardial perfusion single-photon emission computed tomography (SPECT) protocol at a tertiary hospital as part of the BASEL VIII study (NCT01838148). Patients were referred with symptoms possibly related to inducible myocardial ischaemia and clinical suspicion of stable coronary heart disease. If a patient was not able to reach their target heart rate, a pharmacological protocol with either adenosine or dobutamine was initiated by the treating clinician. Individuals for whom stress test by bicycle ergometry was not possible were put on a pharmacological protocol from the start. To compare the algorithmic approaches with expert judgement, the treating cardiologist performed a clinical assessment before and after stress testing: considering all available medical information such as (cardiac) history, relevant symptoms, risk factors, (stress) ECG, prior imaging and more, they indicated the probability of the presence of fCAD on a visual analogue scale (VAS) from 0% to 100%^26,27,28,29. Representing clinical practice, adjudication of functionally relevant CAD was not formally blinded for stress ECG results or demographics and was performed centrally by an expert team composed of a nuclear medicine physician and a cardiologist assessing myocardial perfusion scans. Furthermore, whenever available, adjudication was refined with coronary angiography and fractional flow reserve assessment. Of the 3522 eligible patients who provided written informed consent, 701 (20%) patients underwent coronary angiography within 3 months, with 30 (0.9%) patients being reclassified to the fCAD group and 74 (2.1%) being reclassified to the non-ischaemic group. The VAS score the treating cardiologist provides after the stress test but before they get access to the imaging results represents the cardiologist baseline in our study. In practice, this can be interpreted as an indicator as to whether the cardiologist would recommend a follow-up examination with advanced imaging.

The data set was split into a development (75%) and a held-out test set (25%). All patients in the development set enrolled in the study from Jan. 2010 through Dec. 2014; the held-out test set contains patients who enrolled from Dec. 2014 through May 2016. It was only released and used once the models’ parameters were fixed. Thus, high predictive performance on the held-out test set indicates the robustness of our system’s generalisation capability with respect to a temporal shift of the data³⁰, paving the path towards subsequent real-world applications. Lastly, we use external data from two Israeli medical centres to validate our system on 916 consecutive patients referred for SPECT MPI testing, whose ECG signals were obtained by treadmill stress test. This evaluation scenario is designed to exemplify the ability of both computational approaches to generalise to patients from unseen institutions, new modalities, and highlight their behaviour under distributional shifts. Given an fCAD prevalence of 7.5% in the external data set, our approach based on clinical data alone (AUROC: 0.75 ± 0.004, AUPRC: 0.19 ± 0.01) is outperformed by our deep neural net using ECG time series and clinical variables (AUROC: 0.80 ± 0.01, AUPRC: 0.28 ± 0.01). Please refer to the Method section for more details on data splitting and distributional shifts in the external validation data.

Development of an ensemble predictor and a multi-task neural network for functionally relevant CAD prediction

The ability to learn from raw sequential data (i.e., time series) makes neural networks a popular approach for healthcare applications. However, conventional machine learning (ML) has shown to be at least as powerful as deep learning in the clinical context^22,23, thus creating opportunities for low-cost deployments that do not require specialised hardware. Therefore, we will also compare their performance to deep learning models. To this end, we select a small set of eight non-sequential, easy-to-access variables on which we train four conventional ML methods (i.e., decision trees, random forests³¹, logistic regression, support vector machines³²). These variables include age, weight, biological sex, height, heart rate at rest, systolic and diastolic blood pressure, and presence of a previous CAD. The best-performing approach (a random forest) was selected via 5-fold cross-validation. We refer to all developed predictors as Coronary ARtery disease PrEdictor (CARPE). Based exclusively on clinical data, we refer to the random forest model as CARPE_Clin.. Additionally, we develop a neural network approach, CARPE_ECG, that uses the aforementioned non-sequential variables and the ECG signal, as illustrated in panel c of Fig. 1. We trained CARPE_ECG via a multi-task learning³³ (MTL) architecture with residual layers^15,34 at its core using the torchmtl³⁵ package. MTL uses so-called auxiliary tasks (i.e., prediction targets) related to a main task (e.g., fCAD prediction). These domain-specific inductive biases ensure improved and robust predictive performance on the main task³⁶. As shown in Fig. 1, we train CARPE_ECG on three auxiliary tasks (blue boxes), two of which (MPSSS and MPSRS) quantify the heart’s perfusion capabilities without and under stress, respectively. The third auxiliary task is to predict whether a patient received any pharmacological support to perform the stress test. Each auxiliary task impacts performance on the main task differently. Their respective importance weights were selected in a grid search on the three best-performing leads (see Supplementary Fig. 5 and Supplementary Table 5). To gain insights into the importance of static features and ECG segments, we used SHAP (SHapley³⁷ Additive exPlanation) values³⁸.

Finally, we combine predictions from the ensemble model and deep learning approach with the cardiologist’s post-test judgement by training a new logistic regression model on all three scores from the training set. This way, we leverage the experience and domain knowledge of the cardiologist while adding the potential to benefit from supervised learning techniques. We believe that, in practice, such a collaborative approach has the highest chances of being accepted in a clinical setting not only because it reaches the highest diagnostic performance (see Supplementary Table 6) but also because the cardiologist is an integral part of the score generation (in fact they are required to provide a VAS score after stress testing). Nevertheless, the use of logistic regression to enhance diagnostic accuracy does not ensure or directly translate to clinical utility. The precise impact on patient risk stratification needs to be assessed separately.

Machine learning can be used to reduce unnecessary perfusion imaging

The prevalence of centrally adjudicated fCAD was 32.9% in the full study cohort and 28% in the held-out test split. Figure 2 depicts the diagnostic performance of our machine learning approaches, the cardiologist’s assessment after stress testing, and a computational approach that uses the ECG’s ST-segment depression^39,40 as an indication of the presence of fCAD (see Methods for a detailed description) on the held-out data set. We show receiver operating characteristic (ROC) and precision-recall curves in the first row. Standard deviations shown as envelopes were obtained using bootstrapping, as detailed in the Methods section. Regarding the mean area under the ROC curve, we observe that CARPE_ECG (0.71) and CARPE_Clin. (0.70) outperform both the ST-depression algorithm (0.58) and the cardiologist (0.64). In regions of high specificity, CARPE_Clin. drops below the sensitivity of the cardiologist, while CARPE_ECG reaches comparable predictive performance (see inline plot). At the other extreme of the ROC curve, i.e., at high sensitivity values, both machine learning approaches consistently lead to a higher specificity than the cardiologist’s judgement (see inline plot).

**Fig. 2: Diagnostic performance overview.**

Decision curves⁴¹ (rows two and three in Fig. 2) overcome the drawbacks of conventional performance evaluations and calibration analyses⁴² by focusing on a predictor’s clinical value. The concept of net benefit quantifies the trade-off between diagnosing sick patients and preventing healthy patients from being exposed to harmful testing procedures⁴³. For a specific decision threshold probability of a diagnostic tool, a larger net benefit indicates a greater number of true positive predictions without an increase in the rate of false positives and, conversely, a greater number of true negative predictions without an increase in false negatives. Figure 2 shows a decision curve analysis in the second and third row with pre-test rule-out cutoffs (dotted red) as advocated for in European and US-American guidelines^8,21, which consider probability thresholds between 5-15% for further non-invasive imaging. The European guideline, for instance, considers non-invasive testing in patients with a probability >15% as most beneficial and testing in patients with 5–15% as potentially beneficial. Our machine learning models lead to a higher net benefit than the cardiologist’s assessment at all thresholds. Notably, at the threshold of 15%, relying on the cardiologist’s judgement is worse (in terms of net benefit) than performing myocardial perfusion imaging on all patients demonstrating the value of an ML-based method.

Table 1 offers a detailed decision curve analysis, showing sensitivity, negative predictive value (NPV), and percentage of avoided myocardial perfusion imaging compared to the cardiologist’s judgement at three probability cut-off values. We also show the percentage of patients who received a score below the cutoff threshold to enable a meaningful interpretation of sensitivity values. The highest fraction of MPIs, i.e., almost 25 per 100 patients, could be avoided at a decision threshold of 10% by using CARPE_Coll. as a risk stratification method due to risk-overestimation by the cardiologist. That being said, cutoff thresholds should not be chosen to optimise diagnostic performance, but they represent the cardiologist’s minimum probability of disease at which an intervention would be warranted⁴². In other words, if a cardiologist holds the belief that missing a patient who suffers from fCAD is nine times worse than performing an unnecessary MPI, a model’s performance should be assessed at the 10% cutoff.

Table 1 Detailed Decision Curve Analysis

Evaluating CARPE_ECG as a predictive model on all patients of the held-out test set (at the 15% decision threshold) shows the potential to reduce perfusion imaging by 15.3% (see Table 1) without increasing the rate of false negatives. This number increases to 17.3% when using CARPE_Coll.. We observe similar behaviour in patients without a CAD history. At the 5% threshold (i.e., if a physician considers it 19 times worse to miss an fCAD diagnosis than to perform an unwarranted MPI), ML can be used to avoid 10.8% of the imaging ordered by a cardiologist. For patients with CAD history, the decision thresholds of 5% and 10% lead to a particularly small number (<1% or none) of patients for which fCAD can be ruled out, which inflates sensitivity and NPV of CARPE_ECG and CARPE_Coll.. This inflation is particularly pronounced in CARPE_Clin. (see Supplementary Fig. 6) which is therefore not shown here. Overall, these results demonstrate the potential clinical utility of the proposed methods to reduce potentially unwarranted MPIs.

Subcohort analysis: machine learning performs particularly well on younger patients

Trustworthiness and interpretability are of fundamental importance in the development of risk stratification models in cardiology⁴⁴. Identifying (sub)cohorts of the population for which the model performs particularly well or poorly is crucial. To address the issue of trust, we evaluate our models’ performances on a variety of subcohorts that are important in the context of exercise stress testing. Regarding interpretability, we perform an analysis of SHAP values³⁸ on the population level, and a case study to better understand the impact feature values and ECG segments have on the predicted scores.

Clinically significant subgroups include patients who underwent exercise stress testing versus patients who required pharmacological testing as well as patients without a prior history of CAD versus patients with a known history of CAD; the odds of suffering from fCAD are significantly increased (p = 2.26E-40, two-sided Fisher’s exact test, test statistic = 2.64) for patients with previous CAD (OR: 2.64, 95% CI: 2.28–3.05) over the whole cohort. To obtain a more detailed performance breakdown, we also stratify the data by sex and age. Diagnostic performances of all approaches and subcohorts of the held-out test set are shown in Fig. 3 and Supplementary Table 6. For comparison, the performance of the CAD consortium model⁴⁵ and the currently used ESC pre-test probabilities for obstructive coronary artery disease^8,9, both based on age, sex, and the nature of symptoms, is shown in patients without known coronary artery disease. First, we assess the performance of individual machine learning methods before discussing their combination with the cardiologist’s judgement. Deep learning outperforms the cardiologist in terms of both AUROC (significant performance increase in 6/10 subcohorts) and AUPRC (significant performance increase in 4/10 subcohorts), while CARPE_Clin. exceeds the human baseline in 5/10 strata in terms of AUROC and 2/10 subcohorts in terms of AUPRC. The central plot in panel a of Fig. 4 helps explain this performance discrepancy: The conventional ML model relies more than the neural network on the CAD history and sex variable as visually observable by the large gap between the highest negative and the lowest positive SHAP value for each variable. Strong reliance on a given variable pushes the predictor too strongly in one direction such that other features cannot compensate for this influence on the final score. This SHAP analysis confirms the importance of the “CAD history”, “sex”, and “age” variables as observed in other studies^19,20.

**Fig. 3: Diagnostic performance subcohort analysis.**

Overall, the discriminative performance was highest (excluding CARPE_Coll.) in younger patients (CARPE_ECG AUROC: 0.78 ± 0.04) in general and in younger patients who did not require pharmacological support specifically (CARPE_ECG AUROC: 0.79 ± 0.04). The former cohort also represents the stratum in which the increase over the cardiologist is the highest, namely 0.19 in AUROC and 0.15 in AUPRC. We hypothesise that similar to the conventional ML model (i.e., a random forest), a cardiologist might be more biassed towards a negative diagnosis in younger patients. In contrast, the DL model is more robust to such a behaviour (as shown by the SHAP distribution of the age variable in Fig. 4). We show a more detailed assessment of diagnostic performance in different age groups in Fig. 5.

**Fig. 5: Diagnostic performance over age groups.**

On the male subpopulation, CARPE_Clin. is outperformed by the cardiologist, indicating that the conventional ML model relies too strongly on the sex feature as an indicator for the presence of fCAD, whereas the DL model and the cardiologist use this feature more effectively. This is underlined by the observation that the performance gap between CARPE_ECG and CARPE_Clin. is highest in the male subgroup. In female patients, both CARPE_Clin. and CARPE_ECG perform comparably and better than the cardiologist in terms of AUROC.

In patients of at least 65 years of age, it becomes apparent that human judgement and ML might benefit from each other: while individually, both CARPE_Clin. and CARPE_ECG perform equally or worse than the cardiologist, combining all predictions in CARPE_Coll. results in a statistically significant performance increase over the DL model. It appears that the ML models’ biases are mitigated by the cardiologist’s expertise and vice versa. Augmenting the machine and deep learning output by the cardiologist’s judgement also increases diagnostic performance significantly in the full population. While CARPE_Coll. obtains its maximal mean AUROC in the same cohorts as CARPE_ECG, the highest mean increase over CARPE_ECG can be observed in patients with a CAD history, making it the group in which ML and cardiologists could complement each other most effectively.

Conventional machine learning relies on age, and ST-segment depressions contribute to high risk scores

For the cardiologist who interacts with a risk-stratification tool, it is critical to understand the model’s operations⁴⁶ and whether it is consistent with the clinical knowledge about the phenotype. To develop such an understanding, post-hoc explanations⁴⁷ can be used to make predictions more interpretable. We use SHAP³⁸ values, a game-theoretic approach, to explain the outputs of machine learning models. SHAP values provide a score that quantifies the impact an individual feature value has on the model’s prediction. A positive SHAP value is associated with the prediction of the positive class/the presence of fCAD. Conversely, a feature with a negative SHAP value influences the model towards predicting the negative class/the absence of fCAD.

Panel a of Fig. 4 shows mean absolute SHAP values and SHAP value distributions for all clinical variables for CARPE_ECG and CARPE_Clin. on the left-hand side. On the right-hand side, we show the SHAP values for the “age” feature. For both classifiers, “CAD history” and “sex” are the most influential predictive features (i.e., highest mean absolute value). However, CAD history is only significantly more relevant than the patient’s “sex” in the random forest (p = 7.9E-05, test statistic = 7.36 (CARPE_Clin.)) and not in CARPE_ECG (p = 0.055, Welch’s t-test for independent samples, test statistic = 2.24). Furthermore, the SHAP distribution of these variables around the value of zero is strikingly different. While CARPE_ECG exhibits many values comparatively close to zero (i.e., there are patients for which the respective features have no significant impact on the model’s final prediction), both CAD history and “sex” have a large impact on the model’s prediction in all patients for the conventional ML model (i.e., the distance to zero for both positive and negative SHAP values is substantial). Additionally, both features show a distinctive separation: each variable instance always leads to either a positive (male and presence of CAD history) or negative (female and absence of CAD history) SHAP value. We observe another distinctively different behaviour in the distribution of SHAP values for the “age” feature. The conventional ML model has learnt an age threshold of 70 years, which, when exceeded, leads to mostly positive SHAP values (i.e., it contributes to predicting the presence of fCAD) and vice versa. CARPE_ECG, on the other hand, exhibits a distinctive bell shape around zero, indicating the reduced impact of this variable. While this bias is likely due to the reduced fCAD prevalence of younger patients, the DL model exhibits a more stable behaviour with respect to this variable. The conventional ML model’s reliance on young age as a strong indicator of the absence of fCAD turns out to be detrimental when evaluated on external data, which consists of significantly more young patients (see Fig. 5). This underscores the need for explainability and trustworthiness in assessing ML models; if unaddressed, these aspects may preclude clinical applicability.

In addition to performing a population-wide feature relevance analysis, SHAP values allow for sample-specific analyses. In panel b of Fig. 4, we show a case study of an 83 year-old male patient with no previous CAD. We envision that in a future clinical implementation of our risk assessment tool, such a dashboard will support the cardiologist to understand better on which basis the model arrived at its prediction (e.g., whether the ECG signal is disturbed or noisy) and the influence of each feature (e.g., SHAP values).

The first row of panel b depicts the distributions of the values of all clinical features from the training population. Blue crosses indicate where the patient lies in that distribution. The centre row shows the distribution of SHAP values over five iterations. Moreover, we show the SHAP values of individual measurements in the background of the input ECG in the last row. The mean risk-score CARPE_ECG provides for this patient, who was later diagnosed with the presence of fCAD, is 0.77. We show positive SHAP values in yellow, negative ones in dark purple.

Notwithstanding their opposing signs, among the clinical variables, both the absence of a previous CAD and the patient’s age contribute most to the model’s prediction (-0.1 and 0.1, respectively). The normal resting heart rate of 67 is associated with a lower risk score (mean SHAP value: 0.07). While weight, height, and diastolic blood pressure influence the model only marginally, the fact that the patient is male contributes most towards a higher risk score. Similarly, the patient’s age lies above the upper quartile of the training distribution, pushing the model toward predicting a higher score. Lastly, the systolic blood pressure (129 mmHg) also contributes to the prediction of the positive class. The largest contribution that increases the model’s output comes from the ECG. The SHAP values attributed to certain measurements and segments in the ECG might change throughout the different phases of stress testing. In sum, the mean SHAP value for the whole signal is 2.31. The highest SHAP values can be observed in the part of the input signal that comes from the stress phase of the examination. Measurements around the R-peak during rest and, more strikingly, around the ST-segment in the stress and partially in the recovery phase are associated with higher SHAP values than other segments of the ECG. The latter observation is a data-driven and a priori domain-agnostic confirmation of the relevance of ST-segment depression in the diagnosis of fCAD. This is underlined by the fact that in the pre-stress phase, where almost no ST-segment depression is visible, SHAP values around the ST-segment are close to zero. Conversely, negative SHAP values, in line with conventional medical understanding, are observed in the T-wave region during rest, the PR interval during stress, and prominently at the ventricular activation or R-wave peak time. This case study and the relevance of ST-segment depression for the prediction of higher risk scores is supported by a population-wide SHAP analysis in Supplementary Figs. 7 and 8.

CARPE_ECG generalises to unseen data across countries and modalities

To validate our neural network’s generalisation capabilities, we compute its predictive performance on an external validation data set containing 916 consecutive patients referred for exercise myocardial perfusion single photon computed tomography. Referral reasons included non-anginal chest pain, atypical angina, presence of risk factors, or shortness of breath. This data set was retrieved through the THEW data repository⁴⁸ (SUI: E-OTH-12-0927-015); it differs from the development data in several key characteristics: First, instead of recording the stress test ECG using bicycle ergometry, it was captured by a treadmill exercise test. Therefore, the resulting signal is subject to noise from walking movements rather than the cycling activity. Second, with a mean age of 55 years, the population in the external data set is significantly younger (p = 1.5E-121, one-sided Welch’s t-test, test statistic = 25.39) than the internal study cohort (held-out test set) whose patients are on average 68 years old (see Supplementary Fig. 9 for a complete comparison of all clinical variables). Lastly, the prevalence of ischaemia in the internal cohort is significantly higher compared to the external validation set (7.5%).

As shown in Supplementary Table 7, both approaches reach a good overall diagnostic performance and perform better on the external data set than on the internal held-out test set. CARPE_ECG outperforms the conventional ML model in both AUROC (0.80 \(\pm\) 0.01 vs. 0.75 \(\pm\) 0.004) and AUPRC (0.28 \(\pm\) 0.02 vs. 0.19 \(\pm\) 0.01). We attribute the higher predictive performance of the DL model to the fact that despite coming from a different modality, ECG signals are not fundamentally different among different populations, making it a robust and reliable input signal.

In Fig. 5, we contrast predictive performance on different age groups in both internal and external validation data. In patients who are younger than 70, both computational approaches consistently outperform the cardiologist in terms of diagnostic accuracy. However, for the stratum that makes up the majority of the data set (ages 70–79), pure computational prediction and human judgement individually perform comparably. However, their combination (CARPE_Coll.) significantly (p = 8.1e-4, one-sided Welch’s t-test, test statistic = 7.58) increases diagnostic performance over the cardiologist’s judgement and over CARPE_ECG (p = 0.001, test statistic = 4.73). The two extremes of the age distribution exemplify how the random forest’s cutoff of 70 years (see SHAP analysis) leads to detrimental performance: The further away a patient group lies from the cutoff, the bigger the performance difference between CARPE_ECG and CARPE_Clin. becomes. This is even more pronounced in the external validation cohort, where the differences in mean AUROCs (i.e., 10.3 percentage points) are the largest in patients between 26 and 49 years of age.

Source link

Registro commented on Security Architect | eFinancialCareers: Thanks for sharing. I read many of your blog posts
Anm"al dig f"or att fa 100 USDT commented on Best ChatGPT Tips and Tricks shared by ChatGPT Experts: Turbo-Charge Your AI Experience: Prompts included | by Michael King | Oct, 2023: Thanks for sharing. I read many of your blog posts
Elizabeth Nash commented on AI platform Hugging Face says hackers have stolen authentication tokens from Spaces: 🌍 Global crypto mining is now at your fingertips h
Binance美国注册 commented on Meta’s Mark Zuckerberg on Threads, the future of AI, and Quest 3: Your article helped me a lot, is there any more re
binance us register commented on Campfire brings design review to Quest 3, adds AI assistant: Can you be more specific about the content of your

Enhancing the diagnosis of functionally relevant coronary artery disease with machine learning

Data collection, label generation, and robustness

Development of an ensemble predictor and a multi-task neural network for functionally relevant CAD prediction

Machine learning can be used to reduce unnecessary perfusion imaging

Subcohort analysis: machine learning performs particularly well on younger patients

Conventional machine learning relies on age, and ST-segment depressions contribute to high risk scores

CARPE_ECG generalises to unseen data across countries and modalities

Leave a Reply

RECENT POSTS

Jodhpur woman advances global AI research with Columbia degree

Seekr and Enabled Intelligence partners deliver explainable AI for enterprise applications

Google launches Nano Banana 2 Lite and Gemini Omni Flash to make AI image and video creation faster and more affordable

Data collection, label generation, and robustness

Development of an ensemble predictor and a multi-task neural network for functionally relevant CAD prediction

Machine learning can be used to reduce unnecessary perfusion imaging

Subcohort analysis: machine learning performs particularly well on younger patients

Conventional machine learning relies on age, and ST-segment depressions contribute to high risk scores

CARPEECG generalises to unseen data across countries and modalities

Related Posts

Leave a Reply

CARPE_ECG generalises to unseen data across countries and modalities