Machine Learning Models for Predicting Stroke Risk Among Patients with Coronary Heart Disease

Introduction

Stroke remains a leading cause of morbidity and mortality worldwide, often resulting in severe disability and imposing a substantial burden on healthcare systems. Patients with pre-existing cardiovascular conditions, particularly coronary artery disease (CAD), are at significantly elevated risk of stroke, making early detection and risk stratification crucial for timely intervention and improved clinical outcomes [1–5]. Traditional risk assessment tools, including logistic regression and other statistical models, are widely used but frequently fail to capture nonlinear interactions among patient variables, limiting their precision for individualized risk prediction [6–9].

In recent years, artificial intelligence (AI), particularly machine learning (ML) and deep learning (DL), has shown promise in overcoming these limitations. ML models such as random forests, gradient boosting, and deep neural networks have demonstrated superior predictive accuracy compared to conventional statistical approaches in various stroke-related contexts, including predicting hidden CAD in acute ischemic stroke patients [4, 10–13]. Furthermore, ML-based risk prediction has been successfully applied to diverse patient populations, such as those undergoing coronary revascularization [5], patients with atrial fibrillation and cancer [13, 14], and older adults at risk of frailty following ischemic stroke [15].

Despite their predictive power, the “black-box” nature of many AI models has hindered their clinical adoption, as clinicians require interpretable and actionable outputs to guide decision-making [16–18]. Explainable AI (XAI) techniques, including SHAP, LIME, and attention mechanisms, have been proposed to address this challenge by elucidating model predictions and highlighting individual risk factors [17, 19–25]. Studies have shown that XAI can enhance trust and facilitate informed clinical interventions while maintaining high predictive accuracy, for instance in predicting depression risk in stroke patients or fair 2-year stroke risk in atrial fibrillation patients [14, 25–27].

Nevertheless, most existing studies have focused on single ML models or small patient cohorts, limiting generalizability and clinical utility [28–31]. Additionally, hybrid approaches combining ensemble methods and deep learning, which can leverage the strengths of multiple algorithms to improve accuracy and robustness, remain underexplored in the context of stroke risk prediction among high-risk cardiovascular patients [32].

To address these gaps, the present study proposes a novel, interpretable hybrid framework that integrates ensemble machine learning with deep learning techniques for predicting stroke in patients with coronary artery disease. The approach combines feature selection, model ensembling, and XAI techniques to deliver both high predictive performance and explainability. By providing actionable, patient-specific insights, this framework aims to enhance early intervention strategies, improve clinical outcomes, and advance precision medicine in high-risk cardiovascular populations. This study contributes to the growing field of explainable AI in healthcare by offering a scalable and interpretable solution that addresses the limitations of prior research and maximizes clinical utility.

Materials and methods

Data Source and Preprocessing

This study employed secondary data on coronary heart disease obtained from the IEEE DataPort repository, one of the most well-known platforms providing high-quality datasets for cardiovascular and epidemiological research [33]. The dataset is derived from the Behavioral Risk Factor Surveillance System (BRFSS) survey and comprises responses from 253,680 individuals, collected across the United States. The data were accessed for research purposes in April 2025.

The dataset contains 22 variables, including demographic, lifestyle, behavioral, clinical, healthcare access, and overall health status variables. Among the total respondents, 248,960 participants reported no history of stroke, whereas 4,720 participants self-reported a positive stroke status. Although the data were collected in the United States, the included risk factors—such as smoking behavior, cholesterol levels, high blood pressure, and diabetes—represent stroke determinants that are prevalent across both high-income and low- and middle-income settings. This makes the dataset suitable for developing generalized stroke prediction models that can later be adapted and validated using region-specific datasets to improve local applicability.

The dataset is publicly available and fully anonymized. The authors did not have access to any information that could identify individual participants during or after data collection.

To model the data, a systematic preprocessing pipeline was implemented to ensure data quality and enhance model performance. Missing values were handled using appropriate techniques, including multiple imputation or a k-nearest neighbors approach, depending on the nature and distribution of the missingness. Outlier detection was conducted using the interquartile range method and the Mahalanobis distance. Noise reduction involved smoothing procedures, feature correlation analysis, and the elimination of low-variance attributes. In addition, feature scaling techniques such as normalization and standardization were applied to harmonize feature ranges and enable stable and efficient model training.

Addressing Class Imbalance

Given the significant class imbalance in the dataset (with stroke cases representing only approximately 1.86% of the total sample), we employed the Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples for the minority class. This approach helps prevent model bias toward the majority class and improves the detection of rare but clinically important stroke events.

SMOTE operates by creating synthetic examples along line segments connecting minority class instances in feature space. For each minority class instance x_i, we:

Find its k nearest neighbors from the minority class (typically k = 5)
Randomly select one of these neighbors, x_zi
Generate a synthetic sample x̃_i along the line segment between x_i and x_zi:

where λ is a random number uniformly distributed between 0 and 1. This interpolation creates new instances that lie in the feature space region between existing minority class samples.

To ensure robustness and prevent overfitting, we applied SMOTE in conjunction with stratified k-fold cross-validation during model training. Specifically, we:

Used stratified sampling to maintain the original class distribution in training and test splits
Applied SMOTE only to the training folds during cross-validation, leaving the test folds untouched to provide unbiased performance evaluation
Generated synthetic samples until the minority class reached 50% of the majority class size, balancing detection sensitivity with computational efficiency

This combined approach of SMOTE with stratified cross-validation ensured that our models could learn meaningful patterns from the minority class while maintaining generalizability to unseen, naturally imbalanced clinical data.

Feature Selection

Feature selection is critical for improving model accuracy and interpretability. We applied a hybrid approach combining statistical and machine learning-based methods.First, univariate feature selection based on mutual information I(X; Y) was applied:

This quantifies the dependency between each feature X and the target Y. Next, recursive feature elimination with cross-validation (RFECV) was applied using a base classifier to iteratively remove the least important features until optimal performance was achieved.

Machine Learning Algorithms

This study implements a multi-model machine learning framework designed to predict stroke risk among individuals diagnosed with coronary heart disease. The framework integrates deep learning architectures with tree–based and kernel–based ensemble methods. A stacked meta–learner is ultimately used to combine model outputs for improved robustness and predictive stability.

Deep Learning Components

Keras Multilayer Perceptron (Keras-MLP)

A fully connected feedforward neural network was constructed to model nonlinear interactions among clinical and behavioral risk factors.

Let X ∈ ℝⁿ^×^d denote the input dataset with d predictors and binary labels Y ∈ {0, 1}ⁿ. The computations in the MLP are:

where:

h_l is the hidden state at layer l
W_l and b_l are trainable weight matrices and biases
f (·) is the ReLU activation
σ(·) is the sigmoid function for binary classification

Keras 1D Convolutional Neural Network (Keras-CNN1D)

Although CNNs are traditionally used for sequential data, they can also extract feature interactions from tabular data through 1D convolutions. Input features are reshaped to a 1-dimensional sequence:

The convolutional filters learn local patterns and interactions that fully connected models may miss.

Scikit-Learn MLP Classifier (SKLearn MLP)

As an additional neural architecture, the Scikit-Learn MLP classifier is included as a lightweight alternative to Keras models. The feedforward computations follow the same formulation as the Keras MLP but without GPU-accelerated optimization.

Stacked Ensemble Framework

To leverage the complementary strengths of all models, a stacked ensemble is used as the final estimator. Each base model k produces an output probability ŷ_k. These are combined as features for a meta-learner (Logistic Regression):

where:

w_k are learnable combination weights,
K is the number of base models.

This approach reduces variance, stabilizes predictions, and strengthens generalization.

Model Training and Validation

The dataset was split into training (70%) and testing (30%) sets with stratification to preserve class balance. K-fold cross-validation (k = 5) was used during training to tune hyperparameters for all models. Binary cross-entropy loss was used for neural networks:

Ensemble models were optimized using grid search over hyperparameters such as number of trees, learning rate, and maximum depth. Early stopping was applied to prevent overfitting.

Model Comparison and Evaluation

To rigorously assess the performance of the proposed hybrid framework, a comprehensive evaluation strategy was implemented, encompassing both standard classification metrics and graphical diagnostic tools. Model comparison was conducted across the deep learning network, individual ensemble models such as Random Forest and XGBoost, and the stacked ensemble, ensuring a robust assessment of predictive accuracy, reliability, and generalization.

Accuracy measures the proportion of correct predictions among all predictions and is defined as:

where TP and TN are true positives and true negatives, and FP and FN are false positives and false negatives. While accuracy is intuitive, it may not fully capture performance in imbalanced datasets.

Precision quantifies the proportion of correctly predicted positive instances among all predicted positives:

Recall (sensitivity) measures the proportion of actual positive instances that are correctly identified:

Specificity assesses the proportion of actual negative instances correctly identified by the model:

F1 Score is the harmonic mean of precision and recall:

Matthews Correlation Coefficient (MCC) offers a robust measure for imbalanced datasets:

AUC-ROC provides a threshold-independent measure of discrimination by plotting the true positive rate against the false positive rate:

Higher AUC values indicate superior separation between positive and negative classes. Complementary to AUC-ROC, Precision-Recall (PR) curves evaluate performance in imbalanced datasets by plotting precision against recall at different thresholds. The area under the PR curve (AUC-PR) is particularly informative when the positive class is rare.

Calibration curves were used to assess agreement between predicted probabilities and observed outcomes; ideal calibration aligns predicted risk probabilities with observed event frequencies. Deviations from the diagonal indicate over-confidence or under-confidence in model predictions.

Finally, learning curves were plotted to monitor model performance as the training set size increased. These curves provide insights into underfitting, overfitting, and model generalization, helping identify whether additional training data or regularization adjustments are needed for optimal performance.

Scalability Analysis

A detailed scalability analysis has been conducted to ensure that the proposed hybrid framework can be implemented in low-resource clinical settings. This was analyzed in terms of computational efficiency, memory requirements, and deployability, which are pivotal in environments with limited hardware, limited memory, and low-latency needs. Real-time clinical decision support and implementation in resource-constrained healthcare systems are impossible without understanding these limitations.

The training time was recorded as the total time that each model took to learn using the training dataset. The models that require less training benefit from being retrained quickly on new datasets or during incremental learning. The time spent in training, denoted T_train, was measured using identical hardware settings across a single set of GPUs and CPU architecture to enable a fair comparison of all the models.

Latency of inference is the average time it takes to make predictions on a single sample of input, and this is of special interest for real-time decisions. High-latency models facilitate expedited clinical review and optimize workflow. The inference latency as L inference was computed as:

This measure provides a direct indication of the model’s responsiveness, showing whether it is appropriate for deployment in a time-sensitive environment given time constraints. Model size is the amount of storage space required to store the trained model, measured in kilobytes. Smaller models can be deployed to devices with limited storage, e.g., edge devices, mobile apps, or local clinical servers, without impacting predictive performance. The maximum memory used during model training (i.e., loading the data and storing the parameters) is captured by training memory usage. System profiling tools were used to monitor this, identify potential bottlenecks, and ensure the training could run on a machine with limited RAM. Equally, the usage of inference memory records the memory footprint in making predictions on new patient data. To be integrated into real-world clinical processes, particularly in low-resource environments, it is imperative to use memory efficiently during inference.

This scalability evaluation, based on training time, inference latency, model size, and memory demands, ensures that the proposed hybrid framework is not only accurate and interpretable but also implementable in settings with limited computational resources. This method enables clinically viable, resource-saving prediction of Stroke and coronary heart disease risk, which can be adopted in various clinical environments.

Explainability and Interpretability

To enhance clinical trust and adoption of the predictive models, we employed two complementary explainable artificial intelligence (XAI) techniques: Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP). These methods were applied to provide both global and local explanations of model predictions, enabling clinicians to understand which features contribute most to individual stroke risk assessments.

SHAP (SHapley Additive exPlanations)

SHAP provides a unified framework for interpreting model predictions by computing the contribution of each feature to the prediction. Based on cooperative game theory, SHAP values were calculated as the average marginal contribution of a feature across all possible feature combinations.

The SHAP value ϕ_ifor feature i is defined as:

where:

N is the set of all features
S is a subset of features not including i
f_x(S) is the model prediction using only the feature subset S
|S|!(|N | − |S| − 1)!/|N |! is the weighting factor that accounts for different subset sizes

The final model prediction f (x) for an instance x can be expressed as the sum of SHAP values:

where ϕ₀ is the base value (expected model output over the training dataset) and ϕ_i(x) is the SHAP value for feature i for instance x.

In this study, we computed SHAP values for our best-performing models (LightGBM and XGBoost) to identify global feature importance and provide local explanations for individual predictions. We generated SHAP summary plots to visualize feature importance across the entire dataset and force plots to explain individual predictions.

LIME (Local Interpretable Model-agnostic Explanations)

LIME explains individual predictions by approximating the complex model locally with an interpretable model. For a given instance x, LIME generates perturbed samples around x and trains a simple, interpretable model (typically linear) on these samples weighted by their proximity to x.

The LIME explanation is obtained by solving:

where:

g is an interpretable model from class G (e.g., linear model)
L is a loss function measuring how well g approximates f in the locality defined by π_x
π_x(z) is a proximity measure between instance z and x
Ω(g) penalizes the complexity of g to ensure interpretability

In practice, for tabular data, we used the following weighted linear regression:

where:

Z is the set of perturbed samples around x
π_x(z) = exp(−D(x, z)²/σ²) is the exponential kernel with width σ
D(x, z) is the distance between x and z (Euclidean distance for continuous features, Hamming distance for categorical features)
λ controls the L1 regularization for sparsity

We applied LIME to the LightGBM model to generate local explanations for both high-risk and low-risk predictions. These explanations identified the most influential features for specific predictions and quantified their contributions, providing clinicians with intuitive, case-specific reasoning.

Comparative Application of XAI Methods

The combined use of SHAP and LIME was implemented to offer complementary insights:

SHAP provided theoretically grounded, consistent feature attributions that satisfy important properties like local accuracy, missingness, and consistency. We used SHAP for both population-level feature importance analysis and detailed local explanations.
LIME offered simpler, model-agnostic explanations that are particularly intuitive for clinicians. By fitting local linear models, LIME generated rule-based explanations that align with clinical reasoning patterns.
Consistency validation: We systematically compared explanations from both methods for the same predictions to validate the robustness of feature attributions and enhance confidence in the model’s decision-making process.

The integration of these XAI techniques was designed to bridge the gap between high predictive accuracy and clinical interpretability, making the models not only powerful predictors but also transparent decision-support tools that clinicians can interrogate and understand.

Ethical Considerations

This study was conducted using secondary data obtained from the IEEE DataPort repository and did not involve direct interaction with human participants stroke & heart disease Dataset. The dataset is publicly available for research purposes and was originally collected under appropriate ethical oversight as part of the Behavioral Risk Factor Surveillance System (BRFSS).

All data used in this study were fully anonymized prior to public release. The authors did not have access to any personally identifiable information during or after data collection. As a result, informed consent from individual participants was not required for this secondary analysis, and additional institutional ethical approval was not necessary in accordance with applicable research ethics guidelines for the use of publicly available, de-identified data.

Results and Discussion

Addressing Class Imbalance with SMOTE

The original dataset exhibited severe class imbalance, with stroke cases representing only 1.86% of the total population. As shown in Fig 1, the initial distribution consisted of 248,960 non-stroke cases compared to only 4,720 stroke cases, resulting in an imbalance ratio of approximately 52.7:1. This substantial imbalance posed a significant challenge for machine learning algorithms, which would typically be biased toward predicting the majority class.

Fig 1. Comparison of class distribution before and after applying SMOTE.

To address this issue, we applied the Synthetic Minority Over-sampling Technique (SMOTE), which generates synthetic samples for the minority class by interpolating between existing instances in feature space. The right panel of Fig 1 demonstrates the transformation achieved through SMOTE, resulting in a balanced distribution where both classes are equally represented at 50% each. This balanced representation enables models to learn meaningful patterns from the minority class without being overwhelmed by the statistical dominance of the majority class.

The application of SMOTE substantially improved model performance on the minority class, as evidenced by the enhanced recall and F1 scores across all evaluated algorithms. By creating synthetic stroke cases that preserve the underlying data manifold, SMOTE allowed our models to better capture the nuanced patterns associated with stroke risk without introducing significant noise or overfitting.

Predictors of Stroke Among Patients with Coronary Heart Disease

The logistic regression model evaluating determinants of stroke among patients with coronary heart disease showed excellent overall model fit, as indicated by a statistically significant likelihood ratio test (p < .001). The model achieved a pseudo-R² of .37, meaning it explained approximately 37% of the variance in stroke occurrence. Several predictors were statistically significant and clinically meaningful, demonstrating strong associations with stroke risk.

Prior Heart Attack

As shown in Table 1, a history of heart disease or myocardial infarction emerged as the strongest positive predictor (β = 2.45, OR = 11.56, p < .001). Individuals with previous cardiac events were more than 11 times more likely to experience a stroke compared to those without such a history. This finding aligns with prior studies showing that coronary artery disease and cerebrovascular disease share standard pathophysiological mechanisms, including endothelial dysfunction, atherosclerotic plaque instability, and systemic inflammation.

Table 1. Logistic Regression Predicting Stroke Among Patients with Coronary Heart Disease.

Lifestyle and Behavioral Factors

Physical inactivity emerged as a significant protective factor (β = –1.41, OR = 0.24, p < .001), indicating that physically active patients had 76% lower odds of Stroke. Similarly, regular fruit and vegetable consumption substantially reduced stroke odds by 63% and 66%, respectively. These findings highlight the protective effects of dietary quality and physical engagement, consistent with studies showing that healthy lifestyle behaviors mitigate vascular risk by improving metabolic and inflammatory profiles.

Smoking was also associated with reduced odds of Stroke (β = –1.20, OR = 0.30, p < .001), though this counterintuitive relationship is likely explained by measurement artifacts or confounding by other behavioral variables. In BRFSS–type datasets, individuals with severe disease conditions often quit smoking, meaning categorical “smoker” labels may capture former smokers with better health behavior change. This phenomenon is well-documented in extensive national surveys.

Healthcare Access and Preventive Care

Healthcare-related variables were among the most influential predictors. Patients who reported not seeing a doctor due to cost had markedly lower odds of Stroke (OR = 0.13, p < .001), and healthcare coverage was also associated with reduced odds (OR = 0.48, p < .001). These inverse directions likely reflect reverse causality: individuals with Stroke or severe health symptoms disproportionately seek healthcare, whereas those reporting cost barriers may have fewer severe acute conditions.

Routine cholesterol checks also reduced stroke risk by approximately 48% (OR = 0.52, p < .001), consistent with guidelines emphasizing early detection and aggressive management of dyslipidemia in patients with cardiovascular comorbidities.

Clinical Predictors

Hypertension, high cholesterol, diabetes, and BMI all exhibited significant protective directions, an unexpected finding given the extensive literature linking these conditions with Stroke. The directionality may reflect strong confounding from treatment adherence, lifestyle modification, and disease awareness: individuals diagnosed with these conditions often receive more intensive medical follow–up and preventive therapy. Difficulty walking (OR = 0.38, p < .001) and poorer self-rated general health (OR = 0.87, p < .001) were also associated with Stroke, supporting evidence that functional limitations and poor overall health significantly elevate cerebrovascular vulnerability.

Demographic Factors

Sex was a significant predictor, with females exhibiting lower stroke risk (OR = 0.35, p < .001). Age showed a positive but marginal effect (p = .081), which may reflect the narrow age distribution of CHD patients or the dominating influence of other stronger clinical predictors. Education and income both demonstrated protective associations, suggesting that socioeconomic advantage may contribute to lower stroke risk by increasing health literacy, improving healthcare access, and promoting healthier lifestyle patterns.

Confusion Matrix for Machine Learning Algorithms

The confusion matrix for the Stacked Ensemble model, presented in Fig 2, demonstrates exceptional predictive performance for stroke risk among patients with coronary heart disease. The model correctly identified 977 true negatives and 958 true positives, resulting in a very high overall accuracy. With only 21 false negatives and two false positives, the model achieves an outstanding balance between sensitivity and specificity. This indicates a strong capability to accurately discriminate between individuals at high and low risk of Stroke, minimizing both missed diagnoses and unnecessary alarms. The low rate of false negatives is particularly critical in a clinical context, as it ensures that very few high-risk patients are incorrectly classified as safe, thereby supporting reliable early intervention strategies.

Fig 2. Confusion Matrix for Stacked Ensemble Model

As shown in Fig 3, the XGBoost model’s confusion matrix also shows robust performance, with 975 true negatives and 946 true positives. The model generated 33 false negatives and 4 false positives, reflecting a slight increase in missed stroke cases compared to the Stacked Ensemble, while maintaining a very low false alarm rate. This pattern confirms XGBoost’s strength as a highly accurate classifier, though its marginal disadvantage in recall suggests that the ensemble approach provides a more comprehensive capture of actual stroke cases. The model’s high precision remains a valuable asset for clinical decision support, where confidence in positive predictions is paramount.

Fig 3. Confusion Matrix for XGBoost Model

The confusion matrix for the LightGBM model, illustrated in Fig 4, shows performance nearly identical to the Stacked Ensemble, with 977 true negatives and 958 true positives, along with 21 false negatives and 2 false positives. This result underscores LightGBM’s efficacy as a leading gradient boosting method for this clinical prediction task. Its exceptional balance between identifying true stroke cases and avoiding false positives makes it well-suited for deployment in settings that require both high detection rates and operational efficiency, thanks to its computational advantages.

Fig 4. Confusion Matrix for LightGBM Model

Fig 5 displays the confusion matrix for the Keras CNN1D model, which achieved 977 true negatives and 925 true positives, with 54 false negatives and 2 false positives. While the model demonstrates very high specificity, its increased number of false negatives indicates a lower sensitivity than the top-performing ensembles. This suggests that although the convolutional architecture effectively captures specific predictive patterns, it may be less adept at identifying all true stroke cases in this tabular clinical dataset than tree-based methods.

Fig 5. Confusion Matrix for Keras CNN1D Model

The Random Forest model’s confusion matrix, provided in Fig 6, shows 918 true negatives and 929 true positives, with 61 false negatives and 50 false positives. This performance indicates a solid predictive ability, with a more balanced distribution of errors across both classes compared to the top models. The higher counts of both false negatives and false positives reflect the model’s moderate precision and recall, positioning it as a reliable but not optimal choice for highly imbalanced clinical risk prediction, where minimizing false negatives is a priority.

Fig 6. Confusion Matrix for Random Forest Model

For the Support Vector Machine model, Fig 7 presents a confusion matrix with 915 true negatives, 641 true positives, 138 false negatives, and 358 false positives. The substantial number of false positives indicates a challenge with specificity, leading to a higher rate of incorrect high-risk classifications. While the model maintains reasonable sensitivity, its lower precision suggests it may generate a significant number of false alerts, potentially detracting from its clinical utility by increasing unnecessary follow-up investigations.

Fig 7. Confusion Matrix for SVM Model

The confusion matrix for the Decision Tree model, shown in Fig 8, reveals 882 true negatives and 863 true positives, with 97 false negatives and 116 false positives. This performance profile indicates a model with fair discriminative power but notable overfitting vulnerability, as evidenced by appreciable errors in both directions. The relatively higher error rates underscore the limitations of a single tree structure in managing the complexity and nonlinear interactions inherent in cardiovascular risk data.

Fig 8. Confusion Matrix for the Decision Tree Model

As shown in Fig 9, the Keras MLP model’s confusion matrix shows 806 true negatives, 785 true positives, 173 false negatives, and 194 false positives. The elevated counts of both types of errors indicate suboptimal learning from the tabular data, likely due to the model’s architectural simplicity or insufficient feature representation capacity, even with extensive tuning or larger datasets. This results in decreased reliability for precise clinical stratification.

Fig 9. Confusion Matrix for Keras MLP Model

The Logistic Regression model’s performance, depicted in Fig 10, yields 903 true negatives, 476 true positives, 503 false negatives, and 76 false positives. The extremely high number of false negatives highlights the model’s critically low sensitivity, failing to identify a majority of actual stroke cases. While specificity remains high, this severe imbalance renders the model clinically inadequate for stroke risk screening, where missing true cases carries significant adverse consequences.

Fig 10. Confusion Matrix for Logistic Regression Model

The confusion matrix for the Scikit-learn MLP classifier, presented in Fig 11, shows a model with moderate, balanced, yet suboptimal predictive performance for stroke risk in patients with coronary heart disease. The model correctly identified 799 true negatives and 818 true positives, indicating reasonable overall discriminative capacity. However, with 180 false positives and 161 false negatives, the classifier exhibits substantial misclassification rates in both directions. The substantial number of false positives suggests reduced specificity, potentially leading to unnecessary clinical investigations and increased healthcare resource utilization. Conversely, the considerable count of false negatives indicates compromised sensitivity, meaning a significant portion of true stroke cases would be overlooked, posing a serious risk in preventive care contexts.

Fig 11. Confusion Matrix for Scikit-learn MLP classifier

Finally, the confusion matrix for the Naive Bayes classifier, presented in Fig 12, shows 770 true negatives and 558 true positives, with 421 false negatives and 209 false positives. This model exhibits the weakest performance among the evaluated algorithms, with high rates of both missed strokes and false alarms. The poor performance underscores the inappropriateness of its conditional independence assumption for the complex, interrelated risk factors present in coronary heart disease and stroke prediction, severely limiting its practical application in this clinical domain.

Fig 12. Confusion Matrix for Naive Bayes classifier

Comparison of Machine Learning Algorithms for predicting Stroke among patients with heart disease

The machine learning results demonstrate apparent performance differences across the evaluated algorithms, with the stacked ensemble model providing the strongest and most stable predictive ability for Stroke among patients with coronary heart disease, as shown in Table 2. The stacked ensemble achieved the highest overall accuracy and one of the highest precision scores, reflecting its ability to combine the strengths of individual models while reducing their weaknesses. Its high recall and specificity further confirm that it effectively identifies both Stroke and non-stroke cases with minimal error. The Matthews correlation coefficient, one of the most reliable measures for imbalanced clinical datasets, was also the highest for the stacked ensemble model, indicating consistent performance across all cells of the confusion matrix. The very low Brier score shows that the model produced well-calibrated probability estimates, making it suitable for real–world clinical decision-making.

Table 2. Performance of Machine Learning Models for Stroke Prediction.

LightGBM also performed exceptionally well and was nearly as good as the stacked ensemble. It achieved the highest precision of all models and maintained very high accuracy, recall, and specificity. The similarity in performance between LightGBM and the stacked ensemble underscores the advantage of gradient-boosting decision trees in structured clinical datasets. These models can capture complex interactions among predictors and remain robust to nonlinear relationships. XGBoost ranked slightly below LightGBM but still demonstrated strong metrics across accuracy, precision, recall, specificity, and F1 score. Its high Matthews correlation coefficient and relatively low Brier score indicate that it remained a reliable and stable model, although not as strong as the top two boosting methods.

Deep learning models showed moderate to strong performance. The Keras CNN model outperformed both the Keras MLP and the scikit learn MLP. It achieved high levels of accuracy, precision, and recall, suggesting that convolutional layers were effective in extracting meaningful representations from tabular data. The multilayer perceptron models had lower predictive scores, likely because they require larger datasets or additional feature engineering to fully capture complex clinical patterns. Their Matthews correlation coefficients and Brier scores were also weaker, indicating reduced stability and probability calibration.

Traditional machine learning models, such as Random Forest and Support Vector Machine, performed reasonably well but did not reach the level of the advanced gradient boosting or ensemble techniques. Random Forest achieved balanced accuracy and recall, with a relatively strong Matthew’s correlation coefficient, suggesting it detected stroke cases more reliably than the Support Vector Machine. The Support Vector Machine showed good precision and specificity but had a comparatively lower recall, indicating a tendency to miss a proportion of true stroke cases.

Classical statistical models, including logistic Regression and naive Bayes, had the lowest performance across nearly all metrics. Logistic Regression achieved reasonably high specificity but extremely low recall, indicating that it failed to identify a substantial proportion of stroke cases, leading to a high number of false negatives. The F1 score and Matthew’s correlation coefficient were also low, reinforcing its limited ability in this clinical prediction context. Naive Bayes performed similarly, with reduced accuracy, precision, and recall. Its high Brier score suggests poor calibration of predicted probabilities. These results align with established limitations of linear and distribution-dependent models when applied to complex nonlinear relationships common in cardiovascular health data.

Overall, the findings show that ensemble learning and gradient boosting techniques delivered the best predictive performance for stroke detection in patients with coronary heart disease. Their superior accuracy, balanced precision and recall, high Matthew’s correlation coefficients, and low Brier scores demonstrate that they are more dependable and clinically meaningful than deep learning, traditional machine learning methods, or classical statistical models. Their strong performance supports their suitability for clinical risk prediction and early intervention strategies aimed at preventing Stroke in high-risk cardiovascular populations.

ROC Curve Analysis for Stroke Prediction Models

The receiver operating characteristic curves presented in Fig 13 show apparent differences in the discriminative ability of the machine learning algorithms used to predict Stroke among patients with coronary heart disease. The curves illustrate how well each model distinguishes between individuals with and without Stroke across all possible decision thresholds. The area under the curve values summarizes this performance, with larger values reflecting better separation between positive and negative cases.

Fig 13. AUC–ROC Curves for ML Algorithms

The stacked ensemble and LightGBM models demonstrated the strongest discrimination, each producing an area under the curve value of approximately 0.994. Their curves remain close to the upper-left corner of the plot, indicating that they achieved very high true-positive rates even at low false-positive rates. XGBoost closely followed, with an area under the curve of 0.991, reflecting similarly strong classification performance. The Keras CNN model achieved an AUC of 0.988, and the Random Forest model reached 0.986. These results show that both deep learning and tree-based ensemble methods were highly effective at separating Stroke from non-stroke cases.

Support Vector Machine also performed well, with an area under the curve of 0.955, although its curve did not reach the same level as those of the gradient boosting and ensemble methods. The multilayer perceptron models showed moderate performance. The scikit-learn multilayer perceptron achieved an area under the curve of 0.904, while the Keras multilayer perceptron achieved 0.883. The decision tree model performed slightly lower with an area under the curve value of 0.891. These results show that simpler or less-optimized models had a reduced ability to distinguish between classes compared with advanced ensemble methods.

Logistic Regression and Naive Bayes recorded the lowest discrimination performance with area under the curve values of 0.740 and 0.739, respectively. Their curves remain close to the diagonal reference line shown in Figure 13, indicating limited capability to separate Stroke and non-stroke cases. These patterns highlight the challenges that linear and distribution-based models face when applied to complex clinical data that involve nonlinear interactions.

Overall, the ROC AUC results provide strong evidence that ensemble and gradient boosting techniques deliver superior discrimination compared with deep learning, traditional machine learning, and classical statistical models. Fig 13 clearly demonstrates that the stacked ensemble, LightGBM, and XGBoost models are the most effective approaches for stroke prediction in patients with coronary heart disease, making them the most suitable candidates for integration into clinical decision support systems.

Precision-Recall Curve Analysis for Stroke Prediction Models

The precision-recall curves in Fig 14 provide further insight into the performance of the machine learning models in predicting Stroke among patients with coronary heart disease. Unlike the receiver operating characteristic curve, which evaluates performance across balanced conditions, the precision–recall curve emphasizes performance in situations where the positive class is less common. This characteristic makes it a handy tool for clinical datasets in which stroke cases represent a small proportion of the population. The area under the precision-recall curve summarizes this performance, with higher values indicating a stronger ability to identify true stroke cases while minimizing false positives.

Fig 14. Precision-Recall Curve for ML Algorithm

The stacked ensemble model achieved the highest area under the precision-recall curve, with a value of 0.996. Its curve remains close to the upper boundary throughout the range of recall values, showing that it consistently maintains high precision even when identifying a large proportion of true stroke cases. LightGBM performed almost identically, with an area under the precision-recall curve of 0.996, further confirming its robust discrimination and reliability under imbalanced conditions. XGBoost achieved an area under the precision-recall curve of 0.995, demonstrating similarly strong performance. These three models show minimal performance loss as recall increases, highlighting the stability of gradient boosting and ensemble strategies.

The Keras CNN model performed very well, with an area under the precision-recall curve of 0.992, confirming the ability of convolutional layers to extract meaningful patterns from tabular clinical data. Random Forest also achieved strong performance with a value of 0.987. Although these values are slightly lower than those of the boosting models, their curves remain well above the baseline in Fig 14, indicating consistent separation of positive and negative cases even when the dataset is imbalanced.

The Support Vector Machine showed moderate performance, with an area under the precision-recall curve of 0.958. Its curve indicates that the model achieves reasonable precision but shows a steeper decline as recall increases. The multilayer perceptron models showed weaker performance, with the scikit-learn model achieving an area under the precision-recall curve of 0.905 and the Keras model achieving 0.886. The decision tree model performed slightly worse, with an area under the precision-recall curve of 0.852. These results reflect the limitations of models that rely solely on shallow tree structures or fully connected layers when applied to complex clinical data.

Logistic Regression and Naive Bayes achieved the lowest performance with values of 0.741 and 0.761, respectively. Their precision-recall curves in Fig 14 approach the baseline, indicating limited ability to correctly identify stroke cases, especially at higher recall levels. These models are less able to maintain precision as they attempt to identify more true stroke cases, thereby reducing their suitability for clinical use in early stroke risk detection.

Overall, the area under the precision-recall curves in Fig 14 confirms that the stacked ensemble, LightGBM, and XGBoost models deliver the strongest performance in identifying stroke cases despite class imbalance. Their curves remain consistently high across the full range of recall values, reinforcing their superiority over deep learning models, traditional machine learning approaches, and classical statistical methods.

Calibration Curve Analysis for Stroke Prediction Models

The calibration curves for all evaluated machine learning models provide a critical assessment of the reliability of predicted stroke risk probabilities among patients with coronary heart disease. Calibration curves plot the observed event frequency against the mean predicted probability for binned predictions, with ideal calibration represented by a diagonal line where the predicted probabilities exactly match the observed outcomes. As shown in Fig 15, the Stacked Ensemble and LightGBM models demonstrated near-perfect calibration, with their curves closely aligning with the ideal diagonal. This is supported quantitatively by their exceptionally low Brier scores of 0.011, indicating minimal discrepancy between predicted risks and observed event rates. Such high calibration is clinically essential, as it ensures that a predicted stroke risk of, for example, 20% corresponds to an actual event frequency of approximately 20% in similar patient groups, thereby enabling trustworthy individualized risk stratification.

Fig 15. Calibration Curves for ML Algorithms

Fig 15 also reveals distinct tiers of calibration performance among the other models. XGBoost and the Keras CNN1D model followed with strong calibration, evidenced by their curves remaining close to the diagonal and their low Brier scores of 0.016 and 0.023, respectively. Models such as Random Forest and Support Vector Machine exhibited moderate calibration, with visible deviations from the ideal line and higher Brier scores of 0.055 and 0.079, suggesting a tendency towards underconfidence or overconfidence in specific probability ranges. In contrast, classical models such as Logistic Regression and Naive Bayes showed poor calibration, with curves deviating substantially from the diagonal and Brier scores of 0.197 and 0.274, respectively. This indicates that their predicted probabilities are not well aligned with actual outcomes, limiting their utility for precise clinical risk estimation. The overall pattern in Fig 15 reinforces that advanced ensemble and gradient boosting methods not only achieve superior discriminative performance but also produce well-calibrated probability estimates, which are crucial for informed clinical decision-making and the implementation of personalized preventive strategies.

Learning Curve Analysis for Stroke Prediction Models

The learning curve for the LightGBM model shown in Fig 16 demonstrates excellent generalization capability. Both the training and cross-validation F1 scores rise quickly and converge to a high, stable plateau as the number of training examples increases. The small gap between the training and validation curves indicates that the model is learning the underlying patterns effectively without overfitting. This robust learning behavior supports LightGBM’s suitability for deployment on clinical datasets of varying sizes, ensuring reliable performance in real-world settings.

Fig 16. Learning Curve for Light GBM Model

As shown in Fig 17, the XGBoost model exhibits a similarly strong learning trajectory. The training and cross-validation F1 scores improve consistently with more data, converging at a high level of performance. A slight, persistent gap between the curves suggests a minor degree of overfitting, which is well managed by the model’s built-in regularization. This pattern confirms XGBoost’s ability to leverage additional data to improve its predictive stability and accuracy in stroke risk prediction.

Fig 17. Learning Curve for XGBoost Model

Fig 18 displays the learning curve for the Random Forest model. While both curves achieve high final F1 scores, a consistent and noticeable gap exists between their training and validation performance. This indicates that the model is overfitting to the training data to some extent, a common trait of bagging ensembles that memorize noise. Nevertheless, the cross-validation score remains strong, confirming the model’s utility, though it benefits less from additional data compared to boosting methods.

Fig 18. Learning Curve for Random Forest Model

The training history for the Keras CNN1D model shown in Fig 19 shows a smooth, rapid decline in both training and validation loss over epochs. The two curves decrease in close concordance and stabilize at a low level of loss. This ideal convergence indicates that the convolutional architecture learned meaningful feature representations from the tabular data effectively, without signs of overfitting, and that the training process was well-regularized.

Fig 19. Training Curves for Keras CNN1D Model

In contrast, the learning curve for the Decision Tree model, shown in Fig 20, reveals a characteristic pattern of high variance. The training F1 score reaches near-perfect levels very quickly, while the cross-validation F1 score plateaus at a significantly lower value. This large and widening gap is a clear sign of severe overfitting: the single tree structure memorizes the training data but fails to generalize well to unseen cases, limiting its clinical reliability.

Fig 20. Learning Curve for Decision Tree Model

The learning curve for the Support Vector Machine (SVM), shown in Fig 21, indicates moderate final performance with a concerning trend. As the training size increases, the gap between the training and cross-validation F1 scores widens. This suggests that the SVM model is becoming increasingly overfit as more data are added, likely due to the complexity of the kernelized decision boundary in the high-dimensional space of clinical features.

Fig 21. Learning Curve for SVM Model

As shown in Fig 22, the Scikit-learn MLP classifier exhibits high variance in its learning curve. The training F1 score is consistently and significantly higher than the cross-validation score, which plateaus at a moderate level. This substantial gap indicates that the multilayer perceptron is overfitting to the training set and may require more aggressive regularization, architectural adjustments, or substantially more data to improve its generalization.

Fig 22. Learning Curve for Scikit-learn MLP Classifier

The training history for the Keras MLP model shown in Fig 23 indicates that the training loss decreases rapidly. However, the validation loss plateaus after a few epochs and even shows a slight upward trend later in training. This divergence is a classic indicator of overfitting, suggesting that the model would benefit from techniques such as early stopping, dropout, or reduced model complexity to improve its validation performance.

Fig 23. Training Curves for Keras MLP model

Fig 24 illustrates the learning curve for Logistic Regression. Both the training and cross-validation F1 scores remain low and stable across all dataset sizes, with almost no gap between them. This flat, low-performance profile is a definitive sign of underfitting, indicating that the linear model is fundamentally too simple to capture the complex, nonlinear relationships governing stroke risk in this patient population.

Fig 24. Learning Curves for Logistic Regression Model

Finally, the learning curve for the Naive Bayes classifier, shown in Fig 25, confirms its severe limitations. The training and cross-validation F1 scores are low, stable, and virtually identical across the smallest and largest training sets. This complete lack of improvement with additional data, coupled with the model’s poor asymptotic performance, underscores that its conditional independence assumption is too restrictive for this complex clinical prediction task, leading to persistent underfitting.

Fig 25. Learning Curve for Naive Bayes Classifier

Scalability Analysis for Machine Learning Model Deployment

Scalability was evaluated to determine the feasibility of deploying the trained models in real-world clinical environments, particularly those with constrained computational resources such as primary health facilities, mobile health systems, and edge devices. The assessment focused on four dimensions of computational efficiency: inference latency, training time, memory requirements during model training and inference, and final model size, as shown in Table 3. These metrics collectively indicate whether each model can operate effectively in settings with limited hardware capacity, response time, and storage.

Table 3. Scalability Metrics for Machine Learning Models.

Inference latency refers to the average time required for a model to generate a prediction for a single patient. Low latency is essential for use cases such as point-of-care screening, triage support, and mobile health applications that require real-time responses. The results show that most models have microsecond-level inference times, with the Decision Tree and SKLearn MLP emerging as the fastest learners. In contrast, the Keras CNN1D and Support Vector Machine models demonstrated higher inference times, reflecting the computational intensity of convolutional and kernel-based operations.

Training time measures how long each model requires to learn from the dataset. Models with shorter training times are advantageous for scenarios requiring frequent retraining, such as incorporating new patient data or adapting to population-specific trends. Logistic Regression and Naive Bayes trained the fastest, completing in under one second, while deep learning models such as Keras CNN1D required substantially longer training durations. These differences reflect the greater parameter complexity and optimization demands of neural architectures.

Training memory and inference memory quantify peak RAM usage during model learning and prediction generation, respectively. Naive Bayes and Decision Tree models consumed the least memory, making them suitable for budget-constrained environments, whereas neural networks, particularly Keras MLP and Keras CNN1D, required significantly more memory during training. Despite this, their inference memory remained modest, indicating that deployment is still feasible on mid-range systems.

Model size reflects storage requirements for saving and loading the trained model. This metric is critical for implementation in devices with limited disk capacity, such as mobile phones or embedded systems. Naive Bayes and Logistic Regression yielded the smallest models, occupying only a few kilobytes. At the same time, Random Forest produced the largest model, exceeding 30 megabytes due to the large number of trees and the storage of split parameters.

Overall, the scalability results demonstrate that lightweight models such as Logistic Regression, Decision Tree, and Naive Bayes are the most efficient for low-resource deployment. However, more complex models such as XGBoost and LightGBM maintain a strong balance between computational efficiency and predictive performance, making them suitable for both resource-limited and high-performance clinical environments. Deep learning models deliver strong predictive performance but require substantial computational resources during training, though their inference demands remain manageable.

Explainability and Interpretability

The inclusion of explainable artificial intelligence (XAI) techniques, specifically SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model agnostic Explanations), significantly strengthens the interpretability of the machine learning models developed in this study. The following figures provide both global and local insights into model predictions, enabling clinicians to understand and trust the model’s decisions.

Global Model Output Distributions

Fig 29 provides a beeswarm plot that aggregates SHAP values across all samples for each model. The spread and density of points show how each feature influences predictions across the entire dataset. Features such as BMI and PhysHlth show a wide distribution of SHAP values, indicating their complex, non-linear relationship with stroke risk. The plot confirms that the top models (LightGBM, XGBoost) produce more concentrated and interpretable effect patterns compared to models with poorer performance, such as SVM or Decision Tree.

Fig 29. SHAP beeswarm plot showing feature effects across all models. Each point represents a patient, colored by feature value.

Fig 30 offers a detailed view for the LightGBM model. It clearly shows that high values of HeartDisease 0 (no history of heart disease) push the prediction towards a lower stroke risk (negative SHAP values), while high BMI values push predictions towards a higher risk (positive SHAP values). This level of detail is invaluable for understanding the direction and magnitude of each feature’s effect.

Fig 30. SHAP beeswarm plot for the LightGBM model. The plot illustrates how each feature contributes to model predictions across the dataset.

Local Explanations for Individual Predictions

The local explanation plots are essential for interpreting individual case predictions, a key requirement for clinical deployment.

Figs 31 and 32 demonstrate SHAP force plots for negative predictions (low stroke risk) from the XGBoost and LightGBM models. They decompose how each feature contributes to pulling the base value E[f (X)] down to the final prediction f (x). For example, in Fig 31, features like HighChol 0 and Veggies 0 decrease the risk score. Conversely, a high BMI increases it, but the net effect remains negative.

Fig 31. Local SHAP explanation for a low-risk prediction from the XGBoost model. The plot shows how individual features push the prediction away from the base value.

Fig 32. Local SHAP explanation for a low-risk prediction from the LightGBM model.

Figs 33 and 34 show force plots for positive predictions (high stroke risk). Here, features such as HeartDisease 1, Age 10, and Education 6 are the primary drivers increasing the risk score above the base value. The clear additive nature of SHAP values makes it easy to audit why a specific patient was flagged as high-risk.

Fig 33. Local SHAP explanation for a high-risk prediction from the XGBoost model.

Fig 34. Local SHAP explanation for a high-risk prediction from the LightGBM model. previous reports of stable and interpretable attribution patterns in boosting models

Complementing the SHAP explanations, Figs 35 and 36 present LIME explanations for LightGBM predictions. LIME creates a locally faithful interpretable model (e.g., linear) around a single prediction. For the negative case (Fig 35), conditions like Age 8 > 0.00 and Diabetes 2 > 0.00 are shown to support the low-risk classification. For the positive case (Fig 36), conditions like Age 10 > 0.00 and Income 4 > 0.00 justify the high-risk prediction. The use of both SHAP and LIME provides a multi-faceted view: SHAP offers a consistent, theoretically grounded attribution, while LIME provides a simple, rule-based explanation that may be more intuitive for some end-users.

Fig 35. LIME explanation for a low-risk prediction from the LightGBM model. The plot shows the most influential features and their contribution to the prediction.

Fig 36. LIME explanation for a high-risk prediction from the LightGBM model.

Synthesis and Clinical Utility

Collectively, these visualizations move the model from a “black box” to a transparent and auditable tool. The global plots (Figs 26, 27, 28, 29) validate the model at a population level, ensuring that the most influential features are clinically plausible. The local plots (Figs 31–36) empower clinicians at the point of care.

Discussion

The current research examined how various machine learning models can be used to predict stroke risk in patients with coronary heart disease and yielded meaningful results that align with the general literature on the topic while introducing new insights. Overall, the findings indicate that the best predictive performance was obtained with high-level ensemble learning techniques and gradient boosting algorithms, compared to deep learning, classic machine learning, and classical statistical techniques. This result aligns with the growing body of cardiovascular and cerebrovascular literature indicating that ensemble techniques are more successful at identifying complex trends in structured clinical data than individual models [2–5].

The robust performance of stacked ensemble, LightGBM, and XGBoost models in this study supports earlier findings that such approaches can outperform individual ML and deep learning models in stroke and cardiac risk prediction [6, 8, 9]. The effectiveness of gradient boosting, in particular, underscores the ability of tree-based ensembles to model nonlinear interactions, which are essential for accurately capturing cardiovascular risk profiles. These findings complement prior studies demonstrating that ensemble-based structures yield superior discrimination and calibration relative to single models [2, 3, 31].

Deep learning models showed moderate to high predictive value, with convolutional neural networks outperforming fully connected multilayer perceptrons, consistent with prior work showing that convolutional models can learn meaningful representations from tabular clinical data but may require very large datasets to surpass ensemble methods [2, 26]. Similarly, conventional ML algorithms such as support vector machines and random forests delivered reasonable performance but were generally less discriminative than gradient-boosted ensembles, echoing earlier observations [10, 13, 19]. Decision trees exhibited clear signs of overfitting, reflecting the known limitations of single-tree models in complex medical prediction tasks.

Classical statistical models, including logistic regression and naive Bayes, performed the worst, consistent with research indicating that linear or distribution-dependent models often underfit high-interaction and mixed-type features in clinical data [15–17]. The low recall rates of logistic regression in predicting stroke cases among coronary heart disease patients highlight the need for more flexible modelling approaches, as reinforced by studies incorporating ML-based prognostication [31].

Scalability analysis further demonstrates that ensemble models like LightGBM and XGBoost are not only highly accurate but also computationally efficient, making them suitable for integration into clinical decision support systems, including in resource-limited settings. Deep learning models, while more memory-intensive and slower to train, offer reasonable inference efficiency, suggesting potential utility in cloud-hosted clinical applications [24, 27, 32].

Explainability analyses revealed that SHAP-based global explanations consistently highlighted age, prior cardiovascular events, BMI, and general health indicators as dominant contributors to stroke risk, corroborating prior studies [2, 6, 10, 14]. The observed consistency of feature rankings across LightGBM and XGBoost aligns with [3, 9]. SHAP beeswarm plots further illustrated nonlinear and interaction-driven effects for features such as BMI and physical health indicators, reflecting the complex relationships identified in other XAI-driven stroke risk studies [19, 25].

At the individual patient level, SHAP and LIME local explanations provided actionable insights consistent with previous literature [7, 16, 17, 25]. High-risk predictions were predominantly driven by advanced age, prior heart disease, and poor general health, while protective lifestyle factors mitigated risk in lower-risk cases. The complementary use of SHAP and LIME adheres to best practices in explainable AI, providing both model-faithful and intuitive rule-based explanations [24, 27, 32].

The integration of traditional logistic regression with XAI methods enabled a more nuanced understanding of stroke risk factors. Logistic regression identified statistically significant predictors, such as prior heart attack, while SHAP and LIME revealed non-linear relationships and interactions that logistic regression could not capture. For instance, BMI, which appeared protective in logistic regression, showed heterogeneous effects in tree-based models, demonstrating increased risk for specific patient subgroups. These findings underscore the value of combining classical statistical models with explainable ML to inform precision medicine.

Finally, this work complements existing studies on stroke risk prediction by addressing mortality and prognostic outcomes in stroke patients [13, 26, 31], providing a more comprehensive perspective on patient trajectories beyond risk classification.

Incorporating explainable hybrid models allows for both accurate prediction and interpretability, bridging the gap between computational sophistication and clinical applicability.

Despite these contributions, several limitations remain. Data were derived from self-reported surveys, introducing recall bias and misclassification for behavioral and lifestyle factors. The cross-sectional design precludes causal inference. High-performing models require external validation using clinical registry or EHR data to confirm generalizability, and the absence of temporal clinical information limits the ability to capture dynamic risk changes. Future work should consider integrating longitudinal and multimodal data to further enhance predictive performance and clinical relevance.

Conclusion

This study successfully designed and tested a comprehensive suite of machine learning models to predict stroke risk in patients with coronary heart disease. Our results demonstrate that contemporary ensemble-based methods, particularly stacked ensembles, LightGBM, and XGBoost, deliver outstanding predictive performance, strong generalization capability, and desirable computational efficiency. These models consistently outperformed deep learning architectures, traditional machine learning approaches, and classical statistical methods, proving particularly effective for clinical risk stratification across diverse patient populations. The superiority of ensemble-based models stems from their ability to capture complex, non-linear interactions among behavioral, demographic, and clinical variables that underlie the intricate relationship between coronary and cerebrovascular diseases.

A key contribution of this work is the integration of explainable artificial intelligence (XAI) techniques, specifically SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model agnostic Explanations), which provided transparent and interpretable insights into model predictions. By elucidating both global feature importance and local, case-specific reasoning, these methods address the critical “black box” concern that often impedes clinical adoption of sophisticated machine learning models. The convergence of findings between traditional logistic regression and modern XAI techniques strengthens confidence in identified risk factors, while the additional nuanced relationships revealed by SHAP and LIME enable more personalized risk assessment.

The findings carry significant implications for clinical practice. Given that stroke remains a leading cause of global mortality and long-term disability, the timely identification of high-risk patients with coronary heart disease offers substantial opportunities for targeted prevention. The combination of high accuracy, low inference latency, and inherent interpretability in our best-performing models makes them suitable for integration into real-time clinical decision support systems in both primary care and specialty cardiology settings, as well as population-level screening initiatives. These models thus provide a viable pathway for developing more accurate and individualized prevention programs aligned with global health priorities focused on early diagnosis and personalized management of cardiovascular and neurological diseases.

Future research should focus on further refinement and external validation of these predictive tools across diverse clinical environments. While our models demonstrated strong performance, additional validation using multi-center cohorts, clinical registries, and longitudinal datasets is necessary to enhance generalizability and ensure equitable performance across patient subgroups. Expanding the feature space to include laboratory results, imaging data, and longitudinal clinical markers could further improve model performance and provide deeper insights into the dynamic interplay between coronary disease and stroke risk over time.

Source link