Data preprocessing and feature selection
We initially collected data on 40 variables from patients admitted to the cardiology department. To ensure data quality and model robustness, we implemented a rigorous preprocessing strategy. Figure 1 illustrates the percentage of missing values for each variable in our dataset. Variables with more than 30% missing data were excluded from further analysis to maintain model integrity. This criterion led to the removal of five variables: Glu_2h (89.99% missing), HbA1c (56.06%), renal anemia (47.54%), serum albumin (47.70%), and PO2 (48.80%) (Fig. 2). For the remaining variables, missing data were imputed using the random forest method, a sophisticated approach that preserves relationships among variables while providing accurate estimates for missing values. This method was selected for its ability to handle mixed data types and capture complex interactions, which is crucial in clinical datasets. After preprocessing, 35 variables were retained for subsequent analysis. These included demographic factors (sex, age), clinical measurements (body weight, CRP, WBC, cTnT, creatinine, NT-proBNP, urea, glucose, hemoglobin, D-dimer), blood gas parameters (PCO2, pH), and various comorbidities (hypertension, coronary heart disease, hyperlipidemia, diabetes, angina pectoris, renal insufficiency, lung infection, cardiac insufficiency, cerebral infarction, cerebral hemorrhage, gallbladder stone, gastrointestinal hemorrhage, atherosclerosis, atrial fibrillation, peritoneal dialysis, renal artery stenosis, cirrhosis, dyslipidemia, acute myocardial infarction, transient ischemic attack).

Bar plot showing the percentage of missing values for each variable. Variables are ordered by the percentage of missing data, with those exceeding 30% highlighted for exclusion.
Baseline characteristics of the study population
Our study included a total of 10,706 patients admitted to the cardiology department. We analyzed the characteristics of patients who experienced gastrointestinal hemorrhage (n = 110) compared to those who did not (n = 10,596). Specifically, we used the SMOTE method to synthetically increase the number of positive cases (i.e., gastrointestinal hemorrhage events) from the original 110 samples to 1,100 synthetic samples. The percentage of positive events was adjusted from 110/10,706 to 1,100/11,696. Subsequently, the cohort was randomly divided into a training set (n = 9,356) and a test set (n = 2,340) at a 8:2 ratio.
Table 1 presents the baseline characteristics of patients with and without gastrointestinal hemorrhage (left column) and the baseline characteristics of the training and test sets after SMOTE balancing (right column).
Comparison of patients with and without Gastrointestinal hemorrhage
Patients who experienced gastrointestinal hemorrhage were significantly older (72.01 ± 14.24 vs. 59.88 ± 14.88 years, p < 0.001) and had a higher number of emergency treatments (0.32 ± 0.93 vs. 0.04 ± 0.34, p = 0.002) compared to those without bleeding. They also exhibited significantly elevated inflammatory markers, including C-reactive protein (CRP: 2.86 ± 3.69 vs. 1.13 ± 4.52 mg/L, p < 0.001) and white blood cell count (WBC: 8.14 ± 5.37 vs. 6.93 ± 2.75 × 10^9/L, p = 0.019), as well as increased cardiac stress indicators (NT-proBNP: 4873.24 ± 9806.96 vs. 936.60 ± 3835.61 pg/mL, p < 0.001). Notably, patients with gastrointestinal hemorrhage had significantly lower hemoglobin levels (92.48 ± 20.75 vs. 130.30 ± 21.96 g/L, p < 0.001) and serum albumin levels (33.70 ± 5.16 vs. 40.21 ± 5.00 g/L, p < 0.001). They also showed elevated levels of D-dimer (3.44 ± 3.83 vs. 1.09 ± 2.31 mg/L, p < 0.001), urea (9.32 ± 8.04 vs. 6.39 ± 4.72 mmol/L, p < 0.001), and glucose (8.39 ± 3.73 vs. 6.28 ± 2.68 mmol/L, p < 0.001).
Regarding comorbidities, patients with gastrointestinal hemorrhage had a significantly higher prevalence of cirrhosis (6.36% vs. 0.55%, p < 0.001), cardiac insufficiency (6.36% vs. 1.79%, p = 0.001), and cerebral infarction (8.18% vs. 2.47%, p < 0.001). Conversely, they had a lower prevalence of hyperlipidemia (0% vs. 7.91%, p = 0.002) and angina pectoris (0.91% vs. 12.21%, p < 0.001) (Table 1).
Comparison of training and validation sets after Smote
The cohort was divided into a training set (n = 9,356) and a test set (n = 2,340) at a 8:2 ratio. The two groups were well balanced, with no significant differences in most variables. Age (61.01 ± 15.10 vs. 61.25 ± 15.11 years, p = 0.49), sex (male: 50.60% vs. 50.94%, p = 0.768), and key clinical parameters—including CRP, WBC, NT-proBNP, and hemoglobin—were similar between the sets (all p > 0.05). The prevalence of comorbidities such as hypertension, coronary heart disease, diabetes, and renal insufficiency also showed no significant differences (all p > 0.05). The only exception was lung infection, which was slightly more frequent in the validation set (2.56% vs. 1.89%, p = 0.039); this minor difference is unlikely to affect model performance (Table 1).
Feature selection using adaboost algorithm
To identify the most influential predictors of gastrointestinal hemorrhage in cardiac patients, we applied the AdaBoost algorithm for feature selection. This approach ranked variables according to their contribution to the predictive model, allowing us to focus on the top 10 features for further analysis. Figure 3 displays the relative importance of these top 10 variables as horizontal bars, with bar length representing each variable’s weight in the model. Hemoglobin (Hb) was the most important predictor, followed by creatinine (Cr) and D-dimer. The remaining variables are shown in descending order of importance.

Feature Importance of Top 10 Variables Identified by AdaBoost Algorithm. The x-axis represents the weight importance, while the y-axis lists the variables. Longer bars indicate higher importance in the predictive model.
Table 2 presents the weight importance scores for these top 10 variables, providing a quantitative measure of their relative contribution to the predictive model. Hemoglobin demonstrated the highest importance score (0.16), followed by creatinine (0.12) and D-dimer (0.10). NT-proBNP, glucose, white blood cell count, and body weight each contributed equally with a score of 0.06. Serum albumin, urea, and age completed the top 10, each with an importance score of 0.04.
Multicollinearity analysis of selected variables
We conducted a multicollinearity analysis on the top 10 variables identified by the AdaBoost algorithm. Table 3 presents the results of variance inflation factor (VIF) calculations. The highest VIF value was 3.119 for urea, followed by 2.829 for creatinine. All other variables had VIF values below 2. Figure 4 displays the correlation matrix of the top 10 variables. The correlation matrix identified several key associations among variables: urea and creatinine demonstrated the strongest positive correlation (r = 0.78); hemoglobin positively correlated with serum albumin (r = 0.57) but negatively correlated with D-dimer (r = -0.35), NT-proBNP (r = -0.31), and urea (r = -0.37). Additionally, NT-proBNP showed moderate positive correlations with creatinine (r = 0.49) and urea (r = 0.54). Importantly, no correlation coefficient exceeded 0.8, indicating the absence of multicollinearity among the analyzed features.

Correlation Matrix of the Top 10 Variables. The color intensity and size of the circles are proportional to the correlation coefficients. Blue circles indicate positive correlations, while red circles indicate negative correlations.
Comparison of multiple machine learning models
We evaluated seven machine learning models for their performance in predicting gastrointestinal hemorrhage in cardiac patients. Figure 5 provides a comprehensive comparison of these models across multiple performance metrics. Tables 4 and 5 summarize the performance metrics of all models in the training and validation sets, respectively.

Performance comparison of machine learning models. (A) ROC curves for training set; (B) ROC curves for validation set; (C) Calibration curves for validation set; (D) Decision curve analysis for validation set; (E) PR curves for training set; (F) PR curves for validation set.
In the training set (Fig. 5A), XGBoost, LightGBM, and Random Forest all achieved perfect AUC scores of 1.000 (SD: 0.000). However, their performance varied in the validation set (Fig. 5B). XGBoost demonstrated the highest AUC at 0.995 (SD: 0.001), followed closely by LightGBM (AUC: 0.993, SD: 0.002) and Random Forest (AUC: 0.987, SD: 0.005). In terms of accuracy, XGBoost led with 0.975 (SD: 0.003), exhibiting a sensitivity of 0.769 (SD: 0.044) and specificity of 0.996 (SD: 0.002). LightGBM and Random Forest showed comparable accuracies of 0.979 (SD: 0.003) and 0.976 (SD: 0.002), respectively.
Logistic Regression exhibited consistent but lower performance, achieving an AUC of 0.932 (SD: 0.007) in the validation set. It had the highest sensitivity (0.915, SD: 0.011) among all models but at the expense of lower specificity (0.840, SD: 0.006). Calibration curves (Fig. 5C) indicated that XGBoost, LightGBM, and Random Forest were well-calibrated, closely following the ideal line, whereas Gaussian Naive Bayes deviated notably. Decision curve analysis (Fig. 5D) further demonstrated the superior clinical utility of XGBoost, LightGBM, and Random Forest across a broad range of threshold probabilities (0.1–0.8), consistently yielding higher net benefits than “treat all” or “treat none” approaches. Precision-Recall curves (Fig. 5E and F) highlighted the robust performance of these three models in both training and validation sets, maintaining high precision over varying recall despite class imbalance. The forest plot (Fig. 6) visually confirmed their superior performance with non-overlapping confidence intervals compared to other models. DeLong’s test results (Supplementary Tables 1 and 2) showed that XGBoost significantly outperformed all models (p < 0.05), except for LightGBM (p = 0.379) and Random Forest (p = 0.035).

Multi-model forest plot for validation set.
XGBoost as the optimal predictive model
After a comprehensive evaluation of multiple machine learning algorithms, XGBoost was identified as the most effective model for predicting gastrointestinal hemorrhage in cardiac patients. Figure 7 illustrates its performance across various evaluation metrics. A detailed summary of the model’s performance in the training, validation, and test sets is provided in Table 6.

Performance evaluation of the XGBoost model. (A) ROC curve for the training set; (B) ROC curve for the validation set; (C) ROC curve for the test set; (D) Learning curve; (E) Decision curve; (F) Confusion matrix for the training set; (G) Confusion matrix for the validation set; (H) Calibration curve.
The XGBoost model demonstrated excellent discriminative performance across all datasets. In the training set, it achieved near-perfect results with an AUC of 1.000 (SD: 0.000) and an accuracy of 1.000 (SD: 0.000) (Fig. 7A; Table 6). This high performance was largely sustained in the validation set, with an AUC of 0.994 (SD: 0.003) and an accuracy of 0.973 (SD: 0.005) (Fig. 7B; Table 6). Importantly, the model maintained robust performance in the independent test set, reaching an AUC of 0.995 and an accuracy of 0.974 (Fig. 7C; Table 6). The comparative analysis in Supplementary Table 3 demonstrates that the XGBoost model significantly outperformed other algorithms in the independent test set: its AUC (0.995) was notably higher than AdaBoost (0.962) and Decision Tree (0.896). While maintaining high specificity (0.991), XGBoost achieved superior sensitivity (0.790) and F1-score (0.839) compared to Decision Tree (0.819 and 0.786, respectively). Although AdaBoost showed higher sensitivity (0.896), its specificity (0.901) and positive predictive value (0.513) were significantly lower. These results conclusively demonstrate XGBoost’s clear advantage in overall performance.
The learning curve (Fig. 7D) indicates that training and validation scores converged as sample size increased, suggesting effective learning and minimal overfitting. Decision curve analysis (Fig. 7E) showed that XGBoost provided consistently greater net benefit than the “treat all” or “treat none” strategies across a wide range of threshold probabilities (approximately 0.1 to 0.8), highlighting its clinical utility.
Confusion matrices (Fig. 7F–G) revealed strong classification performance. In the validation set, the model achieved high overall accuracy (0.973, SD: 0.005), with a sensitivity of 0.767 (SD: 0.061) and specificity of 0.995 (SD: 0.002). The calibration curve (Fig. 7H) closely aligned with the ideal diagonal, confirming good model calibration. In the test set, XGBoost continued to perform well, with a sensitivity of 0.790, specificity of 0.991, PPV of 0.895, and NPV of 0.981, reflecting excellent predictive reliability for both positive and negative cases.
The optimal cutoff value, determined by the Youden index, remained stable across datasets: 0.840 (SD: 0.015) in both training and validation sets, and 0.816 in the test set. This consistency further supports the model’s robustness.
Superior performance of XGBoost in male subgroup
The XGBoost model demonstrated robust predictive performance in the male subgroup. The ROC curve (Fig. 8A) exhibited excellent discriminative ability, with an AUC of 0.96, indicating strong capacity to distinguish between positive and negative cases. At the optimal cutoff threshold of 0.823, the model achieved a balanced sensitivity of 0.517 and a high specificity of 0.995, reflecting its precision in ruling out non-cases while maintaining moderate case detection. The calibration curve (Fig. 8B) showed close alignment between predicted probabilities and observed outcomes (Brier score = 0.030), suggesting reliable risk estimation across the spectrum of probabilities. Minor deviations at higher risk ranges may reflect limited sample size in extreme subgroups. DCA (Fig. 8C) further validated clinical utility, with the model providing net benefit over the “Treat All” and “Treat None” strategies across threshold probabilities of 10-80%. This supports its potential for guiding individualized decision-making in male patients.

Performance evaluation of the XGBoost model in the male subgroup (n = 5,926). (A) ROC curve. (B) Calibration plot. (C) DCA curve.
Interpretability and feature importance of the XGBoost model
To quantify the contribution of individual features to the predictions of the XGBoost model, we performed SHAP analysis. The results are illustrated in Fig. 9.

SHAP analysis of the XGBoost model. (A) SHAP summary plot; (B) Feature importance ranking; (C) SHAP force plot illustrating feature contributions for a representative negative prediction; (D) SHAP force plot illustrating feature contributions for a representative positive prediction.
SHAP analysis of the XGBoost model (Fig. 9) revealed key feature contributions to predictions. The SHAP summary plot (Fig. 9A) demonstrated that hemoglobin (Hb) exhibited the strongest influence on model output (SHAP range: -0.2 to 0.4), where lower values (blue) were associated with increased risk and higher values (red) with decreased risk. D-dimer and NT-proBNP showed moderate positive correlations with risk (SHAP peaks ≈ 0.2), while other features like serum albumin and creatinine had minimal effects (|SHAP|<0.1). Feature importance ranking by mean absolute SHAP values (Fig. 9B) confirmed Hb as the most predictive feature (mean|SHAP|=0.035), followed by D-dimer (0.025) and NT-proBNP (0.015). Body weight, age, and urea showed negligible contributions (all < 0.01). Individual prediction analysis (Figs. 8C-D) illustrated these mechanisms: A negative case (Fig. 9C) showed protective effects from high Hb (144) and serum albumin (45.6), counterbalancing mild risk factors to maintain prediction near baseline (-0.0001). Conversely, a positive case (Fig. 9D) demonstrated how extreme risk factors – critically low Hb (87), elevated D-dimer (3.71), and high NT-proBNP (231) – drove the prediction to 0.6 despite some protective factors.
