Exploration and analysis of risk factors for coronary artery disease with type 2 diabetes based on SHAP explainable machine learning algorithm

Machine Learning


Study population

A retrospective collection of clinical data was conducted on 29,960 cardiovascular disease patients admitted to the First Affiliated Hospital of Xinjiang Medical University between January 1, 2001, and December 31, 2018. The collected data included: basic demographic information (gender, age, education level, occupation); personal lifestyle history (smoking, alcohol consumption); family history (presence of diabetes, hypertension, hyperlipidemia); and laboratory tests such as complete blood count (WBC, NE, LY, MO, EO, BA, NE1, LY1, etc.).

Inclusion criteria were as follows: CHD patients: Diagnosed with CHD by coronary angiography (CAG) or CTA, with clear clinical manifestations such as angina pectoris or other ischemic symptoms; age ≥ 18 years. CHD-DM2 patients: Met all inclusion criteria for CHD. Diagnosed with T2DM based on indicators such as C-peptide level, islet autoantibody testing, or age at diabetes onset. Complete glycemic control records were available, including data on glycated hemoglobin (HbA1c).

Exclusion criteria included: Incomplete or erroneous data: Patients with missing key clinical information, such as diagnostic records, or those with significant data inconsistencies. Severe comorbidities: Patients with serious hepatic or renal dysfunction, or other systemic diseases that could interfere with study validity. Individuals with active malignancies receiving chemotherapy or radiotherapy were also excluded.Based on the above inclusion and exclusion criteria, a total of 12,400 eligible patients with CHD and CHD-DM2 were ultimately included in the study.

This study is a retrospective analysis, with data sourced from the medical records of cardiovascular disease patients admitted to the First Affiliated Hospital of Xinjiang Medical University from January 1, 2001, to December 31, 2018. The study protocol was reviewed and approved by the Ethics Committee of Xinjiang Medical University (Approval Number: XJYKDXR20250515001), and exemption from informed consent was granted. The decision to exempt informed consent was based on the following criteria: 1. International ethical guidelines In accordance with Article 32 of the Declaration of Helsinki (2013 revised edition), informed consent may be waived in the following circumstances: The research risk is extremely low. This study only involves statistical analysis of anonymized medical record data and does not involve any intervention measures. According to the ethics committee’s risk assessment, the risk level of this study is “lowest,” which complies with the core principle of the Declaration that “the research risk is no greater than the minimum risk.” The reasonableness of secondary use of data: The research data is derived from medical records generated during the diagnostic and treatment process, constituting lawful and compliant secondary use. The data has undergone double anonymization (removal of direct and indirect identifiers such as names, ID numbers, and hospital admission numbers), ensuring it cannot be traced back to individuals, in accordance with the Declaration’s requirement that “exemption from informed consent shall not adversely affect the rights and health of research participants.” Objective limitations in contacting participants: Due to the study’s long time span of 18 years (2001–2018), after verification by the hospital’s medical records department, over 85% of patients’ contact information was found to be invalid or changed, meeting the exception in the Declaration that “if requiring informed consent would prevent the study from being conducted.” II. Legal Basis in China According to Article 23 of the “Ethical Review Measures for Life Sciences and Medical Research Involving Human Subjects” jointly issued by the National Health Commission and three other ministries in 2023 (National Health Commission Science and Education Development [2023] No. 4), an ethics committee may approve research exempt from informed consent if the following conditions are met: Risk controllability: The research poses extremely low risk and participants cannot be contacted (e.g., in this study, patients became unreachable due to the time span), meeting the criteria of Item (i) of this provision. Anonymization standards: Data has been thoroughly anonymized in accordance with the requirements of the Personal Information Protection Law, with all identifiable information removed, meeting the standard of “cannot be traced back to an individual” specified in Item (ii) of this provision. Protective measures for rights and interests: The study does not involve risks of personal privacy breaches or conflicts of commercial interest, meeting the requirements of subparagraph (3) of this clause. Therefore, this study strictly adheres to the principle of “minimizing risks and maximizing rights and interests,” and the decision to waive informed consent complies with international ethical guidelines and Chinese laws and regulations.

Data preprocessing

To ensure the integrity and accuracy of the analysis, we used the dplyr package in R (version 3.6.1) to identify variables with more than 30% missing values (e.g., Age, Educational Level), which were excluded from the final dataset. For variables with a missing rate below 30%, the mice package was employed to perform multiple imputation, effectively estimating and replacing missing data to retain dataset completeness and continuity (Fig. 1).The dummyVars function from the caret package in R was used to generate dummy variables for categorical data. For instance: Male (female = 0, male = 1); Educational Level (below high school = 1, high school or GED = 2, vocational school = 3, university = 4); Professional (mental worker = 0, manual worker = 1); Current Smoker (no = 0, yes = 1); Current Drinker (no = 0, yes = 1); Hypertensive History (no = 0, yes = 1); Diabetes History (no = 0, yes = 1); Pro (negative = 0, positive = 1); and Glu (negative = 0, positive = 1).In clinical data research, missing values can degrade model accuracy and may even result in misleading conclusions. Moreover, due to objective differences in disease prevalence, imbalanced distributions between positive and negative cases are common in medical datasets11, leading to poor classification performance for minority class samples12. To address the issue of data imbalance, this study employed the SMOTENC algorithm in combination with the themis package in R for data preprocessing. SMOTENC was used to generate synthetic samples for the minority class in order to balance the class distribution within the dataset.

Fig. 1
figure 1

Visualization of missing value patterns.

Feature selection

Feature selection is an important and commonly used dimensionality reduction technique that aims to identify an optimal subset of features by removing irrelevant and redundant information from the dataset13,14. By interpreting the most relevant features, deeper insights into the problem can be obtained15. The LASSO regression algorithm enables dimensionality reduction and variable selection for high-dimensional data16. In this study, a combined approach of univariate analysis and LASSO regression was employed. Potential candidate features were initially identified using univariate analysis with the nortest package in R, selecting variables with statistical significance (P < 0.05). These candidates were further refined using LASSO regression via the glmnet package in R, which introduces a penalty term to reduce model complexity and ultimately selects the most predictive feature set.

Model construction

In recent years, Machine Learning (ML) techniques have been widely applied in medical research, leveraging large datasets to uncover complex patterns that may not be readily identifiable by human observers, thereby offering a promising alternative approach17. In this study, seven machine learning models—XGBoost, Random Forest (RF), LightGBM, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Logistic Regression, and Logistic_Lasso—were used to construct predictive models. These algorithms are commonly applied to binary classification tasks in coronary heart disease and diabetes-related research18.

XGBoost algorithm

XGBoost is an extended implementation of the Boosting ensemble algorithm. By integrating multiple weak classifiers, it constructs an efficient decision tree–based ensemble learning framework, significantly reducing computational complexity and runtime while improving algorithmic efficiency. The objective function consists of two key components: a loss term that measures the difference between predicted and actual values, and a regularization term that controls model complexity and prevents overfitting. The objective function is expressed as:

$${\text{Obj}}\left( {{\uptheta }} \right) = \sum\limits_{{{\text{i}} = 1}}^{{\text{n}}} {\text{l}} \left( {{\text{y}}_{{\text{i}}} ,{{\hat{\text{y}}}}_{{\text{i}}} } \right) + \sum\limits_{{{\text{k}} = 1}}^{{\text{K}}} \Omega \left( {{\text{f}}_{{\text{k}}} } \right)$$

(1)

Here, \({\text{y}}_{\text{i}}\) denotes the true value, and \({\hat{\text{y}}}_{\text{i}}\) represents the predicted value. The function l refers to the loss function, commonly mean squared error (MSE) or log loss. \({\text{f}}_{\text{k}}\) denotes the k-th decision tree, and \(\Omega\) is the regularization term used to control model complexity.

RF algorithm

RF is an ensemble learning algorithm composed of multiple decision trees. It enhances model stability and predictive performance by constructing a multitude of trees. The core idea is to aggregate several weak classifiers (decision trees) into a strong classifier, improving prediction accuracy through majority voting or averaging. It can handle various types of features and reduces overfitting by randomly selecting features and training data subsets for each tree19. The RF model can be expressed as:

$$\text{f}\left(\text{x}\right)=\frac{1}{\text{B}}{\sum \limits_{\text{i}=1}^{\text{B}}}{\text{h}}_{\text{i}}\left(\text{x}\right)$$

(2)

Here, B is the number of decision trees in the forest, and \({h}_{i}\)(x) indicates the output of the i-th tree for a given input x.

LightGBM algorithm

LightGBM is based on the Gradient Boosting Decision Tree (GBDT) framework, an ensemble learning method that iteratively adds new trees to correct errors made by the previous ones, thereby enhancing predictive performance. By leveraging histogram-based algorithms, leaf-wise growth strategies with depth constraints, and parallel optimization, LightGBM achieves significant advantages in training speed, memory efficiency, and scalability for large-scale distributed data processing20. The objective function is defined as:

$${\text{L}}\left( {\uptheta } \right) = \mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{N}}} {\text{l}}\left( {{\text{y}}_{{\text{i}}} ,{\hat{\text{y}}}_{{\text{i}}} } \right) + \Omega \;({\text{T}})$$

(3)

In this expression, \({\text{l}}\left( {{\text{y}}_{{\text{i}}} ,{{\hat{\text{y}}}}_{{\text{i}}} } \right)\) refers to the individual sample loss, while Ω \(\left(\text{T}\right)\) serves as a regularization component to control model complexity and reduce the risk of overfitting.

SVM algorithm

SVM is a classification method based on the principle of structural risk minimization. Its core idea is to find the optimal hyperplane that maximizes the margin between different classes, effectively separating them while maximizing the distance from the hyperplane to the nearest data points—known as support vectors. For datasets that are not linearly separable in the original feature space, SVM employs kernel functions to map the data into a higher-dimensional space where linear separation becomes feasible21. The SVM model can be expressed as:

$$\text{y}\left(\text{x}\right)={\text{W}}^{\text{T}} {\varnothing }\left(\text{X}\right)+\text{b}$$

(4)

W is the weight vector, \({\varnothing }\left(\text{X}\right)\) denotes the mapping function that transforms the input sample into a high-dimensional feature space, b is the bias term, and \(\text{y}\left(\text{x}\right)\) represents the predicted value for sample x.

KNN algorithm

KNN algorithm is a non-parametric and intuitive supervised learning method. Its core idea is that if the majority of a sample’s k nearest neighbors in the feature space belong to a particular class, the sample is also assigned to that class and is expected to share its characteristics. It determines the closest instances by calculating the distance between the query sample and all samples in the training dataset22. The distance between test and training samples is computed using the following formula:

$$\text{d}\left(\text{X},{\text{X}}_{\text{i}}\right)=\sqrt{{\sum }_{\text{j}=1}^{\text{m}}{\left({\text{x}}_{\text{j}}-{\text{x}}_{\text{ij}}\right)}^{2}}$$

(5)

Here, m denotes the number of features. After computing the distances, the k nearest samples with the smallest distances are selected, and their labels or values are used to predict the outcome of the test sample.

Logistic regression algorithm

Logistic regression is a widely used statistical model for binary classification tasks. Its core concept involves applying the sigmoid function to a linear combination of input features, thereby transforming regression outputs into probability values between 0 and 1 and effectively converting a regression problem into a classification task23.

$${h}_{\theta }\left(X\right)=\frac{1}{1+{e}^{-\theta Tx}}$$

(6)

In this expression, \({\text{h}}_{\uptheta }\left(\text{X}\right)\) represents the predicted probability of the input sample. X is the feature vector, and \(\theta\) is the parameter vector of the model, where each parameter corresponds to the weight of an input feature.

Model performance assessment

To ensure optimal performance and robustness, tenfold cross-validation was employed on the training dataset. This method averages performance metrics across multiple trials to provide a more reliable assessment of model performance. The model’s predictive ability was evaluated using the confusion matrix, area under the receiver operating characteristic curve (AUC), Receiver Operating Characteristic Curve(ROC),sensitivity, specificity, and precision. Clinical utility was further assessed using decision curve analysis (DCA). Feature importance analysis was conducted for the selected models to determine the contribution of each variable to the prediction24, thereby evaluating model reliability and clinical applicability, as well as identifying net benefit.

Model interpretability

SHAP was used to analyze and interpret the results of the machine learning models. As an advanced interpretable machine learning framework, SHAP provides detailed explanations for individual model predictions, enhancing the transparency of ML models and facilitating the adoption of AI technologies in clinical practice25. Its capabilities include quantifying the overall contribution of each feature, illustrating their specific influence on individual predictions, examining feature interactions, and analyzing the combined effects of feature dependencies26. It enables the visualization of feature importance relationships and supports comprehensive interpretation of model behavior.In R, the average absolute SHAP values were visualized to rank the relative importance of each variable in the model, providing a comprehensive understanding of their individual contributions to the predictions27. The SHAP beeswarm plot offers an intuitive visualization of how all variables influence model predictions. The SHAP waterfall plot illustrates the direction and magnitude of each feature’s contribution to the final prediction for an individual case. The SHAP dependence plot allows exploration of the relationship between a given variable and its SHAP value, as well as interactions between variables. These visualizations enable unbiased evaluation of each variable’s contribution within the system, allowing the impact of individual variable values on model output to be considered independently28.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *