Explainable artificial intelligence (XAI) for predicting the need for intubation in methanol-poisoned patients: a study comparing deep and machine learning models

Machine Learning


Study design and setting

This retrospective observational study was conducted in 2024 using the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guideline (Appendix A). It utilized a dataset comprising 897 patients poisoned with methanol, including records of both patients needing intubation and those who did not, from Loghman Hakim Hospital in Iran, Tehran. The primary objective of this study was to examine the necessity for intubation in methanol-poisoned patients. To predict the need for intubation in methanol-poisoned patients, eight established ML and DL models were deployed. These models leveraged an array of clinical and demographic features from the dataset to make accurate predictions. To mitigate the risk of overfitting, the training of these models incorporated a robust tenfold cross-validation approach, ensuring their generalizability and reliability.

Data set description and participants

Dataset involves methanol-poisoned patients requiring intubation and covers admissions from March 17, 2020, to March 20, 2024. The dataset comprises 897 records of patients poisoned with methanol from Loghman Hakim Hospital. This hospital acts as the primary destination for referrals for individuals affected by poisoning. Within this dataset, there were 202 cases of methanol-poisoned patients requiring intubation and 695 cases of methanol-poisoned patients who did not require intubation. The confirmation of methanol poisoning involved reviewing medical records for evidence of methanol exposure, serum methanol levels surpassing 6.25 mmol/L (20 mg/dL), or the manifestation of clinical symptoms such as visual disturbances, abdominal pain, breathing difficulties, and neurological symptoms, alongside a pH level below 7.3 and serum bicarbonate levels below 20 mmol/L upon admission.

The patient selection process is depicted in Fig. 1. The study included individuals aged 12 and above who were hospitalized within 24 h of confirmed methanol poisoning. The criteria for exclusion included the simultaneous ingestion of substances besides ethanol, the administration of any pre-admission therapies that might influence the analysis, severe chronic conditions (such as cardiovascular disease, chronic kidney disease, chronic liver disease, diabetes, chronic obstructive pulmonary disease, blood disorders, malignancy, etc.), mortality before assessment, and incomplete medical documentation. Patients were categorized into two groups: those necessitating intubation and those not requiring it.

Figure 1
figure 1

Patients’ selection flowchart.

It should be noted that there is no special classification for intubating patients poisoned with methanol. Typically, these patients are intubated based on clinical symptoms, laboratory findings (such as blood methanol levels and metabolic acidosis), and overall health status. Consequently, in this study, only the patients who met the established clinical criteria for intubation were included.

Data collection

Six individual researchers conducted a thorough review of the patients’ medical records. The original questionnaire, obtained from the electronic databases of Loghman Hakim Hospital (Sabara and Shafa databases), was utilized to gather clinical data.

This questionnaire encompassed details regarding age, gender, vital signs (including respiratory rate, blood pressure, body temperature, and pulse rate), medical history (including underlying conditions), mental status (including agitation, confusion, seizures, and GCS score), visual symptoms upon admission, ingested dose, antidote therapy, and laboratory test results (including hemoglobin, platelet count, white cell count, serum creatinine (sCr), blood glucose, alanine transaminase (ALT), creatine phosphokinase (CPK), aspartate transaminase (AST), sodium, potassium, alkaline phosphatase (ALP), venous blood gas analysis (pH, PCO2, and HCO3), and blood urea nitrogen (BUN)). Furthermore, hospital-related factors such as the requirement for intubation and duration of hospitalization were recorded.

Pre-processing of the data

To prepare the data for analysis, we began by importing the dataset into Jupyter Notebook within the Anaconda environment, utilizing Python version 13.1. For effective data preprocessing, we relied on the NumPy library, essential for handling array operations and mathematical functions. NumPy facilitates various preprocessing tasks, such as managing missing values and conducting statistical calculations, which streamline data manipulation and pave the way for deeper analysis. Initially, missing values of certain variables were replaced with mean and mode statistical measures. Subsequently, nominal values of variables in the columns were converted to numerical values for improved results, employing variable encoding to enhance algorithm learning. Finally, incomplete rows of data (with missing values exceeding 70%) were removed. This systematic approach ensured that our dataset was clean and ready for analysis, minimizing the potential for biased results and enhancing the reliability of our findings.

Afterwards, we applied Min–max normalization to the data. This involved identifying and removing outliers from the dataset to enhance its quality. Min–max normalization adjusts numerical data to a predetermined range, usually 0 and 1, while preserving the relative relationships between values. This method is employed in data normalization to foster uniform and standardized feature scales, preventing certain features from dominating others during analysis. By preserving proportional relationships between values, Min–max normalization guarantees equitable comparison and precise interpretation of the dataset across various features.

Feature selection

The main purpose of feature selection in machine learning is to pinpoint the best features or key parameters to enhance model performance. Among 187 evaluated features, 110 were excluded due to incomplete medical records and missing data. The remaining 77 features underwent Pearson’s correlation coefficient analysis, resulting in the identification of 43 significant features for predicting methanol poisoning prognosis. Features with near-zero correlation and linear data representation were removed. These 43 features were then integrated into machine learning models. Afterwards, the FeatureWiz library was utilized for another round of feature selection, leading to 23 selected features. Feature selection using the FeatureWiz library involves two main stages. Initially, the Searching for the Uncorrelated List of Variables (SULOV) method identifies variable pairs outside the correlation threshold. Subsequently, the Mutual Information Score (MIS) of these pairs is calculated, and the pair with the lowest correlation and highest MIS is chosen for further analysis. In the next phase, variables selected through SULOV are iteratively processed through XGboost to identify optimal features based on the target variable, thus reducing the dataset size. This method helps in selecting the most impactful predictive features from the dataset.

Data analysis software

In this study, we extensively employed the Python programming language (version 13.1) along with various associated libraries. We utilized Jupyter Notebook within the Anaconda environment, utilizing Python version 13.1. Matplotlib, NumPy, Seaborn, and Pandas were utilized for data analysis and visualization, while the scikit-learn library facilitated the development and evaluation of machine learning models Deep learning architectures, were constructed and trained using TensorFlow. Furthermore, model interpretability and feature importance analyses were conducted using SHAP (SHapley Additive exPlanations), and LIME (Local Interpretable Model-agnostic Explanations).”

Machin learning and deep learning models development

In total, we utilized eight well-known models from both the deep learning (DL) and machine learning (ML) realms for prediction the need for intubation in methanol-poisoned patients. Among the DL models employed were the Deep Neural Network (DNN), feedforward neural network (FNN), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN). Conversely, the ML models encompassed Extreme Gradient Boosting (XGB), Support Vector Machine (SVM), Decision Tree (DT), and an additional Random Forest (RF).

These selected models offer a diverse array of methodologies suitable for diseases prediction, encompassing both deep learning and machine learning approaches. Deep learning models like DNN, FNN, LSTM, and CNN excel in capturing intricate patterns within the data, whereas ML models such as XGB, SVM, DT, and RF provide robust and easily interpretable predictions. By leveraging this range of models, our objective was to enhance prediction accuracy and gain insights into the complex factors influencing the need for intubation in methanol-poisoned patients.

It should be noted that, while it is true that CNNs are predominantly used for image data due to their ability to capture spatial hierarchies, recent studies9,10 have demonstrated their potential in handling tabular data as well. CNNs can effectively learn local dependencies and patterns within tabular data, similar to how they detect features in images. The findings of Buturović et al.’s study9 showed that CNNs can perform accurately in predicting diseases using tabular data.

Cross-validation and hyperparameter tuning

To mitigate the risk of overfitting, we incorporated tenfold cross-validation during the training of all proposed models. This approach entails partitioning the dataset into 10 equally sized folds, with the model trained on 9 folds and validated on the remaining fold in each iteration. This iterative process is repeated 10 times to ensure comprehensive validation. The ultimate performance metric is computed by averaging the outcomes from these iterations, offering a dependable evaluation of the model’s effectiveness11.

The process of optimizing models for a specific dataset involves the careful selection and adjustment of hyperparameters to create the most effective model. The selection of hyperparameters plays a crucial role in determining the overall performance of a specific machine learning algorithm. After completing the preprocessing phase, a sequence of machine learning (ML) and deep learning (DL) modeling tasks were initiated to fine-tune and optimize these hyperparameters. This iterative approach was geared towards pinpointing the ideal hyperparameter configurations necessary for developing models with the highest F-score. In this investigation, we employed the GridSearchCV technique to pinpoint the most precise and resilient models. The hyperparameters of the optimal model, the Gradient Boosting Classifier, were adjusted as follows: (learning_rate = 0.2, max_depth = 5, n_estimators = 10, min_samples_leaf = 30, subsample = 0.8, min_samples_split = 400, random_state = 10, max_features = 9)12.

Explanation and justification the output of ML and DL models

ML and DL methods are often regarded as “black box” models due to their intricate inner workings, posing challenges for interpretation13,14.This lack of interpretability can be particularly problematic in critical fields like healthcare, where understanding prediction rationales is vital. To tackle this issue, researchers have focused on enhancing model interpretability. Two notable techniques are Shapley Additive Explanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), which offer insights into ML model predictions15,16. In our study, both SHAP and LIME were utilized as interpretability methods in machine learning. While both methods serve the purpose of explaining model predictions, they have distinct characteristics and can provide complementary insights.

SHAP, drawing from Shapley values in cooperative game theory, has garnered attention across various fields, including clinical studies17,18. It assigns contribution values to dataset features, showing their impact on predicted outcomes. These values are derived by comparing predictions with and without specific features. Through examining all feature combinations, SHAP provides a holistic understanding of each feature’s contribution, aiding researchers in identifying their impact on outcomes17. Moreover, SHAP offers a theoretical framework rooted in cooperative game theory, providing globally consistent explanations by assigning each feature an importance value based on its contribution to the model’s output. This method offers a comprehensive understanding of feature importance across the entire dataset19.

LIME is an algorithm aimed at clarifying predictions made by any classifier or regressor by creating a local interpretable model. It prioritizes interpretability and local fidelity, facilitating a qualitative understanding of the input–output relationship and ensuring the model’s reliability near the predicted instance. As a model-agnostic tool, LIME can elucidate any model’s predictions, treating it as a black box. It demonstrates versatility by interpreting image classifications, providing insights into text-based models, and explaining tabular datasets in various formats textual, numeric, or visual16.

In essence, SHAP and LIME are invaluable for interpreting ML and DL model predictions, boosting transparency and trust in decision-making processes. Their use in healthcare settings aids clinicians in understanding and validating AI predictions, facilitating informed decisions16,20. In this study, SHAP and LIME shed light on feature influences on predicted outcomes for both ML and DL models. Consequently, SHAP and LIME diagrams were created for the top-performing model across sensitivity, specificity, accuracy, ROC, and F1-score indices.

Therefore, by employing both SHAP and LIME, we aimed to leverage the strengths of each method to obtain a more holistic understanding of our model’s behavior. While one method may suffice in certain scenarios, the combined use of SHAP and LIME allowed us to validate and cross-reference the interpretability of our model across different scales and contexts.

Performance evaluation of models

The ML and DL models’ performance underwent a thorough evaluation utilizing performance metrics obtained from the confusion matrix, as detailed in Table 1. The assessment of predictive models encompassed a range of essential metrics including accuracy, specificity, sensitivity, F1-score, and the receiver operating characteristic (ROC) curve, all presented in Table 2.

Table 1 Confusion matrix.
Table 2 The performance evaluation measures.

Ethical considerations

The study received approval from the ethics committee of Shahid Beheshti University of Medical Sciences, identified by reference number IR.SBMU.RETECH.REC.1402.826. All methods were performed in accordance with the relevant guidelines and regulations by ethics committee of Shahid Beheshti University of Medical Sciences. In cases where participants were unable to provide consent themselves, consent was obtained from participants or their families. The informed consent obtained at our institutions also included authorization for potential future retrospective analyses.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *