Predicting in-hospital mortality after transcatheter aortic valve replacement using administrative data and machine learning

Information source

The dataset used was obtained from the NIS/HCUP database^Five. The unit of analysis was the hospital discharge record. ICD-9-CM (International Classification of Diseases, 9th Revision, Clinical Modification) Codes 3505 and 3506 from 1 January 2012 to 30 September 2015 He underwent TAVR surgery. It was used to identify all patients older than age. ICD-10-CM codes 02RF4xx and 02RF3xx were used to identify all patients aged 18 years or older who underwent TAVR surgery between 1 October 2015 and 31 December 2019. rice field.

A total of 54,739 TAVR recordings were obtained by filtering non-adult patients using the aforementioned ICD codes and removing missing data such as age, race, sex, income, elective surgery, and in-hospital mortality. rice field. The data he divided into two groups. The group that survived the procedure (live group). n= 53,626) and those who died during the same hospitalization (deceased; n= 1113). For each procedure, ICD-9-CM (prior to October 1, 2015) or ICD-10-CM (after October 1, 2015) was used to identify comorbidities and the TAVR approach (use See Supplementary Table S5 for modified codes). .

ethical approval

According to the HCUP site^Five“The HCUP database complies with the definition of a restricted dataset. A restricted dataset is medical data that has been stripped of the 16 direct identifiers specified in the Privacy Regulation. Under HIPAA [the Health Insurance Portability and Accountability Act], no review by the Institutional Review Board (IRB) is required for use of the limited dataset. ”

research design

Figure 3 shows the workflow of this research, from data extraction to using machine learning techniques to address two research questions. (a) the usefulness of his NIS preoperative variables alone in TAVR survival prediction and (b) deployment frequent retraining of such predictive models in TAVR survival prediction. The workflow consists of 5 main steps. First, all his TAVR procedures occurring between 2012 and 2019 were extracted from his NIS database using SAS software (version 9.4, SAS Institute Inc., USA). We then used Python 3.9 to prepare the data into a tabular dataset for machine learning. That is, we have generated a set of predictors that will be used to predict the outcome of TAVR. Predictors were patient demographics (age, gender, race, salary information, zip code quartile), hospital information (region, bed size, urban/rural/teaching hospital, etc.), and comorbidities. Categorized into value indicators. The dataset had 54,739 rows/procedures and 45 columns/variables. The final three steps of machine learning model training, evaluation, and interpretation were performed separately for each question.

For the first research question, the TAVR dataset was randomly split into 80% training datasets (n= 43,791, of which 42,906 survived and 885 died), and 20% of the test data set (n= 10,948, of which 10,720 survived and 228 died). Meanwhile, for the second research question, the training set included 40,757 procedures (39,820 survivors, 937 deaths) from 2012 to 2018, and the test dataset included 13,982 from 2019. procedures (13,806 survivors and 176 fatalities) were included. The ratio of training to testing for the second question was 74.5 vs 25.5%.

Considering the imbalance between living and dead patients in both training samples, we considered using random undersampling, random oversampling, and combined resampling to create a balanced training dataset.²⁹ It uses the inbalance learning Python library (version 0.9.1). Based on preliminary analysis, random oversampling yielded the best predictive performance and was used. The resulting training sizes for questions 1 and 2 were 85,812 and 79,640, respectively, and each contained the same number of living and dead patients.

It’s similar to³used a feature ranking approach to look at the top 5/10/20/30/40/all features as input for a machine learning model. Using external variables/feature selection is not optimal for machine learning models with built-in feature selection.³⁰I used this approach for consistency with³ This is because some of the models investigated did not incorporate feature selection techniques (such as support vector machines).The selection of external features used was the “classical” method³¹a threshold of 0.80, from the PyCaret (version 2.3.6) Python library²⁶.

Using PyCaret, 15 common binaries of question 1 and 2 training datasets with 5/10/20/30/40/all top features mentioned above using stratified 5-fold cross-validation You have trained a classification model. The 15 models include: (a) Traditional statistical models: Logistic Regression with L2 Penalty (hereafter referred to as LR for brevity), Ridge Regression, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis, and Naive Bayes. (b) A single machine learning classifier: a support vector machine with a linear kernel, a k-nearest neighbor classifier, and a decision tree. (c) Ensemble Classifiers: Gradient Boosting Classifier (GBC), Light Gradient Boosting Machine (LightGBM), CatBoost, Ada Boost Classifier, Extreme Gradient Boosting, Random Forest, and Additional Tree Classifiers. Five-fold cross-validation allowed us to select and calibrate the top five performing models for each question based on average AUC. The top five models for the two research questions were the LR, LDA, GBC, LightGBM, and CatBoost models. Both are L.R.³² and LDA³² It is a traditional statistical method/single classifier.On the other hand, GBC³²Light GBM³³ and cat boost³⁴ is a tree-based ensemble method for binary classification, in which predicted classes are computed from predicted modes from all generated trees. The prediction performance of the top 5 models was benchmarked against a dummy classifier in PyCaret/scikit-learn that captures the performance of the classifier when no features/predictors are used. We used the default strategy of the dummy classifier, i.e. a prior strategy that predicts the most frequent class in the training set of all test samples regardless of their features. This allowed us to understand the predictive gains from using the supervised features and machine learning models when compared to dummy classifiers. Note that the baseline for the regression problem is somewhat similar, since r is² The metric captures the improvement in predictive performance compared to using the dummy model alone (using the mean response of the predictors, regardless of the latent feature values).

Five classification models were trained in PyCaret for each set of features and questions. The parameters of the adjusted classification model are given in Supplementary Table S6. Additionally, since the dummy model predicts the majority class (i.e., her post-TAVR survival rate for all patients), he was trained once for each question. All models were evaluated on a separate (i.e., step 4, not part of training) test set for questions 1 and 2 using the following performance measures:^35,36: precision, AUC, balanced precision, sensitivity (recall), specificity, precision (ie, positive predictive value (PPV)), negative predictive value (NPV), and F1 score. For the sake of brevity, we did not discuss these models further. Readers should refer to the scikit-learn documentation.³⁶ For more information about LR, LDA, and GBC, see Detailed introduction to LR, LDA, and GBC. Similarly, LightGBM and CatBoost documentation is available from their respective frameworks.^37,38.

The fifth step in the workflow utilized PyCaret to create diagnostic plots of each model’s performance. Due to space limitations, only feature importance plots are shown in this paper.

statistical analysis

following the approach of³, a two-tailed t-test was used to compare differences within continuous variables and a chi-square test was used for categorical data. These tests were performed using Minitab software (version 19, Minitab Inc., USA). p< 0.05 was considered statistically significant. Classification model performance was evaluated using AUC. However, we also reported other measures such as precision, balanced precision, sensitivity/recall, specificity and precision, as is customary in the literature.^29,34.Model training and evaluation were performed using the PyCaret library²⁶ in Python.

Source link