Hospital mortality prediction in traumatic injuries patients: comparing different SMOTE-based machine learning algorithms | BMC Medical Research Methodology

Machine Learning


Data collection and preparation

The present study was a retrospective cohort study conducted on 126 trauma patients. These patients were admitted to an intensive care unit at the Besat hospital of Hamadan province, in the west of Iran, from—March 2020 to—March 2021. The data were extracted from the patients’ medical records. Our focus was on the information about trauma patients’ status (alive/dead) as a response and related risk factors to trauma. Patients were followed up from the time they entered the ICU until death or discharge, and the mean follow-up time from the date of trauma to the date of outcome was 3.98 days. We chose six risk factors associated with trauma outcome including, age, sex (male, female), type of trauma (blunt, penetrating), location of injuries (head and neck, thorax, abdomen and pelvic, spinal, extremities, multi-injuries), Glasgow coma scale (severe, moderate, minor) and white blood cells (k /mm3) to evaluate the performance of ML methods.

Decision tree

Decision Tree is one of the easiest and popular algorithms for classification and regression problems. The main goal of the DT is to construct a model that can predict the value of a target variable by learning simple decision rules deduced from the data features. Nodes and branches are the two main components of a DT model. The three essential steps in making a DT model are division, stopping, and pruning. The tree’s making starts with all training data in the first node. Then, the first partition splits the data into two or more daughter nodes based on a predictor variable [30].

DT contains three types of nodes. (a) A root node or decision node indicates a decision that will result in the subdivision of all features into two or more mutually exclusive subsets. This node has no input branch, and the number of its output branches can be zero or more. (b) Internal nodes indicate one of the possible selects available in the tree structure; the Input branch of the node is linked to its parent node, and the output branch of the node is linked to its child nodes or leaf nodes. (c) Leaf nodes or terminal nodes indicate the final conclusion of a combination of decisions or events. These have one input branch and no output branch [31].

The benefit of DT contains simplicity in interpretation, the facility to handle categorical and quantitative values, the ability to fill missing values in features with the most probable value, and robustness to outliers. The main drawback of the decision tree is that it can be exposed to overfitting and under-fitting, especially when using a small data set [32].

Random forest

The RF method was first proposed by Leo Breiman [33]. This algorithm is an ensemble learning method used widely in classification and regression problems. It produces a large number of decision trees from subsamples of the dataset. Each decision tree will generate an output. Then the final output is obtained based on majority votes for classification and the average for regression. At first, in this algorithm, bootstrap samples were drawn through the resampling of the original data. Approximately 37% of the data is excluded from each bootstrap sample, named out-of-bag or OOB data. Afterward, for each of the bootstrap samples, RF will create an unpruned tree as follow: At each tree node, some variables were randomly picked from all variables, and then picked the best split from among those variables. All the decision data created from the bootstrap samples are compounded and analyzed to gain the final RF model [13, 33].

The performance of the random forest can be estimated by its internal validation using the OOB data. For classification issues, the RF’s classification error rate, which is named out-of-bag (OOB) error will be calculated from OOB data. Each bootstrap iteration will be predicted using the tree grown with the bootstrap sample for the OOB data. Then will be cumulated the OOB predictions and computed the error rate or OOB error [34]. A benefit of the OOB error is that original data is used for its estimation and the other benefit of using it is high computational speed [35]. Many studies represent that the RF algorithm compared with other ML algorithms has higher stability, robustness and high classification performance. Also, it can preserve high classification performance when missing data exist [18]. Another property of the RF method is the generation of prediction rules. This method can identify essential variables [13].

Naïve bayes

The NB classifier is a simple algorithm that applies the famous Bayes’ theorem with strong independence assumptions. Indeed, the NB classifier supposes that all predictor variables are conditionally independent of one another. NB method looks for a clear, simple, and very quick classifier. NB classification model categorized samples by computing the probability that an object belongs to a specific category. Due to the Bayesian formula, the posterior probability is computed according to the prior probability of an object, and the class with the maximum posterior probability is chosen as the object’s class. Easy implementation, good performance, working with little training data and making probabilistic predictions are advantages NB method. Also, it is not sensitive to unrelated features. In addition, NB executes well, even when the independence assumption is violated. However, it is computationally intensive, especially for models involving many variables [15, 32].

Artificial neural network

An artificial neural network inspired by the operation of neurons in the human brain is a machine learning method widely used that performs mightily in classification and pattern identification. The learning process in this method performs via gathering information by detecting patterns and relationships in data and learning through experience. A multilayer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer. The hidden layer is intermediate between the input and output layers, and the number is commonly specified with the cross-validation method. Each layer is made up of units called neurons (nodes). The neurons in the two adjacent layers are fully connected in which each connection has a weight associated with it, while the neurons inside the same layer are not connected. In the feed-forward neural network, information proceeds unidirectionally. Information traverses from the input layer neurons and transits from the hidden layer’s neurons to the output neurons. Furthermore, in a neural network, complex non-linear mappings between input and output are taught by activation functions [13, 32]. In this study, we used the sigmoid activation function because it is a non-linear activation function usually used before the output layer in binary classification.

Support vector machine

The SVM is based on statistical learning theory and was first suggested by Vapnik [36]. The main aim of SVM is to find a particular linear model that maximizes hyper-plane margin. Maximizing the hyper-plane margin will maximize the distance between classes. The nearest training points to the maximum cloud margin are the support vectors. Hence, classification is performed by mapping a vector of variables into a high-dimensional plane by maximizing the margin between two data classes. The SVM algorithm can classify both linear and nonlinear observations. When data are not linearly separable, SVM using a kernel function transforms nonlinear input to a linear state in high-dimensional feature space and carries out the linear separation in this new space. In order to do this, several kernel functions have been proposed and adopted for SVM, such as linear, radial, polynomials, and sigmoid [13]. Selecting the kernel function in the SVM makes it a flexible method [9]. In the present study, we employed the radial basis kernel function for its better performance.

Extreme gradient boosting

XGBoost algorithm has gradient boosting at its core but is an enhanced version of the gradient-boosted decision tree algorithm. This algorithm is a scalable tree-boosting system to overcome long learning times, and Chen and Guestrin developed the overfitting of traditional boosting algorithms in 2016 [37]. XGBoost classifier synthesizes a weak base classifier with a robust classifier. A base classifier’s residual error is utilized in the next classifier to optimize the objective function at each stepwise of the training process [38]. Moreover, this algorithm can restrict overfitting, decrease classification errors, handle the missing values and minimize learning times while developing the final model [39].

SHAP value

Machine learning models have great potential in prediction and classification. However, understanding the complexity of the predictive models’ results is slightly complicated, which is a barrier to the admission of ML models. Hence to overcome this problem, Lundberg and Lee proposed a novel Shapley additive explanations (SHAP) approach for interpreting predictions for different techniques, including XGBoost. It helps us to describe the prediction of a specific input by calculating the impact of each feature on the prediction. SHAP values obtain interpretability through summary plots and the global importance of the variable [19].

Synthetic Minority Over-Sampling Technique (SMOTE)

The imbalanced dataset classification problem occurs when the number of instances of one class is greater than that of the other class. In classification problems with two classes, the class with more specimens is named the majority class, and the class with a smaller number of specimens is called the minority class [20]. The level of class imbalance of a dataset is measured by the imbalance ratio (IR). The IR is defined as the ratio of the number of samples in the majority class to the number of samples in the minority class. The higher the IR, the greater the imbalance [40]. In such cases, reporting the prediction accuracy as an evaluation criterion is inappropriate, as this usually leads to a bias in favor of the majority class [21].

Two main approaches have been proposed to solve the class imbalance problem: a data-level approach and an algorithm-based approach. The data-level approach aims to change or modify the class distribution in the dataset before training a classifier, which is usually done in the preprocessing phase. The algorithm-level approach focuses on improving the current classifier by adapting the algorithms to learn minority classes [41].

The data-level approach is usually preferred and proposed to deal with unbalanced classes in classification problems. This could be due to the fact that the class composition of the data can be adjusted to a “relatively balanced” ratio by adding or removing any number of class instances in the data set, depending on the situation [42].

Other reasons that can be given are: 1) The samples generated by these methods represent the right trade-off between introducing variance and approximating the original distribution. 2) These techniques are easier to apply compared to algorithm-level methods because the datasets are cleaned before they are used to train different classifiers. 3) Data-level techniques can be flexibly combined with other methods [26,27,28].

Re-sampling or data synthesis is the most popular method of processing unbalanced datasets used for data-level approach. The re-sampling approach can be divided into three categories, (i) over-sampling (ii) under-sampling (iii) hybrid sampling [43]. In over-sampling, the weight of the minority class is increased by repeating or generating new samples of the minority class. Under-sampling randomly deletes instances from the majority class to balance with the minority class. Hybrid sampling combines these two methods to take advantage of the benefits and drawbacks of both approaches [43]. The over-sampling approach is generally applied more frequently than other approaches. This approach is called SMOTE family and a collection of numerous over-sampling techniques (85 variants) evolved from SMOTE [26]. One of the first Over-sampling methods, SMOTE, is a powerful tool for dealing with imbalanced data sets suggested by Chawla et al. [21]. SMOTE is an oversampling technique that generates synthetic data for a minority class based on its k-nearest neighbor until the ratio of minority and majority classes becomes more balanced. The new synthetic data are very similar to the actual data because they are produced based on initial features [21].

The main advantage of SMOTE is that it prevents overfitting by synthesizing new samples from the minority class instead of repeating them [44].

There are also some disadvantages of SMOTE, however: oversampling of noisy samples, Oversampling of borderline samples [28]. To overcome these problems, many strategies have been employed in the literature including [28]:

  • Extensions of SMOTE by combining it with other techniques such as noise filtering, e.g., SMOTE-IPF and SMOTE-LOF

  • Modifications of SMOTE, e.g., borderline SMOTE (B1-SMOTE and B2-SMOTE) and SVM-SMOTE.

Borderline-SMOTE is an extension of SMOTE with a more powerful performance ability proposed by Han et al. in 2005. In this method, only the borderline examples of the minority class are over-sampled. A Borderline is a region where the samples of minority classes are near the majority. At first, the number of majority neighbors of each minority instance is used to split minority instances into three groups: safe, noise, and danger, then generate new instances. Suppose the neighbors of the points in the danger region are considered from the minority class. In that case, this method is called Borderline-SMOTE1, and when the point’s neighbors in the danger region are considered from the minority and majority classes, called Borderline-SMOTE2 [45]. Support vector machine SMOTE (SVM-SMOTE) is another extension of SMOTE that generates new synthetic samples near the decision boundary. This approach used SVM to detect decision boundaries [46]. SMOTE-Nominal Continuous (SMOTE-NC) is an over-sampling method that uses k-nearest neighbors, applying the modified-Euclidean distances to generate new synthetic samples [21]. This study introduced SMOTE techniques that have been used in the preparation initial data stage, then training ML algorithms have performed.

Performance criteria

The predictive performance of ML algorithms was evaluated using several criteria, including sensitivity, specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), accuracy, Area Under the Curve (AUC), Geometric Mean (G-means), F1 score, and P-value of the McNemar test. We evaluated the predictive performance of ML methods using a cross-validation approach in which both groups of datasets, the original imbalanced dataset, and the SMOTE-balanced datasets, were randomly split into training (70%) and test (30%) sets. This process was iterated 100 times. Then, mean values for each evaluation criterion were calculated over 100 repetitions. Moreover, to prevent over-fitting, ML algorithms performed fivefold cross-validation to select the optimum hyperparameters. Different values for each of hyperparameters were examined and optimum value was determined. The optimal values of hyperparameters selected for each of the ML models are shown in Table 1.

Table 1 The tuning parameter values of SMOTE-based machine learning methods

Software packages

In the present study, all SMOTE-balancing methods were executed through programming in Python software version 3.10.6 with the package “imbalanced-learn.” Also, all analyses of ML methods were implemented using R software version 4.1.1, with the following packages: “e1071” for SVM; “nnet” for NN; “naivebayes” for NB; “randomForest” for RF and variable importance (VIMP) in the RF; “rpart” for DT; “xgboost” for XGBoost; and “SHAPforxgboost” for SHAP value.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *