Predicting patient risk of leaving without being seen using machine learning: a retrospective study in a single overcrowded emergency department

This retrospective analysis was conducted, in accordance with the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) guidelines for prediction model studies [29], by examining records from the ED database of “Maresca” Hospital in Torre del Greco, Italy. To enhance the reliability of estimates, data were extracted for patients who visited the ED between 2019 and 2023. The study considered daily records of all individuals who accessed the ED, those who were admitted to the hospital, and those who left without being seen during the observation period. The main steps of the analysis are shown in Fig. 1.

To define our cohorts, we assigned a binary indicator reflecting the mode of discharge, i.e. whether the patient LWBS or otherwise. This classification applied to all ED patients since they underwent triage, though some left before receiving a medical evaluation. The initial dataset included 80,614 ED registrations, which were then processed and refined to ensure compatibility with the ML algorithms used in the study. This sample size (n = 80,614) was considered adequate based on the number of events per predictor variable, in line with recommendations for predictive modeling studies.

The variables analyzed included:

Gender (Male / Female),
Age,
Mode of access to the ED (autonomous or by ambulance),
Triage score,
Time of arrival categorized into five time intervals,
Day of the week of arrival,
Waiting time for medical evaluation, and.
Mode of discharge.

Regarding the triage score, patients were categorized into four urgency levels: Red (critical condition requiring immediate care), Yellow (serious condition requiring rapid medical attention), Green (less urgent conditions), and White (non-urgent cases). Patients classified under the Black category (deceased on arrival) were excluded from the analysis. Specifically, the summary of the cleaning operations conducted on the database are shown in Fig. 2.

The day of the week was included because variations in ED crowding and staffing across different days could influence patient waiting times and the likelihood of LWBS events.

The “waiting time for take-over” refers to the time elapsed between patient registration and the first medical evaluation or assignment to a physician. Longer waiting times may increase the risk of patients leaving before being seen.

Patients who arrived at the ED already deceased (triage category: Black) and those who passed away while in the ED were excluded from the analysis. Additionally, records lacking essential demographic or clinical information—such as age, gender, or triage color at admission—were removed. Patients whose discharge records were later annulled or administratively closed by the case manager after several months were also excluded from the dataset. We did not perform imputation for missing data; instead, records with missing or invalid fields were excluded to ensure model integrity.

Figure 3 illustrates the yearly distribution of ED visits categorized by triage color code. The highest number of visits occurred in 2019, totaling 20,186 cases, spread across all levels of severity.

From a methodological perspective, variables such as Gender, Access Mode, Time of Arrival, and Day of the Week were transformed into dummy variables. This process involved breaking each categorical variable into binary indicators (0/1), where n represents the number of distinct categories within each feature. The outcome variable was derived from the Mode of Discharge, assigning a value of 1 to LWBS patients and 0 to all others.

To examine differences between LWBS and non-LWBS patients, a correlation analysis was conducted on patient characteristics. Pearson’s correlation coefficient was used to identify relationships among independent variables before further analysis. Leveraging these patient-related features, a ML model was developed to predict LWBS occurrences. Data were extracted separately for the years under study by populating different Excel sheets. During the analysis, a single dataset was created maintaining the information of the different years by including the year of discharge as an independent variable.

Data pre-processing

Once the relevant records were retrieved from the hospital’s information system, the dataset was analyzed to investigate the factors influencing patient dropout in the ED. The analysis focused on identifying key variables associated with patients and their movement through the ED.

Initially, the dataset was divided into two groups:

A comparative statistical analysis was performed between these two groups. For continuous variables such as age and waiting time, the Mann-Whitney test was applied, while categorical variables, including gender and access mode, were analyzed using the Chi-squared test. A p-value < 0.05 was considered statistically significant. Before implementing the models, it was decided to assess the statistical relationships between selected independent variables and the target variable. Of the various alternatives, Pearson’s most commonly used in the literature was implemented [30].

The analyses were conducted using Python in the Google Colab cloud computing environment. The data were provided as input to the code via Excel spreadsheet (Microsoft Office, Excel, Microsoft Corporation, Redmond, Washington), which was also used for the graphics for this phase of the study.

Classification algorithms

ML algorithms were employed to predict which patients would leave the ED before being seen. These models function by learning a mapping between input features and an output variable, enabling future predictions based on new patient data.

For this study, supervised classification models were implemented, with multiple independent variables serving as inputs for the ML algorithms. The LWBS status was transformed into a binary output variable, distinguishing between patients who left and those who remained. The classification analysis aimed to differentiate and predict LWBS cases by leveraging multiple ML techniques.

The following four classification algorithms were selected and implemented using the scikit-learn library [31]:

1.

Random Forest (RF)– A machine learning technique widely used for classification and prediction tasks. It builds an ensemble of decision trees, each trained on a random subset of the data, and aggregates their outputs to produce the final result. This approach enhances model generalization, reduces overfitting, and effectively handles missing values and imbalanced datasets. By combining multiple trees, random forests improve predictive performance and robustness, making them one of the most powerful and versatile methods in supervised learning [32].
2.

Naïve Bayes (NB)– A simple yet powerful probabilistic classification algorithm based on Bayes’ theorem. Despite its assumption of feature independence often being violated in real-world datasets, Naïve Bayes remains effective across various domains such as product recommendation, medical diagnosis, and autonomous systems. Different variations of NB exist to adapt to diverse data characteristics, offering varying levels of accuracy. Its main advantages include computational efficiency, robustness to irrelevant features, and ease of implementation [33].
3.

Decision Tree (DT)– A non-parametric supervised learning method used for classification and regression. It creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features [34].
4.

Logistic Regression (LR)– A supervised machine learning algorithm specifically designed for classification tasks, where the target variable is categorical. Logistic regression models the relationship between input features and the probability of belonging to a particular class. It estimates these probabilities using a logistic function, making it a standard method for binary classification problems. Its simplicity, interpretability, and effectiveness in linearly separable datasets make it widely adopted across various domains [35].

Decision trees (DT), random forest (RF) and logistic regression (LR) were chosen for their proven effectiveness in similar scenarios, as documented in previous studies [36, 37]. Furthermore, they have the option of setting the class_weight parameter to “balanced” to automatically assign weights inversely proportional to the frequency of classes in the dataset. Naïve Bayes (NB) served as a basis for comparing performance against this type of approach. Given the correlations present, the chosen algorithms also offer greater robustness to the presence of multicollinearity even for LR due to the optimisation of hyperparameters [38, 39].

To ensure robust evaluation, the dataset was split into a training set (80%) and a test set (20%). After splitting the dataset, the Synthetic Minority Oversampling Technique (SMOTE) [40] was employed on training set to replicate instances of the underrepresented class (since LWBS patients constitute a small percentage of the total). Although these duplicates do not introduce new information, they help mitigate class imbalance.

The performance of the algorithms was evaluated according to the following parameters: accuracy, precision, recall, F1score and balanced accuracy. In addition, given the imbalance in the dataset, AUC-ROC (Area Under the Receiver Operating Characteristic Curve) was also used. The ROC curve evaluates classification performance by plotting Recall, defined as Recall = TP / (TP + FN), against the false positive rate (FP / (FP + TN)). The AUC represents the total area under this curve, providing an overall measure of the model’s ability to distinguish between LWBS and non-LWBS patients. For each metric, the 95% confidence interval was reported using the Normal Approximation Interval method [41].

To enhance the reliability of results, 10-fold Cross Validation was performed [42], coupled with GridSearchCV [43] to optimize hyperparameters. GridSearchCV systematically tests all possible combinations of hyperparameter values, assessing model performance through cross-validation (Table 1). This process ensures the selection of the optimal configuration for each algorithm.

Table 1 Key parameter values chosen for the model

To ensure model stability, a statistical test was performed by randomly shuffling a subset of the test data 100 times and evaluating whether accuracy exhibited significant fluctuations. McNemar’s test was then applied to compare the confusion matrix of the classifier with the highest AUC against the others, using a significance threshold of 0.05. This test was based on four computed values:

k₀₀: Instances misclassified by both classifiers.
k₀₁: Instances misclassified by the first classifier but correctly identified by the second.
k₁₀: Instances misclassified by the second classifier but correctly identified by the first.
k₁₁: Instances correctly classified by both classifiers.

In addition to class evaluation, performance was also evaluated in terms of probability by implementing the Brier Score (BS). This parameter is nothing but the mean square error between between the probabilistic prediction and the corresponding event expressed as 0/1, expressed according to the formula (BS= \(\:1/N({\sum\:}_{t=1}^{N}{({p}_{t}-{o}_{t})}^{2}\))). A marked difference between the predicted probability and the observed event (BS close to 1) reflects a larger error, while a small value (BS close to 0) indicates great model accuracy. Once the most effective algorithm was determined, Feature Importance analysis was conducted to interpret the classification process. Specifically, Permutation Feature Importance was used, where individual predictors were randomly shuffled one at a time. The corresponding decline in model performance was quantified by measuring the AUC, illustrating how crucial each feature was. The final results, displayed graphically, highlight the AUC reduction for each variable, shedding light on their relative significance. Finally, to increase the interpretability of the best model, in addition to a global technique such as Feature Importance, it was decided to also implement a local technique the SHapley Additive exPlanation (SHAP) model. This technique considers the prediction as a payoff that must be distributed among the various features (Shap Value). To do this, the output of the model must be determined again by excluding one feature each time and testing all possible combinations of the others. Although highly informative, a high computational weight is associated with this technique [44].

Source link