Machine Learning for Risk Stratification in the Emergency Department (MARS-ED): A Randomized Controlled Trial

Study design and setting

This study was designed as an investigator-initiated, open-label, randomized, non-inferiority clinical trial. The protocol for the Machine Learning for Risk Stratification in Emergency Departments (MARS-ED) study has been previously published.^{twenty two}. Briefly, this study was conducted in the ED of Maastricht University Medical Center + (MUMC +), a secondary/tertiary care medical center in the Netherlands, where 6,800 patients visit the ED for evaluation by internal medicine specialists each year.

This study was approved by the Medical Ethics Committee (METC) of MUMC+ (METC 21-068) and registered at clinicaltrials.gov (NCT05497830). This study was conducted in accordance with the principles of the Declaration of Helsinki and reported in accordance with the Consort Standards for Reporting of Trials on Artificial Intelligence (CONSORT-AI) guidelines (Supplementary Table 5).³³. The first participants were registered on the 16th^th In September 2022, the last patient enrolled was on the 17th.^th July 2024.

participants

Adult patients (age 18 and older) presenting to the emergency department, who were evaluated and treated by an internal medicine specialist, and who had at least four laboratory test results were eligible. Participants provided written informed consent. Patients who returned to the ED within 1 month of the index ED presentation were excluded, as return visits were included in the follow-up period of the index visit.

randomization

We conducted a randomized clinical trial to assess the prognostic accuracy of risk for 31-day mortality.^indexits clinical impact (policy changes), and impact on secondary clinical outcomes (hospitalization, length of stay, ICU admission). When patients entered the emergency department for evaluation and treatment by an internal medicine specialist, they were assessed for eligibility and randomized as soon as informed consent was obtained. Patients were randomly assigned to either the intervention group or the control group (standard treatment) using computer-generated permuted block randomization with an allocation ratio of 1:1. The study was not blinded because doctors needed to be informed about the risks.^index To assess the magnitude of the clinical impact of the risk^index.

preparation for research

In the months before the trial began, ED physicians prepared for the risks.^index Used through briefings held during regular educational and academic conferences (n= 4). These briefings covered extensively the model inputs (age, gender, routine laboratory tests), interpretation of the 0-100 probabilities, key limitations and appropriate use (decision aids rather than instructions), and previously obtained results, including comparisons with physicians, predictive accuracy and explainability in multicenter settings. In addition, we conducted two pilot projects. First, we conducted a survey of ED physicians (n= 17) Compare alternative display formats for preference and clarity (Likert scale), including calibrated probability displays, color-coded categorical displays, and gauge/decile formats (Supplementary Figure 7). Based on these assessments, we implemented an adjusted probability display that reports a single 0–100 value corresponding to the estimated probability of mortality within 31 days. Second, physicians practiced using virtual cases to rehearse their interpretations of risk.^index.

Research interventions and procedures

A summary of the patient timeline is shown in Supplementary Figure 8. After a complete assessment of the patient, the ED physician was asked questions regarding clinical intuition (Table 2). Risk for intervention group only^index A complete assessment of the patient and a preliminary treatment plan were developed and then presented privately to the attending physician by the study members. Research team members informed ED physicians about the study and explained the risks.^index Meaning of variables, calculated probabilities, and high predictive accuracy of 31-day mortality found in previous multicenter studies. To ensure that risks are presented^index Electronic case record forms are designed to not interfere with the clinical intuition of ED physicians, minimizing risk.^index I couldn’t do the math until I answered the clinical intuition questions. These physicians were then asked about risk adjustment.^index Consider initial clinical intuition and resulting changes in treatment plan (Table 2). Finally, in a subgroup of 121 patients, we asked ED physicians whether they were aware of the added value of risk.^index (Using a Likert scale ranging from 1 to 10).

Table 2 Questionnaire on clinical intuition and medical changes

risk^index

risk^index A machine learning (ML)-derived risk score that predicts 31-day all-cause mortality using routine laboratory tests ordered by the attending physician and basic patient characteristics (age and gender).^{twenty one}. calculated risk^index (values 0 to 100) correspond to an individual’s 31-day probability of death.

Clinical prediction tools

To compare the prognostic accuracy of 31-day mortality risk^index For clinical prediction tools, we selected commonly used prediction tools based on their prevalence and global usage. We selected the National Early Warning Score (NEWS), Acute Physiology and Chronic Health Evaluation II (APACHE II), and Sepsis-related Organ Failure Assessment (SOFA) scores.^{6, 7, 8}. Although originally derived for inpatient and intensive care cohorts, these three scores were used as comparative metrics in this study because they are among the few scores validated and clinically used in the ED population.^{29, 30, 31, 32}.

data collection

Briefly, we collected data on patient characteristics, comorbidities, and triage categories (based on the Manchester Triage System (MTS)).³⁴), reason for emergency department visit, vital signs, laboratory tests, and clinical endpoints (e.g., hospitalization, intensive care unit (ICU) admission within 31 days, mortality within 31 days). Responses to all survey questions were immediately recorded on the electronic case record form. Searching for data on laboratory test results and risk outcomes^index Automated. To ensure data quality, all data on outcomes and all data samples were double-checked by another member of the research team and/or study monitor, and discrepancies were resolved through discussion with a second reviewer. Data monitoring was performed by the Maastricht Clinical Trials Center (CTCM).

result

The main outcomes of this study were the prognostic accuracy of 31-day mortality and the clinical impact of risk.^index. Secondary outcomes included prognostic accuracy of 31-day mortality compared with clinical prediction tools, differences in clinical outcomes including hospitalization, length of stay, ICU admission, and risk feasibility.^index.

statistical analysis

Assuming a 31-day mortality rate of 8%, we calculated the required sample size as 1,250 patients based on the number of ED patients in MUMC+ and expected to include 1,300 patients during the study period. Regarding policy changes, post hoc power calculations revealed that 388 patients in each group were needed to detect a 2% policy change (80% power, 5% significance). The current study included a total of 1.303 patients, which allows us to detect a 1.2% difference in policy change (80% power). Baseline characteristics were analyzed using descriptive statistics. Categorical variables were reported as frequency counts and percentages, and continuous variables were reported as median with interquartile range (IQR) or mean with standard deviation (SD) depending on the distribution.

Our primary analysis evaluated the prognostic accuracy and clinical impact of risk.^index. We compared the prognostic accuracy of 31-day mortality to physicians’ clinical intuition by calculating the area under the receiver operating characteristic curve (AUROC) for the intervention group. AUROC was compared using DeLong’s method.³⁵. The precision-recall curve (area under this precision-recall curve (AUPRC)) was used to assess the balance between sensitivity and positive predictive value.³⁶. Risk prediction accuracy as post-mortem analysis^index It was also compared with physicians’ clinical intuition to predict 7-day mortality. As a post hoc analysis, clinical intuition was further analyzed by stratifying physicians’ experience into two categories: 0–6 years (residents in specialty training) and 6+ years (specialists). The assessment of the prognostic accuracy of a risk score is inherently relevant, and clinical actions taken on the basis of that risk score may influence the primary outcome of 31-day mortality, thereby obscuring its true performance. Therefore, a sensitivity analysis was performed within the control group to ensure an unbiased assessment uninfluenced by clinical actions based on risk outcomes.^index. Clinical impact was assessed by assessing the number and type of policy changes in the ED following presentation of the risks.^index. Although not addressing clinical endpoints, the randomized controlled design allowed for a direct exploratory assessment of the impact of risk.^index Regarding secondary clinical endpoints. Feasibility of risk^index It was evaluated based on physician collaboration and recognition of its added value. Coordination between risks^index ED physician expectations were assessed by asking physicians whether they were at risk.^index Did the results match, exceed, or fall short of their expectations? Perceived added value of risk^index It was evaluated by asking ED physicians to rate its usefulness on a Likert scale from 1 to 10. This evaluation was performed in a subgroup of 114 patients.

All analyzes were performed in R version 4.1.3 (The R Foundation for Statistical Computing). Source code for data analysis and in-house developed interfaces is available in a public repository (https://github.com/wptmdoorn/marsedstudy).