Multimodal Ensemble Machine Learning Predicts Neurological Outcomes within 3 Hours of Hospital Cardiac Arrest

Research Design, Setting, and Ethical Considerations

This study was approved by the NARA Medical University Ethics Committee (No. 3753) and was conducted in accordance with the principles of the Declaration of Helsinki. Due to the retrospective nature of this study, the requirement for informed consent was abandoned by the NARA Medical College Ethics Committee. This single-center retrospective observational study was conducted using electronic medical records from the Advanced Emergence and Cralival Care Center at NARA Medical College. Data analyzed in this study were collected between April 2015 and March 2024. This facility offers specialized services in advanced resuscitation, post-arrest care, target temperature management, and neurogenic care, and is ideal for investigating early neurological prognosis after OHCA (more details, https://necm.naramed-u.ac.ac.c.acp/). All data collected was anonymized and patient privacy was strictly protected. This study was conducted and reported in accordance with a transparent report of multivariable predictive models for individual prognosis or diagnostic guidelines.

Research Group

Patients over the age of 18 were enrolled and were admitted directly from outside the hospital following Hosca. The OHCA remained coma for more than 3 hours after hospital arrival. Head CT was deemed appropriate if it showed no obvious severe movement or metal artifacts that would dediagnose the scan. We excluded patients with cardiac arrest caused by head trauma. Patients whose hospitalized CT demonstrated intracranial pathology, such as ischemia or hemorrhagic stroke, intracranial tumor, traumatic brain injury, or chronic dural hemorrhage, regardless of whether the lesion had previously been recorded. Patients transferred from another hospital. These exclusions were made to reduce confounding from different injury mechanisms, altered imaging findings, and inconsistent early care.

Data collection and feature selection

The following information was extracted from the electronic medical records: patient background (age, gender, presence/absence of witnesses, bystander CPR, early rhythm, defibrillation, total dose of epinephrine), time metrics (no-flow and low-flow times), worth serum. Test Results [bilateral pupil diameters measured in 0.5-mm increments immediately after circulatory restoration (spontaneous or extracorporeal)]and imaging findings (head CT scans obtained within 3 hours of resuscitation). Neurological outcomes at discharge or one month after cardiac arrest were assessed using the Brain Performance Category (CPC) scale.²⁰CPC 1–2 is designated as preferred, and CPC 3–5 is designated as disadvantageous. Patients who died before discharge were recorded as CPC 5, and deaths occurring after day 30 did not change the assigned CPC. Details of function preprocessing, missing data processing, and specific CT conditions are found in Supplementary Table S1.

Model development and model evaluation

We first constructed three base models, generating meta features from unfolded predictions, and then developed the final stacking ensemble model. We adopted a nested cross-validation design with 5x outer loops and 4x inner loops (Figure 1). For each outer loop, we used an inner loop to adjust the hyperparameter using an optona.^{twenty one}A Bayesian optimization library that adaptively searches for promising regions of the parameter space by maximizing the region under the receiver operating characteristic curve (AUC) across four inner folds. Model selection was performed between logistic regression, random forest, and LightGBM, and the algorithm was selected with the highest accuracy. Three base models were developed as follows:

OHCA model (LightGBM)

Traditional OHCA scores are linear models that predict neurological outcomes based on patient-specific parameters, such as early cardiac arrest rhythm, flow time, low flow time, serum creatinine levels, and lactic acid levels.^11,12,13. The OHCA model is considered one of the most thoroughly validated prognostic tools after cardiac arrest. However, applying the OHCA score in its original format may prevent the processing of missing data, especially if the flow time is unknown (e.g., arrests that are not arrested).^{twenty two}. Therefore, we used the OHCA score coefficient. Initial rhythm, flow time, low flow time, low flow, serum creatinine, and lactic acid can essentially handle missing values by learning the optimal split direction, so select LightGBM selected as LightGBM selected as LightGBM selected as LightGBM selected as LightGBM selected as LightGBM selected. LightGBM is a gradient boost method that efficiently combines many small decision trees, often providing high accuracy^{twenty three}. Optuna investigated parameters such as learning rate, number of leaves, and number of boost iterations, while maintaining other settings with the LightGBM default values. Model settings are explained in detail in Supplementary Table S2.

Student Model (Logistic Regression)

Bilateral pupil diameters (measured in increments of 0.5 mm) were standardized and entered into a logistic regression classifier. Optuna was used to adjust parameters such as regularization strength and penalty type. The optimal setting was chosen based on the AUC values of the internal loop (Supplementary Table S3).

CT Image Model (ResNet50)

Based on previous research^{twenty four}single slices of head CT images aligned along orbital baseline were used to reduce the variation in the scan conditions. Single slice images were extracted at the level of the Monro and pineal foramen. It is commonly used in grey-white material ratio analysis to assess brain damage, including assessment of the cerebral nucleus.^24,25,26. The CT imaging protocol is detailed in the Supplementary Methods. I selected a single slice (255 x 255 grayscale), resized it to 224 x 224, converted it to RGB, and converted it to suit the input requirements of the pre-processed RESNET50 architecture.²⁷Popular deep neural networks that effectively train very deep layers using skip connections. Next, data augmentation techniques (rotation, shift, zoom, etc.) were applied as detailed in Supplementary Table S4. Optuna was used to optimize hyperparameters such as dropout rate and batch size. Due to computational constraints, the convolutional neural network (CNN) was retrained at each inner fold during hyperparameter tuning. Once the optimal hyperparameters were determined, the model was trained with an even more complete outer training subset. Platt scaling (logistic regression) was then applied to the output probability of the CNN to adjust the prediction without retraining. Details are listed in Supplementary Table S5.

Creating Metafeatures (Collapse Prediction)

After determining the optimal hyperparameters for each base model within each outer fold, the entire external training subset was used to reorganize the model. We then generated collapse predictions for the inner loop. Specifically, for each inner fold, a model trained with the “internal train” data was applied to the “internal validation” data to obtain the predicted probability. This procedure ensured that all samples of the external training subset received predicted probability from each base model without data leakage. After assembling the probabilities of these multiples (one column per base model), a quantile transformer was applied to address the potential distributed skew between the metafeatures. The transformed metafeature was then used for ensemble model training.

Ensemble model (stacking) and calibration

We used these folding predicted probabilities (hereinafter referred to as metafunctions, with the distribution of each model being shown in Supplementary Figure S1). Second, we applied a quantile transformer to scale these metafeatures before training the metaclassifier (random forest).²⁸Use Optuna for HyperParameter tuning. After constructing the final stacking model on the outer training subset, Platt scaling (via CalibratedClassifierCV using Method = ''Sigmoid') was applied to calibrate the predicted probability. Finally, the ensemble model was evaluated with a total outer fold. The entire procedure (hyperparameter tuning for each base model, metafeature creation of folding, quantile transformer-based scaling, stacking model training, and calibration) was repeated for all five outer folds. Performance metrics (AUC, Brier score, etc.) were then averaged over the outer fold to estimate the overall prediction accuracy.

Statistical analysis

Data distributions were evaluated using the Shapiro-Wilk test. Since most continuous variables were non-normal distributed, all continuous variables are represented as median and categorical variables as n (%). Mann – Group comparisons were achieved using accurate tests from Whitney or Fisher (p<0.05 was considered statistically significant). We evaluated predictive performance based primarily on AUC, supplemented with F1 and Brier scores. Calibrations were examined before and after Platt Scaling, and different cutoff values (F1-Max, F2-Max, Youden Index) were also investigated for the trade-off between sensitivity and specificity. To reflect clinical priorities, we further evaluated Spec-Max, which is defined as a threshold that maximizes specificity while keeping false positive rates (FNRs) below 5%. This follows ERC/ESICM recommendation that the top 95% confidence interval (CI) for false predictions of poor outcomes should not exceed 5%.

Performance metrics for every 0.01 increment of predicted probability are summarized in Supplementary Table S6, and all thresholds achieving FNR <5% are listed in Supplementary Table S7. 95% CI for each indicator was estimated using the bootstrap method using 1000 iterations.

All statistical analysis and machine learning procedures were performed on Python 3.10.12 (Python Software Foundation, Wilmington, DE, USA). The logistic regression model was implemented using SCIKIT-LEARN (version 1.6.0). LightGBM (version 4.5.0) was used for gradient boosting, and hyperparameter tuning was done in Optuna (version 4.1.0). Data augmentation was carried out in Keras (version 2.8.0) and Tensorflow (version 2.18.0) and deep learning (RESNET50) was also implemented using the same framework.

Source link