Monitoring for early prediction of gram-negative bacteremia using machine learning and hematological data in the emergency department

Machine Learning


Study designs and cohorts

This study collected data from the EDs of three Taiwanese medical institutions. The data were obtained from China Medical University Hospital (CMUH) and validated at CMUH, Wei-Gong Memorial Hospital (WMH), and An-Nan Hospital (ANH). CMUH, located in Taichung, is an urban, academic tertiary care facility with 1700 beds and 150,000–160,000 annual ED visits. WMH, located in Miaoli, is a regional teaching hospital with 872 beds and 55,000 annual ED visits. It established a strategic alliance with China Medical University (CMU) in July 2022. ANH, located in Tainan, is a public hospital with 925 beds and 50,000 annual ED visits. It was now operated by CMU under a contractual partnership. Although these hospitals collaborate with CMU, each institution maintains an independent electronic medical record and laboratory information systems, and no centralized data repository was used in this study. This study was approved by the Institutional Ethics Committee of CMUH (Reference No. CMUH112-REC3-043), and the requirement for informed consent was waived because of the minimal risk involved in the study.

Emergency physicians typically order CBC, DC, and blood cultures for patients with suspected infections. The number of blood culture sets is based on clinical judgment and an institution’s protocols. For the data of the current study, CBC and DC analyses were performed using the Beckman Colter DxH 900 (Beckman Colter, Miami, FL, USA), and blood culture was performed in accordance with the Clinical and Laboratory Standards Institute (CLSI) guidelines. The BACTEC FX system (Becton Dickinson Microbiology Systems, Sparks, MD, USA) was used for blood sample detection and identification.

For the derivation cohort, we retrospectively enrolled adult patients (aged ≥20 years) who underwent CBC testing in the ED at CMUH between August 1, 2021, and December 31, 2022. Patients were excluded if they did not undergo blood culture testing, lacked DC data, or had missing CPD data for all four WBC subtypes due to poor sample quality, technical errors, or extremely low total white blood cell counts. We also excluded cases in which blood culture sampling occurred more than 12 h after CBC and DC sampling, as well as those in which Gram staining results did not show a consistent pathogen. The validation cohort included data from CMUH (January 1, 2023, to July 31, 2023), WMH (March 1, 2023, to August 31, 2023), and ANH (April 1, 2023, to August 31, 2023). For WMH, patients were directly included if CBC and DC data were available.

Data sources

Clinical data were extracted from the electronic health records of three hospitals and included demographic details such as age and sex. Comorbidity data, however, were available only from CMUH. Figure 1 illustrates the data sources used as inputs for the model, including CBC, DC, and CPD. All input features were treated as continuous numerical variables. Blood cultures collected within 12 h before or after the CBC sampling time were used for labeling. The CBC examination included various parameters: white blood cell count, hemoglobin level, hematocrit, red blood cell count, red cell distribution width, platelet count, platelet distribution width, neutrophil-to-lymphocyte ratio, platelet-to-lymphocyte ratio, mean corpuscular volume, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, and monocyte distribution width (MDW). DC analysis included the proportions of lymphocytes, monocytes, segmented neutrophils, eosinophils, basophils, band cells, blast cells, myelocytes, metamyelocytes, promyelocytes, and atypical lymphocytes. The sample collection time was recorded for each examination.

Fig. 1: Data sources used in the study.
figure 1

There are three main sections: Complete Blood Count, Cell Population Data, and Differential Count. The Complete Blood Count includes erythrocytes (red blood cells), leukocytes (white blood cells), and platelets. The Cell Population Data measures cell conductivity, volume, and scatter. The Differential Count details the different types of leukocytes, including neutrophils, monocytes, basophils, eosinophils, and lymphocytes. BA, basophils; EO, eosinophils; LY, lymphocytes; NE, neutrophils; MO, monocytes.

During DC analysis, CPD were routinely obtained using the Beckman Colter DxH900 analyzer, which uses volume, conductivity, and scattergram (VCS) technology for automated hematology analysis. However, when the number of specific WBC subtypes is too low, the analyzer may be unable to generate CPD values for those subtypes. VCS technology evaluates cells on the basis of 14 parameters, providing mean and standard deviation values for volume, conductivity, and five light-scattering parameters (median, upper median, lower median, low-angle light scatter, and axial light loss). These parameters provide insights into cellular granularity, complexity, and transparency. CPD for leukocytes are data regarding these 14 parameters for 4 cell types (neutrophils, eosinophils, monocytes, and lymphocytes), resulting in information on a total of 56 parameters15. Although CPD remains primarily a research-oriented feature and is not routinely used by clinicians in healthcare settings, in this study, all CPD data were obtained directly from the hematology analyzer.

Preprocessing

We categorized samples as cases of Gram-positive bacteremia if blood culture results indicated Gram-positive bacteria, Gram-negative bacteremia if Gram-negative bacteria were identified, and no bacteremia if cultures exhibited signs of contamination or no microbial growth. Based on the CLSI guideline, organisms considered potential contaminants when identified from only one of a series of blood cultures included: coagulase-negative Staphylococcus spp., Cutibacterium (Propionibacterium) acnes, Micrococcus spp., viridans group streptococci, Corynebacterium spp., Aerococcus spp., and Bacillus spp19.

Training pipeline

We developed three-class classifiers by using CatBoost model (Fig. 2), targeting Gram-negative bacteremia, Gram-positive bacteremia, and non-bacteremia as outcome classes. CatBoost incorporates overfitting prevention techniques crucial for high-dimensional, limited-sample datasets, often outperforming other gradient boosting methods20. Its ordered boosting method reduces prediction variance, enhancing generalization on unseen data. Previous studies have shown that CatBoost outperformed other models in clinical research applications17,21. The development cohort of CMUH was split 80:20 into a training set and a test set. Missing values were imputed using the mean value from the training set. Standardization was applied to all input features to scale the data to zero mean and unit variance before model training. To address potential class imbalance among the three outcome categories, we initially explored adjusting class weights during model training to identify the optimal weighting strategy. However, we observed that applying differential class weights led to a decline in performance across AUROC, AUPRC, and F1 score (Supplementary Table 2); thus, we adopted equal class weights in the final models. Our classification algorithm incorporated a wrapper-based backward elimination approach, evaluated solely on the developing set. In each iteration, one feature was removed based on its impact on the model’s macro-AUROC performance, and the process continued until the optimal subset was achieved. This approach reduces model complexity and improves interpretability while avoiding information leakage from the validation set. Hyperparameter tuning was performed using random search, where each parameter was assigned a range, and multiple tests were conducted to identify the optimal parameter set (Supplementary Table 3). The model’s performance was evaluated within these ranges to identify the best parameter set22. Finally, external validation was performed by applying the fully trained and tuned model from the CMUH development cohort to datasets from CMUH, WMH, and ANH collected in 2023.

Fig. 2: Overall training process.
figure 2

This flow chart illustrates the methodology for establishing a predictive model. It involves data splitting, feature selection, hyperparameter tuning, and validation phases. Catboost, Categorical boosting; CMUH, China Medical University Hospital; WMH, Wei-Gong Memorial Hospital; ANH, An-Nan Hospital.

Classification thresholds were selected using the Youden index to balance sensitivity and specificity while ensuring consistency in evaluation. For the final model, we aimed to enhance the detection of Gram-negative bacteremia due to its higher clinical urgency. Therefore, after initially applying the Youden index to determine thresholds for all three classes, we further refined the threshold for the Gram-negative class to enhance sensitivity. Specifically, the threshold was adjusted in the development cohort to ensure a sensitivity of at least 80%, while maintaining reasonable specificity.

To provide a performance benchmark, we included the Systemic Inflammatory Response Syndrome (SIRS) score as a rule-based baseline comparator. SIRS assigns one point for each of the following criteria: body temperature >38 °C or <36 °C, heart rate >90 beats/min, respiratory rate >20 breaths/min, and white blood cell count >12,000 or <4000 cells/μL. The total score ranges from 0 to 4. Using the same outcome labels, we evaluated its predictive performance alongside the CatBoost models.

Sensitivity test

To evaluate the individual contribution of different feature groups and assess model robustness, we conducted a feature-level sensitivity analysis. Specifically, we retrained the model using three distinct input sets: (1) CBC/DC variables only, (2) cell CPD variables only, and (3) a combination of CBC/DC and CPD. Model performance was then compared to determine the incremental value of CPD features in predicting bacteremia.

To assess the impact of different missing value imputation strategies on model performance, we conducted a sensitivity analysis using four commonly adopted methods: mean, median, zero-fill, and k-nearest neighbors (KNN). The CatBoost model was trained separately on each imputed dataset using default hyperparameters and a feature set comprising CBC/DC and CPD variables. Performance was evaluated using AUROC and AUPRC metrics.

We further performed an analysis using a one-vs-rest strategy to assess class-specific model performance. The original three-class classification problem was reformulated into three individual binary classification tasks. For each binary task, a separate CatBoost classifier without hyperparameter tuning was trained and evaluated on the development set using AUROC, AUPRC, F1 score, and thresholds optimized by the Youden index. This approach enabled us to examine whether individual classes could be more accurately distinguished when modeled independently, without the influence of other classes.

To assess the impact of blood culture sampling practices on model performance, we conducted a sensitivity analysis across two cohorts: (1) the entire population; (2) a subgroup excluding patients with only one set, comprising those who had two or more sets collected. This analysis was motivated by the fact that using only one set of blood cultures is associated with a higher risk of false-negative results, which may confound model training and evaluation.

Lastly, to evaluate the potential bias introduced by including multiple CBCs from the same ED visit, we performed a sensitivity analysis that retained only the first CBC sample per ED visit, excluding any subsequent tests from the same encounter. This aimed to ensure that model predictions were not disproportionately influenced by repeated measurements from individual patients.

Statistics and reproducibility

We analyzed the demographic and laboratory data from patients in three hospitals and calculated median values and standard deviations. A detailed analysis of the CMUH developing set was performed to compare data distributions among three groups: nonbacteremia, Gram-positive bacteremia, and Gram-negative bacteremia. The Kruskal–Wallis test was used to identify significant differences among the groups on the basis of ranked data23. For significant results (p value < 0.05), Dunn’s post-hoc test was conducted to identify specific group differences24. For the distribution of key features, we illustrate them using box plots across different outcomes and hospitals. A box plot visually represents the spread and skewness of data. The box shows the interquartile range where the middle 50% of values lie, with the line inside representing the median. The “whiskers” extend to the smallest and largest values within 1.5 times the IQR from the quartiles.

We used various metrics to evaluate model performance; that is, we used the area under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPRC), precision (positive predictive value), recall (sensitivity), F1 score, negative predictive value, and specificity. For model interpretation, we used the SHapley Additive exPlanations (SHAP) method using the SHAP Python package (version 0.41.0)25. The SHAP method provides local explanations to enhance the global understanding of a model through the interpretation of individual feature contributions. This level of explainability is crucial for building trust in ML models, particularly in medical contexts, where transparency is vital for understanding the rationale behind predictions26. In addition, models must be interpretable to meet regulatory requirements because transparency is often mandated in health care to ensure traceability and accountability in decision-making27.

To further assess the model’s performance, we generated calibration plots and Brier scores. Calibration plots were used to evaluate the alignment between predicted probabilities and actual outcomes, offering a visual representation of the model’s calibration28. The Brier score served as a quantitative measure to assess the accuracy of the probabilistic predictions, with lower scores reflecting better overall model performance by accounting for both discrimination and calibration29.

The study included 28,503 samples in the CMUH development cohort, 15,801 in the CMUH validation cohort, 2632 in the WMH cohort, and 3811 in the ANH cohort. Each final laboratory report of CBC, DC, and CPD from a single specimen was considered one independent replicate. To minimize the risk of data leakage, samples from the same patient were assigned exclusively to either the training or the test set, regardless of whether they were collected during a single ED visit or across multiple visits. This approach mitigates bias from repeated measurements. Model performance was further validated across three independent hospital cohorts to demonstrate reproducibility.

The implementation was carried out in Python 3.9.18, utilizing well-established machine learning frameworks, such as scikit-learn and CatBoost for all modeling and evaluation tasks. To enhance reproducibility and openness, we have shared a curated portion of the code on GitHub, covering the full analytical pipeline—from data imputation and scaling to model training, tuning, and statistical comparison. A requirements.txt file specifying the software environment and package versions is also included in the repository: https://github.com/YuHsin-Chang/Gram-Stain-Bacteremia-Prediction.



Source link