Development and evaluation of machine learning training strategies for neonatal mortality prediction using multi-country data

Descriptive data analysis

After data preprocessing, a total of 575,664 pregnancies were included in this study. Analysis of patient distribution based on place of birth revealed that 31.3% of births occurred in clinics or health centers, 24.2% at home or other locations, and 44.4% in hospitals. Regarding the type of delivery, 14.4% of neonates were delivered by caesarean section, 84.7% by vaginal delivery, and 0.9% by vaginally assisted delivery. Similar to the delivery location data, there were few missing values for delivery type.

The dataset includes births from different countries, each contributing a different proportion of neonatal patients. Specifically, Argentina accounted for 1.7% of the total number of infections, Bangladesh 0.2%, Democratic Republic of the Congo 6.3%, Guatemala 15.0%, India's Belagavi province 21.4%, India's Nagpur province 14.6%, Kenya 13.5%, Pakistan 15.8%, and Zambia 11.6%. In the training set, India-Belagavi had the largest sample size (n= 107,076) and was selected as the reference model for evaluating performance in other countries.

Regarding outcomes, 2.5% of newborns experienced neonatal death. When splitting the dataset for training and testing, the ratios remained similarly at 2.6% and 2.2%, respectively. A comprehensive analysis of neonatal patient characteristics categorized by place of delivery, type of delivery, country of delivery, and outcome is shown in Table 1 .

Table 1 Summary of the complete dataset, training dataset, and testing dataset.

VIF analysis was performed on the training set to assess multicollinearity between predictor variables. The results showed that the VIF values for all variables of maternal age (VIF = 1.012), gestational age (VIF = 1.008), and birth weight (VIF = 1.016) were close to 1, suggesting that multicollinearity was negligible. Furthermore, the correlation plot in Figure A.1 (Supplementary Appendix) confirms the low correlation between these three variables, reinforcing their independence within the dataset.

Algorithm performance

The performance of different ML algorithms was evaluated within each country considering three different model approaches: general, country-specific, and maximum train size. The model was trained using lightgbm (LGBM), xgboost (XGB), adaboost, catboost, and random forest, and hyperparameters were optimized by random search.

In the case of Kenya (Table 2), the LGBM Tuned algorithm achieved an AUC-ROC of 0.808. [0.777, 0.839] When using the general model approach. By comparison, a country-specific approach using the same XGB Tuned algorithm resulted in a slightly lower AUC-ROC of 0.805. [0.774, 0.836]. These results suggest that a general algorithm is a more effective approach to achieve better prediction results in Kenya. Similar trends were observed in other countries. In the Democratic Republic of Congo (DRC), the general model (LGBM Tuned) achieved 0.797. [0.77, 0.825] AUC-ROC and 0.793 [0.765, 0.819] For country-specific approaches (XGB Tuned). The general model approach tended to outperform the country-specific approach, although the differences were small.

Table 2. Test results of predictive models for neonatal mortality risk.

For Guatemala, the LGBM Tuned algorithm achieved an AUC-ROC of 0.795. [0.772, 0.819] 0.796 for general model [0.774, 0.820] AUC-ROC performance of country models. For Zambia, a general model approach using the LGBM Tuned algorithm achieved an AUC-ROC of 0.785. [0.745, 0.826]On the other hand, the country-specific approach using the XGB Tuned algorithm showed superior performance with an AUC-ROC of 0.801. [0.761, 0.839]. Nevertheless, we observed that the general model had higher recall.

For India-Belagavi, both the general and country-specific models using the LGBM Tuned and LGBM algorithms yielded comparable AUC-ROCs of 0.784. [0.751, 0.814] and 0.781 [0.75, 0.811]respectively. These findings suggest that the choice of algorithm did not significantly affect prediction performance. Overall, the results highlight that the choice of algorithm and model approach can influence forecast performance within each country (Figure 2). Although the general model approach generally yielded good results, there may be cases where a country-specific approach or a different algorithm would be more effective. It is important to carefully consider the specific context and data characteristics when choosing the optimal algorithm and model approach for each country.

Regarding the calibration of the general approach (Figure 3), it is observed that the blue line representing the LGBM model closely follows the diagonal line in the lower probability range (0 to 0.4), indicating good calibration. However, in the intermediate range (0.4 to 0.7), the model deviates slightly, suggesting that the predictions are somewhat overconfident. In the higher probability range (0.7 to 1.0), the model gradually realigns with the diagonal, indicating improved calibration at higher confidence levels.

Figure 4 shows the analysis of the most important predictors by Shapley value considering the general model. In the general model, the variables birth weight and gestational age were most associated with predicting the risk of neonatal death, and the LGBM Tuned algorithm performed best.

It is clear from Figures A.2 to A.8 (Supplementary Appendix) that birth weight was consistently the most important predictor of Shapley values. With the exception of India-Belagavi, the second most relevant predictor was gestational age. The results of the country-specific approach support the results of the general model.

A complete report on the performance of all models tested across different algorithms is available in the Supplementary Tables (A.2-A.8).

Model variations

We considered the general model as a baseline and performed further analyzes to evaluate the performance of the model across different outcome variations. Figure 5 shows the change in mortality within 7 and 42 days postpartum. The models show considerable variation in performance across different national contexts. This variation may be due to the characteristics of the predictors in each country. In particular, the “general” model, in which all countries were trained together, showed consistently robust performance across the three temporal variations in outcomes, suggesting a potential baseline model for neonatal mortality. However, it does not always outperform other models, indicating that local factors significantly influence the model's effectiveness.

This result suggests that “country-specific” models may outperform “general” models in certain regions, such as Nagpur, India, and Bangladesh. This superior performance may be related to the ability of these models to accommodate localized medical data and demographic nuances that are less pronounced in global datasets. The findings of this study advocate customizing predictive models to increase their accuracy and relevance in specific regional contexts.

The stability of model performance across different time periods within the same country and model type is noteworthy. This consistency is important for the practical application of these models in medical settings, as it ensures the reliability and predictability of the predictive capabilities over time. In general, we found that the model showed higher AUC-ROC values for deaths within 7 days postpartum. As the period increased, a decrease in overall performance was observed. The model predicting death within 42 days postpartum outperformed only the DRC and Bangladesh models in the maximum training scale strategy. The largest differences in model performance were observed in Guatemala. Supplementary Table A.1 shows that Guatemala had the lowest mortality concentrations in the period 0–7 days, accounting for approximately 59.3% of deaths during this period. In the DRC, approximately 89% of deaths occurred within 7 days postpartum, making the model's performance more balanced.

In addition to comparing different time periods, we also evaluated the impact of including additional variables beyond the five recommended by the WHO. At this stage, variables such as ultrasound modality, maternal education, prenatal visit, tetanus vaccination during pregnancy, infant gender, and gestational trimester at the first prenatal visit were incorporated. AUC-ROC values remained relatively similar overall.

As shown in Table 3, the most significant improvement was observed in India-Nagpur, where the AUC-ROC increased from 0.812 to 0.823. However, AUC-ROC decreased in DRC.

Table 3. Testing results of the predictive model for neonatal mortality risk with the addition of new predictors.

Source link