Calibrating and assessing bias in a machine learning malnutrition prediction model deployed within a large health system

The main training cohort used to refine the models included 49,652 patients (median [IQR] Age = 66.0 [26.0]), of which 49.9% were women, 29.6% were black or African American, 54.8% had Medicare, and 27.8% had Medicaid. 11,664 (24%) malnutrition cases were identified. Baseline characteristics are summarized in Table 1, and malnutrition event rates are summarized in Supplementary Table 2. The validation cohort used to test the models included 17,278 patients (median [IQR] Age = 66.0 [27.0]Of the 1,000 cases (23%), 49.8% were women, 27.1% were black or African American, 52.9% were Medicare recipients, and 28.2% were Medicaid recipients.

Table 1. Overview of baseline characteristics

Calibration and Identification

The overall model c-index was 0.81 (95% CI: 0.80, 0.81), but miscalibration was found for both weak and moderate calibration criteria, with a Brier score of 0.26 (95% CI: 0.25, 0.26) (Table 2), indicating that the model was relatively inaccurate.¹⁷The model was also overfitted to the risk estimate distribution, as evidenced by the calibration curve (Supplementary Figure 1).Logistic recalibration of the model improved the calibration, with a calibration intercept of −0.07 (95% CI: −0.11, −0.03) and a calibration slope of 0.88 (95% CI: 0.86, 0.91), and significant decreases in Brier score (0.21, 95% CI: 0.20, 0.22), Emax (0.03, 95% CI: 0.01, 0.05), and Eavg (0.01, 95% CI: 0.01, 0.02). Refitting the model improved specificity (from 0.74 to 0.93), PPV (from 0.47 to 0.60) and accuracy (from 0.74 to 0.80), but decreased sensitivity (from 0.75 to 0.35) and NPV (from 0.91 to 0.83) (Supplementary Tables 2 and 3).

Table 2 Overall calibration statistics for the MUST Plus model

Weak and moderate calibration indices between black and white patients were significantly different before recalibration (Table 3, Supplementary Fig. 2A, B). In the model, white patients had a more negative calibration intercept on average than black patients (−1.17 vs. −1.07), and black patients had a higher calibration slope than white patients (1.43 vs. 1.29). Black patients had a higher Brier score of 0.30 (95% CI: 0.29, 0.31) compared with 0.24 (95% CI: 0.23, 0.24) for white patients. Logistic recalibration significantly improved calibration in both black and white patients (Table 4, Fig. 1a–c).For black patients in the holdout set, the recalibrated calibration intercept was 0 (95% CI: -0.07, 0.05), the calibration slope was 0.91 (95% CI: 0.87, 0.95), and the Brier score improved from 0.30 to 0.23 (95% CI: 0.21, 0.25).For white patients in the holdout set, the recalibrated calibration intercept was -0.15 (95% CI: -0.20, -0.10), the calibration slope was 0.82 (95% CI: 0.78, 0.85), and the Brier score improved from 0.24 to 0.19 (95% CI: 0.18, 0.21). After recalibration, the calibrations for black and white patients were still significantly different according to weak calibration metrics, but less significantly different according to moderate calibration metrics and strong calibration curves (Table 4, Figure 1). The calibration curves of the recalibrated models showed good agreement between the actual and predicted event probabilities, but the predicted risks for black and white patients differed between the 30th and 60th percentiles. Logistic recalibration also improved the specificity, PPV, and accuracy, but reduced the sensitivity and NPV of the models in both white and black patients (Supplementary Tables 2 and 3). There was no significant difference in the discriminatory ability of white and black patients before and after recalibration. We also found that the calibration statistics for Asian patients were relatively similar (Supplementary Table 4).

Table 3 Empirical bootstrap differences in calibration intercepts and slopes before recalibration

Table 4. Calibration statistics for the MUST-plus model by race and gender.

Calibration indices between male and female patients were also significantly different before recalibration (Table 3, Supplementary Fig. 2C,D). The models had, on average, more negative calibration intercepts in female patients compared to male patients (−1.49 vs. −0.88). Logistic recalibration significantly improved the calibration in both male and female patients (Table 4, Fig. 1d–f). For male patients in the holdout set, the recalibrated calibration intercept was 0 (95% CI: −0.05, 0.03), the calibration slope was 0.88 (95% CI: 0.85, 0.90), and the Brier score improved from 0.29 to 0.23 (95% CI: 0.22, 0.24).For female patients in the holdout set, the recalibrated calibration intercept was -0.11 (95% CI: -0.16, -0.06) and the calibration slope was 0.91 (95% CI: 0.87, 0.94), but the Brier score did not improve significantly. After logistic recalibration, only the calibration intercept differed between male and female patients. The calibration curves of the recalibrated models showed good agreement, but the predicted risks for men and women differed between the 10th and 30th risk percentiles. The discrimination indices for male and female patients were significantly different before recalibration. The model had higher sensitivity and NPV for women than for men, but lower specificity, PPV, and accuracy (Supplementary Table 2). The recalibrated model showed the highest sensitivity (0.95, 95% CI: 0.94, 0.96), NPV (0.84, 95% CI: 0.83, 0.85), and accuracy (0.82, 95% CI: 0.81, 0.83) for female patients, but with a significant decrease in sensitivity (0.27, 95% CI: 0.25, 0.30) (Supplementary Table 3).

As a sensitivity analysis, we also evaluated calibration by payer type and hospital type. In the analysis of payer type, we found that patients with private insurance were more likely to have miscalibrated predicted risk of malnutrition, with more extreme calibration intercepts, Emax, and Eavg, suggesting overestimation of risk (Supplementary Tables 5 and 6, Supplementary Figure 3A,B). Weak and moderate calibrations did not show significant differences between hospital types (regional, tertiary, and quaternary), but tertiary acute care centers had more extreme calibration intercepts, suggesting overestimation of risk (Supplementary Tables 7 and 8, Supplementary Figure 3C,D). In both subgroups, logistic recalibration significantly improved calibration across weak, moderate, and strong strata (Supplementary Table 5, Supplementary Table 7, Supplementary Figures 4,5).

Source link