Population-level predictive variation in machine learning diagnosis of symptomatic bacterial vaginosis

The data used in this work was Srinivasan et al. was created by. ³³It consists of 220 women with or without bacterial vaginosis (BV). BV was diagnosed based on nugent scoring shown based on gram staining tests of vaginal smears. Patients with a Nugent score of 7 or higher have been identified as BV positive, while patients with a Nugent score of 7 are identified as BV negative. Given the goals predicting bacterial vaginosis (BV), we used four machine learning (ML) models: random forest (RF), logistic regression (LR), support vector machine (SVM), and multi-layer percepron (MLP). The hyperparameters used to optimize each classifier are listed in Supplementary Table 1. Four metrics were used to evaluate the performance of the ML model in predicting BV using balanced accuracy (BACC), area under the precision recovery curve (AUPRC), false positive rate (FPR), and false negative rate (FNR).

Descriptive statistics

Within the dataset there were 220 women, of which 97 (44%) were white, 75 (34%) were black, and 48 (22%) were other ethnic groups (i.e. Asian, Hawaiian/Pacific Islander, American Indian/Alaskan natives, mixed, did not disclose ethnicity or reveal race). All ethnic categories were self-proclaimed. Figure 1 shows the percentage of BV diagnosis based on Nugent scoring, including ethnicity. 53% of women had positive BV diagnosis between black women and women, with a higher prevalence of BV compared to white women (Figure 1). The chi-square test showed an important link between ethnicity and BV results (p = 0.0001 <0.05). This work examines the impact of this association between ethnicity and BV outcomes on ethnic learning performance.

Figure 2a shows a two-dimensional T-partition probabilistic adjacent embedding (T-SNE) projection of operational classification unit (OTU) variables mapped to BV diagnosis based on Nugent scoring. From examining the T-SNE projection, most data can be separated by BV diagnosis. However, some samples are not well isolated in the T-SNE project, augmenting the challenges in diagnostics using AI/ML models. To further explore the effects of dominant bacterial species on BV diagnosis, T-SNE projections mapped to community status type (CST) classifications are shown (Figure 2B). The plot is well separated by CST, with most of CST I in BV negative clusters and most of CST IV in BV positive clusters. The mixed BV diagnostic cluster is largely composed of CST III, L. Inner Dominant microbiota for mixed diagnosis.

**Figure 2: Visualization of sequence data in two-dimensional space.**

Figure 3 shows the percentage and counts of women in each CST across ethnic groups. CST IV is the primary CST for Black (56%) and other (50%) women. CST III, that's right L. Inner Dominant is the second most common condition type for women in these two groups (34.7% of black women and 25% of other women). CST I, L. Christapatus The dominant microbiota is the third most common CST among black women (8%), and is the women labeled as females (22.9%). In contrast, CST III is the most common condition type of Caucasian women in this cohort, followed by CST IV (33%) and CST I (26.8%). All three ethnic groups had only one CST V patient (L. jensenii ). Neither group had patients classified as CST II (L. Gasseri).

**Figure 3: Type distribution of community status across ethnic groups.**

Model performance depends on the ethnicity of the BV diagnosis

Table 1 shows the average balanced accuracy (BACC), precision recall curve (AUPRC), false positive rate (FPR), and false negative rate (FNR) areas of the four ML models in predicting BV. Overall, the ML model worked well (BACC: 0.90–0.92; AUPRC: 0.93–0.96; FPR: 0.07–0.10; FNR: 0.10–0.10). Random Forest (RF) and logistic regression (LR) had better BV prediction performance compared to other models, depending on the metric. However, there were no statistically significant differences in performance metrics (Table 1).

Table 1 Overall model performance of RF, LR, SVM, and MLP models in terms of balance accuracy (BACC), AUPRC, false positive rate (FPR), and false negative rate (FNR) for 95% confidence intervals

When examining the performance of the ML model by ethnic groups, differences in predictive results were found (Figure 4, Supplementary Table 2). Overall, black women had the lowest balanced accuracy (BACC) (Figure 4A) and the highest FPRS (Figure 4C) in all models. In contrast, FNR tended to be lower in Caucasian women except when using a multilayer perceptron (MLP) model (Figure 4D).

**Figure 4: After running a train test for model performance 10th tier with ML architecture type (using nested grid search cross-validation in each run).**

Shows group pairs with statistically significant differences in model performance.

In summary, most models except MLP tended to perform worse among black women compared to white women and women of other ethnic groups. However, MLPs tended to perform most equally in all ethnic groups.

Use paired ethnicity training to improve model performance

This subsequent analysis sought to determine whether training and testing using data from the same ethnicity (i.e., training of paired ethnic groups) reduces ethnic disparities in model performance. Only logistic regression (LR) results are shown. This is because the overall balance was the highest accuracy (Table 1). White and black women's paired ethnic training (Figure 5, Supplementary Table 3) resulted in comparable or comparable performance as training in a sample of all ethnic groups. However, these improvements did not result in statistical significance. In contrast, for women of other ethnic groups that were statistically significant, all performance measures except FNR were reduced (balanced accuracy: p = 0.002; auprc: p = 0.037; FPR: p

Figure 5: Model performance by ethnicity with ethnicity with or without ethnicity-specific training (i.e., paired ethnicity and cross-training) using the LR model. — **= 0.004).**

False negative rate (FNR). asterisk Shows group pairs with statistically significant differences in model performance. We also examined whether these models could be generalizable to ethnic groups not used in the training process (i.e., cross-training). Overall, cross-training tended to improve predictive performance among women from other ethnic groups (Figure 5, Supplementary Table 3), with particularly well-balanced accuracy (white: p = 0.048), fpr(white: p = 0.005; Black: p= 0.012), and fnr(white: p= 0.046; Black: p= 0.039). In contrast, we found that paired ethnic training tends to improve predictive outcomes for black women compared to cross-training using data from women from other ethnic groups (BACC: p= 0.003; FPR: p= 0.004; FNR: p = 0.01). Similarly, paired ethnic training often has a higher predictive performance for white women than cross-training with data from women from other ethnic groups (balanced accuracy: p= 0.006; auprc: p= 0.006; FPR:

p

= 0.006). Bacterial taxa has been emphasized as important for predicting BVFunctional selection methods were used to identify bacterial taxa that contributed to accurate BV diagnosis. It was used to extract important bacterial taxa using the following feature selection methods: Gini index, t-test, F-test, and point quadratic (PB) correlations. Both

Overall, the Gini index method performed best compared to other feature selection methods (Table 2). When examining model performance by ethnicity (Figure 6, Supplementary Table 4), model performance improvements differed in different ethnic groups. For white and black women, trait selection improved most predictive measures, but was not statistically significant in most ways. In contrast, performance measures, particularly balanced accuracy and FNR, tended to degrade women from other ethnic groups across all functional selection approaches. Overall, the PBCORR method tended to reduce performance measures for most ethnic groups. — Important features were determined from this method using -Value (PBSIG) and correlation coefficient (PBCORR). The results are shown only with the LR classifier. This is because it was the best performance model overall (Table 1).

Boxplots showing median, upper quartiles, lower quartiles, and outliers of balanced accuracy, precision recovery curves (AUPRC), false positive rates (FPR), and false negative rates (FNR). Table 2. Overall model performance of the LR model is from the perspective of balance accuracy (BACC), AUPRC, false positive rate (FPR), and false negative rate (FNR) at 95% confidence intervalsTo further explore ways to improve model performance equity, features identified as important for BV diagnosis in each ethnic group were used to independently train ML models using GINI indexing methods. For BV diagnosis, unique bacterial taxa were found in each ethnic group-specific subset (Fig. 7). Eggerthella sp. Type 1and Atopobium vaginae (fannyhessea vaginae)It corresponds to important bacterial taxa identified as the most important in the BV diagnosis of Caucasian women in this cohort and identified throughout the cohort. in contrast, Gardnerella vaginalisand L. ChristapatusIt was found to be an important predictor of BV for women of other ethnic groups. Dialister sp. Type 2and