Research Design
This pilot study was approved by the Public General Brigham Human Subjects Committee with IRB number 2015p000085. We enrolled participants under 40 years of age with infertility of unknown cause or MFI who received IVF between October 10, 2019 and 04/2021 at the Fertility Center at Massachusetts General Hospital. All study participants provided written consent prior to registration. Participants were not given incentives or compensation for participation. Unexplained infertility is defined as a couple of normal semen analysis, normal assessment of uterine, tube, and ovarian function (AMH > 0.8) and are about to conceive for 6-12 months (depending on age). Male factor infertility was defined as a normal assessment of uterine, tube, and ovarian function (AMH>0.8) (AMH>0.8) and one or more abnormal semen analysis parameters in two separate samples at least 2 weeks apart. Normal semen analysis values were based on the World Health Organization 5th edition. Guidelines: Concentration > 20 million/ml, Motility > 40%, Anterior Progression > 3, Total Motility Count > 15 million Motility Sperm/Sample27.
Both fresh and frozen embryo transfer cycles were included. For cycles with planned fresh embryo transfer, participants provided vaginal swabs on the third day of the stimulation cycle, day of egg collection and day of embryo transfer. For the CryOthaw cycle, participants provided swabs on baseline ultrasound (days 3–5), second ultrasound day, and embryo transfer day. Swabs were collected prior to clinical procedures. The swabs were stored at -80°C until processed.
Data on age, race, peak serum estrogen levels, antimulahormonal (AMH) levels, follicle-stimulating hormone (FSH) levels, embryo quality, number of embryos migrated, stimulation protocols, pregnancy history, and medical history were obtained through chart reviews. Information regarding treatment outcomes (i.e., confirmed pregnancy, biochemical pregnancy, ultrasound intrauterine pregnancy) was obtained from the medical records of the ART cycle where vaginal samples were collected.
Laboratory analysis
The swabs were eluted with 400 μL of sterile saline, centrifuged, and pellets were submitted for DNA extraction, but the supernatants were aliquoted and frozen for luminex analysis. Genomic DNA (GDNA) was extracted using a plate-based protocol that includes a bead beating process combining the QIAAMP 96 DNA QAICUBE HT kit (QIAGEN) procedure with phenol chloroform separation. Amplicon libraries for the bacterial 16S RRNA gene V4 region were prepared and sequenced in Illumina Miseq with a 300-cycle sequencing kit28. Taxonomic assignments were performed using GTDB and microbial composition analysis package DADA229,30. In stacked bar plots, only the 20 most common taxa are represented. Microbial composition analysis was performed using R (version 4.2.2). Classification data were aggregated at the genus level, excluded low-accumulation taxa (prevalence of <0.5%). Samples were assigned to Microbial Community Status Type (CST) Vaginal Community Status Type Nearest Center of Gravity Classifier– Valencia31:cst i =Lactobacillus crispatus Dominant, CST II = lactobacillus gasserii Dominant, CST III = Lactobacillus dominates,cst v = lactobacillus jensenii Dominant, CST IV = non-non-nonlactobacillus Dominant.
Concentrations of 20 cytokines/chemokines (MIG, IP10, IFN-γ, ITAC, IL1α, IL1β, TNFα, IL6, IL8, MIP-3α, IL12, MIP1α, MIP3β, IL13, IL12, IL21, IL4, IL23, IL5, IL10) were used in vagina. As mentioned above, multiplexed ELISA assays (Luminex)29,32. Values below the lower limit of detection within the assay were recorded as half the lowest standard concentration of that analyte. Similarly, concentrations above the detectable limit were recorded as 1.1 times the highest standard concentration. To identify participants with overall mucosal inflammation, each sample was assigned a “inflammatory score.” This corresponded to the number of any of nine inflammatory markers with the highest quartile values in this cohort. These markers were selected based on panels associated with increased risk of STI acquisition32(Interleukin (IL)-1α, IL-1β, IL-6, tumor necrosis factor (TNF)-α, IL-8, CXC motif chemokine 10 (CXCL10; IP-10), IL-17, macrophage inflammatory protein (MIP)-1α, and MIP-1β). For each marker, we assigned a score of 1 if the concentration value was >=75th percentile (in a cohort), and a score of 0 if it was below that value. Scores for all markers were aggregated across 79 samples to provide an inflammatory score for each sample (scores ranging from 0 to 9 for each sample).
Statistical analysis
This was a useful sample for the pilot. Metadata were compared between those who did not have an intrauterine pregnancy (IUP) and patients with unknown causes and those who were using MFI using Chi Square.t-Test or Mann -Whitneyu– Test if necessary. When assessing the association between microbiome or inflammation and outcomes, we control for multiple samples and different time points per participant, including data from all time points for all participants in the mixed-effects linear regression model. Statistical analyses were performed using either STATA or R.pValues below -0.05 are considered statistically significant.
Alpha Diversity was evaluated using the Shannon Diversity Index and calculated using the Diversity function, which is part of the vegan package. Alpha diversity was compared across groups using mixed-effects linear regression analysis to describe multiple samples from each participant.
Machine Learning Models
Data collection, preprocessing, and target definition
The dataset used in this study included subjects with both cytokine and bacterial abundance data at each of the three time points: 1 A (27 subjects), 2A (25 subjects), 3A (27 subjects), and binary pregnancy outcome labels (“results”). As a pre-treatment step to ensure a high quality feature matrix (x) Subsequent modeling steps excluded feature columns with values below the 0.01 threshold, confirming that only important features were included. Columns with missing or low potential data, defined as less than 50% of non-zero values, have also been deleted. Target variables (y) Binary pregnancy results for each subject.
Data Balancing Using Small
To address class imbalances in the dataset (between pregnancy and non-pregnant), synthetic minority oversampling techniques (SMOTE) were applied in any consistent random state.33. This technique produced synthetic samples of minority classes and ensured a balanced representation of both classes to train classifiers. Using Small Over-Sampling, 34 data points were used for time points 1A and 3A, and 32 was used for time points 2A.
Machine Learning Models
For simplicity and excellent performance, a support vector machine (SVM) with C-Support vector classification using a linear kernel was selected. Classifiers were trained with scaled training data and predictions were generated in left-wing test samples34,35,36. This model was used with the default parameters provided by the Scikit-Learn package. The predicted probability threshold considered positive was above 0.5 and was considered negative.
Model Evaluation Metrics
Due to the unbalanced nature of the data, we used F1 scores to assess model performance. This is the harmonic mean of accuracy and recall, providing a balanced measure of predictive performance of the model, which explains both false positives and false negatives.
$$ f1 = 2 \cdot \frac {{precision} \cdot {recall}} {{precision}+{recall}} $$
where
$$ {precision} = \frac {{true \;positive}} {{true \;positive}+{false \;positive}} $$
$$ {recall} = \frac {{true \;positive}} {{true \;positive}+{false \;negative}} $$
Accuracy, the number of correct predictions made by the model was measured as follows:
$$ {precision} = \frac {{true \;positive}+{true \;negative}} {{true \;positive}+{false \;positive}+{true \;negative}+{false \;negative}} $$
The training set contained both the original and Small-Sysesized data, but the model evaluation was calculated only for the original dataset. A confusion matrix was generated to assess classification performance.
Leave a vacation
To rigorously evaluate the performance of the model, we adopted Leave-One-Out Cross-Validation (LOO-CV). For each iteration, one sample was presented as a test set, and the remaining data was used as a training set. The number of iterations is equal to the number of samples. This is because it was repeated for all samples (Figure 1). Evaluations were performed using F1 and accuracy metrics. The functional matrix was standardized using the “StandardScaler” method, and standardized each feature by subtracting the average and scaling it to unit variance, ensuring that all features contributed equally to the prediction of the model.
The importance of features
Due to the importance of functionality, we relied on shap from the SVC linear model (Shapley Additive Description)37,38. Characteristic significance values were calculated after each loo-cv fold. These importance scores were averaged at the fold to represent the overall contribution of each feature.
Software and Tools
Data analysis and statistics were performed using R. Modeling was performed using machine learning analytics and Python. I used the following open source packages: Pandas for data manipulation, Imblearn (Smote) for handling class imbalances, Scikit-Learn for machine learning, and Matplotlib for plots.
