Machine learning models for predicting blood pressure phenotypes by combining multiple polygenic risk scores

We used two datasets. The primary dataset is from the Trans-Omics in Precision Medicine (TOPMed) consortium, which was used to train and evaluate multiple models. The second dataset is from the Mass General Brigham (MGB) Biobank and was used to evaluate selected model performance in an independent healthcare-system dataset. Below we describe the datasets and the steps for constructing and evaluating models. Figure 1 describes the framework for the development of multi-PRS non-linear ML model allowing for non-genetic components to be calibrated across datasets.

The TOPMed dataset

The TOPMed study population included 62,295 unrelated (3rd degree or less) participants from 15 studies based in the U.S. and in Taiwan. Data was extracted from the freeze 8 TOPMed dataset release. Information about the studies including ethics statements is provided in Supplementary Note 1. Blood pressure phenotypes were harmonized by the TOPMed Data Coordinating Center (DCC)⁵⁰ and included systolic blood pressure (SBP), diastolic blood pressure (DBP), and status of antihypertensive medication use (HTNMED_V1). Medication status was used to increase values of SBP and DBP by 15 and 10mmHg, respectively, to account for the expectations that their values would be higher if the corresponding individuals did not use antihypertensive medications.

Genotype data

We used whole genome sequencing (WGS) data from Trans-Omics for Prevision Medicine (TOPMed) program⁵¹ Freeze 8 dataset. Genome samples were sequenced through TOPMed and the National Human Genome Research Institute’s Centers for Common Disease Genomics (CCDG) program and harmonized together via joint allele calling. The methods for TOPMed WGS data acquisition and quality control (QC) are described in https://www.nhlbiwgs.org/topmed-whole-genome-sequencing-methods-freeze-8. Genetic Principal Components (PCs) and kinship coefficients were computed for the genetic data by the TOPMed DCC using the PC-Relate and PR-AiR algorithms^52,53 implemented in the GENESIS R package⁵⁴. Based on the kinship coefficients, we identified related individuals and generated a dataset in which all individuals were degree-3 unrelated, i.e., all kinship coefficients were < 2^(−9/2) (approximately 0.04). We extracted allele counts of variants that passed TOPMed quality control flags from GDS files using the SeqArray package version 1.28.1 and then further processed the genetic data using custom R version 3.6.2 and Python version 3.6.15 scripts. Only variants with minor allele frequency $\ge$ 0.01 in the TOPMed dataset and that passed TOPMed QC were used in this study. There were 63,093 participants and 12,341,303 variants available for PRS development.

Blood pressure phenotypes

We trained non-linear ML models to predict two phenotypes: systolic blood pressure (SBP) and diastolic blood pressure (DBP). SBP and DBP values were extracted, when available, from a harmonized datasets created by TOPMed⁵⁰, and for some TOPMed-parent studies they were prepared by study researchers. To address the existence of unrealistic values in the dataset, but without forcing a specific threshold, we removed individuals with outlying values defined by phenotypic values above the 99th quantile and values below the 1st quantile for each of the phenotypes. This amounted to 1047 and 1403 individuals from the analyses of SBP and DBP, respectively. The quantiles were computed over the complete dataset. Values for SBP and DBP for individuals that were taking antihypertensive medications were adjusted by increasing SBP and DBP values by 15 and 10 mmHg, respectively.

Summary statistics from published GWAS

We used summary statistics from published GWAS of SBP and DBP provided by BioBank Japan Project (BBJ), Million Veteran Program (MVP), as well as UK BioBank and the International Consortium of Blood-Pressure Genome Wide Associations Studies (UKBB + ICBP). Details are provided in Supplementary Table 12. Because some of TOPMed European ancestry participants also participated in the UKBB + ICBP meta-analysis, we performed GWAS of SBP and DBP using all self-reported White individuals in the TOPMed dataset (i.e. participants included in both TOPMed training and testing datasets), and then applied a procedure to “remove” the contribution of this GWAS from the overall UKBB + ICBP GWAS summary statistics⁵⁵ as described in Supplementary Note 2.

Polygenic risk scores

We developed two types of PRS—“global PRS”, using SNPs from the entire genome, and, in secondary analysis, “local PRS”, calculated from SNPs within LD-regions. In both cases SNPs were selected using the clump & threshold approach. We developed and constructed global PRSs using PRSice2 version 2.3.5⁵⁶, from the BP GWAS summary statistics described above. As tuning parameters, we set R² = 0.1, distance = 1000 kB, and several p-value thresholds: 5 × 10^–8, 10^–7, 10^–6,…, 10^–2. We used the TOPMed data set as a reference panel for LD (used for clumping). In secondary analysis, we developed local PRSs. We used previously computed LD-regions⁵⁷ provided in BED files defining chromosomal segments (see Data Availability) based on a European reference panel to subset the UKBB + ICBP GWAS summary statistics to files consisting of variants falling within each LD-region (chromosomal segment). We developed local PRS based for each of these regions using the same clump & threshold approach using PRSice2, but now, due to the large number of segments and thus features, we used a single p-value threshold of 10^–2. For this secondary analysis we only used the UKBB + ICBP GWAS because we saw before that, although it is based on single genetic ancestry (European) PRS based on it perform well when evaluated in various self-reported race/ethnic groups¹³. In a Supplementary Note 3 we also describe the comparison of the models’ performance using global SBP and DBP developed using PRS-CSx⁵⁸, which is an extension of the Bayesian polygenic prediction method PRS-CS⁵⁹, from the three GWAS summary statistics used in this study.

Non-linear ML model training and hyperparameter tuning

We used the python version 3.6.15 library xgboost version 1.5.2⁶⁰ to fit ensembles of gradient boosted trees. We performed hyperparameter tuning using Optuna⁶¹ library version 3.0.6. Specifically, we split the training dataset at random into 5 independent datasets and performed a fivefold cross-validation procedure to select optimal values of relevant tuning parameters.

Ensemble model

As described in Fig. 1, the ensemble non-linear ML model consisted of two consecutive components—“baseline model” and “genetic model”. The goal behind the two-model construction was (1) separately assess the benefit of non-linear modelling of non-genetic measures and genetic measures, (2) allow for flexible combination of the two models (e.g., linear model for covariates and non-linear model for PRSs, or the other way around), and (3) facilitate model calibration and generalizability between datasets while acknowledging that some covariates are potentially dataset-specific, and these are included only in the baseline model. The genetic model is expected to be fully transferable to a new dataset by using features that are comparable (or harmonized) across datasets.

We compared two potential constructions of both the baseline and the genetic models: non-linear ML (allowing for data-driven incorporation of interactions), and linear regression (without modeling interactions). We divided the dataset such that 70% of the data was used as training data and 30% of the data was held out as a validation set. First, we trained the baseline model using covariates only, including age, sex, self-reported race/ethnicity, BMI, and study. The genetic model was trained on the same set of features as in the baseline model, other than study, which is dataset-specific, and it also included genetic components, i.e., PRSs, where we used the global PRSs described above, the local PRSs in secondary analysis (described later). The genetic model was trained to predict the residuals from the baseline model.

Model development using multiple PRSs

In primary analysis, we studied the use of multiple PRSs in the genetic model via multiple models of increased complexity (in the sense that they include higher number of PRSs). We refer to these models as Level 1, Level 2 and Level 3 with Level X being shorthand for Model complexity levels (Fig. 1c). Note that all genetic models included the same non-genetic covariates as described above, and they only differed by the inclusion of additional PRSs. Level 1 of the genetic model included a single PRS developed from the UKBB + ICBP GWAS summary statistics using the p-value threshold 10^–2. Level 2 included 7 PRSs, all from the UKBB + ICBP GWAS, and based on all considered p-value threshold: 5 × 10^–8, 10^–7, 10^–6,…, 10^–2. Level 3 included PRSs constructed based on all considered GWAS (UKBB + ICBP, MVP, and BBJ) GWASs and using all p-value thresholds, i.e., it included 21 PRSs. Level 1 models included only a single PRS based on the 10^–2 p-value threshold because we expected this PRS to have the best performance based on past work studying BP PRSs¹³. In secondary analysis, we also evaluated model performance, based on the training dataset only, when using each of the other UKBB + ICBP based PRSs (i.e. based on each of the 7 thresholds). The models were fit using the combined, multi-ethnic dataset, with training dataset cross-validation performance reported on the combined group and stratified by self-reported race/ethnicity. The goal was to assess whether the same or different thresholds are useful across groups, and thus whether multi-PRS models are useful because they potentially allow for different utilization of PRSs across individuals (i.e. due to interactions).

Secondary analysis using local PRSs

We considered using local PRSs instead of global PRSs because they may result in more interpretable models, i.e., where one could hopefully explain why different genomic regions may have different potential contributions to the BP model. Due to the large number of PRSs when computed over all LD-regions, potentially resulting in model overfitting, we augmented the ensemble model approach with a variable selection step. Here, we applied LASSO regression from the python library scikit-learn⁶² version 0.24.2 using, as before, cross-validation for tuning parameter selection, on a linear model predicting the residuals from the baseline model using all constructed local PRSs. We next use the selected PRSs in the genetic model (see Fig. 4a for visualization of the Ensemble model with integrated LASSO step). We evaluated this model performance on the test dataset, and also, for comparison, of the model that uses only the LASSO (without the following non-linear XGBoost ML genetic model).

Assessment of computational needs

We assessed computational runtime and memory (RAM) usage of the non-linear ML models in comparison with the linear regression models using python timeit and memit modules. These were measured on a 13-inch 2020 MacBook Pro machine, with the Apple M1-chip, macOS Sonoma version 14.4.1, 16 Gb of RAM and 8 vCPUs. As the non-linear ML models require tuning parameters fitting, we measured their runtime and memory over the fivefold cross validation process. We also measured the runtime and memory for applying the various models for prediction over the test dataset. We performed each computing task 10 times and provide the average runtime and RAM measures, other than for local PRS models, which took longer time, and we therefore fit them once for this assessment. We performed this analysis only using SBP models, that used a slightly larger sample size compared to DBP.

Model performance evaluation

Models’ performance was assessed on the test dataset (30% of the TOPMed dataset of unrelated individuals). We report two performance measures: percent variance explained (PVE) at both the residual (of the baseline model) and at the phenotypic (original BP phenotypes) level. To explain how each is computed we introduce some notation. Including, we distinguish between the predicted values of the baseline model, the genetic model, and the combined ensemble, and between the residuals of the baseline and the genetic models.

Let ${M}_{b}$ and ${M}_{g}$ denote a baseline and a genetic model, ${{\varvec{x}}}_{{\varvec{b}}}$ and ${{\varvec{x}}}_{{\varvec{g}}}$ the sets of covariates used by ${M}_{b}$, and ${M}_{g}$ respectively, and ${\varvec{g}}$ the set of PRSs used by ${M}_{g}$. Let $y$ denote a BP outcome. A trained model ${M}_{b}$ uses ${{\varvec{x}}}_{{\varvec{b}}}$ to predict $y$, as:

$${M}_{b}\left({{\varvec{x}}}_{{\varvec{b}}}\right)={\widehat{y}}_{b}.$$

(1)

With ${\widehat{y}}_{b}$ being the prediction of ${M}_{b}$ applied on ${{\varvec{x}}}_{{\varvec{b}}}$. The residuals of model ${M}_{b}$ are obtained as the difference between the observed and the predicted BP value, and are denoted by ${r}_{b}$:

$${r}_{b}=y-{\widehat{y}}_{b}.$$

(2)

The genetic model is trained to predict ${r}_{b}$ using ${{\varvec{x}}}_{{\varvec{g}}}$ and PRSs ${\varvec{g}}$. Thus

$${M}_{g}\left({{\varvec{x}}}_{{\varvec{g}}},\boldsymbol{ }{\varvec{g}}\right) ={\widehat{r}}_{b}.$$

(3)

The residuals of ${M}_{g}$ are given as the difference between the value it attempts to predict and its prediction:

$${r}_{g}={r}_{b}-{\widehat{r}}_{b}.$$

(4)

Noting based on Eqs. (2) and (4) that the observed BP measure $y$ can be decomposed as:

$$y={\widehat{y}}_{b}+{r}_{b}={\widehat{y}}_{b}+{\widehat{r}}_{b}+{r}_{g}.$$

(5)

We denote the prediction BP based on the ensemble model ${M}_{e}$ by ${\widehat{y}}_{ensemble}$ with:

$${\widehat{y}}_{ensemble}={\widehat{y}}_{b}+{\widehat{r}}_{b}.$$

(6)

With the residuals ${r}_{g}$ of the genetic models being also the residuals of the ensemble model. For a given outcome $out$ the PVE by the predicted outcome $\widehat{out}$ is defined as the percent reduction in the variance of $out$ when accounting for $\widehat{out}$, using this formula:

$$PVE\left(out, \widehat{out}\right)=\left(1-\frac{var\left(out\right)-var\left(\widehat{out}\right)}{var\left(out\right)}\right)\times 100\%.$$

(7)

Finally, we define the performance measures of the various models. To assess the performance of baseline models we compute phenotyping PVE, $PVE\left(y, {\widehat{y}}_{b}\right)$. To assess the performance of the genetic models we compute the PVE at the level of the baseline model residuals, i.e., $PVE({r}_{b}, {\widehat{r}}_{b})$. To assess the performance of the ensemble model we compute again PVE at the phenotypic level, i.e., $PVE\left(y, {\widehat{y}}_{ensemble}\right).$

To further interpret results, we computed 95% confidence intervals for the estimated PVEs using bootstrap sampling from the test dataset, with 100 repetitions. Specifically, we sampled with replacement individuals from the test dataset using the same sample size of the test dataset relevant for each analysis, and applied the prediction models that were trained over the train dataset. When assessing genetic model, we applied both the baseline and the genetic model over the sampled individuals. The PVE was computed for each bootstrap sample, and the 95% confidence interval was derived using the percentile method, i.e. using the 2.5 and 97.5 percentiles of the bootstrap distributions.

The Mass General Brigham (MGB) Biobank dataset

To assess how the patterns observed in the TOPMed dataset generalize to a healthcare-based medical system, we implemented all fitted models on the MGB Biobank dataset (MGB Biobank). In MGB Biobank, phenotypes are available from electronic health records (EHR). Because blood pressure, BMI, and medication data are available sporadically depending on patient visits, we restricted the datasets health records from two years, 5/25/2021 to 5/25/2023, so that all time-varying variable correspond to approximately the same age and time. The initial MGB Biobank dataset included data for 142,476 individuals. Of these, N = 108,389 individuals did not take antihypertensive medications (dataset codes “antihypertensive medications”, “antihypertensive-other”, “beta blockers/related”, “calcium channel blockers”, “diuretics”, “direct renin inhibitor”, “antihypertensive combinations”). Next, we further filtered the dataset to only include individuals with available systolic or diastolic reading (N = 29,282). We then only included individuals with genetic data, who also are genetically unrelated to each other, resulting in 9494 individuals in the final dataset. For each individual, we extracted the median SBP, DBP, and BMI values from these years. Individuals self-identified with categories of race and ethnicity. Individuals with Hispanic ethnicity were set to the “Hispanic” category, and otherwise individuals with non-Hispanic ethnicity and with Black or African American race to Black, and those with non-Hispanic White or Asian race were set to White or Asian, respectively, and non-Hispanic individuals with more than one self-identified race or with “other” or “unknown” were set to “other”. More details about the MGB Biobank dataset are provided in Supplementary Note 4.

To calibrate the ensemble model to the MGB Biobank dataset, we trained the baseline model on the set of covariates from MGB Biobank for prediction of the phenotype (SBP/DBP) and calculated the residuals. Next, we evaluated, separately, the genetic model (trained on the TOPMed training dataset) and the ensemble (MGB Biobank-trained baseline model + TOPMed trained genetic model) models. The genetic models included the three model complexity levels trained on the TOPMed dataset. To assess this calibration approach, we assessed the performance of the TOPMed-trained genetic model when applied over residuals from the MGB-trained and residuals from the TOPMed-trained baseline models.

Ethics statement

All methods were carried out in accordance with relevant guidelines and regulations. The presented analysis relies on observational data only, the experimental protocol is purely computational. This work was approved by the Mass General Brigham IRB (protocol #2021P001928) and by the Beth Israel Deaconess Medical Center Committee on Clinical Investigators (protocol #2023P000541). Informed consent was obtained from all participants, as described in Supplementary Note 1 (TOPMed participants) and Supplementary Note 4 (MGB Biobank participants).

Source link