Study design and participants
The UKB is a large-scale population-based prospective cohort with detailed phenotypic and genetic data from over half a million participants recruited between 2006 and 2010 across 22 assessment centres in England, Wales and Scotland34. The baseline data were collected in person via questionnaires, verbal interview with a trained nurse, physical examinations and biological samples. Follow-up information was obtained through linkage to electronic medical records of death and cancer registries and hospital inpatient records. The UKB study received ethical approval from the North West Multi-center Research Ethics Committee (REC reference: 11/NW/03820). All participants gave written informed consent before enrolment in the study, which was conducted in accordance with the principles of the Declaration of Helsinki.
In this study, we focused only on post-menopausal women due to the etiological heterogeneity of breast cancer by menopause status11,12,13,14,15. We restricted our study population to a sub-cohort of UKB female participants who were post-menopausal with age 40–69 at baseline, met UKB internal genetic quality control (UKB field 22020), were of genetic White ancestry (UKB field 22006), and had no history of breast cancer, breast carcinoma in situ or mastectomy at baseline (Fig. 1).

Flowchart illustrating the selection process for our study population.
Our final study population was further randomly divided into training (80%) and test (20%) sets (i.e. training-test split) for subsequent ML analyses.
Prevalent and incident post-menopausal breast cancer
Prevalent breast cancer cases were identified using International Classification of Diseases codes (ICD-9: 174.X and ICD-10: C50.X) from the linked cancer registry data, with the date of breast cancer diagnosis preceding or on the date of baseline assessment.
Incident cases were ascertained longitudinally using cancer registry data, supplemented by record-level hospital inpatient data due to the reporting delay in registries. The follow-up time for each participant was calculated as the number of years from the date of baseline assessment until the earliest of the following: breast cancer diagnosis, date of death from other causes, date of loss to follow-up, date of mastectomy or last date of medical record availability in UKB: 28th February 2021 in England, 28th February 2021 in Scotland, and 28th February 2018 in Wales.
Polygenic risk scores
Due to the substantial discordance in individual-level risk categorisation between different PRS for the same disease35, we included two breast cancer PRS as potential genetic features: PRS3139 and PRS120k36. Neither PRS used UKB data in its derivation stage, hence both are suitable for calculation within the UKB population without the concern of inflated effect estimates due to sample overlap. PRS313 consisted of 313 (pre-Quality control) independent (correlation \(<0.9\)) genetic variants associated with breast cancer, developed using hard-thresholding and stepwise forward regression with \({p<10}^{-5}\) in Breast Cancer Association Consortium (BCAC) data. PRS120k consisted of 118,388 (pre-Quality control) variants, developed using the lassosum method37 from the same BCAC data.
We used the imputed genetic data from UKB (version 3, March 2018 release). Full details of genotyping, imputation, genetic array, and principal components (PCs) are described elsewhere38 and Supplementary Materials. We performed further variant Quality control (QC) checks across the whole cohort using a published pipeline39, excluding variants that were not available in UKB, variants poorly imputed in UKB (imputation information \(<\) 0.4), ambiguous variants (A/T or C/G single nucleotide polymorphisms (SNPs) with minor allele frequency \(>\) 0.49) and variants with minor allele frequency (MAF) \(<\) 0.005. This led to 305 variants remaining in PRS313 and 115,300 in PRS120k (Supplementary Table 1).
We then performed sample QC, excluding participants who were related (third degree or higher), sex discordant, or identified as outliers for genotype missingness or heterozygosity (as these could indicate poor sample quality), using sample QC data provided centrally by UKB (UKB field 22020) that retained a maximal set of unrelated individuals. Finally, we calculated the PRS as the weighted sum of effect allele dosages, and divided by the number of alleles using PLINK240.
Phenotypic features
In a phenome-wide scan of predictors for breast cancer, we considered 2,315 cross-sectional features from the UKB (Supplementary Table 2), reflecting socio-demographics, lifestyle, family history, early life and reproductive factors, blood and urine assays, physical measures, cognitive function, medication use and health conditions at baseline.
We mapped the 6,745 unique self-reported medications (UKB field 20003) to 411 distinct codes at level 4 of the Anatomical Therapeutic Chemical (ATC) classification system33,41. For example, if participants self-reported taking “kliofem tablet” or “kliovance 1 mg/0.5 mg tablet”, they would be categorised into the “G03FA” ATC group (i.e. “contraceptive and hormone replacement therapy (HRT) related medication” group). Since our study population is post-menopausal women only, any G03FA medication is assumed to be HRT, and is referred to as such throughout this paper.
We identified prevalent cancer diagnoses (level 2 ICD-10 codes under “Chapter II Neoplasms”, “Chapter XV Pregnancy, childbirth, and the puerperium”) from cancer registry data, and prevalent non-cancer diagnoses (level 2 ICD-10 codes except Chapters U,V,W,X,Y, and Z) from hospital inpatient data.
We did not include administrative variables, inapplicable pilot fields, male-specific factors, family history of family members of adoptees, history of surgical operations (because they likely reflect existing diagnoses that were already included as input features), and fields collected exclusively during follow-up (e.g. imaging data).
For biomarker assays, UKB has performed internal quality checks and excluded values where no data or an error was returned from the analyser, the values were outside the reportable range of the assay at the time of measurement or there was an aliquot problem. Full details of quality control were described in UKB Resource 5636. Among biomarkers, we excluded oestradiol and rheumatoid factor due to their high proportions of values below the lower reportable range.
The main pre-processing we performed on training data prior to ML analysis was assigning the following three categories as missing: “Prefer not to answer”, “Do not know”, and empty entries. We then removed features with missing rate \(>\) 30%, and those where all participants had the same value (such as rare diseases which no participants were affected by at baseline) which were of no discriminative utility, yielding 1,737 input features for ML models. All features were fitted in original scale from UKB without transformations.
Analysis pipeline
We adapted an analysis pipeline16 for combining ML and statistical approaches (Fig. 2), but with a fundamental difference in the definition of “test data”. We pre-specified the 20% test data as being unseen hold-out data that were not used in model construction, for the purpose of testing the performance of the Cox models obtained using the 80% training data (i.e. guarding against model overfitting). In contrast, Madakkatel et al. did not have such unseen hold-out test data; Cox models in the latter were constructed using their “test data” which were not held-out, whereas ours were built using the 80% training data. We believe that this separation between training and test enforces the avoidance of overfitting, and of reporting results using the same data that were visible to the construction of the model, in the standard manner for separating training and test sets.

Analysis pipeline. ML models were used for predictor discovery, followed by classical Cox modelling for further investigation. ML: Machine learning. XGBoost: extreme gradient boosting machine. SHAP: Shapley Additive Explanation. PRS: Polygenic risk scores.
Our further enhancements included grid search for hyper-parameter tuning and SHAP dependence plots for exploring PRS-predictor interactions. The tree-based eXtreme Gradient Boosting (XGBoost) machine learning algorithm16 was used to discover novel features among ≈1.7 k variables. Features of high importance according to the SHAP measure were regarded as potentially novel predictors, and were subsequently investigated by classical Cox models.
Model-based feature selection
Feature selection is necessary prior to constructing an analytical model for risk prediction. This task can be undertaken in many ways (e.g. backwards selection, correlation analysis, Delphi process). We used model-based feature selection (via a tree-based XGBoost machine) due to its ability to rank large number of features, prior to risk prediction by classical Cox models. XGBoost machine is also able to handle missing data, and reveal non-linear relationship between features and the outcome. The intention was to obtain the “best of both worlds”, by using non-linear methods to identify candidate associated inputs and then using conventional medical statistical models for maximum interpretability of the results.
Tree-based XGBoost is an ensemble learning method in which decision trees are built in a sequential manner. After initialising the model prediction by minimising a regularised loss function, the algorithm builds each tree by minimising the residuals from prior ones. The final model prediction is the weighted sum of the predictions from these sequential trees. It is capable of revealing non-linear relationships among correlated features from large datasets in a memory-efficient manner.
Our outcome (i.e. response variable) was the time-to-event outcome of incident breast cancer which consists of a binary indicator (present/absent) and the follow-up time. The XGBoost model was trained with negative partial log-likelihood of Cox proportional hazard model (i.e. “cox-loss”) as the loss function. Missing data was regarded as containing information (i.e. missing not at random, MNAR). During the training of the model, samples with missing values were assigned a default direction in each branch to either the left or the right child node, based on the gain.
For hyper-parameter tuning, we performed grid search (Supplementary Table 3) with five-fold cross validation (CV) on training data using cox-loss as the evaluation metric. The optimal set of hyper-parameters for XGBoost were found to be: maximum depth \(=\) 2, number of trees \(=\) 1,635, learning rate \(=\) 0.01, minimum of child weights \(=\) 3, gamma \(=\) 1, subsample \(=\) 0.8, column sample by tree \(=\) 0.8, and lambda for regularisation \(=\) 20. The graphical illustration of the optimal XGBoost model is shown in Supplementary Fig. 1. The Harrell’s C-index obtained from the five-fold CV was 0.717 on training data and 0.667 on test data, indicating that the model was not over-fitted. The reduction in C-index of approximately 0.05 is in keeping with expectations for evaluating a model in-sample (on training data) to out-of-sample (on hold-out test data). To investigate the impact of different loss functions, we trained an additional XGBoost model with a binary indicator of incident breast cancer using log-loss as the loss function (Supplementary Materials).
A variety of measures exist for obtaining features of high importance, such as the XGBoost built-in methods42,43, permutation based feature importance44, and SHAP values [arXiv:1705.07874]. Existing literature16 [arXiv:1802.03888] suggests that SHAP values are the most consistent and stable among the above methods. These properties are vital aspects of feature selection as they provide assurance that the selected features are robust to the perturbation of input data45. SHAP values also have the advantages of faster computation and better visualisation compared to permutation-based methods (Supplementary Materials). Despite those advantages, SHAP values have its limitations (e.g. not suitable for causal inference) and could potentially generate misleading interpretations, which could hide biases46,47.
We therefore chose SHAP values as our main feature importance measure, but also implemented the XGBoost default feature importance (“weight”) and permutation-based methods for comparison. The SHAP value of each feature was first computed using one sample at a time to reflect the local effect on the sample, and then aggregated by taking the mean of absolute SHAP values (SHAPma) across all samples to summarise the global attribution of this feature. In addition to the global SHAP values shown in a summary bar plot (Fig. 4), we presented the local SHAP values in an information-rich “beeswarm” plot, which shows both the relative ranking of features and the relationship of each feature with the outcome. The local SHAP values were further visualised in SHAP dependence plots48 to explore the potential relationship between PRS and phenotypic features.
Statistical model
Following the ML analysis, we further examined the extracted features as follows:
We regarded the top 20 features ranked by SHAPma as “important”. The union of these 20 features with the established risk factors forms the set of potentially “important” features. We then computed different forms of pairwise correlation \(r\) among these different types of features from the training data. Spearman’s rank coefficient was computed for pairs of numeric features, and Cramer’s V (computed using the Chi-squared statistic) for pairs of categorical features. The correlation between a numeric and a categorical feature was computed by regressing the numeric feature on the categorical feature and then taking the square root of the proportion of variation explained (also called correlation ratio).
Within each pair of features that was identified as highly correlated (\(r>0.9\)), we removed either the feature with most missing data, or the auxiliary one. This step is necessary to reduce the collinearity prior to constructing a linear (e.g. Cox) statistical model when the model will be used to draw statistical inference on the estimated effect of features.
The missingness within each feature was carefully assessed at this stage, as the number of features had now been sufficiently reduced (e.g. from over 1k to under 30) to permit such close inspection. For example, the variables “Age at first birth” and “Number of live births” needed to be considered together to ensure the imputed data were reasonable for women who have not had children. We performed multiple imputation using the mice package in R49 to impute the missing data under the assumption of missing at random (MAR). In contrast, the XGBoost machine had assigned a missing category to missing data, effectively assuming MNAR.
We note that the above statistical procedure is essential preparation before constructing classical statistical models and must not be overlooked. The analytical power of their elegant equations comes from careful attention to model specifications and thorough examination of the underlying assumptions.
Following the necessary preparation above, we constructed a Cox proportional hazard model (i.e. the augmented Cox model) using the training data to assess the associations between novel features and incident breast cancer, adjusting for established risk factors. Since PRS were present in the model, we also adjusted for genetic array (UKB field 22000) and the first 10 genetic PCs (UKB field 22009) to account for the underlying population structure.
To determine whether the novel features identified by ML improve model performance, we built a separate Cox model using the training data with only the two PRS and known risk factors (i.e. the baseline Cox model). We computed Brier score at 10-year for assessing overall model performance and Harrell’s C-index for assessing risk discrimination. We used the two sets of baseline hazard at 10-year and model coefficients (i.e. “beta values”) yielded from these two Cox models to compute the 10-year risks and prognostic index (i.e. variable \(\times\) beta) of each participant in the training (80%) and test (20%) data, respectively. These 10-year risks were subsequently used to compute Brier score at 10 years and prognostic indices were used to compute Harrell’s C-index50 using the training and test data, respectively. Both the augmented and baseline Cox models were pre-specified in our statistical analysis plan.
The proportional hazard assumption of Cox models was visually assessed using scaled Schoenfeld residuals. Multicollinearity was assessed by computing the variance inflation factor (VIF); values less than 10 were considered acceptable. Statistical tests were two-tailed, and performed using a 5% significance level.
As sensitivity analyses, we built additional Cox models using the training data to investigate the potential “PRS \(\times\) phenotypic features” interactions indicated by the SHAP dependence plots. To further assess the robustness of feature importance ranking, we implemented another machine learning model, histogram-based gradient boosting machines (GBM) inspired by LightGBM51 (Supplementary Table 4, Supplementary Fig. 3, and Supplementary Materials).
XGBoost version 1.5.0 and SHAP version 0.40.0 were implemented in Python version 3.8.8 and Cox model analyses were conducted using R version 4.0.2.
