Integrating Deep Learning of Clinical Features and Pathology for Breast Cancer Recurrence Assays and Recurrence Risk Prediction

ethics statement

All experiments were performed in accordance with the Declaration of Helsinki and the study was approved by the University of Chicago Institutional Review Board, IRB 22-0707. Model training included patients from the TCGA Breast Cancer Cohort (BRCA).²⁶For validation, anonymized archival tissue samples were obtained from the University of Chicago between January 1, 2006, and December 21, 2020, and Recurrence Score results were available. Informed consent for this study was waived because the patient had previously consented to the secondary use of biological samples.

model development

First, an automated tumor detection module was trained to distinguish breast tumors from background tissue on digitally scanned H&E slides. From TCGA, 1,133 slides were reviewed and 1,106 of her from 1,046 patients had tumor-rich areas of acceptable quality identified by pathologist review. Seven slides had encoding errors that prevented them from processing in the pipeline, leaving a cohort of 1,099 slides from 1,039 patients. These slides were manually annotated by a research pathologist to distinguish the tumor from the surrounding stroma. Tessellated image tiles were extracted from within the tumor region with an edge length of 302 microns and reduced to a width of 299 pixels, consistent with 10x optical resolution. Tile extraction and DL model training were performed using the Slideflow pipeline²⁷using Xception²⁸ The convolutional neural network backbone is pretrained on ImageNet, fine-tuning all layers during training, and uses a variable number of fully connected hidden layers prior to outcome prediction. A tumor likelihood module was trained using the hyperparameters listed in Supplementary Table 10 to distinguish between tiles originating from within the tumor annotation and tiles outside the annotation. Model performance was evaluated with average accuracy over three cross-fold validations, and another model was trained on the entire dataset for prediction of external patients. The data flow used for hyperparameter optimization, model training and validation is shown in Supplementary Fig. 7.

We then trained another DL module to predict recurrence scores from tumor image tiles extracted from pathologist-annotated regions of interest. Because the results of the clinically validated multigene recurrence assay are not available from TCGA, the ‘research-based’ versions of ODX and MP are epiquantiles of normalized star salmon gene-level expression data from TCGA. calculated using numbers. Sequencing data were logarithmically (radix 2) transformed, with row medians centered and columns standardized across TCGA-BRCA. Statistical formulas from published development of OncotypeDx²⁹ and Mamma Print^30,31 It was then applied to the mRNA expression data to calculate a study-based recurrence score.

This module is trained in a weakly supervised manner using patient-level mRNA assay results assigned to each tumor tile. To determine a high-risk “research-based” ODX score threshold, 15^th Percentile results of HR+/HER2- patients on TCGA were used. This is because his ODX score is the 26th percentile or higher in the National Cancer Database.⁹Training of the TCGA model was not restricted to HR+/HER2- patients to enrich samples with high-risk ODX predictions, but internal validation of TCGA was performed on the HR+/HER2- subset. I was. UCMC used a standard high-risk cutpoint of ODX score ≥26 and MP score <0. The hyperparameters for these models were selected with a Bayesian optimization of cross-validated tile-level AUROC and more than 50 iterations were performed (Supplementary Table 10, Supplementary Fig. 8). Two sets of his three crossfolds were used for optimization. Samples from TCGA were he H&E stained at one site, but folds were generated with site preservation.³² Given previous reports of site-specific batch effects present in TCGA, we maximize generalizability. It was calculated by weighting the average of the tile-level predictions from the modules. Therefore, all extracted tiles (after grayspace filtering) contributed to model predictions.

For clinical prediction of recurrence, the University of Tennessee nomogram⁹ It was calculated for each patient on TCGA.Grade is not available in the original TCGA annotation but has been assessed and reported in previous studies³³The exact tumor size was not provided by TCGA, but was estimated from the tumor staging groups. No imputation was required for nomogram calculations on the UCMC dataset. Finally, a logistic regression model was fitted using out-of-sample predictions from the pathology model combined with predictions from the clinical nomogram. , validated with pending data from TCGA. We defined the model used for external validation by averaging the coefficients of the logistic regression fitted to the TCGA. Thresholds for calculating model sensitivity were determined from TCGA (using interpolation to achieve 95% accurate estimated sensitivity) and applied to the validation dataset from UCMC.

Development of the MP prediction model proceeded in a similar fashion with some key differences. As no widely used clinical model was available, we developed clinical predictors from: n= 6,938 non-metastatic HR+/HER2- patients in NCDB with breast cancer diagnosed between 2010 and 2017 with MP test results. Sequential forward feature selection was used to identify features that improve AUROC for MP prediction in logistic regression with 10-fold cross-validation and finally to grade, tumor size, PR status, lymphatic invasion, ducts, mucus, and differentiation. Raw, or medulla histology was identified. , and black or Asian races for inclusion. A logistic regression incorporating these features was fitted to all available data and used for prediction. For the DL pathological MP model, we used the same optimized hyperparameters as for ODX prediction.

statistical analysis

Internal validation of model accuracy for recurrence score prediction in TCGA was estimated by averaging patient-level AUROC and AUPRC with 3-fold site-preserving cross-validation and bootstrapping 1000x for confidence interval estimation. External validation was performed on a single fixed model generated from all TCGA data using Delong’s method for statistical comparison of AUROC³⁴The prognostic accuracy of RFI’s model was assessed with the Wald test of the univariate Cox model. A two-tailed t-test was performed to compare her DL pathology model predictions between patients with and without selected pathological features. All statistical analyzes were performed with Python 3.8, Lifelines 0.27.0 and Scipy 1.8.0 at the significance level α = 0.05. Given the limited number of statistical tests performed on different subsets of patients and the exploratory nature of this work, no modification of multiple hypothesis testing was performed.