Development and deployment of a histopathology-based deep learning algorithm for patient prescreening in a clinical trial

Ethics approval and consent to participate

The study was approved by ethics review boards from 89 sites participating in the ANNAR study, NCT03955913. These sites were sponsored by Janssen R&D, the entity that provided global oversight and approval of the study. The study was carried out in accordance with relevant legislation and ethics guidelines. Enrolled patients provided informed signed consent prior to participating in the study.

Study design and datasets

We collected data from public repositories, third-party vendors, and internal clinical trials amounting to 3940 histology images (H&E-stained whole-slide images of urothelial carcinomas). These images were linked to ground truth molecular testing results, either by NGS or targeted assay. Image ground truth (FGFR positive or FGFR negative) was defined by determining whether the FGFR alterations from the QIAGEN Therascreen® FGFR RGQ RT-PCR Kit, which aids in identifying patients eligible for treatment with BALVERSA™ (erdafitinib). The alterations are the following: (1) FGFR3 gene point mutations with targets p.R248C (c.742 C > T), p.G370C (c.1108 G > T), p.S249C (c.746 C > G), p.Y373C (c.118 A > G); (2) FGFR3 fusions with targets TACC3v3, TACC3v1 and (3) FGFR2 fusions with target BICC1 and CASP7.

To develop the algorithm, we used one whole slide image from bladder tissue per patient from three different cohorts: 407 from The Cancer Genome Atlas (TCGA) consortium (https://portal.gdc.cancer.gov/projects/TCGA-BLCA), 2811 from BLC3001 (NCT03390504) and 184 from BLC2002 (NCT03473743), two erdafitinib trials^7,8, as seen in Fig. 1. The prevalence of FGFR in each cohort was 12.5%, 11.6% and 15.7% respectively, totaling a ~12% average prevalence. The Development Data was split into Training Data (85%, or 2820 slides) and Hold-out Data (15%, or 582 slides). The data split preserved the same ratio of FGFR+ vs. FGFR– patients, as well as the proportion of samples from each cohort. The Training Data set was used for algorithm optimization via cross-validation, and the Hold-out to evaluate performance prior to algorithm packaging for onboarding on deployment platform, Retrospective Validation, Deployment Setting Validation, and Full Deployment. The Train Data was further divided into 5-folds to perform cross-validation for hyperparameter tunning. Similarly, the FGFR mutation and cohort ratios were preserved in each fold.

Note that an extra subset of samples from the BLC3001 (NCT03390504) cohort (350: 150 FGFR +, 200 FGFR-) were left out for Retrospective Validation of the FGFR Device for deployment. To achieve the desired confidence intervals calculated via statistical power analysis for the estimated sensitivity and specificity (to detect a 10% difference in sensitivity using a two-sided exact test with 5% type I error), a total of 150 samples was randomly selected from the positive dataset population, and 200 from the negative dataset population. Furthermore, data from ANNAR (NCT03955913)⁶, the deployment trial, was used for the Deployment Setting Validation of the FGFR Device (17 WSIs acquired in real time for workflow validation and 171 retrospective WSIs to assess performance). An additional independent test dataset (361 WSIs) from an external laboratory with tissue from multiple tumors (i.e., PAN-Tumor) was used to evaluate generalization of the algorithm to solid tumors (not represented in the figure).

Algorithm description

Deep learning methods were used to predict FGFR+ based on an H&E-stained histopathology slide. Specifically, we used convolutional neural networks (CNNs), which excel at pattern recognition for data with inherent structure, like images or sequences. For more efficient training, we incorporated transfer learning⁴⁶ into our approach. That is, we used CNNs that had been previously trained on more general image data sets to recognize simple and complex patterns, which allows them to be more quickly tuned to new, related tasks such as classifying histopathology images.

Additionally, we developed a multi-instance learning approach to accommodate the exceptionally large images obtained by scanning histopathology slides^39,46,47. In this framework, a whole histopathology slide is broken into many smaller tiles. The patient-level outcome associated with the slide is associated with each individual tile during training and the network learns patterns that differentiate the patient label (e.g., FGFR+ or FGFR-). This approach has the added benefit of not requiring manual annotation of the whole-slide image by a pathologist, resulting in a lower cost of obtaining data and a broader set of outcomes on which to train. Figure 8 shows the multi-instance learning pipeline embedded in the FGFR device. Note that all tiles in the slide are fed into the network to predict a single outcome (i.e., FGFR+ or FGFR-) for the entire slide.

Whole slide images were preprocessed into 224 × 224 pixel non-overlapping tiles to train the multi-instance learning pipeline. The tiles were fed into the quality-control pipeline⁴⁸ to remove the tiles with artifacts (i.e., pen marks, blur, etc…), followed by a stain-based data augmentation step⁴⁹ to generate multiple stain versions of each tile for training. Tiles where the quality control (QC) score was below 0.75 were dropped (see device description in Fig. 8 and provided pseudocode from Box. 1 for detailed QC steps). Given that similar performance was obtained at multiple magnifications available, we decided to train on images at 10 x magnification to speed up algorithm training and inference time. A CNN followed by an attention-based network³⁹ was trained end to end using the multiple-instance learning pipeline⁵⁰. The CNN initial weights used were from an ImageNet pre-trained ResNet34 network from the pyTorch library in Python. The attention network started with random initialized weights. We performed hyper parameter tuning via grid search and used a cross-entropy weighted loss function to offset the class imbalance between positive and negative FGFR slide count. We selected the best algorithm as being the one with the highest positive predictive value (PPV) at 0.9 sensitivity on the validation sets from a 5-fold cross-validation data split. The optimal hyperparameters were found to be the following: learning rate of 0.00001, weight decay of 0.0001 and dropout of 0.5.

Box 1 Pseudocode representation of the FGFR device (shown in Fig. 8)

1. Parse input arguments (input_WSI_path, tissue_site, disease_stage).

2. Input metadata QC:

2.1 if tissue_site is not “Bladder”: return “Non-qualifying tissue site.”

2.2 if disease_stage is not “MIBC”: return “Non-qualifying disease stage.”

3. Read WSI:

3.1 if input_WSI_path is empty: return “Unsupported or missing image file.”

3.2 if OpenSlide returns read error: return “Corrupted image file.”

3.3 if 10x magnification is not available: return “Required magnification unavailable.”

4. Run image quality control (QC) filters:

4.1 Obtain low-resolution thumbnail to allow faster preprocessing.

4.2 Compute pen marks and background binary masks:

4.2.1 Red, green, blue pen marks:

filter.filter_green_pen(), filter.filter_blue_pen(), filter.filter_red_pen() *

4.2.2 Background:

filter.filter_grays()*

5. Calculate tile (x, y) locations given tile dimension [224×224] and 10x magnification.

6. For each tile, calculate the image QC score [0,1]:

6.1 Compute percentage of tissue in tile using output masks from 4.2:

tissue_percent, quantity_factor = tiles.tissue_quantity_factor(tissue_quantity()) *

6.2 Compute color_factor and saturation_factor measurements in tile:

color_factor = tiles.hsv_purple_pink_factor(), saturation_factor=tiles.hsv_saturation_and_value_factor() *

6.3 Calculate QC score:

score = 1–10/(10+tissue_percent²·ln(1+color_factor·saturation_factor·quantity_factor)/100) *

7. Keep tiles with score > 0.75:

7.1 if number of remaining tiles (N) is <1: return “QC Failure – Insufficient tissue tiles.”

8. Run N tiles through a ResNet34 convolutional neural network** to extract feature vectors of size Nx512.

9. Run feature vectors through an attention network³⁹ to obtain the WSI likelihood of FGFR.

10. Threshold the outputted likelihood to binarize result as FGFR+ or FGFR–

* Functions are from open-source code found in utils.py, filter.py and tiles.py from⁴⁸

** Network structure available in https://pytorch.org/vision/main/models/generated/torchvision.models.resnet34.html

Proposed clinical workflow

Figure 5 shows the proposed workflow for patient prescreening using the image-based FGFR Device. The device is used prior to planned molecular testing to identify subjects in whom molecular FGFR testing is likely to be negative. Results of digital device analysis are provided to inform clinical trial investigators and help them screen patients who could be eligible for the clinical trial and prioritize patients for molecular testing. Note there are three parties involved in the workflow. The clinical study sites, which are distributed around the globe, the central laboratory, which has multiple central locations (i.e., Indianapolis, Geneva, Japan, and Singapore) in contact with its corresponding investigator sites, and the cloud platform partner, which is connected to the central locations from the lab and the investigators sites.

The gray boxes represent the workflow steps that were being followed for patient enrollment prior to the implementation of the image-based AI prescreening. Starting at a clinical trial site, a patient that meets the enrollment criteria for the trial would sign consent to enroll, and then archival tissue of a tumor biopsy would be sent to the central laboratory for H&E staining and scanning. After quality control of the tissue, the CRO would then send it to genomics to perform a molecular test (i.e., QIAGEN therascreen® FGFR RGQ RT-PCR Kit) to identify if the patient is FGFR +, and hence, eligible for treatment with BALVERSA™ (erdafitinib).

The green boxes represent the steps added to the prior workflow to add the image-based prescreening. Starting at the central laboratory, the staining and imaging department performs a daily transfer of the scanned images and corresponding metadata (i.e., patient id, slide id, tissue site of specimen, etc.) to the cloud platform hosting the FGFR Device. The device runs as soon as the images reach the platform, and in a matter of minutes, the results of the algorithm (i.e., FGFR likelihood) are available via web portal to the investigators. Investigators receive an email notifying them an FGFR result is available for them to review, and based on the result, they decide whether to cancel the molecular test. In that case, they notify the central laboratory by answering a query in their portal.

Design control development and validation of FGFR device

As determined by our regulatory and clinical diagnostics teams, the algorithm was classified as Software as Medical Device³⁸. As a result, we applied medical device standards, including design controls, to the development and validation process. The development, design verification and design validation steps that were followed explained in more detail below. The decision to move forward to fully deploy the algorithm was based on the results of a Retrospective Validation study using representative samples (section C), and a Deployment Site Validation study in which the device was deployed on prospectively collected ANNAR samples (section D).

A. Algorithm packaging and software verification

The first step after algorithm training and optimization using the 3402 slides was to package the selected algorithm into a user-friendly device (see schematic in Fig. 8). The device is a Docker container with the algorithm combined with error checks that allowed for easy integration into the clinical workflow. The device takes an image as input along with the corresponding metadata for that image, and outputs the likelihood of FGFR for that image. In cases where the slide metadata does not meet the predefined criteria (i.e., bladder tissue, 10 x magnification image, MIBC) the device will show an explanatory error message for the clinician. Similarly, it will notify the user if the image does not pass quality control (i.e., image is corrupted or missing, or there are insufficient high-quality tiles to perform a prediction). These checks ensured that the device would only run-on data of same distribution as the one used for training.

B. Migration of device to deployment platform

After development, packaging for deployment (i.e., Docker) and software testing under design controls, the device was shared with our deployment partner to embed it in their cloud platform. First, the deployment partner run the device on their cloud platform to evaluate the fidelity of the device and ensure that the performance metrics agreed with those obtained during software verification. Then, the cloud platform was integrated with the clinical workflow by connecting the central laboratory sites to Amazon Web Services S3 data ingestion buckets, and by means of a web portal for investigator sites around the world participating in the trial to show the FGFR device predictions.

C. Retrospective validation

As mentioned in the Data section, a retrospective data set comprised of 350 (150 FGFR+ and 200 FGFR– samples; to achieve 93% power at detecting a 10% difference in sensitivity using a two-sided exact test with 5% type I error) representative H&E histopathology images were designated for the planned retrospective design validation phase. These samples were not utilized to train the algorithm and were not accessible during development (tuning/training/initial testing) of the algorithm. The sensitivity and specificity of the FGFR Device was assessed using the QIAGEN molecular test as the reference standard.

The acceptance criteria to determine if the FGFR device would be deemed suitable for prospective design validation was by stakeholders as follows: if the point estimate (PE) of sensitivity ≥90%, lower bound (LB) 95% CI ≥ 80%; and PE of specificity ≥30% with LB 95% CI ≥ 20%, the device would move forward to prospective validation. Otherwise, it would be an active decision, and including analytical performance in the acceptance criteria of the prospective validation (section D) could be considered.

D. Deployment setting validation and full deployment

The goal of the Deployment Setting Validation was to assess and optimize workflow integration of the FGFR Device in the ANNAR study. The device was deployed in the ANNAR study in parallel to the QIAGEN molecular test. The results of the algorithm were not reported to investigators at this stage since the objective was to demonstrate clinical study workflow integration and concordance with the molecular test.

The device was run on ~1 month worth of prospectively collected ANNAR samples transferred in real time (17 samples), in parallel to molecular rest, using a standardized workflow to mimic full deployment workflow. The metrics captured were the % of images successfully completing workflow and the turn-around-time (TAT) from receipt of images to posting results on physician portal. The device was also run on supplemental retrospective samples to measure sensitivity and specificity, as an exploratory analysis.

The acceptance criteria to proceed to full deployment was a TAT of all samples on which the device was successfully run (a prediction or error is generated) of <24 h and a holistic review of performance data on retrospective and prospectively collected datasets by internal stakeholders. The validation study was conducted under an Investigational Device Exemption (IDE) regulatory designation.

Under the intended use of the device for full deployment, a patient first underwent screening with our image-based device prior to undergoing molecular testing. Upon receiving the results of the image-based screening, the physician had the choice to stop the molecular testing. Enrollment into subsequent, interventional clinical trials was contingent on the confirmed FGFR+ status, based on molecular testing.