BlurryScope enables compact, cost-effective scanning microscopy for HER2 scoring using deep learning on blurry images

Machine Learning


Design of BlurryScope

BlurryScope is designed to be a fast, compact, and cost-effective imaging system. The optical architecture includes components adapted from a dismantled M150 Compound Monocular AmScope brightfield microscope with an RGB CMOS camera, integrated into a custom 3D-printed framework. The system is driven by three stepper motors for precise stage movement, achieving a stable lateral scanning stage speed of 5000 µm/s with a 10× (0.25NA) objective lens. The other main optical components include the condenser and LED illumination. The structural parts were printed using SUNLU PLA+ filament, maintaining the total component cost under $650 for low-volume manufacturing (see Table 1 and Methods section, ‘BlurryScope design and assembly’).

Table 1 Specifications of BlurryScope by price (USD), speed, weight, and size

It is important to note that BlurryScope is not intended to replace traditional pathology scanners used in digital pathology systems. However, it should be considered a cost-effective alternative for routinely performing specialized inference tasks where trained neural networks can provide rapid, automated, and accurate information regarding tissue specimens, such as the HER2 score classification that is the focus of this work. The key performance trade-offs of BlurryScope involve concessions in resolution, signal-to-noise ratio (SNR), and the detection of smaller objects, prioritizing speed, affordability, and compact design. While these limitations may prevent its use as a standalone diagnostic tool, BlurryScope remains valuable as a complementary system, enabling preliminary assessments, assisting in triage, and expanding accessibility in settings where conventional high-end digital pathology scanners may be unfeasible or unavailable.

To shed more light on the specifications of BlurryScope, we report several parameters, including cost, speed, weight, and size, in Table 1. Traditional digital pathology scanners can perform diffraction-limited imaging of tissue specimens at extreme throughputs and form the workhorse of digital pathology systems; however, their versatility and powerful features come with significantly higher costs, with prices ranging from $70,000 to $300,00033, making them harder to scale up, especially in resource-limited environments. BlurryScope’s minimalist design (dimensions: 35 × 35 × 35 cm, weight: 2.26 kg) enables rapid imaging of samples while minimizing spatial constraints in laboratory environments.

HER2 IHC tissue imaging

To demonstrate the efficacy of BlurryScope, a total of 10 HER2-stained TMAs were used. The training and testing datasets consisted of 1144 and 284 unique patient specimens (tissue cores), respectively. Each patient sample was scanned three times (non-consecutively) to assess the repeatability of the approach, with a total duration of 5 min per scan (~2.7 mm2/s), which covers the entire section of the tissue microarrays of the slide, including the empty space between the cores. This extensive dataset allowed for a comprehensive evaluation of BlurryScope’s capabilities in automated HER2 scoring. The standard of comparison was the output of the same set of slides imaged with a state-of-the-art digital pathology scanner (AxioScan Z1, Zeiss)34.

The stitch generated with a standard scanner (Supplementary Fig. 1a) has a clear and crisp delineation of all the cores since the scan undergoes a “stop-and-stare” operation. That is, the stage is physically halted for the duration of each camera acquisition. In contrast, Supplementary Fig. 1b demonstrates BlurryScope’s continuous scanning output by capturing images at a running lateral stage speed of 5000 µm/s in a zigzag fashion (Supplementary Fig. 2). This rapid acquisition introduces bidirectional motion blur artifacts. Though there is a widening and smudging of features due to the effect of motion blur, the individual cores are still fully separated in the final stitched mosaic, which allows for automated cropping and labeling of each patient tissue core.

The scanned tissue images corresponding to different HER2 scores (0, 1+, 2+, 3+) for individual patient cores are compared in Fig. 2. Figure 2a shows the results from a traditional pathology scanner, yielding sharp, well-defined images for each HER2 score. In contrast, Fig. 2b presents the results from BlurryScope, with images exhibiting opposing directions of blur. Despite the smearing of various details, some correspondence between the two image descriptions is still discernible. Lower-scored HER2 images exhibit fewer brown hues and less geometrical heterogeneity compared with higher-scored ones. This suggests how HER2 classification tasks may still be successful on such compromised data. It is important to note that the interpretation of IHC stains is dependent on both cellular location and intensity. Most IHC stains highlight the cell nucleus and have only two levels, distinguishing between positive and negative staining. However, HER2 staining is different as it follows a four-level scoring system (0 to 3+) and is evaluated based on membrane expression rather than nuclear staining and variable levels of stain intensity. This distinction introduces greater inter- and intra-observer variability among pathologists, as the assessment depends on both stain intensity and continuity across the membrane. Additionally, the quantification of HER2 positivity is limited to invasive tumor cells, explicitly excluding carcinoma in situ, even when these cells exhibit a similar HER2 staining pattern. These aspects of the HER2 evaluation add further complexity to the interpretation process.

Fig. 2: Images of tissue cores with different HER2 scores.
figure 2

a Images of tissue specimens with HER2 scores (0, 1+, 2+, 3+) obtained using a traditional digital pathology scanner, showing clear and well-defined cores for each HER2 score. b Same as a, except the images are obtained using BlurryScope, where the images exhibit smudged details.

Given the critical role of stain intensity and structural integrity/continuity in HER2 assessment, the endurance of these facets within a motion blur compromise is essential for precise classification. In our previous work, GANscan28, a conditioned generative adversarial network (GAN)-based deblurring approach, was used to reconstruct high-speed continuous-scan images of H&E-stained breast cancer tissue. The reconstructed images achieved an SSIM (structural similarity index measure) of 0.82 and a PSNR (peak signal-to-noise ratio) of 27 when compared to stop-and-stare control images, confirming our ability to restore fine tissue architecture and cellular morphology using trained neural network models. This approach successfully retrieved sub-micron structural details, including nuclear contours and tissue organization, which are critical for pathology assessments.

Automated classification of HER2 scores using BlurryScope images

Our data processing pipeline begins by automatically organizing the images of each patient sample into multi-scale stacks (Fig. 3). The process starts with scanning the biopsy slides and recording them in video format using BlurryScope. These BlurryScope videos are then processed through automated stitching and labeling algorithms, which seamlessly integrate the frames into a whole-slide image. Subsequently, the individual cores are arranged into a concatenated stack of subsampled and randomly cropped patches, ensuring that the image data is both precise and representative. The resulting data are then processed by a classification neural network, configured for either 4-class (0, 1+, 2+, 3+) or 2-class (0/1+ vs. 2+/3+) HER2 scoring (see Fig. 3a–e). This approach allows for the efficient handling of complex image data and ensures the repeatability of the classification process (see Methods for details on ‘BlurryScope image scanning, stitching, cropping, and labeling’).

Fig. 3: BlurryScope data processing pipeline.
figure 3

a The data processing workflow of BlurryScope begins with the continuous video output of the scanned slides, followed by b automated stitching and c labeling. d Images are then cropped and concatenated into a stack of subsampled patches. e These image patches are then processed by deep learning-based classification networks. Scale bar 200 µm.

Upon finalizing both of the HER2-score classification networks (see Methods for implementation details), we ran our trained models on the blind test sets imaged by BlurryScope, covering N = 284 unique patient specimens/cores never seen before in the training phase. Since each slide was scanned three times, we were able to use this extra data to improve final accuracy results; see Figs. 4 and 5. These multiple scans also enabled us to assess the consistency of HER2 classification results across repeated measurements for the same tissue core. We quantified the degree of variability that might arise from factors such as slide insertion, alignment differences, and potential fluctuations in the scanning process itself. To achieve this, we calculated the prediction consistency for each core by comparing the classification results across the three scans. Specifically, for each core, we identified the most frequently occurring prediction category (i.e., the mode) among the three scans and then determined the proportion of predictions that matched this mode. The results revealed an overall consistency of 86.2% across all scanned cores, demonstrating a high level of repeatability in BlurryScope’s classification performance. As displayed in a bar graph of prediction consistency for each core (see Supplementary Fig. 3), the majority of the cores exhibit strong consistency, where at least two out of three results have the same score, though some variability is present. This suggests that, while the model performs reliably for most samples, there are still certain cores where predictions are less stable, possibly due to factors like slide placement or operational conditions.

Fig. 4: Testing accuracy as a function of the confidence threshold.
figure 4

a Testing accuracy and indeterminate percentage for the 4-class HER2 classification system with 3N samples. b Testing accuracy and indeterminate percentage for the 4-class HER2 classification system with the highest CI. c Testing accuracy and indeterminate percentage for the 4-class system with the CI-weighted method. d Testing accuracy vs. indeterminate percentage for the 2-class system with 3N samples. e Testing accuracy and indeterminate percentage for the 2-class system with the highest CI. f Testing accuracy and indeterminate percentage for the 2-class system with the CI-weighted method. Gray dashed lines refer to a 15% indeterminate rate.

Fig. 5: Confusion matrices and classification accuracy.
figure 5

a Confusion matrix for the 4-class HER2 classification network for all scans. b Confusion matrix for the 4-class HER2 classification network with the highest CI scores. c Confusion matrix for the 4-class HER2 classification network with average CI scores. d Confusion matrix for the 2-class HER2 classification network for all scans. e Confusion matrix for the 2-class HER2 classification network with the highest CI scores. f Confusion matrix for the 2-class network with the average CI scores. N refers to the number of times each slide (or patient sample) was non-consecutively and separately scanned.

As detailed in the following analyses, three different distributions based on our triple measurements were evaluated for both HER2 classification networks: (1) total scans (3N), (2) maximum confidence interval (CI), and (3) average CI. Total scans include all the measured 3N images, while the highest CI method selects the result with the highest overall CI value from the three repeats, and the average CI method uses a CI-weighted calculation. This weighted CI calculation involves multiplying each score by its corresponding CI, summing the results, and rounding the final value (see “Methods” section, Sample preparation and dataset creation).

One way to heighten the reliability of our BlurryScope-based HER2 classification system is by leaving out results with low CI values and excluding them from the final assessment. To evaluate the balance between CI selection and accuracy vs. left-out (indeterminate) percentages, we plotted their relationship for each data distribution and classification case. Figure 4 shows that, as expected, the accuracy is proportional to the CI threshold score chosen and the number of patients left out as indeterminate cases. Figure 4a shows the testing accuracy and indeterminate percentages for the 4-class case with 3N samples, while Fig. 4b, c present the same relationship for the highest CI and average CI, respectively. These figures illustrate how the chosen CI threshold value begins to exclude indeterminate patients starting around the 50% CI value mark. Figure 4d displays the testing accuracy and indeterminate percentages for the 2-class network with 3N samples, while Fig. 4e, f present the same relationship for the highest CI and average CI.

In all these cases, there is a notable rise in the HER2 classification accuracy, along with indeterminate cases for CI selections above the 50% mark. A 5% improvement in HER2 classification accuracy in this range corresponds to a ~10% increase in the number of indeterminate cases. This suggests that once the CI value exceeds 50%, the user should be mindful of pursuing further improvements in accuracy, as they may result in substantial increases in dropout rates with indeterminate results. Overall, these analyses serve to illustrate that BlurryScope can achieve a high testing accuracy with a manageable percentage of indeterminate results.

The classification accuracies for both networks (4-class and 2-class HER2 inference) were also evaluated with confusion matrices, as shown in Fig. 5. We selected threshold CI values based on the plots in Fig. 4 corresponding to a 15% indeterminate rate, indicated by the gray dashed lines, which were empirically selected. The confusion matrix for the 4-class HER2 score inference of all the acquired BlurryScope images (3N) has a testing accuracy of 75.3% based on a 15% indeterminate CI threshold. Confusion matrices were also generated for the highest and average CI scores (Fig. 5b, c), achieving HER2 score classification accuracies of 78.9% and 79.3%, respectively. Compared to automated HER2 classification results34 using microscopic images from a standard digital pathology scanner, these numbers prove competitive in performance, lagging only by a margin of ~8–9%.

Figure 5d represents the confusion matrix of all tissue scans for the 2-class HER2 classification network, where 0 and 1+, and 2+ and 3+ groups are merged together, combining the two lowest and highest scores; these upper- and lower-bound categories are known to pathologists to have highly nuanced distinctions that are often difficult to differentiate. For this network, there is a markedly higher testing accuracy of 88.4% for a 15% indeterminate rate. When using the averaging CI method, the testing accuracy is slightly better, as shown in Fig. 5f, reaching an accuracy of 88.8%, and for the highest CI method, the accuracy increases even further to 89.7%. For this model, the lower-right sections of the confusion matrices, which represent correctly identified negative cases, consistently show higher values compared to the upper-left sections, where true positive cases are recorded. This suggests the model is better at correctly identifying negative cases, reflecting higher specificity. On the other hand, the relatively lower numbers for positive cases indicate that sensitivity is slightly lower, meaning the model misses more true positives. This observation is important to note because while the model effectively avoids false positives, it could potentially overlook some true positive cases, which would be critical to capture in medical diagnostics.

The receiver operative characteristic (ROC) curves were also plotted for these 2-class cases (Supplementary Fig. 4) and demonstrate varying balances between sensitivity and specificity across different methods. The area under the curve (AUC) is a key metric used to evaluate the overall performance of an inference model, with higher AUC values indicating a better ability to distinguish between classes. The maximum CI method, with an AUC of 0.76, achieves the best performance, indicating a strong capability to maximize sensitivity while minimizing false positives. The absolute average CI distribution (not CI-weighted), with an AUC of 0.74, performs similarly, slightly trailing the maximum CI approach but still maintaining a favorable balance. Overall, the maximum CI approach emerges as the most effective, achieving a decent balance between specificity and sensitivity, as reflected by its higher AUC. These analyses and results collectively indicate that BlurryScope is a promising digital imaging platform for quick inference of tissue biomarkers to potentially prioritize urgent cases or to streamline pathologists’ busy workflow.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *