Improving breast cancer screening workflow using machine learning

Study 1: Standalone performance and integration feasibility

The initial study was divided into two phases. In the first phase, we conducted a large-scale, multicenter, retrospective evaluation of the standalone performance of the AI system. In the second phase, we conducted a prospective, non-interventional implementation study to assess the feasibility and challenges of integrating the live system into real-world clinical workflows.

Phase 1: Multicenter standalone performance evaluation

The first retrospective phase included mammograms from 125,000 women (115,973 applying inclusion/exclusion criteria) who were examined at five NHS testing services in England. The service covers three different clinical workflows, depending on whether the second reader is blind reading the first reader and how cases are selected for arbitration (see image below). AI operating points (thresholds that determine the conservativeness at which the AI flags cases) were determined separately for each screening service to adjust for regional differences in screening populations and workflows.

The study’s primary endpoint assessed the sensitivity and specificity of the AI system in cancer detection compared to the historical (original) first reader for that case. With a 39-month follow-up period, this study was able to use rigorous ground truth to study the incremental benefits of the AI system in detecting interval and subsequent cancers long before they become clinically symptomatic. In addition to the primary endpoint, the study also evaluated the performance of the AI system compared to a secondary reader and a consensus reader, lesion-level localization (whether the correct abnormality in the breast was identified), and fairness analysis. Our study focused on whether AI systems are successful in accurately localizing regions of interest by incorporating rigorous lesion-level analysis, rather than relying on potentially spurious correlations. This phase of the study was retrospective to enable validation of AI performance at scale and did not include gathering additional interpretations from human readers or making any future developments.

Source link