A generalizable deep learning regression model for automated glaucoma screening from fundus images

Machine Learning


Study design

This study adheres to the STARD 2015 guidelines for the standardized reporting of evaluation of a diagnostic test, as well as to the tenets of the Declaration of Helsinki. The training material for G-RISK was retrospectively collected from the University Hospitals Leuven, and approved by the Ethics Committee Research UZ / KU Leuven under study number S60649. Informed consent was waived due to the retrospective nature of the research project, and all fundus images were deidentified before use. For informed consent of the data used for external testing, we refer to the administrators of the respective data sets.

Study population—Model development

Glaucoma detection was achieved using a custom ResNet-5043 CNN model described in our previous work14 that focused on the explainability of the CNN in two glaucoma applications. In that study, 23,930 stereoscopic fundus images (12,265 eyes, 6486 individuals) were selected for training, validation, and internal testing. Fundus images were captured at the glaucoma department of the University Hospitals Leuven (UZL), Belgium, between 2010 and 2018. Hence, the majority of images feature signs of glaucoma. Inclusion criteria for this set were the availability of a matching 30° fundus photo (imaged with a Zeiss VISUCAM® at 1620 × 1444). Glaucoma was based on evaluation by a glaucoma expert using perimetry, IOP, fundoscopy, and retinal imaging. This clinical evaluation included VCDR estimation during fundoscopy, which was selected as the reference risk label during G-RISK development. This continuous value between 0 and 1 was thresholded against a binary glaucoma ground truth to obtain glaucoma detection results. The benefits of using a continuous versus a binary target variable are well-studied in the literature under soft labels. In glaucoma detection, an approach with soft labels allows the model to leverage the richer information of expert annotations during training. The CNN can grasp differences in disease severity, going from no cupping to an optic nerve that has completely cupped. In binary detection, both early symptoms (e.g. RNFL defect, notching, vessel baring) and extreme cupping are bundled in the glaucoma category, which does not accommodate the learning of intermediary severities. To quantify the improved generalizability when using a regression approach, we also validated a binary classification CNN for glaucoma detection on two test sets. This CNN was trained in a similar setup, with the only changes in the glaucoma ground truth (defined by glaucoma expert based on multimodal exam), cross-entropy as loss function instead of mean squared error, and sigmoid activation instead of a linear activation at the end of the ResNet-50 architecture. It was described in detail in our previous work14.

Study population—Model testing (external validation)

We evaluated our model using fundus images from two major population studies and eleven publicly available data sets. External fundus image data sets were eligible for evaluation given the following conditions: (1) availability of a (suspected) glaucoma label, and (2) majority (>50%) of images containing the optic nerve head (ONH). Both the imaging protocol and the definition of glaucoma varied considerably across the test sets.

BMES

The Blue Mountains Eye Study (BMES) is a large population-based study for ocular diseases held three decades ago in an urban area in Australia2. 3654 individuals aged 49 or older participated in the eye examination from 1992–1994. Fundus images were captured using an analog Zeiss FF3 film camera with subsequent digitization of the images. Open-angle glaucoma (OAG) was diagnosed in case of (1) visual field loss of Humphrey Field Analyzer 30–2 exam, (2) matching neuroretinal rim thinning, (3) VCDR exceeding or equal to 0.7, (4) asymmetric cupping between eyes (>0.3), (5) and when gonioscopic results indicated no angle closure.

GHS

The Gutenberg Health Study (GHS) is a large population-based study held in mid-western Germany, with a baseline encompassing 15010 participants between 35 and 74 years23. 30° optic disc-centered images were collected using a Zeiss VISUCAM fundus camera. Glaucoma diagnosis was established using a modification of the International Society for Geographic and Epidemiological Ophthalmology (ISGEO) guidelines including disc size adjustment44. Final grading considered VCDR, asymmetric cupping between eyes, and rim narrowing (<10% of the corresponding disc diameter). ISGEO grading was available for at least one eye in 12089 individuals examined at baseline.

AIROGS

The Rotterdam EyePACS AIROGS data set consists of 113893 fundus images of 60357 individuals who visited numerous centers of the EyePACS network in the United States45,46,47. The training set of 101442 images was made available in late 2021 in the context of an international challenge on glaucoma detection from fundus images. The optic discs in the fundus images were assessed by a team of 22 glaucoma experts (at least two graders per image), who had at least a sensitivity of 80% and a specificity of 95%. Referable glaucoma was defined using ten structural features or biomarkers, and when the annotator expected corresponding visual field damage.

ORIGA

The Online Retinal Fundus Image Database for Glaucoma Analysis and Research (ORIGA) contains 650 randomly selected images from the Singapore Malay Eye Study (SiMES), a population-based study conducted between 2004 and 200748. The glaucoma labeling procedure was not defined. Images were captured at a wider angle than 30° using an unspecified camera device.

REFUGE1

The Retinal Fundus Glaucoma Challenge (REFUGE) was held at MICCAI 2018, to provide a unified evaluation framework for objective comparison of glaucoma detection models using fundus images49. 400 images were captured with a Zeiss VISUCAM, the remaining 800 with a Canon CR-2 of a glaucoma clinic located in China. All images are macula-centered at a 45° viewing angle. The glaucoma reference standard was obtained after a multimodal assessment of clinical records, including IOP, OCT, visual fields, and follow-up examinations. 120 cases of the data set are glaucomatous (POAG or NTG), representing 10% of the data.

LAG data

The large-scale attention-based glaucoma detection database (LAG) consists of 4854 fundus images sourced from a Chinese hospital16. The reference standard was established using IOP, visual field exams, and manual ONH assessment by qualified specialists. Glaucoma was diagnosed in 1711 images, representing 35% of the data set. All images contain a visible ONH and were captured using an unspecified mix of fundus cameras at varying angles. Given the inconsistent image-altering procedure the data set creators used, it is impossible to use the disc ratio as a proxy for correct 30° cropping.

ODIR

The Ocular Disease Intelligent Recognition (ODIR) challenge was organized in 2019 to stimulate research on multi-disease classification from fundus images50. The complete set encompasses 10000 images from 5000 patients (one image per eye), of which 7000 are currently available to download. Macula-centered images were captured using different devices from manufacturers such as Canon, Zeiss, and Kowa. Next to glaucoma cases (4.7%), expert-annotated labels exist for diabetic retinopathy, cataract, age-related macular degeneration, hypertension, and myopia.

REFUGE2

Following the successes of the first REFUGE challenge in 201817, the organizers organized a second edition as part of MICCAI 202049. In a similar setup, 800 additional images were added to the data set. The new fundus images had been acquired using fundus cameras manufactured by Kowa (validation) and Topcon (test).

GAMMA

The Glaucoma Grading from Multi-Modality Images (GAMMA) challenge invited participants to develop and validate models for glaucoma detection using fundus images and OCT scans51. The available training data contains 50 non-glaucoma cases, 25 cases with early glaucoma, and 25 cases featuring mild or advanced glaucoma. Similar to REFUGE data, specialists assigned the glaucoma reference standard based on fundus photography, IOP, VF, and OCT.

RIM-ONEr3

The Retinal IMage databases for Optic Nerve Evaluation (RIM-ONE), first shared in 2011, were initially intended to evaluate algorithms for optic disc segmentation52. The third revision in 2015 contains 85 images of healthy eyes and 74 images of glaucoma patients. Images were captured using a Kowa WX 3D stereo fundus camera at a single center in Spain. The FOV spans 20° horizontally and 27° vertically.

RIM-ONE DL

Launched in 2020, the creators of RIM-ONE data sets updated their fundus images to evaluate deep learning algorithms for glaucoma detection53. All of the images were re-evaluated by two experts and originated from different hospitals, captured with different cameras. The total set encompasses 313 non-glaucoma fundus images and 172 fundus images with confirmed glaucoma (photo evaluation by glaucoma expert). The images are characterized by standardized cropping operation around the optic disc.

ACRIMA

In total, 705 images of the ACRIMA project, founded by the government of Spain for automated retinal disease assessment, were made available in 201954. Images were captured with a Topcon TRC fundus camera at a 35° FOV. Images were labeled for glaucoma by two experts and cropped around the optic disc using a bounding box of 1.5× the optic disc radius. Notably, the glaucoma images are characterized by a larger image size than the non-glaucoma images.

PAPILA

Recently made available to the research community, PAPILA is the first data set providing color fundus images and clinical data of both eyes of the same study participant. Being able to use the joint information of both eyes for glaucoma detection approaches real-life screening scenarios. PAPILA consists of 488 fundus images belonging to 244 individuals, captured with a non-mydriatic Topcon TRC-NW400 device with an FOV of 30°. The glaucoma ground truth label is presented in three categories: glaucomatous, non-glaucomatous, and suspect, based on the evaluation of clinical data by trained ophthalmologists. All images contain the optic disc, with expert segmentation of disc and cup provided.

Image quality control

Image quality was assessed through the segmentation of the ONH using a generalizable CNN developed and validated14. In case of availability of a ground truth ONH segmentation mask in the data set, this step was skipped (ORIGA, REFUGE1, GAMMA, RIM-ONEr3, and PAPILA). The CNN-generated optic disc segmentation image was tested against two criteria for a realistic optic disc. First, the vertical optic disc size per object candidate in the segmentation image was divided by the image height to obtain a disc ratio. This disc ratio should be between 0.10 and 0.40 for images with a FOV of at least 30°. Next, the optic disc candidate was selected based on the first central Hu moment55, a value invariant to the transformation that equals 0.159 when the shape is a perfect circle. The candidate with the Hu moment closest to 0.159 was selected to discard oblong non-circular segmented objects. The image was discarded from the analysis if no candidate matched the criteria. There was no human intervention in this automated process. Supplementary Fig. 1 describes the removal rate per data set.

Image transformation to 30° disc-centered fundus image: original FOV exceeding 30°

Each image with a CNN-detected or human-verified optic disc underwent multiple processing steps to minimize the covariate shift between the external and original training data. First, the image underwent a 30° cropping/extension operation centered on the localized optic disc following ONH segmentation. Original FOV per data set could be determined based on the optic disc size concerning the vertical image dimension (disc ratio) or through the information present in the data set description. In the development set, which contains exclusively 30° disc-centered images, the disc ratio was equal to 0.23 averaged over 23930 images.

$${crop}\,{factor}=\,\frac{{discrati}{o}_{{original}}}{{discrati}{o}_{30^\circ }}=\frac{{discrati}{o}_{{original}}}{0.23}$$

Disc ratios were averaged per image size per data set. For a data set with fundus images featuring a 45° FOV, the average disc ratio will be around 0.15, which would imply a crop factor of 0.65. Using a uniform crop factor per data set is essential, as crop factor per image would remove the natural heterogeneity in optic disc size. Two data sets (ACRIMA, LAG) made it impossible to preserve this normal variation due to the cropping procedure already present in the original data. Therefore, they are marked with an asterisk in the results table. In data sets that feature multiple image sizes (AIROGS, ODIR, REFUGE1, REFUGE2), disc ratios were averaged per image size and set to the global data set average if there were less than ten cases of specific image size. The crop factor was multiplied by the vertical image size to obtain a 30° disc-centered image. Zero padding was applied to the cropped image if the disc-centered crop exceeded the original image boundaries in a specific direction, as can be expected in macula-centered images where the ONH is situated at the image border. We analyzed the importance of the proposed 30° disc-centered image cropping through a sensitivity analysis on REFUGE1 data and a random 10% subset of AIROGS data. These sets feature multiple image dimensions, next to a well-defined glaucoma label.

Image transformation to 30° disc-centered fundus image: original FOV smaller than 30°

Some data sets feature images with smaller FOV values (RIM-ONE r3, LAG), or were cropped around the optic disc (ACRIMA, RIM-ONE DL). Image extension or padding was applied to ensure correct optic disc scale and lighting correction in this case. This was done by copying the original image’s border value in both height and width directions until the average disc ratio equals 0.23. After lighting correction, the image area with copied value (synthetic image information) was replaced by black pixels prior to G-RISK evaluation. See Supplementary Fig. 2 for an example of the proposed image extension procedure.

Further processing

Processed images were subjected to a filtering operation to counter unequal lighting due to the curvature of the retina56. Finally, images were resized to 512 × 512 and 3 RGB color channels, and divided by 255 to match the input requirements of the trained G-RISK model. All image operations per data set are explained and visualized in detail in Supplementary Fig. 2.

Evaluation procedure

All predictions by the G-RISK were evaluated against the reference glaucoma label using thresholding. The area under the receiver operating characteristic (ROC) curve (AUC) was selected as the primary performance metric, accompanied by balanced sensitivity and specificity by minimizing the difference between the two. Harmonized sensitivity and specificity was selected as the costs associated with FP and FN can vary depending on the deployment setting. For the three data sets that featured a prevalence that approaches general population scenarios (BMES, GHS, and AIROGS), additional sensitivities were reported at 90%, 95%, and 97.5% levels of specificity. This choice was motivated by the importance of specificity in the context of glaucoma screening. There exists a general consensus that specificity should be as high as possible, to prevent a large inflow of individuals who do not actually have the disease. Additionally, predictions were thresholded at a fixed value of 0.7 to assess glaucoma detection performance uniformly across data sets. 0.7 was selected as this is a common VCDR threshold for glaucoma detection. Evaluation was also conducted on participant level for the two population cohorts (BMES and GHS) and publicly available PAPILA set, as glaucomatous damage can be unilateral in a glaucoma patient. In order to mimic expert referral as closely as possible, the maximum predicted risk score of the two eyes (when available) was evaluated against the reference standard. 95% confidence intervals for AUC were computed using fast DeLong’s algorithm57. All statistical analyses were performed using the SciPy Python library58. One exception to this is REFUGE2, for which the reference standard is currently not accessible to researchers. The AUC value for this set was retrieved from the online evaluation server hosted by the challenge organizers and through direct e-mail communication. For data sets that contained a VCDR ground truth label (REFUGE1, BMES, RIM-ONEr3, REFUGE2 test set, and PAPILA), we compared the performance of G-RISK with VCDR by thresholding the VCDR variable against the glaucoma ground truth. Furthermore, we report on the association between G-RISK predictions and clinical metadata including IOP, mean deviation of the visual field (MD), axial length, refractive error, and corneal thickness using the PAPILA data set. ROC curves were complemented with a calibration curve (10 bins)59 and the histogram of predictions in the same plot. Results from related work on deep learning-based glaucoma detection and generalizability were included to compare where possible (LAG, ACRIMA, REFUGE1 test set, REFUGE2 test set). To better understand the decision-making process of G-RISK, three independent glaucoma experts manually evaluated randomly selected false positives (n = 20) and false negatives (n = 20) of both the BMES and GHS data. In case there were less than 20 cases, the total number of FP or FN were analyzed. Expert graders assessed image quality (good, poor, bad), glaucoma (no, suspect, definite), listed the reasons for glaucoma diagnosis, and indicated whether the processed image aided in their diagnosis. Cohen’s kappa coefficient (κ) assessed inter-grader agreement and agreement with glaucoma ground truth. The three most extreme FP and FN for all data sets were plotted (with and without overlaid saliency map) with accessible ground truth label and images. Saliency maps were generated using the gradient method provided by the iNNvestigate library v2.0.160.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *