Colon cancer diagnosis with explainable deep learning

In this section, we present the datasets considered to obtain quantitative and qualitative results; the latter were reported and discussed. The dataset we utilized is freely available for research purposes and is available on the Kaggle website at the following URL: https://www.kaggle.com/datasets/andrewmvd/lung-and-colon-cancer-histopathological-images/code. The dataset considers two main directories: one referring to lung cancer images and the other referring to colon cancer images. Consistent with the topic of the paper, we consider only the colon cancer folder consisting of two classes: benign tissue and adenocarcinoma (i.e., binary classification is utilized). As described in the Methods, the dataset was expanded to generate 5,000 histological images per class (we consider binary classification, i.e., adenocarcinoma and benign_tissue). For DL classification, the dataset was split 80-10-10 into training, validation, and test sets, respectively. The sample split is as follows:

Add 80% of the images (8,000) to the training dataset
Add 10% (1.000) of the images to the validation dataset.
Add 10% (1.000) of the images to the test dataset.

Seven different deep learning architectures were considered during the training-test phase: ResNet50^{twenty five}Densnet²⁶VGG19²⁷Standard_CNN^28,29Inception V3³⁰EfficientNet³¹ and mobile networks³²The hyperparameters were set to 50 epochs, 8 batches, and a learning rate of 0.0001. \(224 \times 224 \times 3\) Image size. This combination is determined by evaluating multiple combinations on the studied network.

We utilized binary cross-entropy as a loss function. As a matter of fact, the use of binary cross-entropy is specifically designed for binary classification problems and is suitable for tasks where there are only two outcomes in the output variable. In fact, binary cross-entropy is specifically designed for two-class classification problems where each input can only belong to one of two mutually exclusive classes. Moreover, it mathematically penalizes the distance between the predicted probability distribution and the actual distribution of the classes. This is why it is considered a good option for optimizing models that predict class probabilities.

All training and testing was performed in our working environment using an Intel Core i7 CPU with 16 GB RAM.

Table 2 reports the metrics of the network in terms of accuracy, precision, recall, F-measure, AUC and loss.

The classification results are shown in Table 2.

Table 2 Metric evaluation of the tested DL models.

In Table 2, two distinct architecture groups were identified based on the metrics results. The first group (including VGG19, Standard_CNN, ResNet50, and DenseNet) shows poor results. These networks cannot be trusted to diagnose adenocarcinoma as they cannot classify images correctly and have a higher probability of errors. These networks are excluded from further analysis.

On the other hand, the second group of CNNs, namely EfficientNet, MobileNet and Inception-V3, show the best quantitative metrics, reaching almost 100% accuracy, precision and recall. In other words, the classification applied through these architectures ensures a correct diagnosis of histological colon images. Moreover, these results support the authors' choice not to apply any other pre-processing steps on the dataset, with minimal time consumption and computational costs.

To highlight these results, Figure 3 shows the confusion matrix considering the MobileNet network.

The matrix in Figure 3 shows the good performance of the model, with high values on the first diagonal indicating that objects classified into a particular class are well predicted for that class.

Figure 4 shows the trends of epoch accuracy and epoch loss for the MobileNet network.

Good results from the training phase are shown in Figure 4a, with a slight drop in the validation phase (blue line). The training accuracy trend (red dotted line) shows that the MobileNet model was able to identify the difference between images belonging to different classes. Figure 4 shows the opposite behavior from the losses (training and testing), providing further evidence that the model successfully learns the difference between cells of benign tissue and cells of adenocarcinoma. These trends show the convergence of the loss, i.e., the loss curve converges to a relatively stable value over epochs. This indicates that the model is learning the underlying patterns in the data and is not over- or under-fitting. Both plots show the alignment of the training and validation curves. Indeed, ideally, the training and validation curves should follow similar trends. This indicates that the model generalizes well to unknown data.

Qualitative Analysis

In this subsection, qualitative results were presented and discussed.

For these results, a quantitative approach is not valid, since the qualitative aspects are not related to the quantified measures but are based on descriptions made directly on the heatmaps overlaid on the input images. Therefore, to perform this evaluation, we provide some guidelines in Methods. After generating heatmaps for the three models and three CAM algorithms considered, three different results were obtained.

Inception-V3 is unable to generate heatmaps, which is a typical behavior when the model does not recognize common patterns in the images. From a qualitative perspective, the model is unable to provide visual explanations.

The EfficientNet model generates heatmaps, but when we analyze the entire sample set, we see that the highlighted heatmaps are all identical, in this case focused on the right side, as shown in Figure 5.

This behavior occurs when the model shows a single pattern and repeats the same heatmap for all samples without considering the variations in the input image. The same heatmap is also seen in Score-CAM and FastScore-CAM. From a general perspective, the CAM algorithm relies on the learned feature representations of neural network models, which do not always perfectly match the subtle visual cues related to the presence of disease in medical images. If the model architecture or training data does not adequately capture the relevant features indicative of disease, the heatmap generated by CAM may not accurately highlight the regions of interest. Considering that all networks were trained and tested using the same dataset and optimal hyperparameter combinations, the main differences regarding the network architecture and the corresponding generative models are obvious. Moreover, it is important to remember that in medical image classification, the same network works with good performance for all medical images or all diseases. Therefore, for each dataset and each classification task, an accurate comparison of CNNs is necessary.

For MobileNet, the resulting heatmap is related to the presence of ROIs (adenocarcinoma cell clusters) that correspond to the presence of disease, as shown in Figure 6 .

Figure 6 shows a heatmap of the same sample with three CAMs applied. The CAMs highlight three regions: top, right, and bottom. Changing the CAM algorithm changes the intensity associated with these common patterns, indicating the presence of tumor cell clusters. In this way, the heatmap provides visual explainability and localization of the presence of disease, improving reliability, credibility, and plausibility from a medical perspective.

Moreover, the authors try to quantify the qualitative results and improve the robustness of the model by introducing MR-SSIM.Table 3 shows the average similarity values between Grad-CAM, Score-CAM and FastScore-CAM heatmaps for each class, considering several heatmap sets and obtaining three possible combinations.

Table 3 compares the heatmaps activated by the Grad-CAM, Score-CAM, and FastScore-CAM algorithms on the same model (MobileNet). The MR-SSIM index is high, 0.79 for the Grad-CAM/Score-CAM comparison and 0.76 for Score-CAM/FastScore-CAM. This means that the heatmaps generated by the two different CAMs are very similar and identify the same locations with little change in intensity.

Table 3. MR-SSIM of heatmap set.

When applying SSIM to different CAM algorithms on adenocarcinoma biopsy images, the aim is to evaluate how well these algorithms highlight ROIs indicative of the presence of adenocarcinoma while preserving the structural details present in the original biopsy image. A higher SSIM value between two CAMs means that different CAM algorithms highlight the same regions (ROIs) with improved visual interpretation.

Source link