Explainable deep learning approaches for high precision early melanoma detection using dermoscopic images

Machine Learning


Dataset overview

For the purposes of this research, we utilized two distinct dermatological datasets to train and evaluate our model meticulously. By maintaining these datasets separately, we aimed to provide a comprehensive analysis that accurately reflect the performance of our model across different data sources.

Melanoma Cancer Image Dataset50, referred to as Dataset 1, comprised 2597 images. This dataset was carefully curated to enhance dermatological research and support computer-aided diagnosis. Each image was uniformly sized at 224 \(\times\) 224 pixels, facilitating consistent input for our deep learning model. The data set offers an in-depth analysis of the combination of malignant and benign lesions, making it a significant resource for the development of effective deep-learning models aimed at early detection of melanoma and improved patient outcomes. CNN for Melanoma Detection Data51, known as Dataset 2, comprises 2081 images. Similar to Dataset 1, this collection focuses on the early detection and classification of melanoma.

Fig. 1
figure 1

Images representing the benign classes from the datasets.

Fig. 2
figure 2

Images representing the malignant classes from the datasets.

The inclusion of diverse skin lesion images makes this dataset crucial for advancing deep-learning models that aim to enhance diagnostic precision and reliability. Figure 1 shows examples from the benign class, and Fig. 2 illustrates samples from the malignant class, highlighting the diversity of lesions in our datasets. These figures are essential for understanding the dataset’s composition and the model’s training process.

Table 1 Dataset information.

Table 1 lists the datasets used in this study. We worked with two datasets, both containing benign and malignant classes. The overall image count for the Melanoma Cancer Image Dataset50 is 13900, and for the CNN for Melanoma Detection Data51 is 10000.

Table 2 Total amount of images taken in datasets.

Each dataset was partitioned in the following categories: training, validation, and test forms, ensuring a consistent proportion for proper assessment of our model. Specifically, 70% of those pictures from the different classes were allocated in order to training, whereas the remainder 30% were equally divided between validation and test sets, each comprising 15% of the total images. We used a subset of the full dataset for our experiments due to the limited computational resources available on the Google Colab platform, specifically the T4 GPU. As shown in Table 2, the total number of images used was significantly reduced from the full dataset sizes listed in Table 1. This sampling approach was necessary to ensure that the model could be efficiently trained and tested within these constraints while still providing reliable and meaningful results.

Fig. 3
figure 3

Number of images in each class.

By carefully dividing these datasets, we ensured that our model was trained, tested and validated using a diverse and representative sample of images, thereby enhancing its generalizability and robustness in real-world applications. The distinct datasets and their respective splits are detailed in Fig. 3, which provide a clear overview of the data distribution used in this study.

Data preprocessing techniques

Data preprocessing is a critical initial step in our workflow, ensuring that raw dermoscopic images are standardized and optimized for input into the neural network. The key stages of the preprocessing pipeline are summarized in Table 3.

Table 3 Summary of dermoscopic image preprocessing.

Data augmentation approaches

To increase the variability of training data and improve the model’s generalization ability, we applied several data augmentation techniques during the training phase. These augmentations were implemented using the Keras ImageDataGenerator and were applied only to the training set, while the validation and test sets remained unchanged.The key stages of the augmentation are summarized in Table 4.

Table 4 Summary of data augmentation approaches.

Proposed model

The Xception architecture was selected based on its efficient design and consistent performance during preliminary evaluation. Its depthwise separable convolutions support effective feature extraction while reducing computational complexity, making it well-suited for dermoscopic image analysis. Comparative models demonstrated less stable results or required extensive tuning to mitigate overfitting. An extensive description of the proposed model is provided in this section, detailing the architecture of the proposed models and the preprocessing techniques used to enhance its performance.

Structure of the proposed model

This study presents a robust neural network model that accurately classifies malignant skin diseases. The model utilizes a full preprocessing and augmentation procedure. Key preprocessing includes artifact removal, contrast enhancement, hair removal, median filtering, and resizing photos to a uniform size of 224 \(\times\) 224 pixels. Our model architecture is built on a pretrained Xception network, followed by custom layers consisting of global average pooling, batch normalization, dropout, and dense layers with ReLU and Swish activation functions, incorporating L2 regularization. This configuration was designed to enhance feature extraction and improve classification performance, as illustrated in the Fig. 4.

Fig. 4
figure 4

Proposed model architecture with two classes.

Global average pooling

The Global Average Pooling (GAP) approach was utilized to compress the spatial information of attribute maps in convolutional neural networks (CNNs) into a singular value per map of characteristics52. This approach generates a mean value for every element in the map, leading to an additional concise description.

The description for the GAP surface is described mathematically in the following manner:

$$\begin{aligned} \text {GAP}(x_{ij}) = \frac{1}{H \times W} \sum _{i=0}^{H} \sum _{j=1}^{W} x_{ij} \end{aligned}$$

(1)

where the variable \(x_{ij}\) indicates the specific element located on the \(i\)-th entry and \(j\)-th field within this feature map that appears \(H\) along with \(W\) represent the vertical and horizontal dimensions of the characteristic display, respectively.

Batch normalization

Batch normalization stages have been used to enhance stability while expediting instruction.53. The method optimizes the activation of cells in an arrangement, resulting in a reduction in inside covariate changes.

The batch normalization mechanism is precisely specified to be:

$$\begin{aligned} \text {BN}(x) = \gamma \cdot \frac{x – \mu }{\sqrt{\sigma ^2 + \varepsilon }} + \beta \end{aligned}$$

(2)

The variable \(x\) represents what is entered, \(\sigma\) and \(\mu\) represent the average plus degree of variation of the input, \(\gamma\) and \(\beta\) represents the obtainable measure while evolving factors, The symbols \(\varepsilon\) represent as tiny parameters to avoid the division with a value of zero

Dense layers with ReLU activation and L2 regularization

Along with L2 regularization, those ReLU trigger mechanisms are included in every robust section, along with L2 regularisation. The L2 constraint, also referred to as weighted decomposition, administers an infringement within the size of the loads throughout the system. By preventing the construction of excessively sophisticated algorithms and promoting the production of fewer intricate versions, this method helps mitigate overfitting54. Because the L2 regularization term has been incorporated into the loss function of the network, models with higher applicability to unknown inputs are encouraged to be created by penalizing overly large weights.

  • 1. Dense Layer: A completely connected layer is a type of neural network layer in which each neuron is linked to every neuron in the preceding layer, allowing for comprehensive information exchange.

  • 2. ReLU Activation: The utilized activation function is a Rectified Linear Unit (ReLU), which is mathematically defined as:

    $$\begin{aligned} \text {ReLU}(x) = \max (0, x) \end{aligned}$$

    (3)

    The Rectified Linear Unit (ReLU) function creates nonlinearity by instinctively resulting in the input if it is positive; otherwise, it produces zero.

  • 3. L2 Regularization: A strategy to mitigate overfitting requires the inclusion of a regularization term that is equivalent to the total of the square weights. The L2 regularization term is defined as follows:

    $$\begin{aligned} \text {L2}\_\text {Regularization} = \lambda \sum _i W_i^2 \end{aligned}$$

    (4)

    where \(\lambda\) is the regularization parameter and \(W\) represents the weights of the dense layer.

Dense layers with Swish activation

The structure of the Swish activation function makes it easier to build a strong neural network layer that can correctly handle complex patterns55. A dense layer with Swish activation function is defined as follows:

1. Swish Activation: The Swish activation function is utilized on the output of the dense layer is defined as:

$$\begin{aligned} \text {Swish}(x) = x \cdot \sigma (x) \end{aligned}$$

(5)

where, \(\sigma (x)\) is the sigmoid function.

Dropout regularization

Overfitting is a frequently occurring issue with sophisticated neural network architectures, particularly when there is a lack of appropriate training data56. To address this problem, our model integrates dropout regularization, a method specifically developed to improve applicability by avoiding unnecessary reliance on any one neuron. Dropout is a technique used during training that randomly eliminates a portion of the neurons, as explained in the theoretical framework of dropout regularization.

$$\begin{aligned} h’ = h \odot d \end{aligned}$$

(6)

where, \(h\) represents the movement originating from the prior stratum, \(d\) is a bit array obtained by sampling generated according to Bernoulli dispersion having a retention estimation of \(p\), as well as \(\odot\) indicates component-wise product by selectively discarding a tiny percentage of these triggers. This system prevents excessive reliance on certain procedures, thus enhancing its resilience and efficacy when presented with unseen data. Cross-validation is a common method for determining the hyperparameter known as retention probability \(p\). Dropout rates, represented as \(1 – p\), are often set between the range of 0.2 to 0.5. Excessively elevated rates can result in underfitting, while low rates might not deal with overfitting. During the inferential phase, dropout is disabled, and the activations are multiplied by the dropout rate \(p\) to ensure consistency with the training phase.

$$\begin{aligned} h_{\text {inference}}’ = p \cdot h \end{aligned}$$

(7)

The model architecture has removed the stratums set thoughtfully adhering to every thick stratum, excluding any ultimate prediction tier responsible for generating the class probability distribution. The regularization method mentioned here is particularly advantageous for short training datasets. It encourages the creation of feature representations that are more evenly distributed and resistant to variations, as demonstrated in several studies on dropout approaches such as DropBlock, AutoDrop, and Cutout.

Final prediction layer

The final component of the designed neural architecture includes a fully connected tier, which is a crucial aspect of this framework. This soft max processing mechanism was used in this layer to effectively address multi-class classification problems. The mathematical expression for the Softmax function is as follows:

$$\begin{aligned} \sigma (z)_j = \frac{e^{z_j}}{\sum _{k=1}^{K} e^{z_k}} \end{aligned}$$

(8)

In the given context, \(j\) represents the \(j\)-indexed component within the resultant array \(z\), variable \(K\) represents the complete count of types, and \(\sigma (z)_j\) represents the probability that the input belongs to class \(j\). In essence, each output score (logit) \(z_j\) is exponentiated by the softmax function, which then standardizes these numbers across every category to produce a probability spread with likelihoods by adding up to achieve a single unit.

Model evaluation metrics

The performance of the model is evaluated using a confusion matrix. Before training the model, the dataset was split into training, testing, and validation sets. We used a wide range of criteria to evaluate the effectiveness of the model. The assessment metrics utilized to determine the effectiveness of the proposed approach for the identification of skin cancer57,58,59 are widely established and can be seen using the following equations:

$$\begin{aligned} \text {Precision}= & \frac{\text {True Positive}}{\text {True Positive} + \text {False Positive}} \end{aligned}$$

(9)

$$\begin{aligned} \text {Recall}= & \frac{\text {True Positive}}{\text {True Positive} + \text {False Negative}} \end{aligned}$$

(10)

$$\begin{aligned} \text {F1 Score}= & \frac{2 \times \text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$

(11)

Explainable AI: integration framework

In this section, we present the integration of Explainable AI (XAI) techniques within our deep learning framework for dermatological image analysis, focusing on how these methods enhance model transparency and interpretability. By embedding XAI into our workflow, we aim to bridge the gap between model predictions and clinical understanding, ensuring that the decision-making process is both accessible and justifiable to medical professionals. This integration aids in validating the model’s accuracy and also fosters greater trust in its applications by providing clear visual explanations for each prediction. Explainability is important in medical AI, especially for tasks like melanoma detection. Visual tools such as saliency maps help check model decisions and can increase clinical confidence60. Explainability methods in dermatology often lack proper evaluation. They emphasized the need for clear validation and consistent practices when using XAI in skin cancer diagnosis61. Key Components of Our XAI Approach:

  • 1. Neural Network Frameworks: Utilizing TensorFlow and Keras, we developed a flexible neural network architecture capable of handling complex image analysis tasks. These frameworks provided the essential tools for building and customizing layers tailored to medical image interpretation.

  • 2. Image Enhancement Techniques: Images were preprocessed to ensure consistency, including resizing and normalization. We also introduced subtle variations to simulate real-world conditions, improving the model’s resilience and adaptability to diverse data inputs.

  • 3. Refinement of Pre-trained Models: We adapted existing pre-trained models, incorporating additional layers and fine-tuning parameters specifically for the task of medical image analysis. This process allowed us to leverage the strength of established models while optimizing them for our specific needs.

  • 4. Interactive Visualization Methods: We implemented Grad-CAM and Saliency Maps as interactive tools to visually examine the regions of interest identified by the model. These tools provide clinicians with an intuitive understanding of how the model arrives at its decisions.

  • 5. Heatmap Generation for Clinical Insights: By generating heatmaps through Grad-CAM, we created detailed visual guides that highlight the areas within images that the model considers most significant, thereby aiding in the interpretability of the model’s predictions.

  • 6. Model Validation and Assessment: The model’s performance was rigorously evaluated using validation data, focusing on accuracy, sensitivity, and specificity. We also performed error analysis to understand and improve model predictions.

  • 7. In-depth Result Visualization: The results were comprehensively visualized, including original images, associated saliency maps, and generated heatmaps, providing a clear and holistic view of the model’s analytical process, essential for clinical decision-making.

Explainable AI: Fundamental concepts and key equations

Our methodology builds on established deep learning principles with a focus on interpretability. The fundamental concepts and equations that drive our XAI implementation are as follows:

1. Gradient-weighted Class Activation Mapping (Grad-CAM): Grad-CAM62 is employed to bring transparency to our CNN-based models by highlighting the areas of an image that are most influential in the prediction of a specific class.

$$\begin{aligned} \alpha _k^c = \frac{1}{Z} \sum _i \sum _j \frac{\partial y^c}{\partial A_{ij}^k} \end{aligned}$$

(12)

Here the equation calculates the importance weights for each feature map in the final convolutional layer. The term \(\alpha _k^c\) represents the weight of the feature map \(A_k\) for the class c. The gradients \(\frac{\partial y^c}{\partial A_{ij}^k}\) indicate how much a small change in the feature map \(A_k\) at location (ij) will affect the score for class c. Summing these gradients over all spatial locations (ij), and normalizing by the total number of pixels Z, gives us \(\alpha _k^c\). This weight indicates the importance of the feature map \(A_k\) for the prediction of class c, thereby allowing us to generate a heatmap that highlights the regions in the input image that are most influential for the model’s decision.

2. Saliency Maps : Saliency Maps63 serve as a tool for identifying the regions of an image that most strongly influence the model’s output. By calculating the gradient of the output class score with respect to the input image, we can generate a map that shows the areas of the image that are crucial for the model’s decision.

$$\begin{aligned} \S _c(I) = \frac{\partial y^c}{\partial I}\ \end{aligned}$$

(13)

The saliency map equation computes the gradient of the class score \(y^c\) with respect to the input image I. The saliency map \(S_c(I)\) captures how sensitive the model’s prediction of class c is to changes in each pixel of the input image. In essence, it \(S_c(I)\) highlights the pixels in the input image that have the greatest impact on the class score \(y^c\). This allows us to visualize which parts of the image the model is focusing on when making its decision, providing insights into the model’s reasoning process.

Through the integration of these XAI techniques, our model achieves high accuracy and provides meaningful insights into its decision-making process, thereby enhancing its reliability in clinical applications.

Training specifications

To ensure effective model convergence and maximize efficiency, we employed the Adam optimizer with a consistent learning rate of \(1 \times 10^{-4}\), paired with categorical cross-entropy as our loss function. The training process was carefully managed with a batch size of 16, balancing computational demands and training stability across both original and augmented datasets. To mitigate the risk of overfitting, we incorporated early stopping with a patience threshold of 3 epochs, and model checkpoints were strategically utilized to preserve the model showing the best validation performance. Our training setup, detailed in the relevant section, leveraged various state-of-the-art models, all optimized for execution on the NVIDIA Tesla T4 GPU.

Our approach was designed for efficiency, with training conducted over 100 epochs, but with the flexibility for early termination if performance plateaued. The use of multiple processing workers and the powerful computational resources of the NVIDIA Tesla T4 GPU, accessed via Google Colab, significantly accelerated the training process, ensuring both speed and resource efficiency.

To fine-tune the model’s performance, we manually experimented with different learning rates in the range of \([1 \times 10^{-3}, 1 \times 10^{-5}]\) and found that \(1 \times 10^{-4}\) provided the most stable and accurate convergence. The batch size of 16 was selected after evaluating multiple sizes (8, 16, 32, 64), where 16 achieved optimal balance between convergence behavior and memory utilization. Dropout rates were carefully assigned across layers between 0.1 and 0.3 to prevent overfitting, guided by incremental experiments and validation trends. These hyperparameter values were finalized based on consistent improvements in validation accuracy, loss stability, and generalization performance across both datasets.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *