In this section, a comprehensive analysis of proposed hybrid DL model performance in classifying brain tumors using MRI images is presented. We assess the model’s training history, its classification accuracy on the test set, and the interpretability of its predictions using Grad-CAM visualizations. As a result, we able to assess the model’s accuracy and its possible use in medicine. We also unlock comparative comparisons to previous models to show the superiority of prosed model in BTC.
Training progress
We tracked two major sources of information, loss and accuracy for both the training set (5688 images) and the validation set (632 images) over 20 epochs (training iterations) to determine how well the model learned during training. These metrics were used to plot the model’s learning behaviour, as shown in Fig. 5.

(a) Training and validation loss over 20 epochs. The sharp initial drop followed by convergence in both curves suggests effective learning and minimal overfitting. The model quickly learns to minimize error, reaching stable performance early. (b) Training and validation accuracy across epochs. Both curves show a steady rise, with close alignment after epoch 5, indicating strong generalization and consistent performance across unseen MRI data.
The plot in Fig. 5a shows the training and validation loss, a metric that quantifies the difference between the model’s predictions and the actual labeled data. A stable decrease in loss indicates that the model’s predictions are becoming more accurate. The training loss (blue line) starts around 0.8, reflecting significant initial errors, but drops rapidly—falling below 0.2 by epoch 5 and below 0.1 by epoch 15. This steady, monotonic decrease suggests that the model quickly learned to classify the training images effectively.
The validation loss (orange line) begins at approximately 0.4 and follows a similar downward trend, reaching 0.1 by epoch 10, with slight fluctuations afterward while remaining relatively stable. These fluctuations are expected, as the validation set comprises only 10% of the entire dataset. However, the overall trend closely mirrors the training loss, indicating that the model generalizes well to unseen data. The close alignment of the two curves suggests the model is not overfitting; in other words, it is learning meaningful patterns that can be applied to new MRI images, rather than simply memorizing the training data.
The plot in Fig. 5(b) shows the accuracy of the training and validation sets, representing the percentage of correct predictions. The training accuracy (blue line) starts at around 65%, meaning that in the initial epochs, the model correctly classified 65% of the training images. It increases sharply, reaching 90% by epoch 5 and nearly 100% by epoch 10, where it remains for the rest of the training. This indicates that the model quickly learned the patterns in the training data.
The validation accuracy (orange line) begins slightly higher, around 80%, likely because the validation set is smaller and less diverse. It rises steadily to reach 95% by epoch 10 and stabilizes around 96% by epoch 15. The small gap between training and validation accuracy (less than 5%) suggests strong generalization to unseen data, confirming that the model is not overfitting.
An early stopping mechanism was employed to halt training if no improvement in validation loss was observed over five epochs, ensuring retention of the best-performing model. Additionally, a learning rate reduction—dividing the rate by a factor of 0.1 after two consecutive epochs without improvement—helped the model adjust and contributed to this balanced performance.
Classification performance
To test its performance on unseen data, we evaluated the model on the test set of 703 images. The test set consists of images from all of the four categories: No Tumor, Glioma images, Meningioma, Pituitary. We used classification report, confusion matrix and ROC curves to help evaluate the effectiveness of the model.
Classification report
In addition to accuracy, Table 1 shows detailed metrics including precision, recall, F1-score and support for each category, as well as overall averages. The metrics are defined as:
Precision: It is the number of correct predictions for a category divided by the total number of predictions made for that category. In other words, it answers the question: ‘Of all the images the model labelled as this category, how many actually belonged to it?
$$\:Precision=\:\frac{True\:Positive}{(True\:Posistive+False\:Positive)}$$
(4)
Recall: The fraction of correctly identified instances of a category. It answers the question: ‘Out of all the images that actually belonged to this category, how many did the model predict correctly?
$$\:Recall=\:\frac{True\:Positive}{(True\:Posistive+False\:Negative)}$$
(5)
F1-Score: The harmonic mean of precision and recall, providing a single metric that balances the two. It is particularly useful when precision and recall values are close but not exactly the same.
$$\:F1\:score=\:\frac{2*Precision*Recall}{Precision+Recall}$$
(6)
Support: The total number of test images per category, representing the sample size for each class.
On the test set, the proposed model achieved an average accuracy of 99% across the four classes—no tumor, glioma, meningioma, and pituitary tumor—indicating that the model correctly classified nearly all images. The model exhibited excellent performance for the no-tumor class, with a precision of 1.000, meaning that all predicted non-tumor cases were indeed non-tumorous. However, there was slightly lower recall, suggesting that a small minority of actual non-tumor cases were missed, possibly due to rare imaging artifacts resembling tumor features. Despite this, the model achieved a high F1-score, reflecting trustworthy identification of healthy brain tissue.
In glioma detection, the model maintained exceptional accuracy, with only limited positive and negative errors due to the infiltrative nature of gliomas, which spread into neighboring tissue. This resulted in a balanced F1-score, reflecting the model’s ability to handle these challenging cases. For meningiomas, the model performed flawlessly, correctly identifying all ground-truth cases and confirming the vast majority of predictions. This is likely due to the uniquely homogeneous appearance of meningiomas on MRI scans, making them easily distinguishable, as reflected by the best F1-score. Regarding pituitary tumors, the model excelled once again, achieving precision and recall scores of 0.94 and 0.91, respectively—demonstrating perfect recognition of their rounded borders. Errors resulted in lower F1-scores, likely due to overlapping characteristics with other tumor types (e.g., meningiomas).
The precision-recall-F1-score macro-average of 0.99 reflects consistent performance across all classes, despite the varying tumor morphologies. Similarly, the weighted average of 0.99 further supports this stability, indicating that the distribution differences among classes did not impact the model’s performance. This reinforces the model’s ability to generalize across different tumor presentations, highlighting its potential as a reliable and accurate tool for brain tumor diagnosis in clinical settings.
The precision, recall, and F1-score comparison (Fig. 6) reinforces the model’s strong performance, with consistently high scores observed across all data splits—approximately 0.98–0.99 for training, and 0.98 for both validation and test. These results indicate a high degree of reliability and generalization. Nonetheless, minor performance variations point to potential areas for enhancement. This evaluation highlights the model’s diagnostic strengths, particularly its capacity to generalize across diverse tumor morphologies, while also emphasizing the need for further optimization in feature extraction methods or training dataset composition to support its clinical applicability in brain tumor assessment.

Precision, recall, and F1-score comparison across training, validation, and test datasets. High and consistent values across all splits reflect the model’s stability and classification robustness.
This class-wise performance gives very important information about the behaviour of the model in diagnostics. The model performs very well on meningiomas with a high F1-score which is likely due to their homogenous morphological characteristics and clear tumor borders are more readily captured through convolutional and attention-based mechanisms. In comparison, recall is a little lesser for gliomas due to their diffuse and infiltrative behaviour which makes their spatial boundaries less clear, resulting in them being more susceptible to partial classification error. The few false positives in the no tumor group indicate that incidental anatomical asymmetries or other image artifacts may be mistakenly identified as pathology in some cases. A few patients may be misclassified between glioma and pituitary tumor due to overlapping grayscale characteristics, particularly when the pituitary lesions are atypically growing in extra-axial compartment (25). These results emphasize that while the macro-averaged scores from the model are high, they are not simply a function of class balance, but indicative of its variable performance across complex tumor morphologies. Understanding driven by such interpretation is key for putting numerical metrics into perspective and truly evaluating whether the model is ready for the clinic.
Confusion matrix
The confusion matrix provides a detailed breakdown of the model’s predictions, including the number of images correctly classified and where mistakes occurred (Fig. 7). The matrix helps interpret the classification performance of the model for the four classes—no tumor, glioma, meningioma, and pituitary tumor—by mapping predicted labels to actual labels. A clear diagonal line indicates that the model is capable of accurately predicting each class, learning the differences between specific MRI features, including the irregular, infiltrative patterns of gliomas, the well-defined, homogeneous structures of meningiomas, the round, compact shapes of pituitary tumors, and the balanced, symmetrical characteristics of non-tumor samples.
Non-diagonal entries represent classification errors, presumably due to the homogeneity of visual features in MRI scans, where overlapping tissue densities or confusion in boundary detection between grading types may prevent separation. These errors highlight potential limitations in the model’s ability to discriminate subtle differences between classes, possibly due to image noise or variations in how tumors present across different patients. This analysis reveals valuable diagnostic strengths of the model, particularly its ability to generalize across tumor morphologies, as well as challenges that may need to be addressed (e.g., feature extraction and/or the training dataset) to further enhance the model’s clinical deployment for brain tumor evaluation.
Confusion matrices (Fig. 7—Training, Test, and Validation) discuss how each matrix provides different, insightful angles on how a model performs over the course of development. Figure 7 presents confusion matrices for (a) training, (b) test, and (c) validation sets, showing the class-wise prediction breakdown. Diagonal dominance across all matrices suggests effective learning and low misclassification rates. Training matrix (Fig. 7a) that shows strong initial ability of model to identify different MRI features, e.g. infiltrative patterns of glioma versus homogeneous structures of meningioma, but also highlights some challenging misclassifications. This visual breakdown reinforces that the model not only performs well numerically, but also maintains clinically relevant class separability, which is essential for its use in diagnostic setting.

(a) Confusion matrix for the training set. The model shows high class-wise accuracy with minor confusion between glioma and pituitary, reflecting their shared grayscale and boundary features. Diagonal dominance confirms strong initial learning. (b) Confusion matrix for the test set. Maintains high accuracy with sparse misclassifications, particularly between glioma and meningioma—consistent with real-world morphological overlaps in MRI. (c) Confusion matrix for the validation set. Performance is consistent with the training and test sets, confirming generalization and robustness. Misclassifications are minimal and class-specific patterns are well preserved.
The test matrix (Fig. 7b) provides an evaluation of performance in circumstances closer to real-world scenarios, since an accurate classification indicates that features are effectively representing the underlying biology, whereas residual pairwise misclassifications (e.g., confusion between glioma and meningioma) indicate inter-class feature overlap that may be subtle.
Finally, the model generalization capacity is assessed via a validation matrix (Fig. 7(c)) in which stable learning is quantified by correct predictions while off-diagonal errors are caused by the varying effects of the dataset or the presence of imaging artefacts.
Receiver operating characteristic (ROC) curves
ROC curves (Fig. 8) test the model performance using the True Positive Rate (Recall) vs. the False Positive Rate as we change the decision threshold using a receiver operator curve plot. The ROC curve of each category (No Tumor, Glioma, Meningioma and Pituitary) is a straight line from (0,0) to (0,1) and (1,1), which means perfect performance. This is the point at which the True Positive Rate (along the y-axis) is 1.0 (i.e., 100% recall) while the False Positive Rate (x-axis) is 0.0, which means there are no false positives in the proposed model. The AUC for all categories is 1.00, which is the maximum value possible. An AUC value of 1.00 indicates that the model perfectly separates each category by a wide margin, without any overlap between classes. Overall, such an extraordinary AUC score for all the classes demonstrates the accuracy of the model whereby it can rightly classify the images in to the categories of No Tumor, Glioma, Meningioma, and Pituitary without any indecisions at all.
Additionally, we used Grad-CAM to visualize where the model focused when making its predictions, ensuring that the model’s predictions were interpretable and clinically relevant. This is a vital step in medical applications, where a physician needs to understand the reasoning behind the model’s conclusions in order to trust its recommendations.

ROC curves for all four tumor classes—no tumor, glioma, meningioma, and pituitary. AUC values of 1.00 for each class indicate near-perfect separability and excellent classification confidence.
There is a representative classification results between four different diagnostic categories (Fig. 9). Here, each sub-image presents its respective ground truth label along with the predicted class, serving as a visual verification of the model classification ability. True and predicted labels matches (top) across multiple MRI orientations (axial, coronal, sagittal) and contrast conditions demonstrate the model robustness to a wide range of tumor morphologies and normal brain structures.

Representative classification results across all four diagnostic categories. The true and predicted labels match across MRI orientations and contrast levels, highlighting the model’s ability to generalize to real-world imaging variations.
The distinction from meningioma is also correct, as glioma cases often feature diffuse and irregular infiltrative regions while meningiomas appear as homogeneously dense and well-circumscribed masses. In addition, the model lands consistently as zero tumor for cases with obvious midline cysts containing symmetric, structurally similar brain. while also correctly labelling pituitary tumors. The similarity of these predictions reinforces the quantitative performance characteristics and indicates that the model generalizes appropriately to real clinical data. This further confirms its possibilities in helping radiologists with non-invasive, MRI-based brain tumor screening and diagnosis.
Furthermore, multiple MRI images were tested as shown in Fig. 10(top) to understand how the model makes decisions across different tumor type. For each case, the model was able to classify the image into its respective tumor type (glioma, pituitary, meningioma). We used Grad-CAM to highlight the regions that contributed most to this prediction. In the Grad-CAM visualizations, areas of high model attention are shown in red and yellow, while the areas of low attention are shown in blue.
The corresponding MRI slices demonstrate clearly visible tumor, such as a rounded homogeneous tumor which will often displace brain. This Grad-CAM heatmap also confirms that the model was looking mostly in the tumour region and especially around its borders which are clinically important area. This suggests that the model is using clinically relevant features by aligning its attention to the tumor location in a manner that imitates the radiologist interpretative strategies used.
Model transparency is greatly improved with this visualization in addition to the fact that it also reflects that the predictions made by the model are based upon tumor-related anatomical landmarks rather than irrelevant image artefacts. This not only aids in the validation of the model but is also important towards building confidence and thereby unlocking the utility of AI systems in the clinical workflows of MI.

(top) Original MRI image of a meningioma case used as test input. (bottom): Grad-CAM heatmap overlay. The highlighted regions correspond to the tumor mass and boundary—confirming that the model focuses on clinically significant areas aligned with radiological diagnosis.
Discussion
We also combined VGG16 with a custom-made attention model and Grad-CAM, and the results show an accuracy of 99% on the test set with 703 images in total. This suggests that our proposed hybrid model combines the best of existing techniques to achieve superb results in classifying brain tumors. Despite class imbalance among the categories in the dataset, the average precision, recall, and F1-scores are high, indicating good reliability of the model across all categories. The confusion matrix for training, validation and test set shows that the model is giving correct labelling with low mislabelling. Most errors are between glioma and pituitary tumor, and this is expected given the similarities in their appearance on MRI. This specific tumor type tends to share the characteristics of texture patterns and boundary features which at times can cause confusion. Even so, it consistently separates all four classes—with good overall reliability, supporting the robustness and generalizability of the model over the datasets. However, these discrepancies are negligible and do not undermine the general robustness of the model. While the model achieved an AUC of 1.00 across all classes, which reflects excellent separability, it is acknowledged that such perfect classification is rare in medical imaging. These results may be influenced by the dataset’s limited ambiguity and inter-class separability. Future validation on more heterogeneous clinical datasets will be important to verify generalizability.
The training and validation plots provide insights into the learning process of our model. The fact that the loss has decreased and accuracy has increased significantly shows that the model learned well. We can also observe that the training and validation metrics are very close, indicating that the model is generalizing well on out-of-sample data. This prevents overfitting, meaning the model performs well on new images. Dropout layers were used to discourage co-adaptation, early stopping was used to prevent overtraining, and a learning rate scheduler was used to refine weight updates at convergence. Data augmentation – 40° rotation, translations, zoom and flipping – also helped to prevent overfitting by mimicking real world variations. However, we acknowledge that a formal ablation study quantifying the individual contributions of each augmentation strategy was not conducted. This remains a relevant direction for future analysis.
The Grad-CAM visualization introduces an interpretability element that is very important for clinical use. In addition to quantitative analysis, qualitative analysis using Grad-CAM also has clinical value. These maps consistently lit up diagnostically relevant areas of the brain, and corresponded well with established radiological heuristics. This visual verification confirms that the model is able to attend to relevant anatomical structures – a critical feature in clinical context where trust and transparency are paramount. In the case of meningiomas, for instance, the area of focus on the tumor mass and its edges corresponds with the observations radiologists make when diagnosing meningiomas, which typically involves looking for well-defined, rounded masses that push aside adjacent brain tissue. By aligning with medical logic, this model is a promising tool for helping radiologists because it not only provides accurate predictions but also explains its decisions in a way that doctors can understand and trust. The interpretability provided by Grad-CAM might help doctors confirm the model’s predictions, detect possible errors, and confidently incorporate the model into their diagnostic workflow. This interpretability claim is further supported by expert opinion from a Consultant Radiologist at Phoenix Hospital, Pune, who examined the Grad-CAM outputs. An independent radiologist (Consultant, Phoenix Hospital, Pune) confirmed that the attention maps were well-aligned with diagnostically significant regions across all tumor categories, consistent with standard clinical interpretation. This validation contributes confidence to the medical relevance of the model and supports its argument for translation use.
Such classification models differentiate brain tumor types from non-tumorous cases, and their performance is monitored using relevant metrics at the multi-class level (Table 2). Initial attempts laid a robust foundation by identifying broad imaging paradigms but frequently lacked specificity regarding the individual complexities of anatomic tumor details. Gradually, deeper networks and improved feature extraction have refined the output, generalizing across cases with increasing robustness. Additionally, the new approach combines a convolutional base with a custom attention layer and interpretability capability, increasing precision and accuracy—particularly for non-tumoral cases and challenging tumor types—compared to previous methods. Although all compared models utilize the same Kaggle MRI dataset, differences in preprocessing—such as image resolution, normalization, augmentation protocols, and train-validation splits—limit the fairness of strict one-to-one benchmark comparisons. This underscores the robustness of our model under a clearly defined and reproducible experimental setup. This highlights the robustness of the performance of our model across a tightly controlled experimental paradigm, increasing its reliability and comparability.
Looking at the big picture, DL-based classification approaches have advanced from simple methods to complex frameworks addressing multi-class classification problems (Table 2). Early work used pre-trained models for simple tasks, and from there, the field evolved to include multi-scale methods and hybrid systems for more complex tasks. Our model optimizes this trajectory by relying on an attention-based design and visualization to sharpen its performance across all measures, surpassing previously developed multi-class approaches by better targeting salient features in the image, yielding a model that generalizes robustly (Table 3).
In contrast with traditional CNN-based methods, the proposed hybrid model gains an additional advantage in terms of their diagnostic accuracy and interpretability. Attention enables the model to generalize and dynamically focus on diagnostically relevant regions, while the Grad-CAM overlays deliver interpretable and transparent visual justifications for its predictions—a vital requirement for medical imaging, where black-box decisions are often not acceptable Conversely, many baseline models—while achieving high accuracy—lack interpretability, which limits their clinical applicability in sensitive diagnostic contexts. These advantages however have their costs. The integration of attention layers and Grad-CAM, while beneficial for interpretability, increases model complexity and inference time—posing deployment challenges in resource-constrained or real-time clinical environments. Also, although our model was designed to be robust based on the chosen dataset, how readily this model could be scaled to multi-modality (imaging) or incorporated with volumetric (3D) MRI, will be a potential focus in the future.
These advancements are influenced by the complexity of the dataset. Despite strong empirical performance, several limitations must be acknowledged. Although the dataset was well-annotated and diverse, it was derived from a single publicly available source and may not reflect the entire clinical heterogeneity. Grad-CAM provides intuitive interpretability; however, its coarse spatial granularity limits its precision in localizing fine anatomical structure. Moreover, the discrete impact of model components—such as the attention mechanism and specific augmentation strategies—was not empirically isolated, which will be addressed in future ablation studies. Clearly, non-tumorous cases always perform well, but some tumors—with all their heterogeneity—push older models to their limits. By taking advantage of its attention-enhanced structure, the proposed approach overcomes these challenges more efficiently, resulting in state-of-the-art scores for each individual class and overall performance. It strengthens an existing architectural foundation with contemporary techniques to deliver a new standard of accuracy and reliability in MI, transcending and building on its predecessors.
The model performance metrics validate the technical quality of the model, but the greater value will come from its clinical implementation. Built as a tool that is light and interpretable in nature and thus can assist the radiologists during the daily routine by highlighting the diagnostically relevant region through Grad-CAM. The visual clarity that comes from this explains helps to make the diagnosis more confident and faster, which is important in high-volume imaging centres or centres with limited specialist access. Due to its architecture, it can be deployed flexibly either within hospital PACS systems or cloud-based diagnostic platforms, with applicability to real-time screening, radiology training education, and telemedicine remote diagnostics.
Limitations
Although this work introduces a high-performing and interpretable hybrid deep learning model to classify brain tumor, several contextual insights are needed to influence future research and broader clinical adoption. This model was built using a single, well-annotated public dataset (Kaggle), facilitating controlled experimentation and reproducibility. Nevertheless, the dataset does not represent the entire clinical diversity available in multi-centre institutions due to large variances in imaging protocols, equipment vendors and patient populations. So, while the results are encouraging, external validation in diverse cohort is a necessary next stop to determine the model’s generalizability in real-world use.
Furthermore, although the joint design incorporates various improvements, including attention mechanism, Grad-CAM visualisation, and target augmentation, the latter are not fully decoupled and analysed separately in dedicated ablation studies. While this choice is in line with the focus on end-to-end performance, it restricts interpretability at the component level. Publications seeking to improve or simplify the structure may find granular analysis and discovery of highest impacting modules useful. In addition, the included interpretability blocks in the form of Grad-CAM and attention further improve clinical deliverability at the cost of moderate additional computational expense. While not ideal, this would impact the scalability for resource-constrained or low-latency applications. Investigating the use of lightweight alternatives or optimization methods that can better trade off interpretability and efficiency in more general deployment settings could be worthwhile.
The interpretability in Grad-CAM is beneficial in locating diagnostically significant regions, which is proven by expert reviewing. On the other hand, its low spatial resolution might be limited with respect to precision in applications needing fine anatomical localization (e.g., surgical planning, refinement of lesion boundary). Lastly, a direct comparison with transformer-based architectures such as MRC-TransUNet and hybrid architectures with large language models (which have gained significant attention in the recent literature).
In any case, these limitations do not diminish the contributions offered by the proposed method, but define valuable guidelines for further development and translational applicability.
