Dataset description
The segmentation and classification tasks in the study are conducted using the BraTS2020 and Figshare datasets. Multimodal Brain Tumor Segmentation, or BraTS2020, is a comprehensive medical imaging data repository which is widely used for brain tumor segmentation47. Gliomas, the most common kind of brain tumor, are the subject of this dataset. It includes sequences from T1-weighted, T2-weighted, T1-weighted with contrast enhancement (T1-CE), and fluid-attenuated inversion recovery (FLAIR) pre-operative MRI scans. Precisely dividing and categorizing gliomas is essential for efficient treatment planning. Expert annotations of tumor components, including necrotic core, enhancing tumor, and peritumoral edema, are available in the BraTS dataset. The BraTS 2020 training dataset, which comprises 369 labeled MRI cases, was further split into 80% for training, 10% for validation, and 10% for testing, as shown in Table 9.
In addition, the Figshare brain tumor dataset includes 3064 T1-weighted contrast-enhanced MRI scans from 233 individuals with three types of brain tumors: 1426 glioma slices, 708 meningioma slices, and 930 pituitary tumor slices. This dataset is widely used for brain tumor classification due to its accessibility and availability48. The Figshare dataset was split as 80%, 10% and 10% for training, validation and testing data respectively as detailed in Table 10.
The BraTS images with NIfTI extensions were loaded using the NiBabel python package, which supports various neuroimaging file formats. These images were converted as 2D arrays using Numpy package. Tensorflow, Keras, Scikit-learn Python packages were employed to build and train the model. The BraTS and Figshare images were resized to 256×256 pixels to input the proposed HVU-ED and HVU-E architectures. Subsequently, they underwent several transformations, such as rotation, scaling, and flipping, to create new samples as part of the data augmentation process. The datasets were then split into training, validation, and testing data.
HVU-architecture
The Hybrid Vision U-Net (HVU) architecture is a unified deep learning framework that integrates the strengths of pre-trained Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) within the U-Net structure to address brain tumor segmentation and classification. By fusing local feature extraction from CNNs with the global context modeling of ViTs, HVU effectively captures both fine-grained and holistic information essential for accurate medical image analysis. Four HVU model variants–ResVU-ED, VggVU-ED, XceptionVU-ED, and DenseVU-ED–are constructed by combining ViT modules with ResNet50, VGG16, Xception, and DenseNet121, respectively.These specific pre-trained CNNs were chosen due to their complementary design principles and strong track record in medical imaging.
The ResVU-ED design combines the ResNet model with ViT and UNet to maximize its capabilities. ResNet, known for its residual blocks and ability to address diminishing gradients, has been widely recognized in deep learning. In this architecture, the first 16 layers of ResNet50 were employed as a feature extractor to capture contextual data necessary for precise pixel-wise segmentation and classification with 4,17,43,876 parameters.
The VggVU-ED segmenter leverages the potential of the VGG16 model with ViT and UNet. VGG16, renowned for its deep and robust architecture, effectively extracts high-quality image features. The features extracted from 17 layers were concatenated with UNet backbone for the segmentation and classification process. This model dealt with 5,21,34,596 parameters.
The XceptionVU-ED model merges the Xception layers with ViT in the middle block of UNetwith 4,85,14,732 parameters. Xception, an enhanced version of the Inception model, utilizes depthwise separable convolutions and residual connections to extract intricate patterns and semantic details. Combined with U-Net, the features extracted from its 60-layer parameters significantly enhance segmentation.
The DenseVU-ED architecture combines the DenseNet and ViT features with the bottleneck of the UNet model. DenseNet uses a feed-forward mechanism to connect one layer to the next. It utilizes dense connectivity to enable feature reuse and efficient learning. Combined with U-Net, it enhances image segmentation capabilities through effective feature collection and learning. The first 5 layers of DenseNet121 were used to extract features from input images. The total number of parameters used in this architecture was 3,69,57,380. The training and the non-training parameters of the proposed architecture are shown in the Table 11
Integration of vision transformer
Inspired by Transformer models’ success in natural language processing, Vision Transformers are a kind of deep learning model for computer vision applications. In the proposed HVU architecture, ViT is incorporated to enhance the global representation capability, complementing the local feature extraction strengths of convolutional neural networks (CNNs). As illustrated in Fig. 10, the ViT architecture leverages self-attention mechanisms to model relationships between different regions of an image. The input image is divided into fixed-size patches of 16\(\times\)16 pixels. A stride of 16 is applied to avoid overlapping, ensuring non-redundant spatial coverage. Each image patch is then flattened into a 1D vector and linearly projected into a lower-dimensional embedding space through a learned projection. This linear transformation is achieved via matrix multiplication with a trainable weight matrix and bias addition during training.To retain spatial context, positional embeddings are added to each token, allowing the model to differentiate among patch positions. The self-attention mechanism then enables each patch to gather contextual information from all other patches, learning long-range dependencies critical for identifying tumors with variable shapes and locations. A feed-forward neural network (FFN) further models complex non-linear interactions between patches. The output from the ViT is integrated with CNN-based encoder features at the bottleneck of the U-Net structure. This fusion produces a hybrid feature representation that combines detailed local cues with global semantic context, enhancing the model’s ability to accurately segment tumors with irregular boundaries and support classification tasks. The final classification head converts these fused embeddings into output predictions.
Feature integration with UNet
The backbone of the proposed Hybrid Vision UNet (HVU) framework is based on the U-Net encoder-decoder architecture. The encoder compresses input images through a series of convolutional and max-pooling layers, capturing detailed spatial and contextual information at multiple levels. Meanwhile, the decoder restores the spatial resolution by upsampling the compressed features and integrates them with corresponding encoder features via skip connections. These skip connections help preserve fine-grained details that are critical for accurate localization of anatomical structures.The architecture features two primary integration mechanisms: skip connections that join corresponding layers in the encoder and decoder paths, and a central bottleneck layer that facilitates the transition between the deepest encoding stage and the beginning of decoding. This bottleneck acts as the fusion point where features from various sources are combined to enrich the overall representation.The architecture features two primary integration mechanisms: skip connections that join corresponding layers in the encoder and decoder paths, and a central bottleneck layer that facilitates the transition between the deepest encoding stage and the beginning of decoding. This bottleneck acts as the fusion point where features from various sources are combined to enrich the overall representation.
In the HVU architecture, feature maps generated by pre-trained convolutional neural networks ResNet50, DenseNet121, VGG16, and Xception are integrated with global representations learned by the Vision Transformer (ViT). These hybrid features are merged at the bottleneck layer of U-Net.The fused representation enhances semantic richness and localization, which is particularly beneficial in segmenting tumors with irregular or ambiguous boundaries.
For classification, the encoder-generated fused features are fed into the HVU-E classification module, which supports either a fully connected neural network or traditional machine learning classifiers such as Support Vector Machines (SVM), Random Forest (RF), Decision Tree (DT), Logistic Regression, and AdaBoost. These machine learning approaches were utilized to assess the robustness of the extracted features and to establish performance baselines. Machine learning classifiers offer several advantages, including lower computational complexity, faster training times, and better generalization in scenarios with limited data. Moreover, their interpretability is beneficial in clinical applications where transparency and trust are critical. By leveraging a shared feature representation for both segmentation and classification, the framework enables joint learning, enhances computational efficiency, and improves generalization across tasks.

Architecture of vision transformer.

Architecture of the proposed HVU-ED segmenter.
The fusion of CNN and ViT features is performed at the bottleneck of the U-Net encoder-decoder structure. This fusion ensures that both local and global representations are retained before the upsampling path, thereby enriching the semantic content during feature reconstruction. As a result, the decoder is able to generate more accurate and detailed segmentation outputs, especially in cases where tumor boundaries are diffuse or irregular.
For the classification task, the same fused features extracted from the encoder are forwarded to the HVU-E classification branch. This branch includes either a fully connected neural network layer or traditional machine learning classifiers such as Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), Logistic Regression, and AdaBoost. This shared feature space across segmentation and classification facilitates efficient multi-task learning and enhances overall model performance.
HVU-ED segmenter
The BraTS preprocessed images with a size of 256×256 were used as input for the HVU-ED segmenter architecture, as shown in Fig 11. The UNet encoder consists of several 3×3 convolutional layers with a ReLU activation function. The pooling method downsamples the features to reduce the spatial dimensions with a stride of two. After each downsampling, the channels are doubled to compensate for the decreased spatial dimensions. The model summary Table 12 showcases input size, filter size, number of filters, activation function, and output size. The feature maps from the U-Net encoder, transfer learning models, and ViT are concatenated at the bottleneck of the HVU-ED architecture, all sharing a common shape of 16x16x3. The concatenated feature map is then passed to the subsequent layers of the HVU-ED decoder path for further processing and segmentation. The decoder works opposite to the encoder, upsampling the features to restore spatial resolution. The decoder reconstructs the segmented output based on the comprehensive feature representation obtained from the concatenated encoder outputs.
The HVU-ED segmenter model was trained using the Adam optimizer with a learning rate of 0.001 for 50 epochs. A batch size of 1 was chosen to ensure good generalization of the training and testing images. Parameters used for training the segmenter model are listed in a Table 13. The segmented image produced by this architecture is shown in Fig 12. Performance evaluation of the proposed novel segmentation architecture is based on metrics such as dice score, accuracy, precision, recall, sensitivity, and specificity.

The segmented image from the four segmenters ResVU-ED, VggVU-ED, XceptionVU-ED and DenseVU-ED are shown in first, second, third and fourth row respectively.

The proposed HVU-E classifier architecture.
HVU-E classifier
The HVU-ED segmenter architecture is designed for image segmentation, but it can also be repurposed for classification tasks with some modifications. This involves replacing the decoder with a flattened dense layer and the classification layer with a softmax activation function. Figshare images were input for the HVU-E classifier, as shown in Fig 13. The preprocessed images were then fed into hybrid ViT and UNet models. The U-Net’s encoder gathers structural and local features at different levels of abstraction, which are crucial for distinguishing between classes. In addition to the features from the UNet, the bottleneck layer captures global and complex features from the ViT and transfer learning models. For the classification task, the combined features from the three models were flattened and fed into dense layers with a ReLU activation function. The model parameters like input size, filter size, number of filters, activation function, and output size are displayed in Table 15. This classification model was fine-tuned using the Adam optimizer with a learning rate of 0.001 for 50 epochs, employing the softmax activation function. A batch size 32 was used for the training and validation datasets, along with the categorical cross-entropy loss function. The Table 14 shows training parameters of this model. The brain tumor classification results were evaluated using accuracy, F1-score, precision, recall, sensitivity, and specificity metrics. Similarly, the flattened features from the HVU-E classifier were used as input for machine learning algorithms such as SVM, RF, DT, LR and AdaBoost to classify brain tumor images as glioma, meningioma, and pituitary.
Conclusion
This study introduced HVU-ED for segmentation and HVU-E for classification, two novel hybrid models that combine the power of vision transformers, pre-trained encoders and U-Net architecture for brain tumor analysis. This hybrid strategy improves overall segmentation performance by combining the strengths of global and local feature extraction mechanisms. The HVU-ED segmenter achieved a segmentation accuracy of 98.91%, with Dice scores of 0.902 (enhancing tumor), 0.954 (tumor core), and 0.966 (whole tumor). Building on this, the HVU-E classifier demonstrated strong generalization, achieving a classification accuracy of 99.18% with a dense output layer and 92.21% using an SVM. Additionally, explainable AI (XAI) techniques were employed to validate and visualize the model’s decision-making process, reinforcing its clinical interpretability. Even though much pertinent architecture has been published in the literature, the improved performance of this proposed model will be evidence of its prominence. This proposed model can be adapted to wide range medical image segmentation and classification tasks in the future. This versatility makes this network more fine-tuned and efficient for various medical image applications.
Limitations and future work
The proposed HVU-ED and HVU-E models demonstrated strong performance but were evaluated on a single dataset, which may limit their generalizability in broader clinical applications. Future work will focus on extending validation across diverse datasets, integrating clinical metadata, and improving model efficiency for deployment in real-time healthcare environments.
