This section presents a proposed AD prediction framework using deep learning and explainable AI using MRI scans. Figure 2 shows the proposed framework of AD prediction that illustrates the middle steps, such as data imbalance, model designing, training, features extraction and fusion, and classification. The dataset was acquired in the first step, and data augmentation techniques were executed. A novel inverted residual bottleneck model with self-attention has been designed, and training has been performed. Similarly, vision transformer architecture is employed to train on the augmented datasets. An explainable AI (LIME) technique diagnosed the disorder after the training phase and feature extraction. At the same time, features are obtained from a global average pool and self-attention layers and fused with the help of a novel approach. At the last stage, shallow neural network classifiers were applied for the final classification results. Each of the steps above is briefly described below.

Proposed framework of Alzheimer’s disease prediction.
Dataset
An MRI dataset of AD23 has been employed in this work for the experimental process available from an open-source platform, Kaggle (https://www.kaggle.com/datasets/tourist55/alzheimers-dataset-4-class-of-images). We have separated this dataset into four groups: mild demented, moderate demented, non-demented, and very mild demented. This dataset had 6400 images for training and testing purposes. Figure 3 illustrates images taken from each of the four categories. As shown in this figure, the number of images in each class is insufficient, and the dataset is imbalanced with 52 images and 2560 images.

Original dataset samples and summary of an Alzheimer Disease23.
Overfitting is a common issue in network training due to insufficient data, as the model struggles to forecast unseen occurrences due to network tuning. Inadequate sample size and uneven class distribution reduced the system’s efficacy, leading to less interpretable outcomes from minority samples29. The deep learning models need a huge amount of data for learning; therefore, we performed dataset augmentation using flip and rotation techniques. The framework uses several augmentation elements, including rotating, zooming, shifting, shearing, and horizontal flipping30. After applying the augmented technique, 12,800 images were found in the training and testing parts. Figure 4 illustrates the number of samples in each class.

Summary of the dataset after the augmentation process.
Pre-trained vision transformer
The transformer architecture and the vision transformer architecture both include several degrees of self-attention. A network may discover long-term relationships in information without the need for repetition since each self-attention layer analyses the data that arrives in parallel.
The outcome based on the self-attention levels is processed further with a sequence of feed-forward layers that come after them. The vision transformer design is notable for its innovative use of multi-headed self-attention, which enables the network to concurrently attend to various aspects of the input data31.
An input to a typical transformer is a 1D sequence of token embeddings; hence, ViT reshapes the visual \(A\varepsilon R^{M*N*O}\) into a series of flattened 2D patches, \({\text{A}}^{\text{P}}\upvarepsilon {\text{R}}^{\text{v}*({\text{P}}^{2}*\text{O})}\), to deal with a 2D image, where \(\text{M}*\text{N}\) depicts the resolution of an original image. The symbol \(\text{O}\) represents the number of channels, \(P\) displays the resolution of the image patch, and \(\text{b}=\text{M}*\text{N}/{\text{P}}^{2}\) represents the number of patches. Since vision transformers employ the same width for all layers, they flatten the patches and convert image patches to a dimensional vector (D) with a capable linear projection. The projection results represent the patch embedding.
Multi-Head Self Attention (MSA) and Multilayer Perceptron (MLP) are key components of the conventional transformer layers. The MSA divides the input into many parts and then measures each input’s scaled dot product in parallel. Following that, the slices of the attention outputs generate the final results of multi-head self-attention32. Mathematically, it is defined as follows:
$$Attention \left(E,F,G\right)=Softmax \left(\frac{E{F}^{-T}}{\sqrt{{d}_{x}}}\right)\bullet G$$
(1)
$${head}_{u}=Attention (E{W}_{U}^{-E}, F{W}_{U}^{-F},G{W}_{U}^{-G})$$
(2)
$$MSA \left(E,F,G\right)=Concat \left({head}_{u},\dots \dots .,{head}_{u}\right){W}^{O}$$
(3)
At the top of the MSA layer, a multi-layered perceptron is applied. Layers of linearity in the MLP module were distinguished by applying a Gaussian Error Linear Unit (GeLU). Both MSA and MLP employ skip-connection-like residual networks and layer normalization. Mathematically, it is defined as follows:
$${s}_{t}{\prime}=MSA \left(LN\left({s}_{t-1}\right)\right)+{x}_{t-1}$$
(4)
$${s}_{t}=MLP\left(LN\left({s}_{t}{\prime}\right)\right)+{s}_{t}{\prime}$$
(5)
where \({\text{s}}_{\text{t}-1}\) shows for the \(\text{t}-1\) layer, \(LN\) shows the linear normalization and \({s}_{t}\) represents the output of the \(\text{t}\) layer. Figure 5 shows the architecture of a vision transformer for AD prediction. In this work, we employed tiny16 ViT using the transfer learning concept. For achieving the transfer learning, The last three layers indexing, fully connected, and softmax are replaced with new GAP layer, new fully connected, and new softmax layer.

Architecture of a proposed vision transformer for AD prediction.
Novelty: proposed IRBwSA architecture
In convolutional neural networks (CNN), the inverted residual bottleneck model is a popular design framework, especially in embedded and mobile device architectures. The MobileNetV2 architecture helped to popularize it. In a typical bottleneck block, the input feature map is first dimensionally reduced using a pointwise (1 × 1) convolution. The convolutional is then processed by a depth-wise separable convolution that expands back to its original dimensions using another pointwise convolution. The inverted residual bottleneck model reverses this structure. Instead of decreasing dimensionality, it expands it, performs a lightweight 3 × 3 depth-wise convolution, and finally returns to its required dimensions.
This paper proposes a new architecture based on residual blocks, parallel bottleneck structures, and self-attention mechanism. The employed multiple parallel block is inspired by the multi-branch architectures such as inception33 which shows that parallel paths can enhance the representational power without increasing the computational complexity of the network. Figure 6 demonstrates the proposed architecture. The initial input dimensions in this network design are 227 × 227 × 3. The first convolutional layer and activation layer have a depth of 8, a filter size of 3 × 3, and a stride of 1. After that, a first inverted parallel residual block has been added.

Proposed architecture of an inverted-residual bottleneck model with self-attention.
First parallel inverted residual bottleneck block
In the first stage, the six inverted residual bottleneck blocks are added in a parallel fashion, and each block consists of a ReLU activation34 due to its stability and non-saturating nature, a batch normalization with 16 channels, a convolution layer with a depth of 16, a stride of 1, and a filter size of 1 × 1. A 3 × 3 filter-sized grouped convolution layer is added, followed by an activation layer, a batch normalization layer, and a convolution layer with a depth of 8, a filter size of 1 × 1, and a stride of 1. The remaining five blocks use the similar pattern to convolve the weights in the input layer. Finally, these blocks are joined together with an additional layer. Before adding the next parallel block, the network was updated with a few intermediate layers. A convolutional layer and activation layer with a depth of 8 are added, with a 3 × 3 filter size and stride of 2. The values of stride are employed strategically to perform down sampling while holding features richness.
Second parallel inverted residual bottleneck block
In the second step, five parallel inverted residual bottleneck blocks are introduced. The depth and other layers have the same settings as the First Parallel Inverted Residual Bottleneck Block (FPIRBB). Following this stage, a few intermediate layers were added, including a convolutional layer with a depth of 16, a filter size of 3 × 3, a stride value of 2, and a ReLu activation layer.
Third parallel inverted residual bottleneck block
The third level introduced a four-path parallel inverted residual bottleneck block. The current block consists of four pathways and one skip link, with a total of seven layers in each path. Each path begins with the convolution layer having filter size of 1 × 1, depth of 32, and stride of 1, followed by an activation layer (ReLU). Furthermore, batch normalization has been used to speed up the training process. After that, a grouped convolution layer with a filter size of 3 × 3 is added, followed by another activation layer, a batch normalization, and a convolution layer with a depth of 16 and a filter size of 1 × 1. All these weights (four paths) and skip connections have been added using an additional layer. A few intermediate layers, such as convolutional and ReLu activation, have been added, with a stride value of 2.
Fourth parallel inverted residual bottleneck block
The same procedure is followed in the fourth parallel inverted residual bottleneck block as in the third parallel block, with one exception: three pathways were added along with a skip connection. A depth size 32 was added to the intermediate layer with a stride of 2. Moreover, each convolutional layer is followed by a ReLU activation layer.
Fifth parallel inverted residual bottleneck block
In this stage, two paths are then appended in parallel, each including a convolution layer with a stride of 1, a depth of 64, and a filter size of 1 × 1, followed by a ReLU activation layer and a batch normalization layer with 64 channels. After that, a 3 × 3 filter-size grouped-convolution layer and a ReLU activation layer were added. In addition, a convolution layer with 32-depth, one stride, and 1 × 1 filter size was added at the end of this stage, and it was finally concatenated using an additional layer. After that, a few intermediate layers were added, similar to intermediate layer 4, with a stride value of 2.
Sixth and seventh parallel inverted residual bottleneck blocks
Single and dual paths are added in the sixth and seventh parallel residual blocks. There were two paths in the seventh block. In each path, two convolution layers are placed, the first having a stride of 1, number of filter is 128, a filter size of 1 × 1, and the second having a filter size of 1 × 1, 64, and a stride of 1. A grouped convolutional layer is also added of filter size 3 × 3. After both stages, a few intermediate layers of stride size 2 were added.
Additional blocks
In the next blocks, the convolutional layers are convolved on different depth sizes, such as 128 and 256. The filter size of each convolutional layer was 3 × 3 and a stride of 1. Before the final block, two paths were added. In each path, a convolution layer with a depth of 512, a 1 × 1 filter size, and a stride of 1 is placed, followed by the ReLU layer and batch-normalization layer having 512 channels. Next, a 3 × 3 grouped convolution is added to it. After that, a ReLU layer is inserted, and the next one is the batch-normalization layer. In the end, the convolution layer with a 1 × 1 filter dimension, a depth of 256, and a stride of 1 is applied.
Final layers
The model’s final addition layer connects these blocks to the other layers. After that, a convolution layer with a depth of 512, a stride of 1, and a 3 × 3 filter size has been added. Next, the global average pooling layer was added. Next, a flattening layer is added. It was used to reduce the multidimensional input to one dimension, which was typically used while transitioning from the convolution layer to the fully connected layer, followed by a fully connected layer 1 with 512 input-size in it, a self-attention layer, a new fully connected layer, a new Softmax layer, and an output layer for categorization. The cross entropy function is employed as a loss function of the proposed model which is mathematically formulated as: \(\mathop{\int\mkern-20.8mu \circlearrowleft} {_{loss} } = – \frac{1}{{\phi_{s} }}\mathop \sum \limits_{j = 1}^{{\phi_{s} }} \mathop \sum \limits_{k = 1}^{{\Phi_{c} }} \eta_{{j,\Phi_{c} }} \log \left( {\hat{\eta }_{{j,\Phi_{c} }} } \right)\), where \({\Phi }_{c}\) is the number of classes, \({\phi }_{s}\) is the number of samples, \({\eta }_{j,{\Phi }_{c}}\) actual label for the sample \({j}^{th}\) in class \(k\), and \({\widehat{\eta }}_{j,{\Phi }_{c}}\) is the predicated probability of the class \(k\) for sample \(j\). The total number of parameters of this proposed model is 3.4 million.
Hyperparameters and proposed models training
In this section, the hyperparameters of the proposed models are discussed. The selected dataset is divided into a 50:50 approach, and the training set is augmented. After that, the models are trained from scratch by employing several hyperparameters. The input size of each image for the presented models is \(227\times 227\times 3\). The frequency with which deep learning networks are trained using training and validation data is called epochs. In this work, we used 50 epochs for the training process. The next hyperparameter is the batch size. The batch size refers to the number of subsamples used concurrently in forwarding or backpropagation during network training. In this work, we selected the value of batch size of 64. The optimizer of this work is ADAM, and the learning rate is 0.00011. We use this learning rate due to the fast convergence rate. Moreover, the accuracy is utilized to validate the proposed model training performance.
After the setup of hyperparameters, both designed models are trained from scratch. The models are trained based on the number of epochs later utilized for feature extraction. In addition, features are extracted from the third last layers, such as self-attention and global average pool. The extracted features are finally fused using a novel serially controlled search update approach in the next step.
Novelty: proposed serially search-based fusion
Features are fused using a novel serially search-based approach. The proposed approach is based on two dependent steps. In the first step, features are fused using a serial approach and then passed to the search mechanism for the final fusion of the optimal features35.
Consider, we have two feature vectors of dimensions \(N\times 912\) and \(N\times 512\). The feature vectors \(\left|O(v)\right|\) and \(\left|Q(v)\right|\) are mathematically represented as:
$$\left|O(v)\right|=\left({O}^{v\left(1\right)}\left(v\right)\dots , {O}^{v\left(e\right)}\left(t\right)\dots , {O}^{m}
(6)
$$\left|Q(v)\right|=\left({Q}^{v\left(1\right)}\left(v\right)\dots , {Q}^{v\left(e\right)}\left(t\right)\dots , {Q}^{m}
(7)
where, \(\left|O(v)\right|\) and \(\left|Q(v)\right|\) are feature vectors of the proposed vision and IRBwSA models. The serial fusion of these feature vectors is defined as follows:
$$\left|P\right|={\left(\begin{array}{c}\left|O\left(v\right)\right|\\ \left|Q(v)\right|\end{array}\right)}_{N\times k1+N\times k2}$$
(8)
where \(\left|P\right|\) denotes the serially fused feature vector of dimension N × 1424. After that, we opted for a strategy like Eagle Search that moves in a spiral pattern to search for food in the selected feature space. Mathematically, the movement process in the search space is defined as follows:
$${P}_{k,new}={P}_{k}+s\left(k\right)\times \left({P}_{k}-{P}_{k+1}\right)+z\left(k\right)\times \left({P}_{k}-{P}_{\mu }\right)$$
(9)
$$s\left(k\right)=\frac{s r (k)}{max\left(\left|s r\right|\right)}$$
(10)
$$z\left(k\right)=\frac{z r (k)}{max\left(\left|z r\right|\right)}$$
(11)
$$s r \left(k\right)=r\left(k\right)\times \text{sin}\left(\theta \left(k\right)\right); z r \left(k\right)=r\left(k\right)\times \text{cos}\left(\theta \left(k\right)\right)$$
(12)
$$\theta \left(k\right)=a\times \pi \times rnd r\left(k\right)=\theta \left(k\right)+R\times rnd$$
(13)
where \(R\) denotes the number of search cycles in a feature space, and \(\theta\) expresses the angles. In this paper, we used \(\theta\) values of 45 and 270. The eagle is searching for the optimal solution (food) in the mentioned angles and selects the best points for the final fusion as follows:
$$Best\left(v\right)=\underset{a=1\dots n}{\text{min}}Fit(v)$$
(14)
The final fused vector is obtained on dimension N × 712 and is finally classified using shallow wide neural network and other neural network classifiers.
Shallow Wide Neural Network: the shallow wide network has various types based on different hidden layer and neuron in the hidden layers such as narrow neural networks (NNN), medium neural networks (MNN), wide neural networks (WNN), bi-layered neural networks (BNN), and tri-layered neural networks (TNN). The NNN has one input, hidden and output layer. The hidden layer contains 10 neurons and ReLU activation. The NNN are often less intricate, exhibiting lower computational expenses and less danger of overfitting. The MNN and WNN contains one input, hidden, output layers. The MNN has 25 neurons and WNN has 100 neurons with ReLU activation. The BNN has one input, two hidden layer, and one output layer. The each hidden layer is consist of 10 neurons and the activation is ReLU. These networks allows to learn the complex relationship among the input data. The TNN has one input, three hidden layers, and one output layer. Each hidden layer consist of 10 neurons and the network has ReLU activation. The TNN network has more expressive and complex than the other networks and it can learn hierarchical structures in the data. The final selected features are classified using a Shallow Wide Neural Network (SWNN) classifiers for the final classification.
Local interpretable model-agnostic explanations (LIME)
The fundamental concept of explainable AI (XAI) was expressed to declare the methodologies for developing machine learning frameworks for human dilettantes. XAI approaches are used to explain the model’s predictions, results, and flaws. With XAI, it is possible to characterize the model’s integrity, accuracy, and fairness. Local Interpretable Model-Agnostic Explanations (LIME) were used to estimate the selective competence of the machine learning model36. The primary goal of LIME is to assess the essential features of the input data for the classification outcome. It will estimate the behavior of a deep learning model. Some steps are: (i) Segment the image into features; (ii) Generate synthetic image data by randomly including or excluding features. Each pixel in the exclude feature is regained with the average image pixel; (iii) Classify the synthetic image using the deep network; (iv) Fit a regression model; and (v) Calculate the importance of each feature using the regression models.
