Overview
In recent years, artificial intelligence (AI) methods have demonstrated transformative potential in medical imaging, revolutionizing key tasks such as disease diagnosis, image reconstruction, and workflow optimization. This subsection provides a high-level overview of our method designed to address limitations in current medical imaging AI workflows. Our approach builds upon recent advancements in deep learning, computational efficiency, and explainable AI to improve model performance and usability in real-world clinical environments.
The structure of this section is as follows. In Section 3.2, we introduce the preliminary concepts and define the challenges inherent in medical imaging tasks, including the need for robust AI systems that generalize across diverse patient populations. In Section 3.3, we present our novel model architecture, which leverages domain-specific constraints to enhance interpretability and diagnostic accuracy. In Section 3.4, we describe our innovative strategy, which integrates adaptive learning and uncertainty quantification to optimize decision-making in clinical workflows. This combination of model design and strategic implementation aims to bridge the gap between cutting-edge research and practical utility in healthcare systems. By systematically exploring these components, we aim to provide a comprehensive framework for advancing medical imaging AI, particularly in challenging scenarios such as multi-modal imaging and imbalanced datasets. The sections that follow will formalize the problem setup, introduce our contributions, and highlight the strategies employed to achieve reliable performance and deployment readiness.
Preliminaries
Medical imaging plays a critical role in modern healthcare, encompassing a wide array of imaging modalities such as X-rays, computed tomography (CT), magnetic resonance imaging (MRI), ultrasound, and positron emission tomography (PET). Each modality provides unique insights into anatomical structures or physiological processes, facilitating tasks such as disease diagnosis, monitoring, and treatment planning. Despite their clinical utility, the interpretation of medical images presents challenges due to high data complexity, inter-patient variability, and the potential for subjectivity in manual diagnosis.
To address these challenges, medical imaging AI has emerged as a promising field, combining advanced deep learning algorithms with domain-specific constraints. The task of medical image analysis can be formalized as follows. Let \({\mathcal {X}}\) denote the input space of medical images, where each \(x \in {\mathcal {X}}\) represents a high-dimensional image tensor. For instance, an MRI scan may be modeled as a 3D tensor, \(x \in {\mathbb {R}}^{H \times W \times D}\), where H, W, and D are the height, width, and depth of the image, respectively. Let \({\mathcal {Y}}\) represent the output space, such as diagnostic labels, segmentation masks, or reconstructed image volumes. The primary goal is to learn a mapping function \(f_\theta : {\mathcal {X}} \rightarrow {\mathcal {Y}}\), parameterized by \(\theta\), that minimizes a task-specific objective function.
$$\begin{aligned} {\mathcal {L}}(\theta ) = \frac{1}{N} \sum _{i=1}^{N} \ell (f_\theta (x_i), y_i), \end{aligned}$$
(1)
where \(\ell (\cdot )\) is a loss function that quantifies the discrepancy between the predicted output \(f_\theta (x_i)\) and the ground truth \(y_i\), and N is the number of training samples.
Medical imaging problems are further complicated by factors such as imbalanced datasets, noise, and variability in image quality. For example, datasets may exhibit skewed class distributions, where rare diseases have significantly fewer labeled samples compared to common conditions. To account for such challenges, we consider the problem in a probabilistic framework. Given an observed image x, the true label y is modeled as a random variable with conditional probability distribution P(y|x). The goal of supervised learning is to approximate P(y|x) using the parametric model \(f_\theta (x)\). A probabilistic formulation enables techniques such as uncertainty quantification, essential for clinical decision-making.
Image segmentation, a fundamental task in medical imaging, involves delineating regions of interest (ROIs) such as tumors or organs. Formally, segmentation can be viewed as a pixel-wise classification problem. Let \(x \in {\mathbb {R}}^{H \times W}\) represent a 2D medical image, and let \(y \in \{0, 1, \dots , C\}^{H \times W}\) denote the corresponding segmentation mask, where C is the number of classes. The objective is to optimize:
$$\begin{aligned} {\mathcal {L}}_{\text {seg}}(\theta ) = – \frac{1}{HW} \sum _{i=1}^{H} \sum _{j=1}^{W} \sum _{c=1}^{C} y_{i,j,c} \log f_\theta (x)_{i,j,c}, \end{aligned}$$
(2)
where \(f_\theta (x)_{i,j,c}\) is the predicted probability for pixel (i, j) belonging to class c.
In medical image reconstruction, the goal is to recover a high-quality image \(x_{\text {recon}}\) from noisy or incomplete measurements \(x_{\text {obs}}\). For example, in accelerated MRI, the undersampled measurement \(x_{\text {obs}}\) is related to the fully sampled image \(x_{\text {recon}}\) via a forward model \({\mathcal {F}}\):
$$\begin{aligned} x_{\text {obs}} = {\mathcal {F}}(x_{\text {recon}}) + \epsilon , \end{aligned}$$
(3)
where \({\mathcal {F}}\) denotes the sampling operator, and \(\epsilon\) represents noise. Reconstruction methods aim to solve the inverse problem by finding \(x_{\text {recon}}\) that minimizes:
$$\begin{aligned} {\mathcal {L}}_{\text {recon}}(\theta ) = \Vert x_{\text {obs}} – {\mathcal {F}}(f_\theta (x_{\text {recon}}))\Vert _2^2. \end{aligned}$$
(4)
Another crucial problem is multi-modal medical imaging, where information from multiple imaging modalities (e.g., MRI and CT) must be integrated for comprehensive analysis. Let \(x^{(1)}, x^{(2)}, \dots , x^{(M)}\) represent input images from M modalities. The task is to learn a joint representation:
$$\begin{aligned} z = g_\phi (x^{(1)}, x^{(2)}, \dots , x^{(M)}), \end{aligned}$$
(5)
where \(g_\phi\) is a feature fusion function parameterized by \(\phi\). The joint representation z is then used for downstream tasks, such as classification or segmentation.
The backbone network extracts modality-specific features from radiology or pathology images. These features are then projected into a domain-informed latent space using a structured transformation layer that aligns them across modalities. The DIANet module operates on this latent space, applying adaptive attention to recalibrate and fuse the cross-modal features. The ACWI module takes the fused representations and adjusts the output flow based on uncertainty estimation, enabling the model to select the most reliable prediction path. All components are implemented as lightweight modules that wrap around the backbone without modifying its internal architecture, allowing seamless end-to-end integration.
Domain-Informed Adaptive Network (DIANet)
In this section, we present our proposed model, referred to as the Domain-Informed Adaptive Network (DIANet), which is specifically designed to address the challenges of medical imaging tasks. DIANet introduces a unified architecture that incorporates domain knowledge, multi-scale feature extraction, and task-specific attention mechanisms to improve the interpretability and robustness of deep learning-based medical imaging solutions. This model is developed to handle complex clinical scenarios such as imbalanced datasets, multi-modal imaging, and uncertainty quantification. Below, we detail the structure and key innovations of DIANet (As shown in Figure 1).

Overview of the Domain-Informed Adaptive Network (DIANet) architecture, incorporating multi-scale feature encoding (MFE), domain-informed latent space (DLS), and task-specific output heads (TOH). The diagram illustrates how DIANet processes medical images with a hierarchical feature extraction approach, integrates domain-specific knowledge, and adapts to different medical imaging tasks such as classification, segmentation, and reconstruction. The model’s robust design is equipped with attention mechanisms, multi-scale feature fusion, and uncertainty quantification, making it suitable for handling complex clinical scenarios, such as imbalanced datasets and multi-modal imaging.
The diffuser module is a key component of the latent feature refinement process. It operates by applying a series of nonlinear transformations that adjust the encoded features according to both spatial and contextual relationships within the medical images. Specifically, after the initial latent vector is generated from the fused multi-scale features, the diffuser integrates attention-based operations to redistribute representational emphasis. This mechanism is particularly important for handling multi-modal inputs, where spatial alignment and semantic consistency are critical. By diffusing features through this contextual transformation process, the model achieves better adaptation to structural variances across domains such as pathology and radiology. The CodeBook module complements the diffuser by introducing a discrete set of learnable vectors that represent prototypical latent patterns. Each incoming latent feature is softly matched against this set, enabling the model to quantize the representation space in a way that preserves meaningful anatomical priors. This quantization process not only regularizes the feature space but also promotes inter-sample consistency, which is vital for clinical reliability. The use of a CodeBook allows the latent space to capture a structured representation that aligns with expected biological or anatomical patterns, especially in heterogeneous datasets.
Multi-scale feature encoding
The multi-scale feature encoder is designed to capture information at different spatial resolutions, allowing the model to integrate both fine-grained details and global context, which is crucial for medical image analysis. Let the input medical image be represented as \(x \in {\mathbb {R}}^{H \times W \times C}\), where \(H\) and \(W\) are the height and width of the image, and \(C\) is the number of input channels (e.g., grayscale or RGB). The encoder is composed of a series of convolutional blocks \(\{ f_i \}_{i=1}^{L}\), where \(L\) denotes the number of blocks. Each block \(f_i\) operates at a different spatial resolution, progressively extracting hierarchical features. Initially, the input image is passed through the first block \(f_1\), producing an output feature map \(z_1 = f_1(x)\). Each subsequent block operates on the output from the previous block, progressively downsampling the feature maps, as shown by the recurrence:
$$\begin{aligned} z_i = f_i(z_{i-1}), \quad \text {where } z_0 = x. \end{aligned}$$
(6)
The output of each block \(z_i \in {\mathbb {R}}^{H_i \times W_i \times D_i}\), where \(D_i\) is the number of feature channels at scale \(i\), and \(H_i, W_i\) are the spatial dimensions of the feature map after downsampling. As the resolution decreases, the encoder captures more abstract, high-level representations of the image. This hierarchical feature representation allows the model to capture both local and global information. To further refine the learned features, the multi-scale outputs are aggregated through the Feature Pyramid Fusion (FPF) module. The FPF module combines features from different scales, providing both local details and global context:
$$\begin{aligned} z_{\text {fused}} = \text {FPF}(\{ z_i \}_{i=1}^{L}), \end{aligned}$$
(7)
where \(z_{\text {fused}}\) is the fused feature map that retains multi-scale information. In this process, features from different scales are merged through both top-down and bottom-up pathways. To improve the feature integration, we apply a lateral connection that ensures information flows effectively between different scales:
$$\begin{aligned} z_{\text {fused}}^L = \sum _{i=1}^{L} \alpha _i z_i, \end{aligned}$$
(8)
where \(\alpha _i\) are learnable coefficients determining the importance of each scale. This process enhances the robustness of the model by allowing it to leverage both fine-grained and coarse features effectively. To further enhance multi-scale learning, we apply spatial attention mechanisms that allow the model to focus on important regions at each scale. The attention mechanism at each scale can be represented as:
$$\begin{aligned} z_{\text {attended}} = \text {Attention}(z_i), \end{aligned}$$
(9)
where the attention operation refines the feature map \(z_i\) by emphasizing important regions and suppressing irrelevant ones.
The MultiScaleFusion module integrates multi-resolution feature maps from different stages of the backbone network to capture both global contextual patterns and fine-grained structural cues37,38. Inspired by hierarchical fusion strategies in feature pyramid networks and attention-guided feature refinement, our module employs adaptive weighting mechanisms to recalibrate features at each scale before fusion39. Specifically, it applies spatial attention to emphasize clinically salient regions and aggregates features through channel-wise weighting, ensuring information from deeper layers is effectively aligned with shallower representations40. Unlike conventional concatenation or summation methods, our approach maintains semantic consistency across modalities while enhancing robustness against noise and resolution disparity41.
Domain-informed latent space
DIANet integrates domain knowledge into the learning process to enhance model robustness and interpretability. The intermediate feature representation, denoted as \(z_{\text {fused}}\), is passed through a transformation function \(g_\phi\), which maps it to a domain-informed latent space. This transformation is defined as:
$$\begin{aligned} h = g_\phi (z_{\text {fused}}), \quad h \in {\mathbb {R}}^{d}, \end{aligned}$$
(10)
where \(h\) represents the latent representation vector, and \(d\) is the dimensionality of the latent space. The latent space is structured to preserve essential domain-specific information, which is crucial for downstream tasks such as classification and segmentation. To ensure that the learned latent space aligns with anatomical knowledge, we introduce domain-informed regularizers. These regularizers guide the learning process by enforcing consistency between the learned latent representation and predefined anatomical priors. Specifically, the regularization term is defined as:
$$\begin{aligned} {\mathcal {R}}_{\text {domain}} = \lambda _1 \Vert {\mathcal {P}}(h) – {\mathcal {P}}_{\text {target}}\Vert _2^2, \end{aligned}$$
(11)
where \({\mathcal {P}}(h)\) represents the predicted anatomical distribution derived from the latent space \(h\), and \({\mathcal {P}}_{\text {target}}\) is the anatomical prior, which can be a known distribution such as a probability map of organ locations. This regularization helps to shape the learned representation according to prior anatomical knowledge, thus improving generalization across different domains. Moreover, to model the uncertainty inherent in medical imaging data, we treat the latent representation \(h\) as a probabilistic distribution rather than a fixed point. Specifically, we model \(h\) as a multivariate Gaussian distribution:
$$\begin{aligned} h \sim {\mathcal {N}}(\mu , \Sigma ), \end{aligned}$$
(12)
where \(\mu\) and \(\Sigma\) represent the mean and covariance of the distribution, respectively. These parameters are generated through the transformation function \(g_\phi\), allowing the model to encode not only the point estimate but also the uncertainty about the latent representation. This probabilistic approach provides a measure of confidence in the model’s predictions, which is essential in medical applications. The mean \(\mu\) and covariance \(\Sigma\) are computed as:
$$\begin{aligned} \mu = g_\mu (z_{\text {fused}}), \quad \Sigma = g_\Sigma (z_{\text {fused}}), \end{aligned}$$
(13)
where \(g_\mu\) and \(g_\Sigma\) are separate networks parameterized by \(\phi\) that produce the mean and covariance, respectively. In addition to these primary equations, we introduce a regularization term that encourages the learned latent space to conform to the known anatomy, thus facilitating accurate predictions across diverse domains:
$$\begin{aligned} {\mathcal {R}}_{\text {latency}} = \lambda _2 \Vert {\mathcal {P}}(h) – {\mathcal {P}}_{\text {prior}}\Vert _1, \end{aligned}$$
(14)
where \({\mathcal {P}}_{\text {prior}}\) is an anatomical prior based on prior knowledge or synthetic data, and \(\Vert \cdot \Vert _1\) denotes the \(L_1\)-norm. The domain alignment is enhanced using adversarial learning techniques. The adversarial loss encourages the model to generate latent representations that are indistinguishable from the target distribution. This is achieved by minimizing the following adversarial loss:
$$\begin{aligned} {\mathcal {L}}_{\text {adv}} = -{\mathbb {E}}_{h \sim p_{\text {data}}}[\log D(h)] – {\mathbb {E}}_{h \sim p_{\text {prior}}}[\log (1 – D(h))], \end{aligned}$$
(15)
where \(D(h)\) is the discriminator network that distinguishes between real and generated latent vectors, and \(p_{\text {data}}\) and \(p_{\text {prior}}\) represent the data and prior distributions, respectively. By incorporating these additional regularization and adversarial losses, DIANet ensures that the latent space not only captures the relevant domain-specific information but also models uncertainty and aligns with anatomical priors, making it robust for real-world clinical applications (As shown in Figure 2).

The architecture of DIANet incorporates domain-informed latent space learning. It enhance robustness and interpretability in medical imaging tasks. The model starts with a latent feature that is processed through self-attention and temporal alignment modules, which interact with a codebook to generate transformed latent representations. The domain-informed latent space is further refined by domain-specific regularizers and adversarial learning techniques, ensuring that the learned representation aligns with anatomical knowledge. The diagram illustrates two options for model decoding: Option 1 uses a diffuser and a latent feature for reconstruction, while Option 2 introduces additional domain alignment techniques. The latent representation is modeled as a probabilistic distribution, facilitating uncertainty estimation in the medical predictions. This approach, as shown in the figure, supports accurate classification and segmentation by integrating prior anatomical knowledge and enhancing generalization across different domains.
Task-specific output head
DIANet employs different task-specific output heads, each tailored for particular medical imaging tasks, including classification, segmentation, and reconstruction. Let \(h\) represent the latent representation generated by the network, and \({\mathcal {T}}\) denote the specific task at hand. The output head for each task is a mapping function \(o_\psi ^{({\mathcal {T}})}\), defined as:
$$\begin{aligned} y = o_\psi ^{({\mathcal {T}})}(h), \end{aligned}$$
(16)
where \(y\) represents the predicted output. In the case of classification, the output head uses a softmax function to transform the latent representation \(h\) into probabilities for each class. This is computed as:
$$\begin{aligned} o_\psi ^{(\text {class})}(h) = \text {softmax}(W h + b), \end{aligned}$$
(17)
where \(W\) is the weight matrix, \(b\) is the bias term, and \(h\) is the input latent representation. The softmax function ensures that the outputs are probabilities, such that:
$$\begin{aligned} \sum _{i} \text {softmax}(W h + b)_i = 1, \quad \forall i. \end{aligned}$$
(18)
For segmentation tasks, the output is generated through a transposed convolution (also known as a deconvolution) operation, which increases the spatial resolution of the latent representation \(h\) to match the target segmentation map. This can be expressed as:
$$\begin{aligned} o_\psi ^{(\text {seg})}(h) = \text {ConvTranspose}(h), \end{aligned}$$
(19)
where \(\text {ConvTranspose}\) denotes the transposed convolution operation. This operation involves the learned filters being applied in a reversed manner, helping recover the spatial dimensions of the input image. For finer details in the segmentation, skip connections from earlier layers in the network might be employed, enhancing the output resolution.
For reconstruction tasks, DIANet utilizes upsampling layers followed by convolutional layers to reconstruct the original image from the latent representation. This process can be mathematically expressed as:
$$\begin{aligned} o_\psi ^{(\text {recon})}(h) = \text {Upsample}(h), \end{aligned}$$
(20)
where \(\text {Upsample}\) refers to a process of increasing the spatial resolution of the feature map by interpolating the values of \(h\). Following upsampling, convolutional layers are applied to refine the reconstructed image, as given by:
$$\begin{aligned} {\tilde{x}} = \text {Conv}(o_\psi ^{(\text {recon})}(h)), \end{aligned}$$
(21)
where \({\tilde{x}}\) denotes the final reconstructed image. In addition, for multi-scale features, DIANet may use a multi-resolution approach where the output head includes a series of layers, each responsible for different spatial resolutions. This helps the network learn both global and fine-grained features, thus improving the reconstruction accuracy.
Adaptive Clinical Workflow Integration (ACWI)
In this subsection, we introduce our new strategy, termed Adaptive Clinical Workflow Integration (ACWI), which complements the proposed model by addressing the operational challenges of deploying medical imaging AI systems in real-world clinical settings. ACWI is designed to ensure seamless integration, enhance model adaptability, and facilitate trust through explainability and uncertainty quantification. Below, we outline the core components of ACWI and their contributions to the clinical deployment pipeline (As shown in Figure 3).

The architecture of the Adaptive Clinical Workflow Integration (ACWI) system, which consists of an adaptive learning framework, explainable AI mechanisms, and uncertainty-aware decision support for medical imaging AI. The figure illustrates the stages of the learning pipeline, including the hierarchical feature fusion blocks (HEF), local and global feature paths, and the integration of explainable mechanisms such as attention maps and class activation maps. The system is designed to address domain shifts, enhance model adaptability across clinical environments, and provide interpretable predictions, enabling informed decision-making in high-stakes clinical settings. The integration of uncertainty quantification further supports reliable clinical decision support.
Adaptive learning framework
Medical imaging datasets often suffer from domain shifts due to variations in imaging devices, protocols, and patient demographics. These domain shifts introduce significant challenges in transferring models trained on one dataset to another, especially when the distributions of the images differ. To handle these challenges, DIANet employs an adaptive learning framework that incorporates domain adaptation, transfer learning, and continual learning strategies. Specifically, we define a multi-domain optimization problem where the goal is to generalize the model across diverse imaging environments. Let \({\mathcal {D}} = \{{\mathcal {D}}_s, {\mathcal {D}}_t\}\) represent the source domain \({\mathcal {D}}_s\) (e.g., a dataset from a specific imaging center) and the target domain \({\mathcal {D}}_t\) (e.g., a dataset from a different clinical environment). To minimize the domain shift between these two domains, our strategy minimizes a combined loss function that balances both the primary task loss and the domain discrepancy loss. The combined adaptive loss is given by:
$$\begin{aligned} {\mathcal {L}}_{\text {adaptive}} = {\mathcal {L}}_{\text {task}} + \lambda _3 {\mathcal {L}}_{\text {domain}}, \end{aligned}$$
(22)
where \({\mathcal {L}}_{\text {task}}\) represents the primary task loss, such as the cross-entropy loss for classification or the dice loss for segmentation, and \({\mathcal {L}}_{\text {domain}}\) is a domain discrepancy loss term that measures the difference between the feature distributions of the source and target domains. A typical approach to this is to use the Maximum Mean Discrepancy (MMD) or adversarial loss. The MMD loss is defined as:
$$\begin{aligned} {\mathcal {L}}_{\text {domain}} = \Vert f_{\text {source}}(x_s) – f_{\text {target}}(x_t)\Vert ^2, \end{aligned}$$
(23)
where \(f_{\text {source}}\) and \(f_{\text {target}}\) are the feature representations of the source and target domain images \(x_s\) and \(x_t\), respectively. By minimizing \({\mathcal {L}}_{\text {domain}}\), the model aligns the feature distributions between the source and target domains, reducing the negative impact of domain shifts. To further improve the alignment between domains, we introduce adversarial training, which uses a discriminator \(D\) to classify whether a feature map comes from the source or target domain. The adversarial loss can be written as:
$$\begin{aligned} {\mathcal {L}}_{\text {adv}} = {\mathbb {E}}_{x_s \sim {\mathcal {D}}_s}[\log D(f_{\text {source}}(x_s))] + {\mathbb {E}}_{x_t \sim {\mathcal {D}}_t}[\log (1 – D(f_{\text {target}}(x_t)))]. \end{aligned}$$
(24)
This adversarial loss encourages the source and target domain feature distributions to become indistinguishable. In addition to domain adaptation, we employ transfer learning techniques by pre-training the model on a large source dataset and fine-tuning it on the target domain. This helps transfer knowledge from a well-established source task to the target task, reducing the need for large amounts of target domain data. The transfer loss is:
$$\begin{aligned} {\mathcal {L}}_{\text {transfer}} = \Vert {\mathcal {M}}_s – {\mathcal {M}}_t\Vert _F^2, \end{aligned}$$
(25)
where \({\mathcal {M}}_s\) and \({\mathcal {M}}_t\) are the models trained on the source and target domains, respectively, and \(\Vert \cdot \Vert _F\) denotes the Frobenius norm. Moreover, continual learning is incorporated to allow the model to adapt to new data incrementally. The continual learning objective ensures that the model retains previously learned tasks while learning new ones:
$$\begin{aligned} {\mathcal {L}}_{\text {continual}} = \sum _{i=1}^{N} {\mathcal {L}}_{\text {task}}^{(i)} + \lambda _4 {\mathcal {L}}_{\text {regularization}}, \end{aligned}$$
(26)
where \({\mathcal {L}}_{\text {task}}^{(i)}\) is the task loss for the \(i\)-th task, and \({\mathcal {L}}_{\text {regularization}}\) is a regularization term that penalizes changes in the weights that would drastically affect previously learned tasks. By minimizing \({\mathcal {L}}_{\text {adaptive}}\), the adaptive learning framework ensures that DIANet can generalize well across different clinical settings, reducing the impact of domain shifts and improving performance on unseen target domains (As shown in Figure 4).

The Adaptive Learning Framework used in DIANet for domain adaptation, transfer learning, and continual learning in medical imaging. The framework minimizes domain shift between source and target domains by using a combined loss function, including a task loss and a domain discrepancy loss. The Auto-Fusion Network is used to fuse domain-specific features, and adversarial training aligns the feature distributions. The framework also incorporates transfer learning from a large source dataset and continual learning strategies to handle new data incrementally, ensuring that the model generalizes well across diverse clinical environments.
Explainable AI mechanisms
Building trust in AI systems is critical for clinical adoption. ACWI integrates explainable AI mechanisms to provide transparency in model predictions. Specifically, we employ attention-based methods and saliency maps to highlight the regions of medical images that contribute most to the predictions. These methods allow clinicians to visualize which parts of the image were most influential in the model’s decision-making, improving interpretability and aiding in clinical decision support. Let \(z_{\text {fused}} \in {\mathbb {R}}^{H \times W \times D}\) represent the fused feature map obtained from the encoder, where \(H\), \(W\), and \(D\) denote the height, width, and depth of the feature map, respectively. To focus on specific regions of interest in the image, an attention mask \(A \in {\mathbb {R}}^{H \times W}\) is computed by applying a convolutional operation followed by a softmax function:
$$\begin{aligned} A = \text {softmax}(\text {Conv}(z_{\text {fused}})), \end{aligned}$$
(27)
where the convolutional layer projects the feature map into an attention space, highlighting the spatial regions that are most relevant for the model’s predictions. This attention mask \(A\) is then applied to the input image \(x \in {\mathbb {R}}^{H \times W \times C}\), where \(C\) represents the number of channels in the image. The resulting interpretable saliency map \({\tilde{x}} \in {\mathbb {R}}^{H \times W \times C}\) is computed by element-wise multiplication:
$$\begin{aligned} {\tilde{x}} = A \odot x, \end{aligned}$$
(28)
where \(\odot\) represents element-wise multiplication. The saliency map \({\tilde{x}}\) visualizes the model’s focus areas, showing clinicians which regions of the medical image were most influential in generating the model’s prediction. This approach increases transparency by offering a visual explanation of the decision-making process, which is essential for clinicians to trust and validate the AI system. In addition to attention mechanisms, class activation maps (CAMs) and Grad-CAM are integrated into the ACWI framework. CAMs can highlight the discriminative regions of an image corresponding to specific classes, aiding clinicians in understanding which image areas are most relevant to the diagnosis. For a given class \(c\), the CAM can be computed as:
$$\begin{aligned} \text {CAM}_c = \text {ReLU} \left( \sum _{k} \alpha _k^c \cdot A_k \right) , \end{aligned}$$
(29)
where \(\alpha _k^c\) represents the weight of the \(k\)-th feature map for class \(c\), and \(A_k\) is the \(k\)-th feature map obtained from the final convolutional layer. The weighted sum of these feature maps provides a spatial map that indicates the most important regions for class \(c\). Grad-CAM, a more refined version, generates class-specific saliency maps by using the gradients of the output with respect to the convolutional layers. It is computed as:
$$\begin{aligned} \text {Grad-CAM}_c = \text {ReLU} \left( \sum _{k} \frac{1}{Z} \sum _{i,j} \frac{\partial y_c}{\partial A_k(i,j)} \cdot A_k \right) , \end{aligned}$$
(30)
where \(y_c\) represents the output for class \(c\), and \(A_k(i,j)\) denotes the \(i,j\)-th element of the feature map \(A_k\). The Grad-CAM technique is valuable as it generates highly localized saliency maps that show the regions in the image that contributed most to the prediction. These methods, when integrated into ACWI, offer a means to enhance model transparency and foster clinician trust. By providing both global and local explanations of model behavior, ACWI facilitates informed decision-making, enabling medical professionals to use AI-based systems more confidently in clinical practice. By employing uncertainty quantification alongside these explainable mechanisms, ACWI ensures that clinicians are also aware of the model’s confidence in its predictions, further supporting decision-making processes.
Uncertainty-aware decision support
Medical imaging often involves ambiguous or low-quality data, making uncertainty quantification essential for reliable decision-making. ACWI incorporates Bayesian uncertainty estimation to provide clinicians with confidence measures alongside predictions. This is particularly important in high-stakes clinical environments where decision errors can lead to significant consequences (As shown in Figure 5). In the proposed model, the latent representation \(h\) is modeled as a probabilistic distribution to account for the inherent uncertainty in the data, and is described as:
$$\begin{aligned} h \sim {\mathcal {N}}(\mu , \Sigma ), \end{aligned}$$
(31)
where \(\mu\) and \(\Sigma\) represent the mean and covariance of the latent distribution, respectively. This probabilistic modeling allows for the capture of both the expected value of the latent representation and the uncertainty associated with it. Uncertainty estimates are propagated to the output predictions \({\hat{y}}\), where the predicted output is also treated as a probabilistic distribution:
$$\begin{aligned} {\hat{y}} \sim {\mathcal {N}}({\hat{\mu }}, {\hat{\Sigma }}), \end{aligned}$$
(32)
with \({\hat{\mu }}\) and \({\hat{\Sigma }}\) computed using a Bayesian approximation of the task-specific output head. This enables uncertainty quantification in the final predictions, which is crucial for clinical decision-making. For classification tasks, the predictive uncertainty is quantified using the cross-entropy loss, which measures the uncertainty in the predicted class probabilities:
$$\begin{aligned} \text {Uncertainty} = – \sum _{c=1}^{C} p_c \log p_c, \end{aligned}$$
(33)
where \(p_c\) is the predicted probability of class \(c\) and \(C\) is the total number of classes. A higher uncertainty value indicates a greater level of confidence in the prediction. Moreover, the model can also estimate the variance in predictions for each class, which is important for assessing the reliability of predictions:
$$\begin{aligned} \text {Variance}(y) = {\mathbb {E}}[y^2] – ({\mathbb {E}}[y])^2, \end{aligned}$$
(34)
where \({\mathbb {E}}[y]\) is the expected value of the prediction and \({\mathbb {E}}[y^2]\) is the expected value of the square of the prediction. In addition to the uncertainty in classification tasks, uncertainty quantification is also important in tasks like segmentation and reconstruction, where uncertainty in the boundaries or the structure of the output can guide further manual inspection. For example, in segmentation tasks, uncertainty can be computed for each pixel or voxel, and high uncertainty regions could indicate areas requiring additional focus:
$$\begin{aligned} \text {Uncertainty}_{\text {seg}} = \sum _{i=1}^{N} {\hat{p}}_i \log {\hat{p}}_i, \end{aligned}$$
(35)
where \({\hat{p}}_i\) represents the predicted probability of pixel \(i\) belonging to a specific class in segmentation.
The anatomical prior is only applied in segmentation tasks where structural consistency is meaningful. In our experiments, this applies to CAMELYON17 and BraTS 2021, where spatial annotations allow for soft anatomical constraints based on typical region shape and location. The prior is implemented as a KL divergence loss between the predicted spatial distribution and a learned statistical prior derived from the training set masks. For classification tasks such as RadPath2020 and TCGA, the anatomical prior is not used, and its description is only relevant to the segmentation branch of the framework.

Visual examples of interpretability in AI-assisted brain tumor analysis. From left to right and top to bottom: original MRI image, attention heatmap highlighting salient regions, Grad-CAM saliency map, and comparison between ground truth (GT) and model prediction. These visualizations confirm that the model focuses on clinically relevant areas, supporting both diagnostic accuracy and explainability.
Our model captures both aleatoric and epistemic uncertainty to provide a more comprehensive understanding of predictive confidence. Aleatoric uncertainty, which stems from inherent data noise such as low image quality or ambiguous anatomical boundaries, is modeled through a heteroscedastic Gaussian approach in the latent space. The model learns to predict a covariance matrix conditioned on the input, allowing it to quantify data-dependent variability directly. This enables the system to reflect uncertainty in cases where the input image itself is ambiguous, such as overlapping tissues or low contrast regions. Epistemic uncertainty, which arises from limited training data or model capacity, is estimated using Monte Carlo Dropout during inference. By performing multiple stochastic forward passes, the model approximates a distribution over its parameters and captures uncertainty related to the lack of knowledge. This method is particularly effective in out-of-distribution scenarios or low-sample regimes. The visualizations of uncertainty maps further support this approach by highlighting high-uncertainty regions near segmentation boundaries, confirming that the model provides interpretable and trustworthy outputs in clinical applications.
