A non-sub-sampled shearlet transform-based deep learning sub band enhancement and fusion method for multi-modal images

Non-subsampled shearlet transform

The composite framework of Wavelet Theory (WT), which integrates multiscale analysis with classical geometric selections, includes the Shearlet Transform (ST) as a significant advancement⁶⁴. The Shearlet transform provides an optimally sparse representation of images with distributed discontinuities and achieves near-optimal performance in Nonlinear Approximation (ONA) tasks⁵⁰. Due to its strong directional sensitivity and localized time–frequency features, the ST has been widely applied in image processing tasks such as texture FE, image denoising, and IF.

The Discrete Shearlet Transform (DST) is constructed within the model of composite wavelet theory, incorporating multiscale decomposition and directional sensitivity. The DST system, as $\mathcal{S}\mathcal{H}(\psi )$ is generated from a mother shearlet function $\psi \in {L}^{2}\left({\mathbb{R}}^{2}\right)$ as Eq. (1)

$$\begin{array}{c}SH(\psi )=\left\{{\psi }_{j,l,k}={2}^\frac{3j}{2}\psi \left({G}^{l}{S}^{j}x-k\right):j\ge 0,-{2}^{j}\le l\le {2}^{j},k\in {\mathbb{Z}}^{2}\right\}\\ \end{array}$$

(1)

where: $j$ → The scale index, governing resolution refinement, $l$ → the shear index, controlling directional selectivity, $k\in {\mathbb{Z}}^{2}$ → the translation index, determining spatial location, $\psi$ → the mother shearlet function, localized in space and frequency.

The matrices $S$ and $G$ are defined as Eq. (2):

$$S = \left( {\begin{array}{*{20}c} 4 & 0 \\ 0 & 2 \\ \end{array} } \right),G = \left( {\begin{array}{*{20}c} 1 & 1 \\ 0 & {1^{\prime}} \\ \end{array} } \right)$$

(2)

where, $S$→ The anisotropic scaling matrix responsible for frequency refinement and redundancy. $\text{G}$→ The shear matrix that introduces directional selectivity by controlling the orientation angle in the transform domain.

Furthermore, distributed discontinuities in the SD are captured using the shift parameter ‘k’. When applying the Fourier transform to the SD Shearlet atom ${\psi }_{j,l,k}(x)$, the following expression is attained Eq. (3):

$$\hat{\psi }_{j,l,k} \left( w \right) = 2^{{ – \frac{3j}{2}}} \psi \left( {wS^{ – j} G^{ – l} } \right)e^{{2\pi iwS^{ – i} G^{ – l} k}}$$

(3)

where: $\omega = \left( {\omega_{1} ,\omega_{2} } \right) \in {\mathbb{R}}^{2}$ →The frequency vector, $\hat{\psi }$ →The Fourier transform of the mother Shearlet ‘ψ’, $S^{ – j}$, $G^{ – l}$ →The inverse anisotropic scaling and shear matrices, $\left\langle { \cdot , \cdot } \right\rangle$→ The standard inner product.

The frequency support of $\hat{\psi }_{j,l,k}$, which symbolizes its directional localization, is bounded as follows Eq. (4):

$$\begin{gathered} {\text{Supp}}\left( {\hat{\psi }_{{j,l,k}} \left( {\omega _{1} ,\omega _{2} } \right)} \right) \subset \left\{ {\left( {\omega _{1} ,\omega _{2} } \right) \in \mathbb{R}^{2} :\omega _{1} \in \left[ { – 2^{{2j – 1}} , – 2^{{2j – 4}} } \right]} \right. \hfill \\ \left. {\quad \cup \left[ {2^{{2j – 4}} ,2^{{2j – 1}} } \right],\left| {\frac{{\omega _{2} }}{{\omega _{1} }} – l2^{{ – j}} } \right| \le 2^{{ – j}} } \right\}\;for\;\omega _{1} > 0,\omega _{2} > 0 \hfill \\ \end{gathered}$$

(4)

This support region forms a trapezoidal segment in the frequency domain, centered along the slope $l{2}^{-j}$with the angle controlled by the shear parameter ‘l’. These anisotropic and directional properties enable the shearlet system to represent edges and curves more effectively than traditional transforms, providing a sparse computation of images with distributed singularities (Fig. 1a,b).

Furthermore, distributed discontinuities in the SD are captured using the shift parameter ‘k’. When applying the Fourier transform to the SD Shearlet atom as ${\psi }_{j,l,k}(x)$, Eq. (5):

$${\widehat{\psi }}_{j,l,k}(w)={2}^{-\frac{3j}{2}}\psi \left(w{S}^{-j}{G}^{-l}\right){e}^{2\pi iw{S}^{-i}{G}^{-l}k}$$

(5)

where: $\omega =\left({\omega }_{1},{\omega }_{2}\right)\in {\mathbb{R}}^{2}$ →The frequency vector, $\widehat{\psi }$ →The Fourier transform of the mother Shearlet $\psi$, ${S}^{-j}$, ${G}^{-l}$ →The inverse anisotropic scaling and shear matrices, $\langle \cdot ,\cdot \rangle$→ The standard inner product.

The frequency support of ${\widehat{\psi }}_{j,l,k}$, which symbolizes its directional localization, is bounded as follows Eq. (6):

$$\begin{gathered} {\text{Supp}}\left( {\hat{\psi }_{j,l,k} \left( {\omega_{1} ,\omega_{2} } \right)} \right) \subset \left\{ {\left( {\omega_{1} ,\omega_{2} } \right) \in {\mathbb{R}}^{2} :\omega_{1} \in \left[ { – 2^{2j – 1} , – 2^{2j – 4} } \right]} \right. \hfill \\ \;\;\;\left. { \cup \left[ {2^{2j – 4} ,2^{2j – 1} } \right],\left| {\frac{{\omega_{2} }}{{\omega_{1} }} – l2^{ – j} } \right| \le 2^{ – j} } \right\},\;for\;\omega_{1} > 0,\omega_{2} > 0 \hfill \\ \end{gathered}$$

(6)

This support region forms a trapezoidal segment in the frequency domain, centered along the slope $l{2}^{-j}$, with the angle controlled by the shear parameter ‘l’. These anisotropic and directional features enable the Shearlet to represent edges and curves more effectively than traditional transforms, providing a sparse computation of images with distributed singularities.

As the number of coefficients $N\to \infty$, the asymptotic error of the Shearlet Transform (ST) approaches ${N}^{-2}(\text{Log}N{)}^{3}$ ⁶⁷, enabling highly accurate selection of interferometric borders. In the frequency domain, the Shearlet also forms a Parseval frame⁶⁵. The vital frequency support is defined by trapezoidal regions of size ${2}^{j}\times {2}^{2j}$, oriented along zero-crossing lines with slopes of $-{2}^{-j}$. In the SD, the orientation corresponds to the slopes of ${2}^{-j}$. Shearlet elements can be uniquely distinguished based on their scale, location, and directional orientation. However, the transform exhibits rapid degradation in the spatial domain⁶⁶.

The Shearlet Transform is particularly effective in interferogram filtering due to its high directional selectivity. Nevertheless, its implementation involves subsampling operations, which introduce spectral aliasing in the frequency domain and make the transform shift-variant in practice⁶⁷. Directional filtering is achieved via shifted window functions, but the resulting subsampling often leads to reform objects such as Gibbs distortion⁶⁸. To address these limitations, the NSST was developed, inspired by the Non-Subsampled Contourlet Transform. NSST replaces subsampling with convolution-based directional filtering, thereby eliminating spectral aliasing and ensuring shift-invariance. This improvement significantly reduces pseudo-Gibbs phenomena, resulting in more visually intuitive and diagnostically useful fused images. The NSST decomposition process consists of two main stages, as shown in Fig. 2⁶⁹.

Step 1: Multiscale Decomposition	The image is decomposed into HFC and LFC using a Non-Subsampled Pyramid (NSP). Do this step iteratively until the image is decomposed into j scales
Step 2: Direction Localization	Non-subsampled shearing Filter Banks (NSSFB), which apply the shearing filter’s 2-D convolution and the HFC on the cartesian domain, are the foundation of direction localization. The NSST is shift-invariant because the convolution operation prevents subsampling

Due to several specific advantages, NSST was selected over alternative multi-scale transforms such as DWT and NSCT. Unlike DWT, NSST provides shift-invariance, which prevents pseudo-Gibbs phenomena at tissue boundaries in medical images. Compared to NSCT, NSST proposals have comparable directional selectivity with lower computational complexity (approximately 40% faster processing time in typical implementations)⁷⁰. Most importantly, NSST achieves superior sparse representation of curvilinear structures (with asymptotic approximation error of O(N⁻²) (Log N)³) compared to O(N⁻¹) for wavelets, making it particularly effective for preserving anatomical boundaries in medical images⁷¹. These advantages make NSST an optimal transform basis for the proposed neural development method.

AlexNet

The 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) marked a significant breakthrough in visual object recognition, as the winning model introduced a deeper and wider CNN compared to the earlier LeNet⁷².

AlexNet significantly outperformed all traditional ML and CV in the ILSVRC, achieving unprecedented recognition accuracy (Fig. 3). The architecture consists of 8 learned layers, including 5-convolutional layers, 3-max-pooling layers, 2-local Response Normalization (LRN) layers, 2-Fully Connected (FC) layers, and a final SoftMax output layer. In the 1st convolutional layer, 64 large kernels are used to extract low-level features from the input image. This is followed by overlapping max-pooling layers applied after the 2nd and 3rd convolutional layers to reduce spatial dimensions while preserving feature integrity.

The 3rd convolutional layer uses a standard kernel size with 192 filters, followed by the 4th and 5th convolutional layers with 384 filters each, enabling deeper FE. These layers are connected directly without intermediate pooling. Another overlapping max-pooling layer follows the 5th convolutional layer. The output of this pooling layer is passed to the 1st-FC layer, while the 2nd-FC layer connects to a 1000-way SoftMax classifier corresponding to the 1000 class labels in the ILSVRC dataset.

AlexNet was selected for the LFSB fusion task over new models based on several domain-specific considerations⁷³. While more recent networks propose superior depth, experimental evaluations presented that AlexNet provides an optimal balance between performance and efficiency for MIF. The key advantage in this context is AlexNet’s larger kernels in early layers (11 × 11, 5 × 5), which effectively capture the global network data predominant in low-frequency components⁷⁴.

The five distinct convolutional layers also provide an ideal multi-level FE for generating the WM labelled in the IF process. The relatively simple model also facilitates faster training and testing, making it more suitable for clinical deployment where computational resources may be limited⁷⁵.

Proposed concurrent de-noising and enhancement network (CDEN)

The input to the proposed CDEN is a noisy decomposed frequency as $\tilde{x}=x+n$, where ‘x’ → the clean signal; ‘n’ the noise component or signal perturbations introduced during decomposition⁷⁶. It is essential to note that this configuration is employed exclusively during the training phase, enabling the model to learn the effective separation of clean and noisy components.

Targeted enhancement of each frequency sub-band is vital, as it specifies the distinct challenges associated with different spectral components in medical images. The LFSB preserves the primary structural information but frequently suffers from contrast degradation during multiscale decomposition. In contrast, the HFSB contains fine edge and texture features that are vital for clinical interpretation but are more vulnerable to noise contamination and structural distortion^77,78,79.

De-noising sub-network (DNSN)

The first component of the proposed CDEN-NN is the De-noising Sub-Network (Fig. 4)), which aims to learn a mapping function as $g_{{{\Theta }_{1} }} \left( {\tilde{x}} \right) = \overline{x}$, where $\tilde{x}$ → the noisy input; $\overline{x}$ → the predicted clean signal; ${\Theta }_{1}$ → the set of trainable parameters within the DNSN⁸⁰. The training objective is to minimize the discrepancy between the predicted clean signal $^{\prime}\overline{x}^{\prime}$ and the ground-truth clean signal ‘x’, thereby optimizing ‘θ₁’ to accurately reconstruct the latent clean representation from noisy decomposed sub-band inputs. To achieve this, the network minimizes the Average Mean Squared Error (AMSE) loss defined as Eq. (7)

$${\text{Arg}}\_{\text{Min}}_{{\Theta _{1} }} {\mathcal{L}}_{1} \left( {\Theta _{1} } \right) = \frac{1}{{2N}}\sum\nolimits_{{i = 1}}^{N} {\left\| {\overline{{\mathbf{x}}} _{i} – {\mathbf{x}}_{i} } \right\|} _{2}^{2} = \frac{1}{{2N}}\left\| {\overline{{\mathbf{X}}} – {\mathbf{X}}} \right\|_{F}^{2}$$

(7)

where: ${\Theta }_{1}$ →The parameter set of the DNSN, $x_{i}$ →The ground truth (clean signal) for the $i$-th input, $\overline{x}_{i}$ →The network-predicted denoised signal, $N$ →The sum of training samples, $\left\| \cdot \right\|_{2}$ →The Euclidean norm, $\left\| \cdot \right\|_{F}$→ The Frobenius norm.

The Frobenius norm formulation provides a compact method for expressing the overall discrepancy between the predicted and true signal matrices. These matrices are defined as Eq. (8):

$$X = \left[ {x_{1} ,x_{2} , \ldots ,x_{N} } \right], \overline{X} = \left[ {\overline{x}_{1} ,\overline{x}_{2} , \ldots ,\overline{x}_{N} } \right]$$

(8)

where, Each Column ${\text{x}}_{\text{i}}\in {\mathbb{R}}^{\text{d}}$ → a single input signal sample. $\text{d}$ →The feature measurement.

The proposed DNSN comprises a sequence of convolutional layers as $D_{1}$ (Conv + ReLU blocks), followed by a final convolutional layer without activation. Each Conv + ReLU block contains 64 filters of size $3\times 3\times c$, followed by a Rectified Linear Unit (ReLU) to introduce nonlinearity. In the 1st layer, the number of input channels ‘c’ may vary (1, 2, or 3) depending on the number of input polarizations used⁸¹, whereas all subsequent layers are standardized with $c=64$. Since the target clean signal may include negative values, the final convolutional layer omits the ReLU activation to preserve the signal’s full range. Zero padding is applied throughout the subnetwork to ensure that the predicted output as ‘$\bar{X}$’ has the exact spatial dimensions as the ground-truth signal as ‘X’.

EnhanceNet sub-network (ENSN)

The ENSN generates a probability vector ${\mathbf{y}} = h_{{{\Theta }_{2} }} \left( {{\overline{\mathbf{x}}}} \right)$, from each input, $`\overline{{\mathbf{X}}} ^{\prime}$. The predicted probability, which indicates that the original signal belongs to the matching class individually, is exposed by⁸² each element in ‘y’. ‘y’, the output vector is compelled to be near $`{\hat{\mathbf{y}}}^{\prime},$ the one-hot vector, which stands for the input signal’s real label, to maximize performance⁸³.

The objective function of the second network, known as the ENSN, is designed to optimize classification accuracy by minimizing the cross-entropy between the predicted and true label distributions Eq. (9):

$$L_{2} \left( {{\Theta }_{2} } \right) = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} H\left( {\hat{y}_{i} ,y_{i} } \right)$$

(9)

where ${\Theta }_{2}$→ The set of trainable parameters in the ENSN, $\hat{y}_{i}\in \{{0,1}{\}}^{C}$ →The one-hot ground truth label vector for the $i$-th sample, ${y}_{i}\in [0,1]^{C}$→The predicted probability distribution output from the network for the $i$-th sample, $C$ is the number of classes, $H\left( {\hat{y}_{i} ,y_{i} } \right)$ →The Cross-Entropy Function (CEF), defined as Eq. (10):

$$H\left( {{\hat{\mathbf{y}}},{\mathbf{y}}} \right) = – \mathop \sum \limits_{c = 1}^{C} \hat{y}_{i} {\text{Log}}\left( {y_{i} } \right)$$

(10)

This loss penalizes deviations between the predicted class probabilities and the ground truth, encouraging the network to produce highly confident and correct classifications.

The training procedure for the feature improvement problem could be split into two parts. The DNSN is individually trained in the first part. For ENSN, the 1st sub-network generates training samples after training. $\overline{\mathbf{X} } ,$ the hidden new signals are used to train the 2nd sub-network. First, x + n, a noisy feed, is fed into the DNSN during the test process. The extraction subnetwork is then fed with the predicted new signal $\overline{\mathbf{X} }$ to produce the predicted label. Both sub-networks are trained simultaneously by integrating their Loss Functions (LF)⁸⁴.

The total loss for the CDEN is defined as Eq. (11):

$${\mathcal{L}}_{\text{CDEN}}\left({\Theta }_{1},{\Theta }_{2}\right)=\frac{1}{2N}\parallel \overline{\mathbf{X} }-\mathbf{X}{\parallel }_{F}^{2}+\gamma \frac{1}{N}\sum_{i=1}^{N} H(\widehat{\mathbf{y}},\mathbf{y})$$

(11)

where:$\bar{X}$, $\text{X}$→ The denoised prediction matrix and ground truth matrix respectively, Eq. (8), $\text{H}\left({ \hat{y} }_{\text{i}},{\text{y}}_{\text{i}}\right)$ →The cross-entropy loss between the predicted class probability vector and the one-hot ground truth label (Eq. 10), ${\Theta }_{1}$, ${\Theta }_{2}$→ The trainable parameters of DNSN and ENSN. $\upgamma \in {\mathbb{R}}^{+}$→’R’egularization coefficient that balances the contribution of the enhancement loss relative ‘t’ the denoising loss.

To achieve simultaneous signal denoising and enhancement, this study introduces an enhancement subnetwork, EnhanceNet⁵⁶, whose network (Fig. 5). EnhanceNet is primarily composed of a pointwise $1\times 1$ convolutional layer followed by a modified Inception Module (IM) variant. At the base of this network lies a squeeze convolutional block, which includes a $1\times 1$ convolution layer, a Batch Normalization (BN) layer, and a ReLU6 activation function. This module is measured to reduce channel dimensionality and efficiently extract essential features. The modified IM implementation supports only asymmetric kernels of sizes $1\times 1, 1\times 2,\text{ and }1\times 3,\text{ which significantly reduces}$ computational overhead between successive convolutional layers. EnhanceNet further integrates two convolution methods⁸⁶: (1) a standard convolution block composed of a $1\times 3$ Conv2D layer, followed by BN and ReLU6 activation, and (2) a depthwise separable convolution block that includes a $1\times 3$ depthwise Conv2D layer, BN, and ReLU6. These components collectively improve FE while maintaining computational efficiency.

Following the depthwise convolution block, a channel attention mechanism is introduced to the network to enhance the extraction of salient feature data across channels. This module is fundamental as the output features at this stage consist of multiple channels with enriched contextual information. Drawing inspiration from the inverted residual network⁸⁷, the attention block is strategically placed after the $1\times 3$ depthwise convolutional layer to recalibrate channel-wise responses selectively. The input to the channel attention module is the feature map generated by the second $1\times 3$ depthwise convolution layer, enabling the model to emphasize informative channels while suppressing less relevant ones⁸⁸.

The Global Average Pooling Operation (GAPO) is employed in the channel attention module in Fig. 5 for compressing the FM with the size H × W in each channel into 11.

The GAPO’s output represents the corresponding channel’s global data. The output of the GAPO for the ${c}^{\text{th}}$ channel is computed as follows Eq. (12):

$${x}_{c}=\frac{1}{H\times W}\sum_{i=1}^{H} \sum_{j=1}^{W} {y}_{c}(i,j)$$

(12)

where: ${y}_{c}(i,j)$ →The activation value at spatial position, $(i,j)$ →The feature map (FM) corresponding to the ${c}^{\text{th}}$ channel, $H$, $W$→ The height and width of the feature map.

This operation condenses spatial data from each feature map into a single scalar value, ‘x_c’, representing the global descriptor for the ${c}^{\text{th}}$ channel. The resulting vector $\mathbf{x}=\left[{x}_{1},{x}_{2},\dots ,{x}_{c}\right]$ is then passed through two FC layers. The 1st FC layer reduces the dimensionality from ‘c’ to ‘c/r’, where ‘r’ is a channel reduction ratio (tunable hyperparameter) to limit overfitting and computational complexity. This compressed extraction is subsequently activated by a ReLU function and passed to a 2nd FC layer, followed by a sigmoid activation to generate the channel attention weights.

The ReLU-AF covers the first FC layer ⁸⁹, while the sigmoid AF covers the final one. The channel attention module processes the global descriptor vector $\mathbf{x}\in {\mathbb{R}}^{c}$ (attained via GAPO) using two FC layers, a ReLU activation, and a sigmoid function to generate a channel-wise attention weight vector $\mathbf{z}\in [\text{0,1}{]}^{c}$, as follows Eq. (13):

$$\mathbf{z}=\sigma \left({\mathbf{W}}_{2}\cdot \delta \left({\mathbf{W}}_{1}\cdot \mathbf{x}\right)\right)$$

(13)

where: ${\mathbf{W}}_{1}\in {\mathbb{R}}^{\frac{c}{r}\times c}$→ The weight matrix of the 1st FC layer, $\delta (\cdot )$ →The ReLU activation function, ${\mathbf{W}}_{2}\in {\mathbb{R}}^{c\times \frac{c}{r}}$ →The weight matrix of the 2nd FC layer, $\sigma (\cdot )$ →the sigmoid activation function that outputs the attention weights ${z}_{c}\in (\text{0,1})$ for each channel ‘c’.

Each attention weight as ‘z_c’, then used to scale the corresponding feature map ‘y_c’ from the previous convolutional layer, resulting in the reweighted output Eq. (14):

$${o}_{c}={z}_{c}\cdot {y}_{c}$$

(14)

where: ${y}_{c}$→The feature map of the ${c}^{\text{th}}$ channel, ${z}_{c}$ →The scalar attention weight for that channel, ${o}_{c}$ →The reweighted feature map emphasizes more informative channels.

This mechanism allows the model to selectively enhance discriminative feature maps while suppressing less relevant ones, thereby improving representation quality for downstream classification tasks.

The channel attention mechanism further improves FE by amplifying the channels that carry more discriminative information. It is followed by a pointwise convolutional block comprising a $1\times$ 3 Conv2D layer and a Batch Normalization (BN) layer, with the output passed by a ReLU6 activation function ‘o_c’⁹⁰. To preserve gradient flow and support deeper network training, EnhanceNet includes an optional residual connection, which helps mitigate the vanishing gradient problem in deep architectures⁹¹. EnhanceNet is designed to be parameter-efficient by employing squeezed convolutions and depthwise separable convolutions, which significantly reduce the model’s computational complexity. Additionally, a 1 × 1 standard convolutional layer is employed to refine feature map clusters and enhance local FS. By delaying spatial down-sampling, the network maintains large activation maps that contribute to improved representational accuracy. The use of heterogeneous kernel sizes further enables multi-scale FE, enhancing the model’s capacity for robust and fine-grained feature learning.

LFC fusion method using AlexNet

LFSB and HFSB components primarily differ in content and fusion challenges, necessitating specialized mechanisms for addressing these differences. LFCs primarily contain structural and intensity data with higher Signal-to-Noise Ratios (SNR), making them suitable for deep FE by convolutional models⁹², such as AlexNet, which captures multi-scale structural relationships. Conversely, HFCs contain edge and texture details with lower SNR, making them particularly susceptible to noise amplification during fusion⁹³. The PCN with NSML input is designed to address this challenge through its bioinspired temporal linking mechanism, which effectively decides between meaningful edges and noise in HFCs. This specialized treatment of each component type enables optimal balance between structural integrity and detail preservation, which is particularly crucial for multimodal medical images where complementary data appear in different frequency bands⁹⁴.

In general, the LFC of a source image retains the principal structural components, while the HFC preserves finer details such as edges and textures. The traditional IMF frequently applies simple weighted averaging or maximum-value selection methods to the LFC, which neglects the contextual relationships among pixels and may result in suboptimal integration. To address these limitations, this study employs AlexNet for multi-layer FE from the source images. Subsequently, an Adaptive Selection Algorithm (ASA) is employed to generate optimized Weight Maps (WM), resulting in more effective and context-aware fusion compared to classical IFM⁹⁵.

The LFC fusion process involves FE maps from AlexNet at multiple layers and computes activity-level maps that guide the generation of WM. This method is ruled by Eqs. (15) to (17) as follows Eq. (15):

$${f}_{k}^{(n,m)}={F}_{n}\left({I}_{k}\right)$$

(15)

where: ${I}_{k}$→ ${k}^{\text{th}}$ source image, ${F}_{n}(\cdot )$→ The transformation (Convolution + Activation) applied by the ${n}^{\text{th}}$ layer of AlexNet, ${f}_{k}^{(n,m)}\in {\mathbb{R}}^{H\times W\times m}$→ The resulting feature map of spatial dimensions $H\times W$ with $m=64\cdot {2}^{n-1}$ channels.

The activity level map ${A}_{k}^{n}(x,y)$ for each spatial location $(x,y)$ is computed by applying the ${L}_{1}$-norm across the depth dimension of the feature map Eq. (16):

$${A}_{k}^{n}(x,y)={\Vert {f}_{k}^{(n,m)}(x,y)\Vert }_{1}=\sum_{c=1}^{m} \left|{f}_{k}^{(n,m)}(x,y,c)\right|$$

(16)

To improve robustness to local variation and ensure spatial smoothness, the activity level map is smoothed using a block-wise average filter of radius ‘r’ as Eq. (17):

$$\hat{A} _{k}^{n}(i,j)=\frac{1}{(2r+1{)}^{2}}\sum_{\beta =-r}^{r} \sum_{\theta =-r}^{r} {A}_{k}^{n}(i+\beta ,j+\theta )$$

(17)

where: $r\in {\mathbb{Z}}^{+}$→ Controls the neighborhood size (set to $r=1$ in proposed test results). $\hat{A} _{k}^{n}(i,j)$ →The smoothed activity level map, which reflects the intensity of FE in the neighborhood around $(i,j)$.

These refined activity level maps are later used to construct adaptive WM that guide the fusion of LFSB.

The IFM employs multi-layer FE from AlexNet to compute activity-level maps that guide low-frequency fusion. The k-SI’s FM at the n layer is ${f}_{k}^{n,m}$, and the FM’s dimension is m, $m =64\times {2}^{n – 1}$, $k =2$, where ${F}_{n}$ specifies the layer in the AlexNet, and $n\in \{1, 2, \text{3,4},5\}$, and [ReLU1, ReLU2,…,ReLU5] activation layer are represented by $n\in \left\{1, 2, \text{3,4},5\right\}$. To make the IFM resilient to misregistration, ${l}_{1}$-norm generates the activity level map known as ${A}_{k}^{n}(x,y)$ at position $(i, j)$. The final activity level map ${A}_{k}^{n}(x,y)$ is computed using the block-based average operator, where ‘r’ specifies the block size and is set to 1 to preserve more data⁹⁶.

To construct the adaptive WM used for fusing the LFC, an Adaptive Selection Algorithm (ASA) is employed. The method begins by computing the ratio of the smoothed activity level maps between the two source images EQU (16):

$$t_{n} (i,j) = \frac{{\hat{A}_{1}^{n} (i,j)}}{{\hat{A}_{2}^{n} (i,j)}}$$

(18)

This ratio ${t}_{n}(i,j)$ is used to compute the WM as ${W}_{1}^{n}(i,j)$ and ${W}_{2}^{n}(i,j)$ for each image as follows: Eqs. (19) and (20):

$${W}_{1}^{n}(i,j)=\frac{{t}_{n}^{3}(i,j)}{1+{t}_{n}^{3}(i,j)}$$

(19)

$${W}_{2}^{n}(i,j)=\frac{1}{1+{t}_{n}^{3}(i,j)}$$

(20)

These formulations ensure that if ${t}_{n}(i,j)\to 0$, more weight is assumed to the second source image, and vice versa.

Since the feature maps from AlexNet are downsampled due to pooling operations, the weight maps must be upsampled to match the original spatial resolution of the source images⁹⁷.

This is done using nearest-neighbor upsampling Eqs. (21) and (22):

$$\hat{W} _{k}^{n}(i+p,j+q)={W}_{k}^{n}(i,j)$$

(21)

$$p,q\in \left\{\text{0,1},\dots ,{2}^{n-1}-1\right\}$$

(22)

where: $\hat{W} _{k}^{n}$→ The upsampled weight map, $k\in \{\text{1,2}\}$→ The image index, The upsampling factor ${2}^{n-1}$ aligns with the receptive field size of the AlexNet layer ‘n’

Using these upsampled WM⁹⁸, the fused LFC at each layer is computed by weighted averaging Eq. (23):

$${L}_{\text{Fused }}^{n}(i,j)={L}_{1}(i,j)\cdot \hat{W} _{1}^{n}(i,j)+{L}_{2}(i,j)\cdot \hat{W}_{2}^{n}(i,j)$$

(23)

Finally, across the multiple layers $n\in \{\text{1,2},\text{3,4},5\}$, the fused coefficient at each spatial location is selected using a maximum activity rule Eq. (24):

$${L}_{\text{Fused }}(i,j)=\underset{n}{\text{max}} \left[{L}_{\text{Fused }}^{n}(i,j)\right]$$

(24)

This step ensures that all layers’ most salient fused features contribute to the final low-frequency fusion result.

HFC-fusion method using pulse-coupled neural network (PCNN)

The PCNN is a biologically inspired Feedback Neural Network (FNN) on the visual cortex of mammals⁹⁹. It consists of a single-layer, 2-D array of neurons, where each neuron corresponds 1-to-1 with a pixel in the input image. In this network, spatially adjacent neurons interact within a defined local neighborhood, enabling localized feature improvement. From Fig. 6, each PCNN neuron comprises three core components: the receptive field, which receives external stimuli; the linking or modulation field, which facilitates inter-neuronal communication; and the pulse generator, which directs the neuron’s firing behavior based on internal dynamics and external input. Also, according to¹⁰⁰, there is a split of the feed signal into the input and ${F}_{i,j}$ and linking ${L}_{i,j}$ Inputs.

The PCNN for IF is formulated as follows, where each neuron corresponds to a pixel at a spatial location $(i,j)$, and the PCNN evolves over discrete time steps $n\in \left\{\text{1,2},\dots ,{n}_{\text{max}}\right\}$ as implicit in Eqs. (25) to (29).

$${F}_{i,j}[n]={S}_{i,j}+{e}^{-{\alpha }_{F}}{F}_{i,j}[n-1]+{V}_{F}\sum_{k,l} {M}_{i,j,k,l}{Y}_{i,j}[n-1]$$

(25)

$${L}_{i,j}[n]={e}^{-{\alpha }_{L}}{L}_{i,j}[n-1]+{V}_{L}\sum_{k,l} {W}_{i,j,k,l}{Y}_{i,j}[n-1]$$

(26)

$${U}_{i,j}[n]={F}_{i,j}[n]\left(1+\beta {L}_{i,j}[n]\right)$$

(27)

$${T}_{i,j}[n]={e}^{-{\alpha }_{T}}{T}_{i,j}[n-1]+{V}_{T}{Y}_{i,j}[n]$$

(28)

$${Y}_{i,j}[n]=\left\{\begin{array}{c}1,{U}_{i,j}>{T}_{i,j}\\ 0, \, {\text{O}}{\text{t}}{\text{h}}{\text{e}}{\text{r}}{\text{w}}{\text{i}}{\text{s}}{\text{e}}\end{array}\right.$$

(29)

where, ${F}_{i,j}[n]$ →Feeding input at position $(i,j)$ at iteration ‘n’, composed of the static input ${S}_{i,j}$, its decayed past value, and the effect from neighboring neuron firings via synaptic weights ${M}_{i,j,k,l}$. ${L}_{i,j}[n]$ →Linking input, decayed over time and influenced by neighboring outputs via weights ${W}_{i,j,k,l}$. ${U}_{i,j}[n]$ →Internal activity (modulated signal) combining feeding and linking with a linking strength co-efficient $\beta$. ${T}_{i,j}[n]$ →Dynamic threshold, decays over time and increases upon firing. ${Y}_{i,j}[n]$ →Output pulse; a binary indicator of neuron firing at time ‘n’. ${\alpha }_{F},{\alpha }_{L},{\alpha }_{T}$ →Decay coefficients for feeding, linking, and threshold signals. ${V}_{F},{V}_{L},{V}_{T}$ →Normalization constants scaling the influence of previous outputs. ( $k,l$ )→Indices of neighboring pixels in the spatial neighborhood. ${n}_{\text{Max}}$ →Maximum number of iterations.

Six parameters, including three degeneration factors $\left({\alpha }_{F},{\alpha }_{L},{\alpha }_{T}\right)$ and three normalizing constants $\left({V}_{F},{V}_{L},{V}_{T}\right)$ for the feeding (${F}_{i,j}$), linking (${L}_{i,j}$), and threshold (${T}_{i,j}$) inputs are acquired using the PCNN, which consists of a feeding and linking field. ${U}_{i,j}$ stands for the neuron’s internal activity (linking modulation) in Eqs. (26) to (28) (T_ij, Y_ij) for the dynamic threshold and the neurons’ pulse output. An important factor that alters the linking field’s weight is the linking parameter ‘β’. The HFC’s Novel Sum-Modified LAPLACIAN (NSML) measures and assesses contrast levels to satisfy the HVS requirement.

Compute the NSML for the HFSB as follows: Eqs. (30) and (31):

$$NSML(i,j)=\sum_{a} \sum_{b} w(i,j)\cdot F(i+a,j+b)$$

(30)

$$\begin{array}{c}F(i,j)=\mid 2H{F}_{Z}^{NSST}(i,j)-H{F}_{Z}^{NSST}(i-1,j)-H{F}_{Z}^{NSST}(i+1,j)\\ |+|2H{F}_{Z}^{NSST}(i,j)-H{F}_{Z}^{NSST}(i,j-1)-H{F}_{Z}^{NSST}(i,j+1)\mid \end{array}$$

(31)

where:$H{F}_{Z}^{NSST}(i,j)$ →HFC from modality $Z$ (e.g., CT or MRI) at pixel location $(i,j)$, attained by NSST. $F(i,j)$ →Local contrast magnitude at pixel $(i,j)$ computed using modified Laplacian operators. $w(a,b)$ →Normalized window function used in local contrast aggregation, has the following Eq. (32):

$$w(i,j)=\left[\begin{array}{c}1/15\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}2/15\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}1/15\\ 2/15\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}3/15\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}2/15\\ 1/15\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}2/15\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}1/15\end{array}\right]$$

(32)

To activate the PCNN, use the following Eq. (33) and set up the neuron’s pulse with the help of each HFSB’s NSML:

$$\left.\begin{array}{c}{F}_{i,j}^{Z}[n]\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}=NSM{L}_{i,j}^{Z}\\ {L}_{i,j}^{Z}[n]\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}={e}^{-\alpha L}{L}_{i,j}^{Z}[n-1]+{V}_{L}\sum_{k,l} {W}_{i,j,k,l}^{Z}{Y}_{i,j}^{Z}[n-1]\\ {U}_{i,j}^{Z}[n]\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}={F}_{i,j}^{Z}[n]\left(1+\beta {L}_{i,j}^{Z}[n]\right)\\ {T}_{i,j}^{Z}[n]\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}={e}^{-\alpha T}{T}_{i,j}^{Z}[n-1]+{V}_{T}{Y}_{i,j}^{Z}[n]\\ {Y}_{i,j}^{Z}[n]\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}=\left\{\begin{array}{cc}1,& {U}_{i,j}^{Z}>{T}_{i,j}^{Z}\\ 0,& \text{ otherwise}\end{array}\right.\end{array}\right\}$$

(33)

where: $Z$ →Source modality (e.g., CT or MRI). $(i,j)$ →Pixel coordinates. ${F}_{i,j}^{Z}[n]$ →Feeding input derived from NSML for modality $Z$. ${L}_{i,j}^{Z}[n]$ →Linking input aggregating neighborhood activity. ${U}_{i,j}^{Z}[n]$ →Internal activity (modulated product of feeding and linking inputs). ${T}_{i,j}^{Z}[n]$ →Dynamic threshold for neuron firing. ${Y}_{i,j}^{Z}[n]$ →Binary neuron output (firing decision). ${\alpha }_{L},{\alpha }_{T}$ →Decay rates for linking input and threshold. ${V}_{L},{V}_{T}$ →Normalization constants. $\beta$ →Linking modulation coefficient. ${W}_{i,j,k,l}^{Z}$ →Synaptic weights connecting neighboring neurons.

In the iterative process of the PCNN, a neuron’s firing is found by whether its internal activity ${U}_{i,j}^{Z}[n]$ exceeds the dynamic threshold ${T}_{i,j}^{Z}[n]$.

This condition generates the binary firing output ${Y}_{i,j}^{Z}[n]\in \{\text{0,1}\}$ as defined in Eq. (29). Over ${n}_{\text{Max}}$ iterations, the total firing time map is computed by collecting the number of times each neuron fires, Eq. (34):

$${t}_{i,j}^{Z}[n]={t}_{i,j}^{Z}[n-1]+{Y}_{i,j}^{Z}[n], \text{ for }n=\text{1,2},\dots ,{n}_{\text{Max}}$$

(34)

where: ${t}_{i,j}^{Z}[n]$ →The cumulative firing count at pixel $(i,j)$ up to iteration ‘n’, ${Y}_{i,j}^{Z}[n]=1$ if ${U}_{i,j}^{Z}[n]>{T}_{i,j}^{Z}[n]$, and 0 otherwise, This summation reflects the number of times a neuron at location $(i,j)$ was activated across the PCNN evolution.

At the decision of the iterations (i.e., when $n={n}_{\text{Max}}$), the final firing times ${t}_{i,j}^{X}\left[{n}_{\text{Max}}\right]$ and ${t}_{i,j}^{Y}\left[{n}_{\text{Max}}\right]$ are used as a decision measure to select the most salient co-efficient from the HFSB of the input images¹⁰¹.

The fusion rule is defined in Eq. (35):

$$H{F}_{F}^{NSST}(i,j)=\left\{\begin{array}{ll}H{F}_{X}^{NSST}(i,j),& \text{ If }{t}_{i,j}^{X}\left[{n}_{\text{Max}}\right]\ge {t}_{i,j}^{Y}\left[{n}_{\text{Max}}\right]\\ H{F}_{Y}^{NSST}(i,j),& \text{ Otherwise}\end{array}\right.$$

(35)

This decision rule ensures that for each pixel ( $i,j$ ), the fused HFC is taken from the source image whose PCNN neuron exhibits robust and more reliable firing behavior, indicating higher local saliency, such as edges or texture elements¹⁰².

Source link