Artificial intelligence-driven cybersecurity: enhancing malicious domain detection using attention-based deep learning model with optimization algorithms

This manuscript presents an EMDD-ADLMOA technique. The proposed method relies on improving malicious domain detection in cybersecurity. The EMDD-ADLMOA model has data pre-processing, feature subset selection, attack classification, and parameter tuning to accomplish that. Figure 1 signifies the entire flow of the EMDD-ADLMOA model.

Data pre-processing: min–max scaling

At first, the min–max scaling method is utilized in the data pre-processing phase to convert input data into an appropriate design³⁴. This model is chosen as it efficiently normalizes the feature values to a fixed range, typically between 0 and 1, confirming consistency across the dataset. This is particularly useful when the model is sensitive to the scale of input features, as is the case with many ML models. Compared to other scaling techniques, like Standardization, Min–Max scaling preserves the relationships between the original data points and ensures no distortion in the data distribution. It also avoids issues of data skewness that could arise when features have varying magnitudes or units. This methodology is computationally simple, easy to implement, and assists in improving the convergence speed of algorithms, resulting in faster training times and more stable model performance. Furthermore, the dataset performs well with a known, fixed feature-value range.

Due to the wide variety of features in the dataset, it is crucial to standardize or normalize the features connected to the DL method. In this case, Min‐Max scaling has been applied for constant features and a one‐hot encoder for absolute features.

For the constant variable $x$, the transformation of Min‐Max scaling is described as:

$$x^{\prime} = \frac{{x – {\text{min}}\left( x \right)}}{{{\text{max}}\left( x \right) – {\text{min}}\left( x \right)}}$$

(1)

whereas x denotes the scaled value, ${\text{min}}\left( x \right)$ and ${\text{max}}\left( x \right)$ characterize the minimal and maximal values of variable $x$ in the data set, respectively. This scaling guarantees that each feature rests inside the range [0,1], which enhances the model convergence throughout training.

Feature subset selection: QIFA model

For the process of FS, the proposed EMDD-ADLMOA technique utilizes QIFA³⁵. This method is chosen as it integrates the merits of quantum-inspired optimization and the FA, enabling it to navigate large search spaces efficiently. Unlike conventional methods like Forward Selection or Genetic Algorithms, QIFA presents faster convergence by utilizing quantum-inspired mechanisms, improving the speed and accuracy of feature selection. It can handle complex, high-dimensional datasets by detecting the most relevant features, thus mitigating computational costs and overfitting. QIFA also balances exploration and exploitation during the search process, averting local optima and ensuring a robust selection. This makes it particularly effectual in scenarios where the relationships between features are complex and conventional feature selection techniques might face difficulty. Overall, QIFA improves model performance by choosing the optimal feature subset with minimal computational overhead. Figure 2 illustrates the steps involved in the QIFA methodology.

The QIFA model is the combination of the FA model and the beliefs of the QC model. The FA is a meta-heuristic approach which describes stimulation from the fireflies’ normal behaviour. They use their dazzling diversions to fascinate other fireflies in addition to possible prey. All firefly class produces an individual flashing light model that associates the objective function that should be enhanced according to the difficulty. FA emulates 3 fundamental principles for the processing of its agents that are designated as shown:

1. Fireflies are captivated by one another because of light intensity or brightness $\left( {I_{i} } \right)$ rather than sex, specifying a direct interrelation to the value of the objective function $f\left( {x_{i} } \right)$ at their respective places $\left( {x_{i} } \right)$. These relationships are stated by Eq. (2).

$$I_{i} \propto f\left( {x_{i} } \right)$$

(2)

2. The attraction $\left( \beta \right)$ among fireflies reduces proportionally with improving distance $\left( r \right)$ among them. The $\beta$ value among fireflies $i$ and $j$ is assessed utilizing Eq. (3).

$$\beta \left( {r_{ij} } \right) = \beta_{0} e^{{\left( { – \gamma r_{{ij^{2} }} } \right)}}$$

(3)

whereas $\beta_{0}$ signifies the first brightness value at a distance $\left( {r = 0} \right)$, and $\gamma$ denotes the coefficient, which calculates the absorption of the light. The $r_{ij}$ value among fireflies $i$ and $j$ is assessed by the Euclidean distance: $r_{ij} = x_{i} – x_{j}$.

3. Fireflies show deterministic attractions concerning bright peers; however, they combine an arbitrary component to improve their exploration of the searching region. These movements of the fireflies $i$ within the optimistic firefly path $j$ are measured utilizing Eq. (4).

$$x_{i}^{t + 1} = x_{i}^{t} + \beta_{0} e^{{\left( { – \gamma r_{ij}^{2} } \right)}} \left( {x_{j}^{t} – x_{i}^{t} } \right) + \varepsilon \left( {rand – 0.5} \right)$$

(4)

whereas $rand$ denotes a random generator number uniformly distributed in the interval of[$0,1]$, $\varepsilon$ represents a randomizing parameter, which controls the randomness step size. A more excellent value of $\varepsilon$ specifies a better potential for exploration, whereas the lowest value highlights exploitation.

These principles guarantee the efficient performance of the FA in resolving composite optimizer difficulties. To incorporate the features of quantum computing using FA, it is converted into Binary FA (BFA). The BFA model is the categorical variable of FA that is efficient for distinct binary issues that give outputs of both $(0$,1). These changes must be limited to stop the model from making constant outputs. To address this problem, utilize the function of sigmoid $\left( {x_{i} } \right)$, which retains the value inside the limits of the binary. The location of the fireflies that have bits $\left( {x_{i} } \right)$ in the BFA is assessed by Eq. (5).

$$x_{i} = \left\{ {\begin{array}{*{20}c} {1,} & {if S\left( {x_{i} } \right) > RN_{U} } \\ {0,} & {otherwise} \\ \end{array} } \right.$$

(5)

On the other hand, $RN_{U}$ refers to a randomly generated uniform amount. It rests within the range ($0,1)$, and the function of sigmoid $S\left( x \right)$ is assessed by Eq. (6).

$$S\left( {x_{i} } \right) = \frac{1}{{1 + e^{{\left( { – x_{i} } \right)}} }}$$

(6)

Like BFA, quantum computation assesses the outputs within the binary q–bit model of $zero$ or $one$ and the superposition of $(0$,1). Equation (7) is applied to compute the quantum state’s superposition.

$$\left| \psi \right. = C_{1} \left| 0 \right. + C_{2} \left| 1 \right.$$

(7)

$C_{1}$ and $C_{2}$ are composite numbers with possibilities $|C_{1} |^{2}$ and $|C_{2} |^{2}$ comparable to q–bit $(0$,1). The sum of these possibilities has to complete the normalized state: $\left| {C_{1} |^{2} + } \right|C_{2} |^{2} = 1$. The q–bit state is changed through quantum gates indicated in the operator of the unitary $U$. The rotation gate is designated to upgrade quantum bits owing to its wide-ranging and efficient pertinence in empirical methods. The operator of the unitary for the rotation gate, using rotation angle $\theta_{i}$ $\left( {for i = 1,2, \ldots , n} \right)$, is expressed by Eq. (8).

$$U\left( {\theta_{i} } \right) = \left[ {\begin{array}{*{20}c} {{\text{cos}}\left( {\theta_{i} } \right)} & { – {\text{sin}}\left( {\theta_{i} } \right)} \\ {{\text{sin}}\left( {\theta_{i} } \right)} & {{\text{cos}}\left( {\theta_{i} } \right)} \\ \end{array} } \right]$$

(8)

During QIFA, the dynamic rotation angle model has been applied to determine the coordinate rotation gate model and rotation angle dimensions to update q–bits. Therefore, the rotation angle is assessed utilizing Eq. (9) without requiring the previous reference table.

$$\theta_{i} = \theta \times \left( {x_{i} + \beta_{0} e^{{\left( { – \gamma r_{ij}^{2} } \right)}} \left( {x_{j} – x_{i} } \right) + \varepsilon \left( {rand – 0.5} \right)} \right)$$

(9)

Here, $\theta$ denotes a range of the rotation angle, which reduces monotonously from $\theta_{m\alpha x}$ to $\theta_{min}$ by the rise within the iteration amounts. The last location of the fireflies in the binary model is upgraded utilizing Eq. (10).

$$x_{i} = \left\{ {\begin{array}{*{20}c} {1,} & {if\left| {\beta_{i} \left( {t + 1} \right)} \right|^{2} > RN_{U} } \\ {0,} & {otherwise} \\ \end{array} } \right.$$

(10)

Now, $RN_{U}$ represents a randomly generated uniform number within the range[$0,1].$ The fitness function (FF) contemplates the classifier precision and the designated feature amounts. It exploits the classifier precision and decreases the selected attribute’s set size. Thus, the following FF is applied to calculate individual solutions, as exposed in Eq. (11).

$$Fitness = \alpha * ErrorRate + \left( {1 – \alpha } \right)*\frac{\# SF}{{\# All\_F}}$$

(11)

Here, $ErrorRate$ epitomizes the classification error rate deploying the chosen features. $ErrorRate$ is the incorrect percentage characterized by the sum of completed classifications, articulated as a value among (0,1). $\# SF$ stands for nominated feature amounts, and $\# All\_F$ denotes total feature counts within the novel data set.

Hybrid classification model: TCN-BiLSTM-SEA

Furthermore, the hybrid model of the TCN-BiLSTM-SEA model is implemented for the classification process³⁶. This model is chosen for its ability to capture both temporal dependencies and crucial feature interactions in sequential data. TCN effectively learn long-range dependencies in time-series data, overcoming the limitations of traditional Recurrent Neural Networks (RNNs), which suffer from vanishing gradients. The addition of BiLSTM allows the model to process data in both forward and backward directions, further improving its capability to capture temporal patterns. Squeeze-and-excitation attention (SEA) enhances the concentration of the model on relevant features by dynamically recalibrating feature responses. This hybrid model is specifically appropriate for complex datasets, as it integrates the merits of convolutional and recurrent layers along with AM, giving superior classification performance over models depending on a single architecture like CNNs or LSTMs alone.

TCN is a DL method specially tailored to process time series data. Compared with conventional LSTMs and RNNs, TCN might have long-range dependencies in time-series information over convolution processes. This convolution process permits equivalent calculation and acceleration of the training method. It allows open feature extraction through dissimilar time scales by fine-tuning the convolution kernel depth and size. Compared with normal convolutional methods, TCN has essential benefits in seizing long-range dependencies and enhancing computational complexity. Normal convolution methods have a small receptive area and usually need stacking additional convolution layers to expand the receptive area, improving computing costs. To enhance training stability and efficiency, TCN uses a residual block framework, where each dual layer of convolution is summarized within a residual block, preventing the gradient from vanishing and stimulating the smoothing flow of information. This residual block speeds up training and improves the representation capability of the model. Loading numerous residual blocks forms deeper networks, using all layers to remove more composite temporal features, thus enhancing the stability and accuracy of the model in time series.

LSTM is the enhanced form of the RNN, specially intended for capturing long-range dependencies. Compared with conventional RNNs, LSTM deals with explosion and gradient vanishing problems by presenting gating mechanisms, which is additionally efficient in processing longer data sequences. It is extensively applied in machine translation, time-series predicting, speech recognition, and natural language processing. The calculation equations of the LSTM net are as demonstrated:

$$f_{t} = \sigma \left( {W_{f} \cdot \left[ {h_{t – 1} ,x_{t} } \right] + b_{f} } \right)$$

(12)

$$i_{t} = \sigma \left( {W_{i} \cdot \left[ {h_{t – 1} ,x_{t} } \right] + b_{i} } \right)$$

(13)

$$\tilde{C}_{t} = \tanh \left( {W_{C} \cdot \left[ {h_{t – 1} ,x_{t} } \right] + b_{C} } \right)$$

(14)

$$C_{t} = f_{t} *C_{t – 1} + i_{t} *\tilde{C}_{t}$$

(15)

$$o_{t} = \sigma \left( {W_{o} \cdot \left[ {h_{t – 1} ,x_{t} } \right] + b_{o} } \right)$$

(16)

$$h_{t} = o_{t} \cdot \tanh \left( {C_{t} } \right)$$

(17)

Now, $i_{t}$, $o_{t} ,$ and $f_{t}$ correspondingly describe the output, input, and forget gates. $C_{t}$ signifies the value of the cell layer at $tth$ time, whereas $h_{t}$ and $h_{t – 1}$ characterize the output and the input of the hidden layer (HL) at $tth$ time, correspondingly. $b_{f}$, $b_{i}$, $b_{C}$, and $b_{o}$ represent biased vectors, and $W_{f}$, $W_{i}$, $W_{C}$, and $W_{o}$ represent weighted matrices. $tanh$ and $\sigma$ signify the hyperbolic tangent and Sigmoid functions, correspondingly. The Bi-LSTM contains dual LSTMs, all handling the forward and reverse data of the time series, allowing the method to concurrently seize the context data at the present step and know the dynamic modifications within the time series data. During conventional LSTM, this method can merely make forecasts according to previous time steps, unable to completely utilize upcoming information. Compared with LSTM, Bi-LSTM, by distributing data in either backward or forward propagation, improves the memory ability of the model for the sequence.

In DL, the AM can rapidly remove main characteristics from more extensive data, decrease computational complexity, and enhance learning accuracy and efficiency. SE Attention improves the capture of main characteristics by dynamically and adaptively fine-tuning the weighting of the feature networks. These SE units contain Scale, Squeeze, and Excitation. X′ characterizes the unique data, whereas W′, H′ and C′ symbolize the unique input’s width, height, and channel counts individually. $X$ characterizes the convolution information, and $W, C,$ and $H$ are the width, channel counts, and height of the convolution data correspondingly. $\tilde{X}$ denotes the outcome of feature recalibration. Lastly, the channel weights are produced and applied to weigh the Bi-LSTM output features. SE Attention efficiently decreases the influence of lower-variance and noise features overweight mechanisms. In training, lower-variance characteristics are dynamically allocated low weights, whereas the effect of noise features is repressed, thus prohibiting them from restricting the method’s predictions. This guarantees the accuracy and robustness of the technique in realtime applications.

This study incorporates TCN and SE Attention into the Bi-LSTM method to build a new DL structure, TCN-BiLSTM-SE Attention, targeted at increasing the predictive ability. The unique 1D causal convolution architecture of TCN, whereas the residual connection component, speeds up the convergence of the network. As the essential component of the method, Bi-LSTM can utilize either future or past information. Over its gating mechanism, Bi-LSTM successfully preserves important characteristics and removes unrelated information, thus improving the prediction precision. SE Attention permits the method to concentrate on the main features automatically.

Parameter optimizer: PO approach

At last, the PO model fine-tunes the hyperparameter values of the TCN-BiLSTM-SEA method optimum and results in better classification performance³⁷. This approach is chosen because it can efficiently explore and exploit the hyperparameter search space. Unlike conventional methods such as Grid or Random Search, PO utilizes a nature-inspired algorithm that replicates the behaviour of parrots to balance the exploration of new areas with the exploitation of known optimal solutions. This results in faster convergence to optimal hyperparameters while averting the computational cost of exhaustive search methods. The PO approach also adapts dynamically to the problem at hand, making it more flexible and robust in fine-tuning complex models. It outperforms simpler optimization techniques by providing a global search capability, preventing local optima, and improving overall model performance. Additionally, PO is computationally efficient, making it appropriate for models with large and complex hyperparameter spaces. Figure 3 depicts the flowchart of the PO method.

The PO model is stimulated by the adaptive strategies and social behaviours noticed in parrot populations in the natural surroundings. Established on this performance, the PO model follows the competitive and cooperative mechanisms established inside parrot flocks, presenting an efficient solution to multifaceted optimizer difficulties. During this method, parrots fine-tune their fight routes and accelerations over cooperative behaviour, thus improving their capability to cover the search area effectively. This event is similar to optimizer difficulties, where individuals discover possible best solutions inside the solution area; however, they substitute information with another solution to recognize improved results.

The fundamental equation of the PO model imitator’s natural behaviours is mainly separated into dual vital stages: the exploitation and the exploration stage. During this exploration stage, the model mimics the random movements of the parrot in pursuit of food, exploring through the solution area to recognize possible best solutions. This procedure is mathematically stated by the succeeding updated Eq. (18):

$$\chi_{i} \left( {f + 1} \right) = x_{i} \left( f \right) + \alpha \cdot r_{i} \left( f \right)| \cdot \left( {x_{best} \left( f \right) – x_{i} \left( f \right)} \right) + \beta \cdot \left( {x_{i} \left( f \right) – x_{j} \left( f \right)} \right)\chi_{best}$$

(18)

Here, $x_{i} \left( f \right)$ characterizes the location of the $i th$ parrot at $tth$ time. In the meantime, $x_{best} \left( f \right)$ denotes the current global best solution. $r_{i} \left( f \right)$ represents a random feature that controls the strength of the exploration behaviour. $\alpha$ and $\beta$ parameters normalize the balance between exploitation and exploration, correspondingly. Moreover, $x_{j} \left( f \right)$ mentions the location of other parrots through which the present parrot communicates. The PO establishes robust global searching abilities and higher solution value, making it suitable for composite optimizer problems that must recognize optimum solutions in a vast search space.

Fitness selection is a crucial element in the PO model’s performance. The hyperparameter selection process comprises a solution-encoded approach that considers the proficiency of the candidate solutions.

$$Fitness = {\text{ max }}\left( P \right)$$

(19)

$$P = \frac{TP}{{TP + FP}}$$

(20)

Here, TP and FP exemplify the true positive and false positive rates.

Source link