Fault classification in the architecture of virtual machine using deep learning

Machine Learning


Fault classification is critical for fault analysis and speedy repairment in the cluster network. We have seen multiple algorithms based on deep learning and machine learning have been implemented for fault classification.44. Our approach includes dynamic feature selection, hierarchical decision masking, and attention-based learning for fault classification in its methodology. Therefore, We have enhanced the TabNet architecture and proposed a new architecture for our research work.

Fig. 1
figure 1

Decision tree-like classification TabNet architecture.

The base architecture

With the advent of deep neural networks, one data type that was still left out to see success was tabular data. Tabular data is considered to be one of the staple data types in the modern world of artificial intelligence45. Several models utilizing the ensemble approach for constructing decision trees have been used extensively for these types of datasets. The deep learning architecture can successfully simulate the decision-making process of a decision tree by dynamically constructing multiple hyperplane decision boundaries at various levels of abstraction. Unlike traditional decision trees, which rely on discrete branching at each node, deep learning models-particularly neural networks-can create continuous, non-linear decision boundaries that evolve as the model learns from data. The best features at each decision step are chosen through sequential attention. Here, the masking of features is done at each step, which helps to identify the best features. It can work directly on the original data without performing much data preprocessing. The research is concluded successfully with the TabNet model surpassing all the baseline classifiers35. TabNet is a deep learning neural network architecture for the classification of tabular datasets. It is also called a Tabular learning model, which can be used on various datasets for classification and regression tasks. As shown in Fig. 1, TabNet uses decision trees through neural network blocks. The most significant features are extracted by using multiplicative sparse masks on the inputs. The extracted features are transformed linearly after the addition of the bias terms to represent the decision boundaries. The Rectified Linear Unit (ReLU) activation function is employed to refine region selection in the model. It effectively filters out regions by setting negative values to zero, ensuring only positively activated regions contribute to further computations. This characteristic helps in defining distinct decision boundaries by ignoring non-relevant negative regions. Consequently, ReLU enhances the model’s ability to focus on meaningful features while improving computational efficiency46. Finally, the multiple regions are aggregated using addition to obtain the classes used in the classification.

Mathematical function

The proposed model uses sparse feature selection to select only the relevant and non-redundant feature subsets from all the given features. SparseMax normalization promotes sparsity in features by selectively activating only the most relevant ones while suppressing the rest. This is achieved by projecting the input onto a probabilistic simplex, ensuring that only a few features receive significant values. Unlike SoftMax, which assigns small probabilities to all inputs, SparseMax enforces zero activation for less important features. The Euclidean projection method guarantees that the output remains within a valid probability distribution while maintaining sparsity. To compute the Euclidean projection of a point b, where b=\([b_1,…b_D]^T\) \(\in \mathbb {R}^D\) onto our projection simplex, where D is the number of decision steps in the algorithm. It is defined using the following formula:

$$\begin{aligned} \underset{x\in \mathbb {R}^D }{min} \frac{1}{2}\Vert a-b \Vert \;\ni a^T=1 \;and \; a \geqslant 0 \end{aligned}$$

(1)

where a is the unique solution to the problem, which will be denoted by \(a=[a_1,…,a_D]^T\).

The above quadratic equation and the defined objective function will be strictly convex, so there will always exist a unique solution, a. The time complexity of the algorithm that finds the solution a is O(DlogD).

  1. 1:

    Input: \(b \in \mathbb {R}^D\)

  2. 2:

    Sorting b into m such that: \(m_1 \ge m_2 \ge …m_D\)

  3. 3:

    Find \(\rho =max \left\{ 1\le j\le D: m_j+\frac{1}{j} \left( 1-\sum _{i=1}^{j} m_i\right) <0 \right\}\)

  4. 4:

    Define \(\lambda = \frac{1}{\rho }\{1-\sum _{i=1}^{\rho }\)m\(_i \}\)

  5. 5:

    Output: a \(\ni\) \(a_i\) = \(\text {max}\{b_i+\lambda ,m\},\text {i=1….,D}\)

The cost of sorting the component b of the equation determines the complexity of the algorithm. The active set is determined after D steps as the algorithm is not iterative. Adhering to the standard practice of ensuring the input values are scaled to fall inside the normal distribution, the mean and the standard deviation of the normal distribution are taken as 0 and 1, respectively. In every continuous distribution, there are uncountable infinite sets of values sampled from it. If we assign positive probability to each of the possible values, the probability might sum to infinity, which should not happen. So, the normal distribution is taken to be centered around \(\mu\) (mean of input features), and we can observe most of the samples close to \(\mu\), which rectifies the distribution of our input values. For initialization of the parameters of our neural networks, we use the Xavier initialization, where the parameters are selected randomly from the normal distribution with the mean 0 and standard deviation \(\sigma = \sqrt{\frac{2}{x+y}}\) where x and y are the input and the output weights of the neural network, which makes sure that the optimized weights are identified.

Dataset analysis

To build a trustworthy classification model with better performance accuracy, one should comprehend the dataset and extract the significant features. Therefore, our proposed Virtual Machine fault classification model uses \(severity\_type\) and \(event\_type\) features to leverage the classification.

Dataset exploration

The dataset is provided by the Telstra cluster network47. This dataset includes the failure records of service disruption events that brought momentary glitches or total interruption of connectivity in the Telstra network. The dataset is categorized into five different files as depicted in Table 1, where the main dataset is \(event\_type\). The \(log\_feature\) contains the various features that are extracted from log files. The \(fault\_severity\) in the data set is a target variable, the fault that the users reported in the network. The \(fault\_severity\) is categorized into three different levels, such as 0, 1, 2, which correspond to no fault, only a few, and many faults, respectively. On the other hand, \(severity\_type\) is a warning message extracted from the log files, which is classified into five categories (1-5) in increasing order of severity. All five files are consolidated into a single dataset that includes different types of features such as event type, severity type, resource type, log feature, event count, location, and log volume etc.

The dataset covers more than 7300 failure records, which include \(fault\_severity\) and \(severity\_type\). The dataset corresponding to 0, 1, 2 \(fault\_severity\) contributes 65%, 25%, and 9% of the total dataset, respectively, and the total record counts of events of disruption at each location in the network are shown in Fig. 2. Further, these record counts also show different event types that occurred at each location. The dataset also explores the record counts of various events and severity types. The scatter plot depicts the relationship between the \(fault\_severity\) id and the location of the system in the cluster network. The fault of the system and its severity types (0, 1, 2) are observed at a particular location as shown in Fig. 3.

Fig. 2
figure 2

Data exploration and a consolidated report.

Fig. 3
figure 3

Fault severity per location plot.

Data preprocessing

To develop a trustworthy classification system with better performance accuracy, one should comprehend the dataset to extract the significant features. Data preprocessing is a critical step in every deep learning and machine-learning approach. The collected data from the network cluster logs may contain duplicate and superfluous values. The data preprocessing step filters the data by removing the noisy content that affects the system‘s performance. Data preprocessing has a significant role in analyzing the dataset and generating accurate results. The Google Dataprep48 is an intelligent, fully managed dynamic data service to explore, preprocess, visualize, clean, and prepare data for analysis on the Google Cloud Platform. It facilitates creating recipes of transformations on the sample dataset to apply different transformations on the entire dataset. The five critical aspects of Dataprep are data preparation, collection, analytics, management, and storage. The major advantages of using Dataprep are cleaning, combining, and computing multiple huge datasets. Motivated by Dataprep features, the Telstra dataset is normalized before preprocessing steps viz., deduplication, early join, and statistical evaluation, etc. Further, the data is preprocessed in Google Dataprep to check the missing values and detection of any outliers. Additionally, the five different files from the Telstra dataset are joined together based on the key field id into a single dataset. Several features are converted into non-ordinal categorical features and then encoded using label encoding. Now, the final obtained dataset is completely numerical. The data is preprocessed to fit our classification model so that the model can achieve better performance in network failure.

Proposed architecture in virtual machine interpreters

This section introduces the proposed model for Virtual Machine fault classification. This model uses the TabNet architecture based on deep learning, and works on tabular data. This model consists of three modules as feature selection, attention transformer, and feature transformer. The system design of the proposed model is described below.

Fig. 4
figure 4

Proposed architecture in virtual machine Interpreters.

System design

The TabNet architecture includes a multi-step sequential structure and instance-wise feature selection. For decision-making the TabNet utilizes the decision blocks stationed at different learning steps to focus on processing the input features of the dataset. Internally, the TabNet architecture is a tree-like function that finds the proportion of each feature with the help of a coefficient mask and outperforms the decision trees. The approach followed by the model is as follows:

  1. (i)

    Use sparse instance-wise feature selection based on learning from the training dataset.

  2. (ii)

    Build a sequential architecture involving multiple steps to identify the decision step that contributes the most to the final selection of the best features.

  3. (iii)

    Performing non-linear processing and selecting the best features that can help improve generalization and robustness across diverse datasets.

The feature selection is designed to be instance-wise, which helps the proposed model determine the feature(s) that are concentrated separately for the inputs. Selecting the explicit set of features helps us to determine the sparse features, which results in more efficient learning. The parameters used in each decision stage are used to select the best feature(s). In this design process, the features are extracted and consolidated into a single file for modeling. The working model is discussed below.

The proposed model uses sparse feature selection to select only the relevant and non-redundant feature subset from all the given features, such as location and resource type, as shown in Fig. 4. This model utilizes various decision blocks to check the best subset of input features. The model dynamically selects characteristics based on feedback from previous processes, ensuring adaptive learning. Its decision blocks analyze features by considering location, resource availability, and defect patterns, allowing for precise decision-making. These blocks leverage hierarchical processing to refine feature importance at each stage. This structured approach enhances the model’s ability to generalize and improve accuracy across various scenarios. In this scenario, features (location-related, resources, and fault-related) are processed to predict the severity level (0, 1, 2) of the input features in the Telstra cluster network. The proposed architecture consists of an attention transformer, a feature transformer, and feature masking for each decision step. The architecture employed for the proposed model consists of four steps of evaluation as shown in Fig. 5.

Fig. 5
figure 5
step 1::

The Attention Transformer Block operates by applying a single-layer mapping that processes the values obtained from earlier steps to aggregate and refine the extracted features. It consists of four components: fully connected, Ghost Normalization, multiplication function, and SparseMax layer, as shown in Figure 6. Batch Normalization has been performed, and a SparseMax layer is used to select the best features obtained in each step.

step 2::

The Feature Transformer Block consists of a four-layered network, as illustrated in Fig. 7. This architecture is structured into four levels of blocks, where: Two layers are shared across all decision stages, ensuring consistency and helping the model learn common patterns applicable to multiple tasks. The remaining two layers operate independently, meaning they are unaffected by the judgments made by other decision blocks. These independent layers enable more specialized learning, adapting to specific aspects of the data without interference from other components. This design balances shared feature learning (for generalization) and task-specific adaptation (for fine-tuned decision-making), making the transformer block more flexible and effective in complex scenarios. The architecture comprises four layers of blocks, with two layers shared across all decision steps to ensure consistency in feature extraction. The remaining two layers operate independently, allowing flexibility in decision-making without being influenced by other blocks. This design strikes a balance between global learning with task-specific adaptations, improving overall model efficiency. By combining shared and independent layers, the system achieves both generalization and specialization in processing. Each layer in the block consists of a fully connected layer that detects the global aspects of the features that were detected in the lower layer of the network. A Batch Normalization layer normalizes the input layer for better performance. The mini-batch size during normalization is not defined; therefore, we have used Ghost Batch Normalization (Ghost BN).

step 3::

Feature Masking is employed at each decision step, which can provide insight about the model functionality. Also, provides information regarding the model’s working, which can be used to obtain the global importance of the selected features.

step 4::

: The Split block will split the processed data into two outputs, one of which will be used by the attention transformer in the next step whereas the second output shall be used for the overall results. Furthermore, the ReLU function is used to deal with the non-linearity of the features.

Fig. 6
figure 6

Attention transformer block architecture.

Fig. 7
figure 7

Feature transformer block architecture.

Algorithm

The algorithm for the proposed model is described below.

  1. (1)

    Feature Selection: For the soft selection of the salient features, a sparse matrix \(M[i] \in \mathbb {R}^{\alpha \times \beta }\) is deliberately chosen, which in turn guarantees that the irrelevant features are masked in each decision step. This enhances the model in the context of parameters and ensures that the learning capacity is retained at each decision step. An attentive transformer is used to obtain the masks from the processed features of the step a[i]:

    $$\begin{aligned} M[i+1]=\text {sparsemax}\left( \prod _{j=1}^i(\lambda -M[j]).t_{i+1}(a[i])\right) \end{aligned}$$

    (2)

    with the satisfiability of normalization property M[i]:

    $$\begin{aligned} \mathop {{\sum }}\limits _{j=1}^\beta M_{b,j}[i]=1 \end{aligned}$$

    (3)

    where \(t_i\) is the trainable function. The factor \(\prod _{j=1}^i(\lambda -M[j])\) accounts for the frequency of the usage of a particular feature, termed as scale term P[i], where \(\lambda\) is called a relaxation parameter. Setting \(\lambda =1\) in this factor ensures that a feature is limited to use at a single decision step, and \(\lambda \>1\) triggers the scope of its usage more than once. Hence, at multiple decision steps, a large \(\lambda\) characterizes the model as flexible in the usage of a feature. The initial value of P[i] i.e. P[0] is taken as a constant sequence having unity at each position. The selected features are first passed through a fully connected layer, where they transform to capture complex relationships. A ReLU activation function is then applied to introduce non-linearity, enhancing the model’s ability to learn intricate patterns.

  2. (2)

    Input Feature: The selected features f belong to batch size \(\alpha\) and its dimension \(\beta\) as follows.

    $$\begin{aligned} f\in \mathbb {R}^{\alpha \times \beta } \end{aligned}$$

    (4)

  3. (3)

    Feature Transformer: It is a four layered network architecture. The first two layers of blocks are shared by all decision blocks, but the latter two layers of blocks are independent of the decisions made by the other decision blocks.

  4. (4)

    Feature Processing: The output m[i] and n[i] are obtained from the previous step after applying the split function. The n[i] is further used to the prediction results, and m[i] is fed into the following Attention Transformer.

    $$\begin{aligned} [n[i],m[i] ]=f_i (M[i].f) \end{aligned}$$

    (5)

    where \(n[i]\in \mathbb {R}^{\alpha \times N_n}\) and \(m[i]\in \mathbb {R}^{\alpha \times N_m}\).

  5. (5)

    Attention Transformer: It aggregates the features using the values collected in the previous phases. M[i].

  6. (6)

    Final Output: At each stage, the final output fout is summarized by adding all the values received in the preceding phases, and the decision output n[i] is passed using ReLU function.

    $$\begin{aligned} f_{out}=\mathop {{\sum }}\limits _{i=1}^{N_{steps}}ReLu(n[i]) \end{aligned}$$

    (6)



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *