Machine learning of automatic hierarchical multi-label classification method for identifying metal failure mechanisms

Machine Learning


Related work

Deep learning-based SEM image analysis methods

SEM is an important tool for microstructure characterization of engineering structural materials, capable of obtaining high-resolution images of material surfaces. SEM images contain rich information about the microstructure and morphology of materials, but traditional SEM image analysis heavily relies on the experience and knowledge of experts, with issues of low efficiency and subjectivity. In recent years, the development of deep learning technology has provided new ideas and methods for SEM image analysis32,33,34. For example, a system for automatic classification of metal pipeline defects was developed using ViT and CNN. A multi-label dataset containing 2,075 SEM images of four subcategories was created, and eight models at different resolutions were trained and validated to determine the optimal model. The results demonstrated that the model based on EfficientNet and ViT can accurately identify metal defects on SEM images in real-time, with accuracy comparable to manual judgment, and thereby the operational efficiency can be substantially enhanced. Meanwhile, a deep learning approach was proposed to address the challenge of identifying a consistent and transferable set of features across material systems, due to the inherent complexity and variability of features in most heterogeneous materials30. However, based on the traditional machine learning methods, the rich and diverse nature of features in such systems made it difficult to define a universal and transferable feature set across these systems. Though leveraging the flexible architecture and exceptional learning capabilities of deep learning methods, the feature design step can be bypassed, and it has verified the use of deep learning method without feature engineering in predicting the micro-elastic strain field of the three-dimensional voxel microstructure of high-contrast dual-phase composites. Accordingly, deep learning has an excellent performance for implicitly learning significant information of local neighborhood details.

Hierarchical multi-label classification issues and challenges

Hierarchical Multi-Label Classification (HMC) is a classification task in which a given sample (such as text or images) can be associated with multiple category labels that are organized in a hierarchical structure9,35. Unlike traditional multi-label classification, the category labels in HMC tasks exhibit hierarchical relationships, and the lower-level labels are constrained and influenced by higher-level labels. Common methods for hierarchical multi-label classification can be divided into two major categories: Local and Global, which differ in the way they utilize hierarchical structure information36,37. Local methods can learn the relationships between different levels of categories and texts, and aggregate predictions from different levels to obtain the final prediction results. These methods usually consist of multiple classification modules, such as top-down hierarchical classification, and each non-leaf node has a local classifier that predicts the final subcategories based on the predictions of the parent categories. Local-based methods can utilize finer-grained hierarchical information, but they usually require the construction of multiple classification modules and are susceptible to error propagation. Recent work introduced a lightweight multi-scale encoder-decoder with locally enhanced attention, showing promise in segmenting fine structural defects in concrete38. Global methods are usually composed of a single classification module that directly utilizes hierarchical structure information for modeling. For example, the hierarchical structure is used to construct a recursive regularization loss term to constrain the classification parameters. In contrast, global-based methods are generally simple, but they often cannot exploit fine-grained hierarchical information when learning text semantic representation, resulting in insufficient model learning performance and possible underfitting.

Baseline of this work

To verify the effectiveness of method developed in this work, one of the most representative algorithm architectures Hierarchical Feature Fusion Vision Transformer (HFFVT) was selected for comparison39. Based on the ViT architecture, a novel image multi-level multi-classification model of HFFVT was proposed. The model aims to effectively utilize the hierarchical label information of images and enhance classification performance by fusing feature representations from different levels. Figure 1 illustrates the HFFVT network structure. It can be found that HFFVT introduces an independent feature embedding module for each classification level, mapping the image features extracted by ViT into different feature subspaces according to hierarchical labels to obtain level-specific feature representations. To fully leverage information from different levels, HFFVT designs a hierarchical feature fusion module that adaptively aggregates features from various levels through cross-level attention mechanisms, allowing semantic information of different granularities to complement and enhance each other. During the prediction phase, HFFVT sets up an independent classification head for each level, and the classification losses of all levels are optimized simultaneously to achieve end-to-end multi-level classification. Due to the fusion of features at multiple levels, each classification head can benefit from the information at other levels and improve the overall classification performance. Notably, HFFVT has been extensively tested on the image classification datasets with hierarchical labels, such as CIFAR-10040, and has been compared with various methods using CNN and ViT. HFFVT can significantly improve the performance of image multi-level multi-classification tasks, thereby validating the effectiveness of its multi-level feature fusion strategy. Furthermore, the ablation study can confirm the critical importance of multi-level feature fusion methods in improving the performance of hierarchical models.

Fig. 1
figure 1

SEM images of various metal failure modes: (a) Metal fatigue; (b) Instantaneous stress-induced plastic deformation; (c) Stress corrosion cracking; (d) Thermal cycling; (e) Oxide deposition spalling; (f) Deposition-induced thermal corrosion.

Dataset

Data sources and classification scheme

The SEM image data used in this study were sourced from open-access materials science papers in the field of metal material preparation, performance testing, and failure analysis from the sci-hub repository. A total of 1,964 SEM images with explicit identification of metal failure were manually selected. Subsequently, based on the explanations provided in the papers for the SEM images and in conjunction with common classification schemes in the field of materials science for metal failure analysis, a two-level classification scheme was refined and annotated. The first level was used to identify the root causes of metal material failure, i.e. failure caused by mechanical forces or thermal effects. The second level was used to further subdivide each first-level category into three subcategories. The mechanical force induced-failure was further divided into three subcategories including metal fatigue, instantaneous stress-induced plastic deformation (such as tensile and compression), and stress corrosion cracking. The failure caused by thermal effects was further divided into thermal cycling, oxide deposition spalling, and deposition-induced thermal corrosion. Finally, the data were divided into the training set and testing set according to the ratio of 9:1. After processing, the specific data categories and distribution are listed in Table 1.

Table 1 Data set categories and data distribution.

Theoretical basis of the data classification scheme

Metal fatigue is a material degradation process in which cyclic loading induces the initiation and propagation of microcracks, ultimately resulting in the occurrence of fracture (Fig. 1). Figure 1(a) presents the SEM image as an example of metal fatigue data. This is usually due to local stress exceeding the material limit at stress concentration sites (such as microcracks, pits, or scratches), leading to the gradual propagation of cracks. Long-term repeated stress (even below the strength limit of material) can also cause the dislocation movement and multiplication in the metal lattice to reduce the intergranular force, and eventually the fatigue cracks can be generated in the material.

Instantaneous stress causes permanent plastic deformation on a macroscopic level, while on a microscopic level this is caused by the movement and interaction of dislocations in the slip system41. When the stress exceeds the limit of fracture strength, the fracture occurs. The fracture modes can be divided into brittle fracture and ductile fracture. Figure 1(b) presents the SEM image for an example of permanent plastic deformation caused by mechanical force. Brittle fracture typically occurs suddenly without obvious prior indication of plastic deformation, which is predominantly driven by the rapid expansion of cracks. This type of fracture usually appears in materials with brittle crystal structures and a high density of defects (such as microcracks, voids, inclusions, etc.). Ductile fracture is always characterized by obvious plastic deformation prior to failure, as the material undergoes considerable deformation before fracture with gradual crack propagation.

The stress corrosion cracking (SCC) is a complex phenomenon that is primarily caused by the combined action of stress state and corrosive environment42. The tensile stress can be applied directly or may result from residual stress introduced during manufacturing processes such as cold working, welding, heat treatment, machining, and grinding. These residual stresses can significantly affect the susceptibility of materials to SCC. Additionally, SCC is closely associated with the specific combination of material, service environment, and stress conditions. Figure 1(c) shows SEM image for an example of stress corrosion cracking. It indicates that the formation and growth of cracks in a corrosive environment can lead to the unexpected and sudden failure of ductile metal alloys under tensile stress, especially at high temperatures. SCC exhibits a high degree of chemical specificity, as certain alloys are prone to SCC only when exposed to a limited number of specific chemical environments. Such chemical environments are often only mildly corrosive to the alloys, but SCC can be still induced. As a result, metal parts severely affected by SCC may appear shiny on the surface but are internally filled with micro-cracks. SCC typically progresses rapidly and is more prevalent in alloys than in pure metals. The specific environment is critical, and only very low concentrations of certain highly reactive chemicals are required to initiate catastrophic cracking, such as chloride ions or hydrogen sulfide, thereby leading to sudden and unexpected failure of critical components.

Thermal cycling refers to the process of repeatedly exposing a material to alternating high and low temperatures43. During this process, the temperature gradient and thermal stress are generated due to differences in the coefficients of thermal expansion among various components or layers. The thermal stress can lead to deformation, cracking, or other forms of material degradation. Figure 1(d) shows the SEM image for an example of thermal cycling data. As thermal cycling progresses, these thermal stresses can bring about the change of microstructure, such as the generation and movement of dislocations, and the migration of grain boundaries. Over time, these microcracks may coalesce and evolve into macroscopic cracks, ultimately compromising the structural integrity of material and leading to catastrophic failure. Failures caused by thermal cycling are common in fields such as turbine blades in aircraft engines and electronic packaging materials.

Oxide deposition spalling refers to the phenomenon where the formation of an oxide layer on the surface of metal materials at high temperatures can provide oxidation resistance, and then it undergoes cracking or detachment due to mechanical or thermal stress44. This protective oxide layer is initially dense and serves to prevent further oxidation of the underlying metal, whereas its integrity can be compromised by tensile stress that develops within the oxide layer, which is derived from differential thermal expansion between the oxide and the metal substrate. Additionally, external mechanical forces or repeated thermal cycling can induce spalling, leading to the exposure of the underlying metal and potentially accelerating further oxidation. Figure 1(e) shows the SEM image for an example of failure caused by oxide deposition and peeling. Once the protective oxide layer is compromised, the underlying metal undergoes rapid oxidation, significantly accelerating material degradation. The spalling of oxide layers can lead to the obstruction of critical precision components, such as engine oil passages, thereby impairing lubrication and potentially resulting in mechanical failure.

The deposition- induced thermal corrosion refers to the condition that impurities in fuels or lubricating oils, such as sodium, potassium, and vanadium, can deposit on the material surface at high temperature45. Figure 1(f) shows the SEM image for an example of the failure caused by the deposition- induced thermal corrosion. These deposits can chemically react with the material, destroying the protective oxide film on the material surface and exposing the substrate to a corrosive environment. The deposits may also form low-melting-point eutectics with the material, leading to localized melting and accelerating material failure. Thermal corrosion is widely present in high-temperature components such as aircraft engines and gas turbines.

The present classification scheme was designed on the basis of the following considerations. Firstly, the classification scheme included the two major types of metal material failure, i.e. failure caused by mechanical forces and failure caused by thermal effects. These two types of failure modes are widely present in engineering practice and have a significant impact on the service performance of materials. Secondly, under each main failure mode, it was subdivided into three subcategories, each corresponding to a specific failure mechanism. This classification method can be helpful to systematically and comprehensively understand the failure process of materials, and provides guidance for material selection, application and maintenance. Additionally, the proposed classification scheme is closely integrated with engineering practice, with each type of failure mode corresponding to specific engineering issues. For instance, fatigue failure is common in components subjected to cyclic loads, while thermal corrosion failure is common in components operating in high-temperature environments. This classification method can address the material failure issues in engineering in a targeted manner. Meanwhile, the cutting-edge research directions in the field of materials science are also involved in the present classification scheme. For example, the failure caused by thermal cycling is an important subject in aerospace and energy sectors, while oxide spalling, and thermal corrosion are hot topics in the field of high-temperature materials. This classification approach can provide a innovated perspective for understanding the latest advancements in the study of material failure dynamics.

Experimental methods

HFFNet_2D network architecture

As illustrated in Fig. 2, HFFNet_2D integrates CNN and Transformer architectures to perform hierarchical feature extraction and classification of metal failure images. In the implementation of the HFFNet_2D module, a pre-trained ResNet50 is first used as the backbone network for extracting local features from images. ResNet50 is a classic CNN structure that effectively alleviates the gradient vanishing problem in deep network training through residual connections, achieving efficient image feature extraction. Specifically, the pre-trained CNN ResNet50 was selected as the backbone network for feature extraction due to its well-established ability to capture hierarchical feature representations efficiently while maintaining computational feasibility. As compared to other CNN architectures, ResNet50 can provide deeper feature extraction capabilities while mitigating vanishing gradient issues through residual connections. By integrating ResNet50 with our Transformer encoder, its local feature extraction capabilities can be utilized while benefiting from the global context modeling of the Transformer. In this work, the first four convolutional blocks of ResNet50 were employed as a feature extractor, with the output feature map size being 1/32 of the original image size. After obtaining the adapted feature map, the HFFNet_2D module introduced learnable position encoding (pos_embedding). As an important component in the Transformer structure, position encoding was used to assign unique position information to each position in the feature map. By adding position encoding to the feature map, the model can consider the spatial relationships of features in subsequent self-attention calculations to better capture the global information of the image.

Fig. 2
figure 2

HFFNet 2D Network Architecture.

In addition to position encoding, the HFFNet_2D module also introduced a learnable class embedding vector (cls_token) as an additional input to the Transformer. The class embedding vector can be regarded as a global representation of the entire image, which can facilitate the model better understand the overall semantics of the image and capture relationships between different categories by interacting with local features. In the implementation, the class embedding vector was expanded to the same spatial dimension as the feature map and concatenated with the feature map along the channel dimension.

On the feature map that fused position encoding and class embedding, the HFFNet_2D module applied the Transformer2D submodule for further feature extraction and fusion. The Transformer2D submodule was composed of multiple self-attention layers and feed-forward neural networks, which can model long-range dependencies between different positions in the feature map, achieving efficient utilization of global image information.

After processing by the Transformer2D submodule, the HFFNet_2D module extracted the average of the class embedding vector and local features respectively and concatenated them along the channel dimension to form the final feature representation of the image (prim_rep). Finally, predictions for different levels of categories were made through multiple fully connected layers (level_fc) and classifiers (level_classifier). The fully connected layers mapped the final feature representation to feature spaces of different levels, while the classifiers performed classification predictions based on the mapped features.

Transformer2D module

The Transformer2D module was used to apply the self-attention mechanism on feature maps to model long-range dependencies between different regions. This module was composed of multiple stacked Transformer encoder layers, with each encoder layer containing two sub-modules: MultiHeadAttention2D and FeedForward2D. In each encoder layer, the multi-head self-attention mechanism first calculated the correlations between different feature positions to aggregate and interact global information. Subsequently, the feed-forward neural network performed nonlinear transformations on the aggregated features, enhancing the representational capacity of model. By stacking multiple encoder layers, deep-level extraction and fusion of image features can be achieved.

MultiHeadAttention2D submodule

As a key component of the Transformer2D module, the MultiHeadAttention2D submodule applied multi-head self-attention mechanism on feature maps to model long-range dependencies between different positions. In the implementation of the MultiHeadAttention2D submodule, the input feature map was first linearly transformed into query (Q), key (K), and value (V) tensors through 1 × 1 convolutional layers (qkv). The dimensions of these three tensors are (B, 3, num_heads, H, W, head_dim), where B is the batch size, num_heads is the number of attention heads, H and W are the spatial dimensions of the feature map, and head_dim is the dimension of each attention head. Subsequently, the MultiHeadAttention2D submodule splits the Q, K, V tensors into num_heads attention heads and performs self-attention calculations independently for each attention head. Specifically, for each attention head, the submodule first calculated the dot product between the query tensor Q and the key tensor K, and then was divided by the square root of head_dim to obtain attention scores. The attention scores were passed through a Softmax function to obtain attention weight. Finally, the attention weights were multiplied with the value tensor V and concatenated along the num_heads dimension to obtain the output feature map. To further enhance the representational capacity of features, the MultiHeadAttention2D submodule introduced an additional 1 × 1 convolutional layer (proj) to adjust the number of channels in the output feature map to match the input feature map. This step can be regarded as fusing features from different attention heads to obtain richer and more abstract representations. Through the multi-head self-attention mechanism, the MultiHeadAttention2D submodule can capture relationships between features at multiple scales and extract global information of images from multiple perspectives.

FeedForward2D submodule

As another component of the Transformer2D module, the FeedForward2D submodule was used to perform nonlinear transformations on the feature map output by the MultiHeadAttention2D submodule to further enhance the representational capacity of model. In the implementation of the FeedForward2D submodule, two 1 × 1 convolutional layers and a GELU activation function were used. The first convolutional layer (Conv2 d) expanded the number of channels in the input feature map from dim2 to hidden_dim, increasing the dimensionality of the features. After then, the GELU activation function was applied to introduce nonlinearity. Finally, the second convolutional layer (Conv2 d) can reduce the number of channels from hidden_dim back to dim2 to match the input dimension of subsequent layers. Through this structural design, the FeedForward2D submodule can perform multi-level nonlinear transformations on the feature map, extracting more abstract and high-level feature representations. Simultaneously, based on 1 × 1 convolutional layers, the FeedForward2D submodule can maintain the spatial dimensions of the feature map unchanged, ensuring that the position information of features is preserved.

Hierarchical feature fusion module

To effectively utilize the hierarchical structural information of images, the HFFNet model introduced a hierarchical feature fusion module. The input to this module was the output features from the Transformer encoder, including class embedding vectors and image patch embeddings. First, global average pooling was performed on the image patch embeddings to obtain a global feature vector. The global feature vector was concatenated with the class embedding vector to construct a comprehensive image representation. Subsequently, multiple fully connected layers were used to transform this comprehensive representation, generating feature representations at different levels. Each fully connected layer can be assigned to a specific level, learning semantic information at that level. Finally, the feature representations of different levels were concentrated to obtain the fused feature vector as the final representation of the image.

Classification output

After obtaining the fused image feature representation, the HFFNet model used multiple classifiers to perform hierarchical classification of the image. Each classifier was a fully connected layer corresponding to a specific level. The input to the classifier was the fused image feature, and the output is the category probability distribution at that level. Through the collaborative work of multiple classifiers, the HFFNet model can simultaneously predict category labels at different levels for the image. During training, the cross-entropy loss function was used to supervise the classification results at each level, and the losses of all levels were added up as the final optimization objective. This multi-task learning approach can make the model better utilize the correlations between different levels and improve overall classification performance.

Adjustable parameters of the network

The HFFNet-2 d network incorporated several key adjustable parameters that can significantly influence its performance and structure. The input image size (img_size) determined the dimensions of the data fed into the network. For the hierarchical metal failure analysis task, the number of categories (num_classes) was set to2,7, representing two categories at the first level and six at the second level. The Transformer’s feature dimension (dim) was crucial as it defined the input and output dimensions for both the self-attention mechanism and the feed-forward neural network within the Transformer. The depth parameter, which specified the number of Transformer encoder layers, allowed for deeper feature extraction and fusion, albeit at the cost of increased computational complexity. In the multi-head self-attention mechanism, the number of heads (heads) enabled the model to learn richer feature representations from various subspaces. Eventually, the hidden layer dimension of the Transformer’s feed-forward neural network (mlp_dim) was increased to improve the representational capacity of the network, although it also led to an increase in the overall number of model parameters.

Data preprocessing and model training

Data preprocessing

To enhance the generalization ability and robustness of model, various data preprocessing techniques were adopted. For the training set, the data augmentation methods were implemented, including Random Resized Crop, Random Horizontal Flip, Random Rotation, Normalization, and Random Erasing. Data augmentation techniques used in our study, including Random Resized Crop and Random Horizontal Flip, were carefully selected based on our understanding of SEM image characteristics and validated through experiments. These methods can effectively preserve critical microstructural features of material failure regions while improving model robustness. Unlike augmentations that might distort physical interpretations (such as vertical flipping or extreme color alterations), the selected techniques can maintain the scientific validity of failure mechanisms—Random Resized Crop preserves key morphological features like cracks and fatigue striations, while Random Horizontal Flip respects the topological relationships since the material failures exhibit similar characteristics regardless of orientation. All augmentation strategies were verified by materials science experts to ensure they maintained the physical significance and recognizability of microstructural features, ultimately enhancing model generalization without compromising the scientific integrity of the SEM imagery.

Optimization algorithm and loss function

Stochastic Gradient Descent (SGD) was selected as the optimization algorithm, with an initial learning rate of 1e-4, momentum of 0.937, and weight decay of 5e-4. These parameters were selected to accelerate convergence, reduce oscillations, and prevent overfitting. The loss function design combined Focal Loss and Label Smoothing techniques. The total training loss was calculated as a weighted sum of losses from two branches, with the second branch given higher weight to reflect its importance in hierarchical classification. The loss functions for each branch are defined as follows:

$$\:\begin{array}{c}{L}_{1}=-\alpha\:{\left(1-{p}_{1}\right)}^{\gamma\:}\sum\limits_{i=1}^{{C}_{1}}{q}_{1i}log\left({\widehat{y}}_{1i}\right)\end{array}$$

(1)

$$\:\begin{array}{c}{L}_{2}=-\alpha\:{\left(1-{p}_{2}\right)}^{\gamma\:}\sum\limits_{i=1}^{{C}_{2}}{q}_{2i}log\left({\widehat{y}}_{2i}\right)\end{array}$$

(2)

$$\:\begin{array}{c}{L}_{total}={L}_{1}+2{L}_{2}\end{array}$$

(3)

where \(\:\alpha\:\) is the balancing factor, \(\:\gamma\:\) is the focusing parameter, \(\:{p}_{1}\) and \(\:{p}_{2}\) represent the prediction probabilities of model for the true categories in each branch, \(\:{q}_{1i}\) and \(\:{q}_{2i}\) represent the smoothed true probability distributions, and \(\:{\widehat{y}}_{1i}\) and \(\:{\widehat{y}}_{2i}\) represent the model’s prediction probabilities for each category in the respective branches.

Training hyperparameters

The training process utilized a batch size of 16, balancing training efficiency and memory consumption. We trained the model for 200 epochs to ensure sufficient convergence. A Cosine Annealing (CosineAnnealingLR) learning rate scheduling strategy was employed to dynamically adjust the learning rate during training, adapting to optimization needs throughout the process. In our final model, a 12-layer Transformer depth with 12 attention heads was employed, with the model dimension set to 768 and the MLP dimension set to 3072. The input image size was (224, 224). All other parameters remained consistent throughout the ablation experiments.

Evaluation metrics

Hierarchical Precision (HP) measures the proportion of true positives among the samples predicted as positive by the model. HP can be expressed by the following formula:

$$\:\begin{array}{c}HP=\frac{1}{N}\sum\limits_{i=1}^{N}\frac{\left|{T}_{i}\cap\:{P}_{i}\right|}{\left|{P}_{i}\right|}\end{array}$$

(4)

where \(\:{T}_{i}\) represents the set of true categories for the i-th sample, \(\:{P}_{i}\) represents the set of predicted categories for the i-th sample.

Hierarchical Recall (HR) measures the proportion of correctly predicted samples among the samples with true positive labels. HR can be expressed by the following formula:

$$\:\begin{array}{c}HR=\frac{1}{N}\sum\limits_{i=1}^{N}\frac{\left|{T}_{i}\cap\:{P}_{i}\right|}{\left|{T}_{i}\right|}\end{array}$$

(5)

Hierarchical F-score (HF) is the weighted harmonic mean of HP and HR, comprehensively considering precision and recall, providing a more comprehensive evaluation of model performance in hierarchical multi-label classification tasks. HF can be expressed by the following formula:

$$\:\begin{array}{c}H{F}_{\beta\:}=\frac{\left(1+{\beta\:}^{2}\right)\times\:HP\times\:HR}{{\beta\:}^{2}\times\:HP+HR}\end{array}$$

(6)

where \(\:\beta\:\) is a parameter adjusting the weights of HP and HR. Typically, \(\:\beta\:\) is set to 1, indicating that HP and HR are equally important. In this case, HF degenerates to HF1, and the calculation formula can be simplified as follows:

$$\:\begin{array}{c}H{F}_{1}=\frac{2\times\:HP\times\:HR}{HP+HR}\end{array}$$

(7)

Ablation experiments and algorithm analysis

To validate the effectiveness of the HFFNet-2 d network’s module design and explore the impact of different architectures on model performance, we conducted a series of ablation experiments. Four variant models were designed: HFFCNN, HFFVT, HFF_CNN_VT, and HFF_CNN_Conv_VT. By comparing these variants with HFFNet-2 d, it aims to better understand its advantages and limitations.

HFFVT model

The baseline version, HFFVT, was constructed based on the original ViT structure. It directly divided the input image into fixed-size patches and used linear projection for Transformer input. Unlike HFFNet-2 d, HFFVT cannot use CNN for feature extraction, potentially reducing parameters and computational complexity. However, this design may not fully utilize local features and semantic information. The direct addition of position encoding to image patch embeddings, without considering CNN feature spatial structure, may impact the ability to model spatial information.

HFFCNN model

The HFFCNN model extended the baseline concept using a convolutional neural network (ResNet50) for feature extraction while suppressing the HFF module and HFFVT. This model served as a direct comparison with HFFVT to explore performance differences between transformer and CNN structures.

HFF_CNN_VT model

HFF_CNN_VT combined convolutional neural networks and transformers. It used the first four convolutional blocks of ResNet50 for feature extraction before transforming these features into Transformer input. Compared to HFFNet-2 d, it used linear projection instead of convolutional layers for feature mapping, potentially reducing parameters but possibly underutilizing spatial structure information. The position encoding size matched the number of image patches, which may affect the ability of model to capture spatial information effectively.

HFF_CNN_Conv_VT model

Building on HFF_CNN_VT, the HFF_CNN_Conv_VT model replaced linear projection with a 1 × 1 convolutional layer to transform CNN features into Transformer input. However, similar to HFF_CNN_VT, its position encoding size matched the image patch number and is directly added to the Transformer input without considering CNN feature spatial structure. This approach may impact the model’s spatial information modeling capabilities. All the ablation experiments were maintained consistent network hyperparameters and training parameters with HFFNet-2 d to ensure experimental validity.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *