MLRN algorithm
In the current artificial intelligence algorithm, the training of the model is mainly based on datadriven. With a large amount of training data and gradient descent algorithm to optimize the model, the expected training task can be completed^{32}. When artificial intelligence methods face a small amount of training data, model training will become difficult and prone to the problem of model over fitting. But for a kid, he can learn to recognize such new things with the limited data because of his learning ability. Metalearning algorithm expects to design such a model that can learn new knowledge with a few training examples and it can apply previous learning experiences to new tasks. So metalearning algorithm is also called to “learning to learn”^{33}. MLRN insist the idea of meta learning, it acts on the new task by optimized parameters on the training task. The essence of MLRN algorithm is to train a model on a series of learning tasks, so that it can solve new small sample tasks through previous learning experience. Different from the traditional machine learning model training based on each piece of data, the training process of MLRN model is based on tasks, and the MLRN model learns knowledge between tasks through multibatch and multitask training. Therefore, the dataset \(D\) is divided into \(D_{train}\) and \(D_{test}\) in MLRN and they all are divided into a series of tasks \(D_{train} = \{ task_{1} , \cdots ,task_{n} \}\) and \(D_{test} = \{ task_{1} , \cdots ,task_{m} \}\), where \(task_{i} = \{ D_{train}^{i} ,D_{test}^{i} \}\). By training on these tasks, we need to predict a model:
$$ \mathop {\min }\limits_{\omega } = \mathop E\limits_{{{\rm T} \sim p(T)}} {\mathcal{L}}(D;\omega ) $$
(3)
where \(p({\mathcal{T}})\) stands for the sum of tasks and \(\omega\) stands for the learning strategy. We need to find a learning strategy \(\omega\) that minimizes the value of the loss function for all tasks.
According to the above dataset, MLRN is divided into two layers: inner layer and outer layer, and there is a gradient updating algorithm in each layer. Among them, gradient update of inner layer refers to the gradient update of a single task on the temporary model, but does not affect the original model. The gradient update of the outer layer is the gradient update from one task to another, acting on the original model. To distinguish the data set of inner and outer layers, the training data and testing data of inner layer is called support set and query set, so each task is denoted as \(task_{i} = \{ D_{{\sup {\text{port}}}}^{i} ,D_{{{\text{query}}}}^{i} \}\). And the training data and testing data of outer layer are still called training data and testing data. Nway Kshot is a common experimental setting in limited data scenario. Nway means that there are N categories in the training data of each inner task, and Kshot means that there are K labeled data under each category. MLRN algorithm look forward to obtain the initialization parameters by training \(D_{train}\), and it can quickly fine tuning to achieve better result by training \(D_{{t{\text{est}}}}\).
The above is expressed mathematically as follows: the goal of MLRN is to train a PQP model, which represented by a parameterized function \(f_{\theta }\) with parameters \(\theta\). The \(\theta\) parameter is common to the inner and outer layers, and it is updated by two gradient updates. The inner layer calculates loss function by the support set and query set on the subtask, and then updated with the parameters \(\theta\) to \(\theta^{\prime}\) on the new task \({\mathcal{T}}_{i}\), which is the first gradient update:
$$ \theta_{i}^{\prime } { = }\theta – \alpha \nabla_{\theta } {\mathcal{L}}_{{{\mathcal{T}}_{i} }} (f_{\theta } ) $$
(4)
where \(\alpha\) is the fixed hyperparameter of the inner layer and \({\mathcal{L}}_{{{\mathcal{T}}_{i} }} (f_{\varphi } )\) is the loss function of task \({\mathcal{T}}_{i}\). The outer layer calculates the loss function across tasks, and then updated with the parameters \(\theta\) by stochastic gradient descent, which is the second gradient update,
$$ \theta \leftarrow \theta – \beta \nabla_{\theta } \sum\limits_{{{\mathcal{T}}_{i} \sim p({\mathcal{T}})}} {{\mathcal{L}}_{{{\mathcal{T}}_{i} }} (f_{{\theta_{i}^{\prime } }} )} $$
(5)
where \(\beta\) is the fixed hyperparameter of the outer layer and \(\sum\limits_{{{\mathcal{T}}_{i} \sim p({\mathcal{T}})}} {{\mathcal{L}}_{{{\mathcal{T}}_{i} }} (f_{{\theta_{i}^{\prime } }} )}\) is the sum of loss functions of the batch task \({\mathcal{P}}({\mathcal{T}})\). The optimal parameter initialization \(\theta\) of MLRN model is obtained according to the two gradient updates alternating optimization of Eqs. (5) and (6). Therefore, the goal of outer layer optimization is to minimum the loss function of the multitask \(p({\mathcal{T}})\):
$$ \mathop {\min }\limits_{\theta } \sum\limits_{{{\mathcal{T}}_{i} \sim p({\mathcal{T}})}} {{\mathcal{L}}_{{{\mathcal{T}}_{i} }} (f_{{\theta_{i}^{\prime } }} )} = \sum\limits_{{{\mathcal{T}}_{i} \sim p({\mathcal{T}})}} {{\mathcal{L}}_{{{\mathcal{T}}_{i} }} (f_{{\theta – \alpha \nabla_{\theta } {\mathcal{L}}_{{{\mathcal{T}}_{i} }} (f_{\theta } )}} } ) $$
(6)
As the above analysis, the essence of production quality analysis is fault classification. Therefore, the crossentropy^{34} function is chosen as the loss function, which is expressed by formula (7):
$$ {\mathcal{L}}_{{{\mathcal{T}}_{{\text{i}}} }} (f_{\varphi } ) = \sum\limits_{{x^{(j)} ,y^{(j)} \sim {\mathcal{T}}_{i} }} {y^{(j)} \log f_{\varphi } } (x^{(j)} ) + (1 – y^{(j)} )\log (1 – f_{\varphi } (x^{(j)} )) $$
(7)
Enhanced residual connection
For fewshot learning, the lack of data will inevitably lead to the low prediction accuracy of the quality model. Therefore, the existing limited data should be used as much as possible to explore and excavate more features. Studies show that the depth of the network can help the model fit more complex sample distributions and improve the robustness of the model. When the number of network layers is increased, more complex feature patterns can be extracted from the network, because the training process of the model is the process of adjusting parameters, the deeper the layer number is, the more adjustable parameters are, which means the greater the degree of freedom of adjustment and the better fitting of the complex objective function. So theoretically better results can be obtained when the model is deeper. Consequently, we can enhance the prediction accuracy of the model by increasing the number of layers in the MLRN network structure. In neural network training, the deeper the network is, the more parameters the model needs to learn and the more data it needs to train. Otherwise, insufficient data in deep learning will lead to overfitting. However, with the help of multibatch and multitask training characteristics of metalearning, the problem of overfitting caused by insufficient data in deep learning can be avoided, which is also the advantage of MLRN model.
In the traditional metalearning model, convolutional neural network is used as the learning framework. To make the model have better prediction accuracy, it is necessary to increase the depth of the network architecture of the MLRN model. MLRN model training is based on back propagation, and the process of passing errors forward from the final layer is a form of continuous multiplication. Therefore, with the increase of the number of neural network layers, there may have some problems, such as gradient disappearance and gradient explosion. To solve the above problems, the residual network was selected as basic network in MLRN. In structure, the “bottleneck” in residual learning unit is designed to decrease the amount of model parameters and increase network depth, so that the model has better feature learning ability and reduce the cost of calculation.
The basic idea of residual network is to introduce the concept of shortcircuit connection, which makes it easier to optimize and short^{35}. Several shortcircuit connections are stacked together to form a residual learning unit. As the focus of this paper is on predicting product quality in smallsample data, the original network is based on ResNet18, utilizing conventional residual connection within the network. However, the original residual network exhibits excessive nonlinear functions, such as ReLU activation functions, in the main pathway. This may hinder information propagation and impede the effective identification of product quality features. Therefore, to optimize information propagation efficiency in the network, LeakyReLU^{36} activation functions are used instead of the original ReLU within residual network connections, thereby enhancing the utilization of highquality product features. LeakyReLU addresses the issue of neuron death while retaining all the advantages of ReLU activation functions. It allows for a small, nonzero gradient for negative inputs, ensuring the activation of neurons throughout the entire network. The improved residual learning unit is shown in Fig. 2.
Let’s say the input to the model is \(x\), the potential mapping achieved by using residual learning units is \(G(x)\). Defining \(G(x) = h(x) – x\) as the residual mapping, we have \(h(x) = G(x) + x\), such that the residual unit approaches \(G(x)\) infinitely, effectively allowing multiple nonlinear layers within the residual unit to approximate \(h(x)\). Utilizing multiple nonlinear layers to achieve the residual mapping makes it easier for \(h(x)\) to tend towards 0 compared to approximating the identity mapping using multiple nonlinear layers. Thus, the mathematical definition of a residual learning unit is:
$$ y = LR[x + G(x)] $$
(8)
where, \(y\) is the output of the residual learning unit; \(x\) is the input to the residual learning unit; \(LR( \cdot )\) is the LeakyRelu activation function; \(G(x)\) stands for the original mapping, and the formula of \(G(x)\) is as follows:
$$ G\left( x \right) = w_{2} \times (LR(w_{1} \times x)) $$
(9)
where, \(w_{1}\) and \(w_{2}\) stand for the weight layer in the residual connection. When the accuracy of the model reaches saturation, the subsequent training of the model will limit the mapping of \(G(x)\), and only an identity mapping between the output \(y\) and the input \(x\). Therefore, MLRN model based on residual network can increase the network depth without increasing the error, and improve the accuracy of model prediction.
MLRN network structure based integrating ECA and enhanced residual connections
In fewshot learning, the information content of data is typically limited, making effective feature extraction crucial. Introducing attention mechanisms enables adaptive weight allocation within the network, allowing it to focus more on the most representative features within the samples. This enhances the efficiency and accuracy of feature extraction. This mechanism helps the network capture key features more effectively, reducing dependence on noisy data and thereby mitigating the risk of overfitting. This paper integrates efficient channel attention mechanism (ECA)^{37} with an improved residual network as the network architecture for product quality prediction models using smallsample data. The goal is to enhance model performance and interpretability.
The structure of the ECA, as shown in Fig. 3, is designed to handle features of size 27 × 1 × C (where C represents the number of input channels). Upon receiving such features, the module initially employs a global average pooling layer to aggregate features without altering their dimensions. Subsequently, a onedimensional convolutional layer is utilized for learning, allowing for weight sharing. This convolutional layer incorporates a hyperparameter, denoted as \(k\), representing the kernel size, which signifies the coverage rate of local crosschannel interactions. This coverage rate is adaptively determined based on the mapping of the channel dimension C. Next, the learned weights undergo redistribution through the sigmoid activation function. Finally, the resulting 1 × 1 × C feature is aggregated with the original feature to obtain a new attentional feature, thereby significantly enhancing the model’s ability to learn attention.
The local crosschannel interaction strategy without dimensionality reduction can complete the information exchange by nonlinear mapping adaptive onedimensional convolution. As shown in Eq. (10).
$$ w = \sigma (C_{k} (x)) $$
(10)
where, \(w\) is the weight; \(\sigma\) stands for nonlinear mapping; \(C_{k}\) represents onedimensional convolution of \(k\) parameter information; \(x\) stands for the input of data. The weights of onedimensional convolution are interleaved, that is, crosschannel, and exist in a group of groups. The number of weights in a group depends on the size of the convolution kernel \(k\). \(k\) is determined adaptively by formula (11).
$$ k = \psi (C) = \left {\frac{lbc}{\gamma } + \frac{b}{\gamma }} \right_{odd} $$
(11)
where, \(\psi (C)\) represents the linear mapping relationship between the number of channels \(C\); \(k\) is the kernel size, representing the crosschannel interaction area; \(C\) is the number of channels; \(\left \cdot \right_{odd}\) stands for the nearest neighbor odd number; \(\gamma\) represents the slope of the linear map, which is 2; \(b\) is the intercept of the linear mapping and its value is 1; \(lbc\) stands for the size of the data block.
In order to balance the performance and complexity of MLRN algorithm, an effective channel attention mechanism is added to the improved residual network, which is named ECA + ResNet. The structure of ECA + ResNet is shown in the Fig. 4.
In this paper, the improved 18 layers residual network is chosen as the network architecture of MLRN model. It consists of one convolution layer, eight residual blocks (each residual block has two convolution layers and an efficient channel attention mechanism structure) and one full connection layer, which is named IEResNet18. Figure 5 shows the overall model framework. Figure 6 shows that working process of the MLRN for limited data intelligent production quality prediction on industrial production.
In addition, as shown in Algorithm 1, the learning process of MLRN model is as follows:

1)
First, the MinMax Normalization method is used to normalize the raw data and the 18 layers improved residual network IEResNet18 is constructed as the basic framework for MLRN.

2)
Data set D is divided into training set \(D_{train}\) and testing set \(D_{{t{\text{est}}}}\).

3)
Tasks in \(D_{train}\) and \(D_{{t{\text{est}}}}\) are sampled. We randomly pick N classes of all the categories and Q samples of each category, in which K samples are the training set in the inner layer, which is also called the support set, and the remaining QK samples are the testing set in the inner layer, which is also called the query set.

4)
The crossentropy loss function of Eq. (7) is selected as the loss function of each classification task, and according to the first gradient descent algorithm of Eq. (5), parameters of the task in the inner loop are optimized.

5)
After the internal batch task parameters are updated in step 4), the outer layer parameters are updated according to the second gradient descent algorithm of Equation (6).

6)
Repeat steps 4) and 5) to get the optimal parameter initialization \(\theta\) of the MLRN model.