The quality of labeled training data is an important aspect to consider when training machine learning models. Since machine learning models learn from the labeled data itself, the accuracy of machine learning models directly depends on the quality of annotations in supervised learning scenarios. Training data quality is affected by label noise. Label noise can be caused by human error, pure confusion, or ambiguous examples. Improving the quality of annotations in supervised learning models requires robust quality checks by multiple experienced human data annotators, which also increases the budget for building robust machine learning models.
We conducted a study to benchmark model performance degradation due to label noise on text classification tasks in spoken language understanding (SLU) systems and proposed mechanisms that could be deployed to mitigate the degradation. For the purposes of this study, we injected synthetic random noise into the dataset, ranging from 10% to 50%, and compared its performance with a model trained on clean data. To inject synthetic noise, we randomly select x% of the utterances in the training and validation sets (we call x the “noise percentage”) and randomly flip their class labels.
Figure 1 shows the epochs and model accuracy for both clean and noisy training data with 50% noise injected into the public data. SNIPS dataset (Coucke, et al. 2018).
We can see that the performance of the SLU model remains consistent on clean training and test data (green solid and dotted lines). For models trained on noisy data, accuracy on the training data continues to increase with increasing number of epochs (solid red line), but accuracy on the test set decreases (dotted red line). This means that the SLU model will overfit when the training data is noisy, resulting in poor performance on the test set. It is also interesting to note that the model performs much better on the clean test set than on the noisy training set before overfitting (less than 10 epochs in this case). That is, the model can pick up the correct patterns. Despite the presence of random noise in the training data.
To mitigate the performance degradation of SLU models due to noise on text classification tasks, we considered the following five mitigation methods:
noise layer (Jindal, et al. 2019): Noise layers focus on mitigating the effects of noise by adding a nonlinear layer at the end of a deep neural network (DNN). This layer differs from the rest of the network in that it has different regularization and proper weight initialization. This makes learning the noise distribution easier. During inference, the noise layer is removed and the base model is used for final prediction.
robust loss (Ma, et al. 2020): The Robust Loss method proposes a loss function called Active Passive Loss (APL), which is a combination of “active” and “passive” loss functions. The former loss explicitly maximizes the probability of being in the labeled class, while the latter minimizes the probability of being in the incorrect class. In our study, we used a combination of normalized cross-entropy (NCE) and inverse cross-entropy (RCE) loss as APL.
limit (Harutyunyan, et al. 2020): LIMIT is a noise reduction method that works by adding an information-theoretic regularization to the objective function, which attempts to minimize the Shannon mutual information between the model weights and labels. increase. LIMIT leverages an auxiliary DNN network to predict gradients in the final layer of the classifier without access to label information to optimize for this purpose.
label smoothing (Szegedy, et al. 2016): Label Smoothing is a regularization technique that introduces noise into the labels to account for possible errors in the labeled data. Label smoothing refines the model by replacing the hard class labels (0 and 1) with soft labels by subtracting a small constant from the correct classes and adding it uniformly to all other incorrect classes. Regularize.
early stop (Li, Soltanolkotabi and Oymak 2020): Early Stopping stops training when model accuracy starts to decline, preventing overfitting on noisy data. The validation set error is used as a proxy for the model’s accuracy, and training stops when the validation error starts increasing at a certain number of epochs.
The above noise reduction approaches fall into three categories. In the first category, training is stopped when overfitting begins (e.g. Early Stopping). Methods in the second category, such as Noise Layer and Label Smoothing, can further benefit from Early Stopping as they delay or reduce the effects of overfitting. The third kind updates the loss function to prevent the model from overfitting, which occurs with Robust Loss and LIMIT. Therefore, in our experiments, we use Early-Stopping with Noise Layer and Label Smoothing instead of Robust Loss and LIMIT.
Public dataset results
we trained Distilled BERT (Sanh, et al. 2019) Models for three datasets published for text classification tasks: ATIS – Air Travel Information Systems (Hemphill, Godfrey and Doddington 1990), SNIPs – Personal voice assistant (Coucke, et al. 2018) and English part Up Facebook’s Task-Oriented Dataset dataset (Gupta, et al. 2018). Table 1 contains the results for all three datasets with various noise percentages ranging from 10% to 50%.
Based on the results in Table 1, we can see that the accuracy of all three datasets decreases when noise is added to the training data. LIMIT can recover most of the lost precision on all three datasets, followed by early stopping. Robust Loss performs best on SNIPS and Label Smoothing performs best on TOP datasets. We also see improvements in the ‘clean’ ATIS and SNIPS datasets. This indicates that the published data are noisy.
Industrial dataset results
We applied the above mitigation methods to industrial datasets used for text classification tasks under large spoken language understanding (SLU) systems. This dataset contains text transcripts of voice requests made by users in this large SLU system, preprocessed (“anonymized”) so that the users cannot be identified. I had access to the Gold and Noisy versions of the same example annotation. Gold data undergoes rigorous quality checks by multiple experienced human data annotators and has few to no annotation errors.
Train the mitigation method on the noisy version and test it on the corresponding gold version. For comparison, we train a model using the same example gold annotation and report the improvement in model accuracy. We believe that noise reduction models should compete to achieve the accuracy of models trained on gold data.
Table 2 reports results for three domains of industrial SLU systems. Similar to the intent classification task for public datasets, the domain classification task assumes target domains such as music, books, etc. given spoken text. First, note that training on noisy data reduces measurable accuracy. In fact, the last row of Table 2 is the absolute difference in accuracy between training on gold data and training on noise data. On average, training on gold data yields a 2.69% relative improvement in accuracy. Next, we examine the effects of different noise reduction strategies. For each mitigation method in Table 2, the corresponding row represents the accuracy gap between models trained with and without the mitigation method on noisy data from three industries. We find that Early Stopping performs best and recovers 57% of the lost accuracy compared to the gold data. LIMIT and Robust Loss track Early-Stopping closely, recovering 31% and 40% accuracy loss respectively.
Overall, although no single method is best across the board, Early Stopping and LIMIT are found to outperform other methods evaluated on both public and industrial datasets. understand. However, we find that even the best-performing method (early stop) can recover only about 57% of the accuracy loss in industrial environments, whereas it performs up to 98% with randomly injected noise on public datasets. Recovery is observed. This may be due to the fact that the label noise distribution in real datasets is more complex than the simple random noise model considered in this work. Therefore, there is still room for improvement in building robust systems against real-world noise. Characterizing the noise distribution of real systems and devising better mitigation strategies for such noise models is an interesting open question and will be considered as future work.
For more information, see the research paper published in Industry track at Interspeech, 2022.Training Under Label Noise for Robust Spoken Language Understanding Systems”.