Improving Training Efficiency and Performance of Deep Neural Networks with Linear Prediction

Machine Learning

In this section, we present experimental details, followed by experiments to evaluate the performance and hyperparameter sensitivity of the proposed method.

Implementation details

We used three representative backbones (i.e., Vgg16Four,Resnet186 and GoogleLeNetFive) and two representative non-adaptive parameter optimization methods (SGD and DEMON). All networks are randomly initialized with PyTorch's default settings without pre-training on any external dataset. The networks on which our methods are evaluated are the CIFAR-100 dataset, which has 100 categories, each containing 600 images, with 50,000 training images and 10000 test images. 20% of the training set is randomly split into a validation set. We also use an NVIDIA GeForce RTX 3060 Laptop GPU to implement and evaluate our experiments.

The code was implemented based on Python 3.10.9 and Pytorch 2.0.1. All models were trained for 100 epochs with different parameter optimization methods. The weight decay was set to 1e-4 and momentum to 0.9. In terms of learning rate, a cyclic learning rate update strategy with a base learning rate of 0.001 and a maximum learning rate of 0.002 was adopted in the performance evaluation to obtain more stable results, and no learning rate adjustment strategy was adopted in the sensitivity evaluation.

To make the validation of the PLP method more objective and to avoid false validation results due to randomness, we repeated the accuracy and top-1/top-5 error evaluations in this section 10 times and reported their averages (i.e., we trained 10 models for each of the three selected networks using the proposed PLP method, SGD, and DEMON, respectively, and tested each of them on the test set).

Performance Evaluation and Analysis

Figure 3 shows the accuracy of the proposed PLP method, DEMON, and SGD on the training set. We can see that the accuracy of the proposed PLP method and DEMON almost overlap, and both outperform SGD almost throughout the training process. Combined with the loss curve of the validation set shown in Figure 4, it is not difficult to draw the conclusion that the model is becoming overfitted at about 40 epochs. This explains why the difference in training accuracy among the three methods becomes smaller and smaller after 40 epochs, that is, because the parameters in the overfitting stage gradually converge, and the performance of SGD gradually becomes comparable to PLP and DEMON.

Figure 3
Figure 3

Accuracy and comparison of the proposed PLP method, DEMON, and SGD. “PLP” refers to the network using the proposed parametric linear prediction method, as well as DEMON and SGD.

Figure 4
Figure 4

Comparison of the loss on the validation set of the proposed PLP method, DEMON, and SGD. The legend “XXX-XXX” in “Loss Diff” indicates that the curve shown in the figure is “XXX method loss minus XXX loss”. The black solid horizontal line is the zero line, and the red dashed horizontal line represents the average value of the loss difference.

Figure 4 shows the loss comparison of the proposed PLP method, DEMON, and SGD on the validation set. As can be seen from Figure 4-“Loss Difference”-“SGD-PLP”, there are some outliers before the 40th epoch, and the proposed PLP method performs worse compared to SGD. That is, the validation loss of the proposed PLP method is larger than the model with SGD. The root cause of these outliers is the characteristics of SGD, which can suddenly change parameter values, and as a result, the proposed PLP method cannot keep up with the changes. In addition to this, except for the overfitting stage after about 40 epochs, the proposed method performs better compared to SGD during model training. Figure 4-“Loss Difference”-“PLP-DEMON” shows the loss comparison of PLP and DEMON. At the beginning of training, the loss of the DEMON method is lower, but after about 20 epochs, it is reversed, and as the parameter optimization progresses, the PLP method has better optimization performance compared to DEMON.

Figures 3, 4 show the ability of the proposed PLP method to improve the DNN training performance. To better illustrate the ability of the proposed method to improve the training efficiency of DNNs, Tables 1, 2, 3 show the epochs and loss values ​​corresponding to the best training models obtained by testing 10 times for the three selected models.

Table 1. Comparison of Vgg16 base (loss factor: 1e-2).
Table 2. Comparison of Resnet18-based loss coefficients (1e-2).

Table 1 shows the comparison of PLP, DEMON, and SGD based on Vgg16 net. Except for the 1st and 8th tests, the PLP method obtains higher losses in later epochs, and the 10th test obtains lower losses in later epochs. In the remaining tests, the PLP method can obtain better performance in early epochs compared to SGD. The reason for the anomaly in the 1st, 8th, and 10th tests is that the randomness of SGD may cause parameter gradient mutations during the training process, which results in the noise introduced by the PLP method exceeding the tolerance of SGD, leading to the PLP method spending longer training epochs to converge than the normal method. However, even under this condition, the introduction of noise may allow PLP to obtain better training performance due to the generalization effect caused by the introduction of noise (10th test). In comparison with DEMON, PLP showed lower loss and faster convergence than DEMON in about 50% of tests (1st, 5th, 7th, 8th, 10th), lower loss and slower convergence in the 6th test, and worse performance of PLP in the remaining 40% of tests. Considering the effect of randomness in the SGD process, based on the experimental results, it can be concluded that the proposed PLP method performed practically equivalent to DEMON.

Tables 2 and 3 show the comparison of PLP, DEMON, and SGD based on Resnet18 and GoogLeNet. As shown in the tables, there are some outliers such as those in Table 1 above, but overall, in most cases, the PLP method can obtain the optimal model faster than SGD during the training process and performs similarly to DEMON, validating the ability of the proposed method to improve the training efficiency of DNNs.

Table 3 Comparison of GoogLeNet-based loss coefficients (loss coefficient: 1e-2).

To further demonstrate the effectiveness of the proposed method in improving the efficiency and performance of DNN training, we use the CIFAR-100 test set to show the average accuracy, top-1/top-5 error of the best model, and the accuracy of the model at a certain training period in Table 4. As can be seen from the table, the model trained with the proposed PLP method obtains higher accuracy than SGD at each stage of model training, with an average accuracy improvement of more than 1%. This indicates that the good training performance of the proposed PLP method is not achieved at the expense of generalization, but achieves better generalization than the baseline model. In addition, the proposed PLP method obtains smaller top-1/top-5 error on average, further proving that the proposed PLP method is effective in obtaining good training performance.

Table 4. Average accuracy and top-1/top-5 error results on the test set.

Also, compared to DEMON, both the accuracy and the top-fifth error of the best model at a given training period are very close, proving that the PLP method and DEMON are in good agreement in terms of performance and efficiency, and both are better than SGD.

The above experimental results suggest that the proposed PLP method outperforms SGD in terms of convergence speed and accuracy in most cases, and is comparable to the SOTA non-adaptive method (DEMON), demonstrating the ability of the proposed method to improve the efficiency and performance of DNN training. The presence of outliers also revealed that the proposed PLP method relies to some extent on the reliability of predictions.

Sensitivity assessment and analysis

To demonstrate the hyperparameter sensitivity of the proposed PLP method, we evaluated and analyzed the performance of the PLP method using different learning rates and batch sizes on different backbones.

As introduced in the “Problem Formulation” section, the regular change of DNN parameters during the optimization process is the basis of the proposed PLP method. This imposes certain requirements on the stability of parameter changes during the DNN optimization process when applying the PLP method. It is not difficult to conclude that when the learning rate is large and oscillations occur during the parameter optimization process, linear parameter prediction will lead to large prediction errors and the performance of the PLP method will deteriorate. This is consistent with the experimental results shown in Table 5, which show that the PLP method usually performs better with smaller learning rates (< 0.01). In summary, despite the GoogLenet with a learning rate of 0.01, the proposed PLP method still performs well in the efficiency and effectiveness of DNN training in other cases, proving its stability for different learning rates.

Table 5. Performance comparison with different learning rates.

A comparison with commonly used batch size values ​​is shown in Table 6. Experimental results show that the proposed PLP method exhibits good training performance with different batch size settings on different backbones, demonstrating the stability of the proposed method with respect to the hyperparameter batch size.

Table 6. Performance comparison with different batch sizes.

Considering the performance and sensitivity evaluation and analysis, it is concluded that the PLP method exhibits good performance in most cases and has relatively low sensitivity to the changes in hyperparameters, validating that the proposed PLP method is effective in optimizing the performance and efficiency of DNN training.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *