A Study on Large Batch Training Part 2 (Machine Learning) | Monodeep Mukherjee

Accelerating Large Batch Training with Gradient Signal-to-Noise Ratio (GSNR)

Authors: Guo-qing Jiang, Jinlong Liu, Zixiang Ding, Lin Guo, Wei Lin

Abstract: Natural language processing (NLP), computer vision (CV), and recommendation system (RS) models require exponential computation, so they are parallelized across a large number of GPUs/TPUs as large batches (LBs) to improve training throughput. However, training such LB tasks often encounters a large generalization gap, which reduces the final accuracy and limits the expansion of batch size. In this study, we develop a gradient signal-to-noise ratio (GSNR) based reduced variance gradient descent (VRGD) and apply it to popular optimizers such as SGD/Adam/LARS/LAMB. We conduct a theoretical analysis of the convergence rate to explain its fast training dynamics, and conduct a generalization analysis to demonstrate that the generalization gap in LB training is small. Comprehensive experiments demonstrate that VRGD can speed up training (by 1-2 times), narrow the generalization gap, and improve the final accuracy. We pushed the batch size limits for BERT pre-training up to 128k/64k and DLRM up to 512k without any noticeable accuracy loss. ImageNet Top-1 accuracy at 96k improved over LARS by 0.52pp. The generalization gap between BERT and ImageNet training was significantly reduced by over 65%.

Source link