Large-scale language models (LLMs) have become important tools in a variety of domains due to their remarkable ability to understand and generate human language. These models often contain billions of parameters and require enormous computational resources to train and fine-tune. A major challenge is to efficiently manage memory and computational demands to make these models accessible to a wide range of users and applications.
Training an LLM is inherently memory intensive, requiring significant hardware resources that are only available to a small percentage of users. Traditional methods require large memory allocations to handle a large number of parameters and optimization states. For example, training an LLaMA 7B model from scratch typically requires approximately 58 GB of memory, including 14 GB for trainable parameters, 42 GB for Adam optimization states and weight gradients, and 2 GB for activations. This high memory requirement creates a significant barrier to entry for many researchers and developers who require access to advanced hardware setups.
To address this issue, various techniques have been developed, including designing small LLMs, employing efficient scaling techniques, and incorporating sparsity into the training methodology. Among these, GaLore has emerged as a notable technique, enabling full-parameter training of LLMs through low-rank gradient updates using singular value decomposition (SVD). GaLore reduces memory usage by up to 63.3%, enabling training of a 7B model with just 24 GB of memory. However, GaLore requires more memory than is available on many common devices, including popular laptop GPUs such as the RTX 4060 Ti, which have up to 16 GB of memory.
Researchers from the University of Texas at Austin, University of Surrey, University of Oxford, California Institute of Technology, and Meta AI Q-Galois It further reduces memory consumption and makes LLM training more accessible. Q-GaLore combines quantization and low-rank projection to achieve significant memory efficiency. The method is based on two key observations: the gradient subspace exhibits diverse characteristics and some layers stabilize early in training. In contrast, other layers change frequently and the projection matrix is highly resistant to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on convergence statistics, reducing the number of SVD operations while maintaining performance. Model weights are kept in INT8 format and the projection matrix is in INT4 format, thus aggressively saving memory.
Q-GaLore employs two main modules: low-precision training with low-rank gradients and lazy layer-wise subspace search. The entire model, including the optimization state, uses 8-bit precision for Adam optimization, and the projection matrix is quantized to 4-bit. This approach reduces the memory of gradient low-rank training by approximately 28.57%. Stochastic rounding maintains training stability and approximates the trajectory of high-precision training. This method enables a high-precision training pass using only low-precision weights, effectively preserving small gradient contributions without the need to maintain high-precision parameters.
In real-world applications, Q-GaLore demonstrated exceptional performance in pre-training and fine-tuning scenarios. During pre-training, Q-GaLore enabled us to train the LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB of memory. This is a significant achievement, demonstrating the exceptional memory efficiency and practicality of the method. In fine-tuning tasks, Q-GaLore reduced memory consumption by up to 50% compared to other methods such as LoRA and GaLore, and consistently performed better, outperforming QLoRA by up to 5.19 on the MMLU benchmark at the same memory cost.
The performance and efficiency of Q-GaLore was evaluated across a range of model sizes from 60 million to 7 billion parameters. For models with 1 billion parameters, Q-GaLore maintained comparable pre-training performance with less than 0.84 perplexity increase compared to the original GaLore method, achieving 29.68% memory savings over GaLore and 60.51% memory savings compared to the full baseline. In particular, Q-GaLore facilitated pre-training of a 7B model within a 16GB memory constraint, achieving less than 1 perplexity difference compared to the baseline model.
In conclusion, Q-GaLore provides a practical solution to the memory constraints traditionally associated with these models for efficient training of LLMs. By combining quantization and low-rank projection, Q-GaLore achieves competitive performance and broadens the accessibility of powerful language models. This method highlights the potential for optimizing large models for more commonly available hardware configurations, making state-of-the-art language processing technology available to a wider audience.
Please check Papers and GitHub. All credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter.
participate Telegram Channel and LinkedIn GroupsUp.
If you like our work, you will love our Newsletter..
Please join us 46k+ ML Subreddit
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His latest endeavor is the launch of Marktechpost, an Artificial Intelligence media platform. The platform stands out for its in-depth coverage of Machine Learning and Deep Learning news in a manner that is technically accurate yet easily understandable to a wide audience. The platform has gained popularity among its audience with over 2 million views every month.