Image // Zeus demo video
New ways to optimize the training of deep learning models, a rapidly evolving tool for augmenting artificial intelligence, have the potential to significantly reduce the energy demands of AI.
An open-source optimization framework developed at the University of Michigan studies deep learning models during training to identify the optimal trade-off between energy consumption and training speed.
“At extreme scale, a single training session of the GPT-3 model consumes 1,287 MWh – enough to power an average US household for 120 years,” said the electrical engineering and said Mosharaf Chowdhury, Associate Professor of Computer Science.
Using Zeus, a new energy optimization framework developed by Chowdhury and his team, we can achieve this, without using new hardware, with only a small impact on the time it takes to train a model. up to 75% reduction in This was announced at his 2023 USENIX Symposium on Network Systems Design and Implementation (NSDI) in Boston.
From image generation models and expressive chatbots to recommender systems powering TikTok and Amazon, the mainstream use of powerful deep learning models has exploded over the past three years. Cloud computing is already draining commercial aviation, so the increased climate burden from artificial intelligence is a major concern.
“Existing research is primarily focused on optimizing deep learning training to complete faster, often without considering its impact on energy efficiency,” says a computer science and engineering expert. Jae-Won Chung, Ph.D. student and co-first author of the study, said. “We’ve found that the energy gains we’re putting into our GPUs are dwindling. This allows us to significantly reduce our energy consumption without comparably slowing us down.”
Deep learning is a set of techniques that utilize multilayered artificial neural networks to tackle a variety of common machine learning tasks. These are also called deep neural networks (DNN). The model itself is very complex and is learning from the largest data set ever used in machine learning. For this reason, it greatly benefits from the multitasking capabilities of the graphics processing unit (GPU), which consumes his 70% of the power spent training one of these models.
Zeus uses two software knobs to reduce energy consumption. One is the GPU power limit. This reduces GPU power usage while slowing down model training until settings are adjusted again. The other is the batch size parameter of the deep learning model. This controls how many samples from the training data the model processes before updating how the model represents the relationships it finds in the data. A larger batch size reduces training time but increases energy consumption.
Zeus adjusts each of these settings in real-time to find the optimal trade-off point that has the least impact on your training time and minimizes your energy usage. In the example, the team was able to visually illustrate this trade-off point by showing all possible combinations of these two parameters for him. That level of completeness doesn’t actually occur in any particular training his job, but Zeus takes advantage of the iterative nature of machine learning to get it very close.
“Fortunately, companies are training the same DNN with new data over and over again, every hour. You can,” said Jie You, a recent Ph.D. in computer science and engineering and co-lead author of the study.
Zeus is the first framework designed to plug into existing workflows for a variety of machine learning tasks and GPUs, reducing energy consumption without changing system hardware or data center infrastructure. .
Additionally, the team has developed complementary software to layer on top of Zeus to further reduce its carbon footprint. Called Chase, the software favors speed when low-carbon energy is available, and trades speed for efficiency at peak times. Peaks will likely require increased carbon-intensive energy generation, such as coal. Chase took his second place in his CarbonHack hackathon last year, which will be announced at the International Conference on Learning Representations Workshop on May 4th.
“Due to large dataset sizes and data regulations, DNN training jobs are not always easily migrated to other locations,” said Zhenning Yang, a master’s degree student in computer science and engineering. I’m here. “Achieving the highest accuracy requires DNNs to be trained on fresh data and deployed to production quickly, so deferring training jobs to greener timeframes may not be an option.
“Our goal is to reduce the carbon footprint of DNN training while designing and implementing a solution that does not violate these practical constraints.”
This work was supported in part by National Science Foundation grants CNS-1909067 and CNS-2104243, VMWare and Kwanjeong Educational Foundation, and computing credits provided by CloudLab and Chameleon Cloud.
Research: Zeus: Understanding and Optimizing GPU Energy Consumption for DNN Training
Research: Pursuing Low Carbon Power for Practical and Sustainable DNN Training
Open source software:
Zeus on GitHub
Chase on Github
