How to Optimize Deep Learning Models Faster

Machine Learning


love

Credit: CC0 Public Domain

AI and its associated terminology are pretty familiar by now. Many have heard terms like “neural network,” and “CNN” might remind you of “convolutional neural network” rather than a news organization. If you're even remotely interested in AI, you might know about AlexNet, the pioneering CNN architecture that revolutionized image recognition and deep learning in 2012.

What is less known is the use of optimizers, or optimization algorithms, to help improve the performance of AI models. For example, a computer vision AI model needs an optimizer that can take data input (visual images) and correctly “predict” that data, i.e. correctly identify a picture of a panda as a “panda” and not a “bear” or a “koala.”

“Panda” is the ground truth, what the AI ​​model should predict correctly every time, and the difference between the AI ​​prediction and the ground truth is quantified into a number called the training loss.

“Given a task, an AI model takes an input sample and outputs its prediction. Without training, an AI model often fails to predict correctly, resulting in poor performance on the task,” explained Zhou Pan, an assistant professor of computer science at SMU. “The optimizer updates the parameters of the AI ​​model so that it can make the correct prediction.”

“The optimizer’s main role is to input training samples into the AI ​​model, calculate the training loss, i.e. the difference between the model’s predictions and the actual predictions, and finally tune the model parameters to minimize the training loss.”

Resolving Overshoot

Different types of deep learning networks require different optimization tools, and the best one is often selected only after multiple costly and time-consuming trials.

Simply put, an optimizer works when the output of an AI model coincides with the lowest point on the roughly V-shaped curve that represents training loss (often called the convergence point). This is the point at which the model has learned the optimal set of parameters, and further training iterations will not significantly improve performance on the current task.

The main obstacle to efficient optimization is what is known as the “overshoot problem,” where the optimizer generates predictions that correspond to the other side of the V-curve, and so must be readjusted to bring the predictions back to the contour of the curve.

Professor Zhou’s latest project, “Adan: An Adaptive Nesterov Momentum Algorithm for Faster Optimization of Deep Models,” attempts to solve the overshooting problem.

He explains: “The Adan optimizer can speed up the process of finding good model parameters for our model. Like other optimizers, at each training iteration, Adan also feeds the data into the model, then calculates the training loss, and finally calculates the gradients of the model parameters.”

“But when we use gradients to update parameters, we first run a step to update the model parameters and check if the current model parameter updates are good. If they are, we update the model parameters in a larger step. If not, we run smaller steps to update the parameters slowly. This ensures that the parameter updates are always correct and results in a faster convergence speed.”

A Breakthrough Achievement

Improvement in neural network training can be measured in epochs, where an epoch is a complete pass or cycle through the training dataset.

Professor Zhou expects Adan to outperform existing state-of-the-art (SoTA) optimizers in key deep learning tasks, including vision, language and reinforcement learning, which powered AlphaGo, the AI ​​model that in 2017 defeated the world's top-ranked human player at the ancient board game Go.

“Overall, Adan is able to achieve comparable performance to the SoTA optimizer while requiring half the number of training iterations,” explains Professor Zhou.

“For vision tasks, for ViT and Swin models for supervised image classification tasks, Adan can use 150 training epochs to achieve similar performance to the SoTA optimizer AdamW, which trains for 300 epochs. For MAE models for self-supervised image classification tasks, Adan can use 800 training epochs to achieve similar performance to the SoTA optimizer AdamW, which trains for 1,600 epochs.”

“For language tasks, on GPT2, Adan can use 150,000 training iterations to achieve similar performance to the SoTA optimizer Adam, which is trained for 150,000 training iterations. On Transformer-XL, Adan can use 100,000 training iterations to achieve similar performance to the SoTA optimizer Adam, which is trained for 200,000 training iterations.”

For RL (reinforcement learning) tasks, Adan is working on four games: Ant, Half Cheetah, Humanoid, and Walker2d, sometimes referred to simply as MuJoCo games. These games are designed to control a robot body to stably and robustly complete various activities in a 3D environment, such as walking and running.

“In RL, using the same training iterations, Adan consistently achieves higher performance than the SoTA optimizer Adam on the four gaming tasks tested,” Professor Zhou said.

Provided by Singapore Management University

Quote: How to Optimize Deep Learning Models Faster (May 31, 2024) Retrieved June 1, 2024 from https://techxplore.com/news/2024-05-faster-optimize-deep.html

This document is subject to copyright. It may not be reproduced without written permission, except for fair dealing for the purposes of personal study or research. The content is provided for informational purposes only.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *