Complete hyperparameter transfer across modules, width, depth, batch, and duration

Tuning hyperparameters can have a dramatic impact on the training stability and final performance of large-scale models. Recent work on the parameterization of neural networks such as μP has made it possible to transfer optimal global hyperparameters across model sizes. These studies propose an empirical practice of searching for optimal global fundamental hyperparameters at small model sizes and transferring them to larger sizes. We extend these efforts in two key ways. To handle scaling along the most important scaling axes, we propose a Complete(d) parameterization that integrates scaling of width and depth, as well as batch size and training duration, using an adaptation of CompleteP. Next, we investigate per-module hyperparameter optimization and transfer through parameterization. We characterize the empirical challenges in navigating high-dimensional hyperparameter environments and propose practical guidelines for tackling this optimization problem. We show that with proper parameterization, hyperparameter transfer is preserved even in the per-module hyperparameter region. Our research covers a wide range of optimization hyperparameters for modern models, including learning rate, AdamW parameters, weight decay, initialization scale, and residual block multipliers. Our experiments demonstrate that transferring per-module hyperparameters significantly speeds up the training of large-scale language models.

† University of Cambridge
** Work I did while at Apple

Diagram showing hyperparameter optimization on a 50M parameter scale. We compare global and per-module strategies and highlight the transition to much larger FLOP budgets using Complete(d)P parameterization. — Figure 1: Optimize hyperparameters at a small scale of 50 million parameters/160 million tokens (learning rate, initialization scale, Adam ε, momentum, and weight decay) using an evolutionary strategy. These hyperparameters (HP) can be optimized globally using shared values across the model, or per module (there are 13 different modules, some with additional tuning at each depth). A module-by-module approach yields better results at the 50 million scale. Optimal global HP requires 2.3 times more training to achieve the same performance. Importantly, the new parameterization Complete(d)P allows for ~14000x direct transfer (without subsequent adjustment) to the FLOP budget.

Source link

Najlepszy kod polecajacy Binance commented on Insights from Nabil Batawi, Group CHRO, Alkhorayef Group, KSA, ETHRWorldME: Your point of view caught my eye and was very inte
Parker Robinson commented on AI platform Hugging Face says hackers have stolen authentication tokens from Spaces: Bitcoin Mining for Passive Income in 2026 https://
100 USDT commented on How to Make AI Work for You, at Work: Thanks for sharing. I read many of your blog posts
创建Binance账户 commented on AI jobs in financial services: $350k for junior hires: Your article helped me a lot, is there any more re
1win commented on Do AI apps really need a GPU or NPU?: Saved as a favorite, I really like your website!

Complete hyperparameter transfer across modules, width, depth, batch, and duration

RECENT POSTS

GPIC: Advancing next-generation generative models

“Don’t use AI just to use AI”

LumeFlow AI revolutionizes video production pipelines with GPT Image 2 and AI agent skills

Related Posts