Keel advances deep language models to stabilize post-layer gnomes for extreme depth

Machine Learning


Scientists are grappling with a critical challenge in artificial intelligence: the limits of scaling large language models (LLMs). ByteDance Seed’s Chen Chen and Lai Wei, along with colleagues, demonstrated that simply increasing the width of the model or increasing the length of the context has diminishing returns, and increasing depth, which would theoretically be more powerful, proved difficult to achieve reliably. Their work revisits the Post-LayerNorm (Post-LN) technique, previously abandoned due to instability, and introduces “Keel,” a new architecture that incorporates highway-style connections to address vanishing gradients in deep networks. This innovation enables stable training over 1000 layers deep and consistently outperforms Pre-LN. This suggests that Post-LN, in combination with Keel, provides a surprisingly simple and effective route to building truly deep and scalable LLMs, even infinitely deep models.

ByteDance researchers have announced Keel, a new Transformer architecture that enables stable training at extreme depths of over 1000 layers, unlocking greater expressive power compared to current methods. The team accomplished this by revisiting the Post-LayerNorm (Post-LN) formulation, previously abandoned due to large-scale instability, and identifying the root cause of its failure: ResNet-style residual paths that introduce vanishing gradients into deep networks. This effort brings fundamental changes to LLM architectures and provides a path beyond the diminishing returns of traditional scaling techniques.

This study reveals that the central problem with Post-LN stems from how residual and transformed activations are mixed before normalization, leading to unstable gradient signals. To solve this, the researchers replaced the standard remaining paths with highway-style connections in the Keel architecture. This modification maintains gradient flow and prevents signal loss from the top layer to the bottom layer, allowing stable training at unprecedented depths. Unlike previous attempts to revive Post-LN, Keel does not require specialized initialization or complex optimization techniques, streamlining the training process and making deep LLM more accessible.
Experiments show that Keel consistently improves perplexity and depth scaling properties compared to Pre-LN, the current mainstream approach. This breakthrough is substantiated by empirical results demonstrating robust training of Keel at depths of over 1000 layers. This study proves that Keel maintains smooth convergence even at aggressive learning rates, specifically 4.5×10-3, whereas Pre-LN exhibits severe instability under the same conditions. Furthermore, as shown in Figure 1(c), Keel consistently outperforms Pre-LN across all depths ranging from 64 to 1024 layers. The team’s analysis, based on formal gradient mechanics, demonstrated that the highway-style connections allow demonstrable control over gradient magnitude and allow signals to propagate through depth without loss.

Keel’s impact extends beyond just scalability. This study revealed significant improvements in the expressive power of the model across a variety of features. Performance benchmarks show a +6.6% improvement in multilingual understanding, +4.4% improvement in general knowledge and common sense, and a significant +16.5% improvement in math and code, demonstrating Keel’s ability to improve performance in specialized areas. This work opens the possibility of future infinite-depth architectures, potentially unlocking qualitatively new behaviors in LLM, and establishing a simple and effective foundation for building highly scalable models.

Keel Transformer effectively stabilizes 1000-layer deep networks

Scientists are facing the limits of scaling large-scale language models (LLMs) and observing diminishing returns with increasing model size and context length. Researchers in this study revisited the Post-LayerNorm (Post-LN) formulation, previously abandoned due to large-scale instability, and identified ResNet-style residual paths as the main cause of gradient vanishing in deep networks. To address this, this study developed Keel, a post-LN transformer that replaces the traditional residual path with a highway-style connection to maintain gradient flow and prevent signal loss from the top layer to the bottom layer. The team designed Keel to enable stable training over 1000 layers deep. This is a feat not previously achievable with standard architectures.

In our experiments, we adopted a rigorous training scheme with a learning rate of 4.5×103 to demonstrate the robustness and convergence speed of Keel, as shown in Figure 1(a). This shows that Keel maintains smooth convergence while Pre-LN exhibits severe instability. Unlike traditional methods, this innovative approach achieves stable optimization of ultra-deep networks without the need for special initialization or complex optimization tricks. Additionally, this study evaluated Keel’s expressiveness across multiple competency areas, including multilingual understanding, general knowledge and common sense, and math and code, and found consistent improvements compared to Pre-LN, with a +16.5% improvement in math and code performance in particular (Figure 1(b)). The researchers carefully measured the average benchmark scores across these domains to quantify Keel’s enhanced capabilities. To demonstrate depth scaling, the team trained the model with different numbers of layers (64 to 1024) and observed that Keel consistently outperformed Pre-LN, achieving an average benchmark score of 60.9% with 1024 layers compared to Pre-LN’s lower score (Figure 1(c)). This work shows that improvements to Keel’s architecture can unlock a simple and effective foundation for building highly scalable LLMs, paving the way to infinite-depth architectures.

Keel overcomes vanishing gradient in deep LLM

Scientists have developed Keel, a new Post-LayerNorm (Post-LN) architecture that enables stable training of large-scale language models (LLMs) that are more than 1000 layers deep. By focusing on depth scaling as a more promising path forward, this study addresses the limitations of current LLM scaling where increasing model width and context length has diminishing returns. Experiments reveal that the central failure mode of Post-LN comes from ResNet-style residual paths, which causes vanishing gradients in deep networks and prevents effective training. The team measured gradient dynamics and formally demonstrated that ResNet-style residual paths, rather than regularization itself, are the primary cause of gradient vanishing.

To overcome this, Keel replaces the traditional remaining path with a highway-style connection that maintains the flow of the grade and prevents signal loss from the top layer to the bottom layer. Tests have shown that this change makes Post-LN stable at scale without the need for special initialization or complex optimization tricks, which is a major advance in LLM architectures. Our data shows that Keel maintains smooth convergence even at aggressive learning rates, unlike Pre-LN, which exhibits severe instability under the same conditions. The results show that Keel consistently outperforms Pre-LN across all depths from 64 to 1024 layers.

Specifically, the model achieves +16.5% performance improvement in the math and code functional domains compared to the pre-LN baseline. Measurements confirm that Keel’s architectural improvements improve learning efficiency and model expressiveness, enabling stable optimization of ultra-deep networks. This breakthrough provides a simple and effective foundation for building highly scalable LLMs, potentially unlocking an infinite depth of potential. The scientists documented that Keel’s highway-style gate connections dynamically adjust the balance between carrier and transform signals, controlling both forward and reverse information flow. This allows provable control over the gradient magnitude and allows the signal to propagate through depth without loss, an important outcome when training very deep networks. This work establishes a practical framework for next-generation LLM scaling, effectively addressing training stability issues associated with traditional deep architectures, and opening new avenues for per-parameter expressiveness.

Keel enables stable training in ultra-deep LLM

Scientists have demonstrated that increasing the depth of large-scale language models (LLMs) is a promising means to improve expressiveness. Expressiveness is currently hampered by the instability of training at extreme depths. Researchers revisited the Post-LayerNorm (Post-LN) formulation, which was previously superseded by Pre-LN due to scaling issues, and identified ResNet-style residual paths as the primary cause of gradient vanishing in deep networks. To address this, they introduced Keel, a Post-LN Transformer that incorporates highway-style connections that maintain gradient flow and enable stable training over 1000 layers deep. Keel consistently outperforms Pre-LN baselines in complexity and depth scaling, and maintains its lead even after fine-tuning on difficult inference benchmarks such as BBH, MMLU-Pro, and CMMLU.

This architectural improvement is reflected directly in downstream tasks, allowing the model to adapt to complex instructions without significant performance degradation. The authors acknowledge that training instability is not caused by depth alone, and that broader models may require additional stabilization mechanisms. Since a large amount of training data is currently required for optimal performance, future work will investigate the stability during width scaling and investigate the effectiveness of Keel in low data regions. These findings suggest that the depth facilitated by innovations like Keel’s is a viable path to building highly scalable LLMs and potentially achieving infinite-depth architectures.



Source link