Advances in transformer learning enable data scaling to characterize generalization risk

Machine Learning


Despite observed improvements due to increased computing power, the quest to understand how large-scale language models learn and generalize remains a central challenge in artificial intelligence. Chiwun Yang and colleagues at Sun Yat-sen University present a unified theoretical framework that goes beyond purely empirical observations to elucidate the learning process within transformer networks. The researchers modeled Transformer learning as a continuous system and rigorously analyzed how the model improves during training with realistic data, allowing them to predict the relationship between computational resources and final performance. This work establishes a clear understanding of how excess risk, a measure of learning error, changes with resource scale, reveals a clear transition between rapid initial improvement and slow power-law decay, and ultimately provides independent scaling laws for model size, training time, and dataset size.

This research investigates the processes that link optimization to kernel behavior. Unlike previous analyzes using simplified models, the team rigorously examines stochastic gradient descent (SGD) training of multilayer transformers applied to sequence-to-sequence data, closely replicating real-world conditions. This analysis focuses specifically on the optimization process itself, characterizing how generalization errors converge to an irreducible risk as computational resources scale with the amount of data. Researchers establish theoretical upper bounds on excess risk and identify clear stage transitions in performance. Excess risk initially decreases exponentially with computational cost, but transitions to a power-law decay when a certain resource threshold is reached.

Scaling laws for machine learning training

Scientists have developed a theoretical understanding of how the performance of machine learning models changes as resources such as data, model size, and computing power change. These scaling laws reveal that training performance is not a simple linear function of increasing resources, but exhibits a variety of situations in which a single resource becomes the limiting factor, such as undercomputing, data limitations, model limitations, etc. The constant, denoted as ξ, represents the inherent difficulty of the learning task and influences how quickly performance improves with more resources, while the Lambert W function represents the complex relationship between these resources. The team presents a theorem that explains how generalization error scales with data size, model size, and computation under different conditions.

The central result divides the scaling trend into two stages. There is an initial undercomputation phase in which the error decreases exponentially as the amount of computation increases, and a subsequent data restriction phase in which the error is adjusted according to a more complex formula involving a Lambert W function. Additionally, the theorem outlines how to optimize performance by adjusting data, model size, and compute, showing that the greatest improvement is achieved by increasing the limited resources. These scaling laws provide guidance for allocating limited resources to achieve the best possible performance, predicting how performance will improve with increased resources, and identifying bottlenecks in the training process. Knowing which regime a model is in allows researchers to focus on the most impactful resources, and the analysis highlights the importance of balancing data, model size, and compute for efficient machine learning systems.

Transformer learning, scaling laws, and optimization dynamics

Beyond purely empirical observations, scientists have established a rigorous mathematical framework for understanding how increasing computational resources improves the performance of transformer-based language models. In this study, we formalize the learning process as an ordinary differential equation and approximate it using a kernel method, allowing a detailed analysis of stochastic gradient descent training. Experiments revealed that the generalization error converges to an irreducible minimum as computational resources scale with the data, especially during the optimization phase, and the team defined a clear phase transition that controls this process. This integrated framework provides independent scaling laws for model size, training time, and dataset size, showing how each variable independently controls the upper bound on generalization performance.

The researchers established a theoretical upper limit on the excess risk characterized by this phase transition and confirmed the stability of the process through careful mathematical analysis. Measurements confirm that model performance is strongly related to layer width and weight update radius, with larger layer widths and smaller update radii resulting in faster convergence. Further analysis establishes that, given an arbitrary dataset size and infinite training time, the approximation error, which measures the model's ability to represent the target function, is bounded by the inverse of the model size. The team's findings provide a foundation for optimizing future language model design and training, provide a quantifiable upper bound on generalization error, and highlight the interplay between model size, dataset size, and training time.

Scaling laws and generalization in transformers

This study establishes a comprehensive theoretical framework for understanding the scaling laws observed in large-scale language models, and specifically investigates the relationship between computational resources and model performance. By modeling the transformer architecture training process as a mathematical system, the scientists demonstrate how the generalization error converges with computational power and data scale, identifying two distinct stages of improvement: an initial exponential decay followed by a power-law decay. The team rigorously characterizes excess risk and sets upper bounds that are influenced by both computational cost and data characteristics. Additionally, this study clarifies the independent roles of model size, training time, and dataset size in determining performance limits, providing insight into how each factor contributes to the overall model power. The analysis reveals that simply increasing model size does not guarantee continued improvement, especially when the model becomes significantly larger than the complexity of the data itself, suggesting a point of diminishing returns. Although the findings support the general trend of improved performance with increasing resources, the authors acknowledge limitations related to dataset noise and model capacity, suggesting future research areas for optimizing resource allocation for large-scale language model development.



Source link