This paper was accepted at the ICLR 2026 Workshop on Navigating and Addressing Data Issues in Fundamental Models.
Language models achieve superior performance on a variety of knowledge, language, and inference tasks due to the scale and variety of available pre-training data. The standard training recipe is a two-step paradigm. We first pre-train on a complete corpus of data and then specialize on a high-quality, specialized subset of data from the complete corpus. In a multi-domain setting, this involves continuous pre-training of multiple models in each specialized domain, called split model training. We propose a method to pretrain multiple models independently on a common pretraining corpus and use scaling laws to determine the optimal computational allocation between pretraining and continuous pretraining. Our approach accurately predicts the loss of a model of size N with D pre-training and D’ specialization tokens, and estimates the size and number of tokens for a larger model. Applying our approach to language model training consistently improves performance across common knowledge and inference benchmarks across a variety of model sizes and computing budgets.
- † National University of Singapore, Singapore
