Optimal partitioning of language models from mixed language models into specialized domains

This paper was accepted at the ICLR 2026 Workshop on Navigating and Addressing Data Issues in Fundamental Models.

Language models achieve superior performance on a variety of knowledge, language, and inference tasks due to the scale and variety of available pre-training data. The standard training recipe is a two-step paradigm. We first pre-train on a complete corpus of data and then specialize on a high-quality, specialized subset of data from the complete corpus. In a multi-domain setting, this involves continuous pre-training of multiple models in each specialized domain, called split model training. We propose a method to pretrain multiple models independently on a common pretraining corpus and use scaling laws to determine the optimal computational allocation between pretraining and continuous pretraining. Our approach accurately predicts the loss of a model of size N with D pre-training and D’ specialization tokens, and estimates the size and number of tokens for a larger model. Applying our approach to language model training consistently improves performance across common knowledge and inference benchmarks across a variety of model sizes and computing budgets.