
In machine learning, the focus is often on improving the performance of large language models (LLMs) while reducing the associated training costs. This effort often involves improving the quality of pre-training data, as data quality directly impacts the efficiency and effectiveness of the training process. One of the leading ways to achieve this is data pruning, which involves selecting a high-quality subset from a large dataset to train the model more effectively. This process protects the model from noisy and irrelevant data, streamlining the training process and improving overall model performance.
A challenge in training LLMs is the presence of large and noisy datasets. Low-quality data can significantly degrade the performance of these models, so developing methods to filter out low-quality data is important. The goal is to retain only the most relevant, high-quality information. Effective data pruning is essential to optimize the training of these models and ensure that only the best data is used, increasing model accuracy and efficiency.
Traditional data pruning methods include simple rule-based filtering and basic classifiers to identify high-quality samples. Although these methods are useful, they have limitations in handling large and diverse datasets. Advanced methods have emerged that use neural network-based heuristics to assess data quality based on various metrics such as feature similarity and sample difficulty. Although these methods have advantages, they can be computationally expensive and may not perform consistently across different data domains, necessitating the development of more efficient and universally applicable methods.
Researchers from Databricks, MIT, and DatologyAI introduced an innovative approach to data pruning that uses a small reference model to calculate the perplexity of text samples. The approach starts by training a small model on a random subset of data and evaluates the perplexity of each sample. Perplexity in this case measures how accurately a probabilistic model predicts a sample. A lower perplexity score indicates better quality data. By focusing on samples with the lowest perplexity scores, the researchers can prune the dataset to keep only the most relevant data, improving the performance of a larger model trained on this pruned data.
The proposed method splits the dataset into a training set and a validation set for a small reference model. This model is trained based on a standard next-token prediction objective and calculates a perplexity score for each sample in the dataset. The dataset is then pruned based on these scores and samples within a certain perplexity range are selected; for example, samples with the lowest perplexity are selected using a low selection criterion. This pruned dataset is used to train a final, larger model that benefits from high-quality data. The effectiveness of the method is demonstrated on different dataset configurations, including Pile, which consists of diverse curated domains, and Dolma, a dataset derived primarily from web scraping.
Perplexity-based data pruning significantly improves the performance of LLM on downstream tasks. For example, pruning based on perplexity scores calculated on a 125 million parameter model improved the average performance of downstream functions for a 3 billion parameter model by up to 2.04%. Additionally, the pre-training steps required to reach comparable baseline performance were reduced by up to 1.45x. The method also proved effective in a variety of scenarios, including in the regimes of overtraining and data constraints. In overtraining scenarios, the absolute gains in average downstream regularization accuracy were similar for both computationally optimized and overtrained models, demonstrating the robustness of the method.
This work highlights the usefulness of small reference models in perplexity-based data pruning and marks a major step forward in optimizing LLM training. By leveraging small models to filter out low-quality data, researchers can improve model performance and training efficiency. This method is a promising tool for data researchers, improving downstream performance of Pile by 1.89 and downstream performance of Dolma by 1.51 when trained for the computationally optimal period. It will be a valuable addition to the toolkit of the modern data researcher as it improves the performance of large language models and reduces the computational resources required.

In conclusion, this work presents a novel and effective method of data pruning that uses a small reference model to calculate perplexity. This approach improves the performance and efficiency of large-scale language models by ensuring high-quality pre-training data. The robustness of this method across different data configurations and training regimes highlights its potential as a key technique in modern data research. Optimizing data quality allows researchers to achieve better model performance with fewer resources, making perplexity-based data pruning a valuable technique for future advances in machine learning.
Please check paper. All credit for this work goes to the researchers of this project. Also, don't forget to follow us: twitter. participate Telegram Channel, Discord Channeland LinkedIn GroupsUp.
If you like our work, you will love our Newsletter..
Please join us 43,000+ ML subreddits | In addition, our AI Event Platform

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His latest endeavor is the launch of Marktechpost, an Artificial Intelligence media platform. The platform stands out for its in-depth coverage of Machine Learning and Deep Learning news in a manner that is technically accurate yet easily understandable to a wide audience. The platform has gained popularity among its audience with over 2 million views every month.
