How does AI scale with data size? This Stanford paper introduces a new class of discrete data scaling laws for machine learning

Machine Learning

Machine learning models for vision and language have improved significantly recently, thanks to larger model sizes and large amounts of high-quality training data. Research shows that models predictably improve with more training data and derives scaling laws that explain the relationship between error rate and dataset size. While these scaling laws help determine a balance between model size and data size, they look at the entire dataset without considering individual training examples. This is a limitation, especially in noisy datasets collected from the web, where some data points are more valuable than others. Therefore, it is important to understand how each data point or source affects model training.

Related work in this paper describes techniques called deep learning scaling laws that have become increasingly popular in recent years. These laws are useful in many ways, including understanding the tradeoff between increasing training data and model size, predicting the performance of large-scale models, and comparing the performance of different learning algorithms at a smaller scale. The second approach focuses on how individual data points can improve the performance of a model. These methods typically score training examples based on their marginal contribution. They can identify mislabeled data, filter out high-quality data, weight useful examples, and select promising new data points for active learning.

Researchers at Stanford University introduced a novel approach by investigating the scaling behavior of the values โ€‹โ€‹of individual data points. They found that as the dataset becomes larger, the contribution of a data point to the model's performance predictably decreases, following a log-linear pattern. However, this decrease varies from data point to data point, meaning that some points are more useful in small datasets while others are more useful in large datasets. Furthermore, maximum likelihood and amortized estimators were introduced to efficiently learn these individual patterns from a small number of noisy observations of each data point.

To provide evidence of parametric scaling laws, experiments are conducted focusing on three types of models: logistic regression, SVM, and MLP (specifically, two-layer ReLU network). These models are tested on three datasets: MiniBooNE, CIFAR-10, and IMDB movie reviews. Pre-trained embeddings such as frozen ResNet-50 and BERT are used to speed up training and prevent underfitting on CIFAR-10 and IMDB, respectively. The performance of each model is measured using cross-entropy loss on a test dataset of 1000 samples. For logistic regression, 1000 data points and 1000 samples are used for each k value. For SVM and MLP, 200 data points and 5000 samples are used for each dataset size k due to the large variance of the marginal contributions.

The proposed method is tested by predicting the accuracy of the marginal contribution at each dataset size. For example, using the IMDB dataset and logistic regression, we are able to accurately predict the expected value for dataset sizes ranging from k = 100 to k = 1000. To evaluate this systematically, we show the accuracy of predicting the scaling law at different dataset sizes for both versions of the likelihood-based estimator with different samples. In a more detailed version of these results, we see that the R2 score decreases when the predictions are extended beyond k = 2500, while the correlation and rank correlation with the true expected value remain high.

In conclusion, Stanford researchers developed a new method by examining how the values โ€‹โ€‹of individual data points change with scale. The researchers found evidence of a simple pattern that works across a variety of datasets and model types. Experiments confirmed this scaling law by showing a clear log-linear trend and testing the accuracy of predicting contributions across different dataset sizes. The scaling law can be used to predict behavior for larger datasets than the one initially tested. However, measuring this behavior across the entire training dataset is costly, so the researchers developed a method to measure the scaling parameters using a small number of noisy observations for each data point.

High-quality data for AI research.

Please check paper. All credit for this research goes to the researchers of this project. Also, don't forget to follow us. twitter.

participate Telegram Channel and LinkedIn GroupsUp.

If you like our work, you will love our Newsletter..

Please join us 46k+ ML Subreddit

Sajjad Ansari is a final year undergraduate student at Indian Institute of Technology Kharagpur. As a technology enthusiast, he studies practical applications of AI with a focus on understanding the impact of AI technology and its impact on the real world. He aims to express complex AI concepts in a clear and understandable manner.

๐Ÿ Join the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft & more…

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *