Join top executives in San Francisco July 11-12 to hear how they are integrating and optimizing their AI investments for success. learn more
Large Language Models (LLM) are one of the hottest innovations today. With companies like OpenAI and Microsoft working to release impressive new NLP systems, there’s no denying the importance of having access to large amounts of unspoiled, high-quality data.
However, according to recent research by Epoch, more data may soon be needed to train AI models. The team investigated the amount of quality data available on the Internet. (“High quality” indicates resources such as Wikipedia, as opposed to low-quality data such as social media posts.)
Analyzes show that high-quality data will be exhausted quickly, possibly by 2026. Sources of low-quality data will be exhausted in just a few decades, but it is clear that the current trend of endlessly scaling models to improve results could soon slow down.
Machine learning (ML) models are known to perform better as the amount of data they train on increases. However, simply feeding more data to the model is not always the best solution. This is especially true for rare events and niche applications. For example, when training a model to detect rare diseases, you may need more data to work with. However, I would like the model to become more accurate over time.
event
transform 2023
Join us July 11-12 in San Francisco. A top executive shares how she integrated and optimized her AI investments and avoided common pitfalls for success.
Register now
This suggests that we need to develop a different paradigm for building machine learning models that is data-agnostic if we want to prevent technology development from slowing down.
This article describes what these approaches look like and estimates their strengths and weaknesses.
AI model scaling limits
One of the most important challenges in scaling machine learning models is the diminishing returns of increasing model size. As the size of the model increases, its performance improvement becomes marginal. This is because the more complex the model, the more difficult it is to optimize and the more likely it is to overfit. Moreover, larger models require more computational resources and training time, making them impractical for real-world applications.
Another significant limitation of scaling models is the difficulty of ensuring robustness and generalizability. Robustness refers to the ability of a model to perform well even in the face of noisy or adversarial inputs. Generalizability refers to the ability of a model to perform well on data not seen during training. The more complex a model is, the more susceptible it is to adversarial attacks and the less robust it is. In addition, large models memorize the training data instead of learning the underlying patterns, resulting in poor generalization performance.
Interpretability and explainability are essential to understanding how a model makes predictions. However, as the model becomes more complex, the inner workings become increasingly opaque, making its decisions difficult to interpret and explain. This lack of transparency can be a problem in critical applications such as healthcare and finance, where the decision-making process must be explainable and transparent.
Alternative approaches to building machine learning models
One approach to overcome this problem is to rethink the difference between high quality data and low quality data. According to Swabha Swayamdipta, his ML professor at the University of Southern California, creating a more diverse training dataset could help overcome limitations without compromising quality. Additionally, he said, training a model multiple times on the same data can help reduce costs and reuse data more efficiently.
These approaches can delay the problem, but the more times you train a model using the same data, the more likely overfitting will occur. In the long term, we need an effective strategy to overcome data problems. So what are the solutions instead of just feeding more data to the model?
JEPA (Joint Empirical Probability Approximation) is a machine learning approach proposed by Yann LeCun that differs from traditional methods in that it uses empirical probability distributions to model data and make predictions.
In traditional approaches, models are designed to fit a mathematical formula to the data, often based on assumptions about the underlying distribution of the data. However, in JEPA the model learns directly from the data through empirical distribution fitting. This approach divides the data into subsets and estimates the probability distribution of each subgroup. These probability distributions are then combined to form the joint probability distribution used to make predictions. JEPA can handle complex, high-dimensional data and adapt to changing data patterns.
Another approach is to use data augmentation techniques. These techniques involve modifying existing data to create new data. This can be done by flipping, rotating, cropping, or adding noise to the image. Data augmentation can reduce overfitting and improve model performance.
Finally, you can use transfer learning. This involves using a pre-trained model and fine-tuning it for new tasks. This saves time and resources as the model is already learning valuable features from large datasets. Pre-trained models can be fine-tuned using small amounts of data, making them a good solution when data is scarce.
Conclusion
Data augmentation and transfer learning are still available today, but these methods do not completely solve the problem. That’s why we need to think more about effective methods that can help us overcome our problems in the future. After all, it is enough for humans to observe a few examples to learn something new. Maybe one day we’ll invent an AI that can do that too.
what is your opinion What will your company do if it runs out of data to train a model?
Ivan Smetannikov is the Data Science Team Lead at Serokell..
data decision maker
Welcome to the VentureBeat Community!
DataDecisionMakers is a place for data professionals, including technologists, to share data-related insights and innovations.
Join DataDecisionMakers for cutting-edge ideas, updates, best practices, and the future of data and data technology.
You might consider contributing your own article!
Read more about DataDecisionMakers
