Artificial intelligence has transformed many sectors, from healthcare to retail to entertainment to the arts. But new research suggests we may have reached a tipping point where AI learns from AI-generated content.
This AI Ouroboros (a snake that eats its own tail) can have a very tragic end. Research groups at various UK universities have warned of what they call ‘model collapse’, a degeneration process that could disconnect AI from reality altogether.
In a paper entitled “The Curse of Recursion: Training on Generated Data Makes Models Forget,” researchers from the Universities of Cambridge, Oxford, Toronto, and Imperial College London wrote that model collapse is “generative This happens when the data that has been generated will contaminate the training.” A set of next-generation models. “\
“Being trained on polluted data, they have a false perception of reality,” they write.
In other words, the extensive AI-generated content published online can be sucked into AI systems, resulting in distortions and inaccuracies.
This problem has been found in a variety of learning generative models and tools, including large-scale language models (LLMs), variational autoencoders, and Gaussian mixture models. Over time, the model begins to “forget the true underlying data distribution”, leading to an inaccurate representation of reality as the original information becomes so skewed that it no longer resembles real-world data.
There are already examples of machine learning models being trained on AI-generated data. For example, language learning models (LLMs) have been intentionally trained based on the output from GPT-4. Similarly, DeviantArt, an online platform for artists, allows AI-generated artwork to be published and used as training data for new AI models.
Image: Deviant
Similar to trying to copy or clone something indefinitely, the researchers say, such actions can lead to further model collapse.
Given the severe consequences of model collapse, access to the original data distribution is critical. AI models need real-world human-generated data to accurately understand and simulate our world.
How to prevent model collapse
According to research papers, there are two main causes of model collapse. The main one is the “statistical approximation error”, which is related to a finite number of data samples. The second is the “function approximation error”. This is due to the error bounds used during AI training not being set properly. These errors get worse from generation to generation, and can have a cascading effect of worsening inaccuracies.
This paper articulates the “first mover advantage” in training AI models. If we can maintain access to the original human-generated data sources, we may be able to prevent detrimental distributional changes and even model collapse.
However, distinguishing AI-generated content at scale is a difficult task and may require community-wide coordination.
Ultimately, the importance of data integrity and the impact of human information on AI is just as important as the underlying data, and the explosion of AI-generated content is a double-edged sword for the industry. It can become a sword. It’s “garbage in, garbage out”. AI based on AI content will produce many very smart but “delusional” machines.
What about ironic developments? Our mechanical offspring learn and “delusion” more from each other than from us. Then you have to deal with his ChatGPT of paranoid adolescents.
