This article is part of our coverage of the latest AI research.
Generative artificial intelligence has made it possible for everyone to be more creative. A large scale language model (LLM) like ChatGPT can generate excellent quality essays and articles. Diffusion models such as Stable Diffusion and DALL-E produce excellent images.
But what happens when the internet is flooded with AI-generated content? That content is eventually collected and used to train the next iteration of a generative model. A study by researchers from Oxford University, Cambridge University, Imperial College London, and University of Toronto found that machine learning models trained on content generated by generative AI experience irreversible gradual deterioration over generations. You will suffer from flaws.
The only way to maintain the quality and integrity of future models is to ensure that they are trained on human-generated content. But with his LLMs such as ChatGPT and GPT-4 enabling large-scale content creation, access to human-generated data may soon become a luxury that few people have.
model collapse
In their paper, the researchers explore what happens when text generated by GPT-4, for example, forms the bulk of the training dataset for subsequent models.
“What happens to GPT version GPT-{n} as generation n increases?” the researchers wrote. “We found that learning from data generated by other models caused the following: model collapse – A degeneration process in which the model forgets the true underlying data distribution over time, even if the distribution does not change over time. ”
A machine learning model is a statistical engine that attempts to learn the distribution of data. This is true for all kinds of ML models, from image classifiers to regression models to more complex models that generate text and images.
The closer the model’s parameters are to the underlying distribution, the better it predicts real-world events.
However, even the most complex models are only approximations of the real world. As a result, they tend to overestimate likely events and underestimate unlikely events by even small differences. When used recursively to retrain itself, these errors stack up and the model collapses. Eventually, models later in the sequence will fall outside the original distribution of the natural data used to train them.
Model collapse is related to catastrophic oblivion, is a problem that arises in models that are continuously trained on new data. With catastrophic forgetting, ML models gradually forget the information used to train them early in their lifecycle. Collapsing a model does not erase previously learned data, but causes the model to interpret it in the wrong way.
Model collapse is also relevant data poisoning, the process by which a malicious attacker attempts to manipulate the behavior of a model by intentionally altering the data used to train the model. Model collapse can be considered a form of data poisoning. However, it is the model and the training process, not the intentional actors, that pollute the training data.
Model Collapse in Generative AI
In the study, researchers simulated the effects of training a generative model on their own data. They tested his three types of models: Gaussian Mixture Model (GMM), Variational Autoencoder (VAE), and Large Language Model (LLM).
The task of GMM is to separate two artificially generated Gaussian distributions. The model was first trained on a dataset generated from a fixed function. I then used it to generate new data and retrain the following models: Within 50 generations, the distribution of data has completely changed. At 2,000 generations all variance was lost.
VAE was used to generate handwritten digits. An initial model was trained on real data. The next generation was trained on the data generated by the previous model.The image gradually becomes blurry, and after 10 secondsth Generation after generation, they became incomprehensible.
The researchers then tested their hypothesis with OPT-125m, a smaller version of Meta’s open source LLM. They evaluated common scenarios where pre-trained models are fine-tuned with recent data. However, the fine-tuned data is produced by another fine-tuned pre-trained model.
They tested two different variations of the scenario. One is to use only LLM-generated data for fine-tuning. In the second, a small portion of the original human-generated data is also added to the training mix.
“Both training plans lead to poor model performance, but we found that it was possible to learn using the generated data, and the model was able to successfully learn (part of) the underlying task.” wrote the researchers.
However, their findings also show that, over generations, the model generated samples that were more likely to be generated by the original model.
“At the same time, we also found that the generated data had much longer tails, suggesting that some of the data were never generated by the original model. Learning using generational data,” the researchers wrote.
What about future generations of ChatGPT?
The digital age has created all kinds of data pollution. Search engine algorithms have had a huge impact on how people create online content. Malicious attackers have resorted to all sorts of techniques to ensure that their content ranks high on search engine result pages. Similar effectiveness has been observed in social media content recommendation algorithms, where bad actors use controversial or clickbait titles to drive engagement and promote content.
However, while the previous problem can be mitigated by changing the ranking algorithm, the impact caused by LLM is much more difficult to detect and deal with.
“Our evaluation suggests that there is a ‘first mover advantage’ when it comes to training models such as LLM,” the researchers wrote. This means that platforms and companies that have access to authentic human-generated text will have an advantage in producing high-quality models. The web could then be flooded with AI-generated content.
Researchers suggest taking steps to maintain access to the original data over time. However, it is not clear how to track and filter LLM-generated content at scale. This could be the focus of a new wave of innovation and competition among tech companies in the coming months and years.
