schwit1 shares a report from VentureBeat. [A]As anyone who follows the burgeoning industry and its underlying research knows, it is used to train Large Language Models (LLM) and other Transformer models that underpin products such as ChatGPT, Stable Diffusion, and Midjourney. Data are initially derived from human sources (books, articles, photographs). And so on, which were created without the help of artificial intelligence. As more people use AI to create and publish content, an obvious question arises. What happens when AI-generated content proliferates on the internet and AI models start training on that content rather than primarily human-generated content?
A group of researchers from the UK and Canada have investigated just this question and recently published a paper on their research in the open access journal arXiv. Their findings are worrying for current generative AI technology and its future. “We found that using model-generated content in training resulted in irreversible flaws in the resulting models.” Specifically examined the probability distributions of AI-generated models from text-to-text and image-to-image. The researchers conclude that ‘learning from data generated by other models causes model collapse. This is a degenerative process in which the model forgets the true data over time.’ . The underlying data distribution… this process is inevitable even under near-ideal conditions for long-term learning. ”
“Over time, the errors in the generated data pile up, and eventually the models learned from the generated data get even more wrong about reality,” said Ilya Ilya, one of the paper’s lead authors. Shumaylov wrote in an email to VentureBeat. “We were surprised to observe how quickly the model collapsed. The model can quickly forget most of the original data it was originally learned on.” In other words, AI training As models are exposed to more AI-generated data, their performance degrades over time, the responses and content they generate contain more errors, and the error-free diversity in responses they generate is far greater. less. One of the authors of the paper, Ross Anderson, Professor of Security Engineering at the Universities of Cambridge and Edinburgh, wrote in a blog post discussing the paper: This makes it difficult to scrape the web to train new models, giving an advantage to companies that are already doing it, or have large-scale control over access to human interfaces. AI Startup Training He’s Already Seen Attacking The Internet Archive For Data. ”
schwit1 wrote: “Garbage in, garbage out – and if this thesis is correct, generative AI is turning into a self-licking ice-cream cone of garbage-generating.”
