AI chatbot training data may exhaust human-written text

Artificial intelligence systems like ChatGPT could quickly run out of what makes them smart: the tens of trillions of words that people have written and shared online.

A new study published Thursday by research group Epoch AI predicts that tech companies will run out of the supply of public training data for AI language models roughly in the early 2020s, meaning sometime between 2026 and 2032.

Study author Tamai Beciroglu likened this to a “literal gold rush” that depletes finite natural resources, saying the field may struggle to maintain its current pace of progress once the stockpile of human-created text is depleted.

In the short term, tech companies like ChatGPT developer OpenAI and Google are racing to secure, and in some cases pay for, high-quality data sources to train their own AI large-scale language models, for example by striking deals to tap into the steady stream of text coming out of Reddit forums and news outlets.

In the long term, there will not be enough new blogs, news articles, and social media comments to maintain the current trajectory of AI development, and companies will be pressured to access sensitive data currently considered private, such as emails and text messages, or to rely on unreliable “synthetic data” spit out by the chatbots themselves.

“There's a real bottleneck here,” Beciloglu says. “Once you start running into constraints on how much data you have, you can't scale the model efficiently. And model scaling was probably the most important way to extend the capabilities of the model and improve the quality of the output.”

The researchers first made the prediction two years ago, shortly before ChatGPT debuted, in a working paper predicting the impending end of high-quality text data in 2026. A lot has changed since then, including new techniques that allow AI researchers to make better use of existing data and “overtrain” on the same sources multiple times.

But there are limitations, and after further research, Epoch predicts that public text data will be exhausted within the next two to eight years.

The team's latest research has been peer-reviewed and will be presented this summer at the International Machine Learning Conference in Vienna, Austria. Epoch is a nonprofit research lab run by San Francisco-based Rethink Priorities, which is funded by advocates of Effective Altruism, a philanthropic effort that has dedicated money to mitigating the worst risks of AI.

Beciloglu said AI researchers realized more than a decade ago that they could significantly improve the performance of AI systems by aggressively scaling up two key ingredients: computing power and vast amounts of internet data.

According to a study by Epoch, the amount of text data fed into AI language models is growing about 2.5 times per year, while computing is growing about four times per year. Meta Platforms, Facebook's parent company, recently claimed that the largest version of its upcoming, yet to be released, Llama 3 model has been trained on up to 15 trillion tokens, each of which can represent parts of a word.

But how much worrying about data bottlenecks is worth it is debatable.

“I think it's important to keep in mind that you don't necessarily need to train bigger and bigger models,” said Nicholas Papernot, an assistant professor of computer engineering at the University of Toronto and a researcher at the nonprofit Vector Artificial Intelligence Institute.

Papernot, who was not involved in the Epoch research, said building more skilled AI systems also requires training models that are specialized for specific tasks, but he worries that training a generative AI system on the same outputs it is generating could lead to a performance degradation known as “model collapse.”

Training with AI-generated data is “just like what happens when you copy a piece of paper and then copy the copy of that copy: some of the information is lost,” Papernot says. What's more, his research has found that it can further encode the errors, biases, and inequities already built into our information ecosystem.

If real, human-created texts remain an important source of data for AI, it will force managers of some of the most popular repositories, websites like Reddit and Wikipedia, and news and book publishers, to think hard about how they are used.

“We may not chop off the tops of every mountain,” jokes Serena Deckelman, chief product and technology officer at the Wikimedia Foundation, which runs Wikipedia. “It's an interesting question right now that we're having a natural resource discussion about human-created data. I shouldn't laugh, but I'm kind of surprised.”

While some have tried to shut down the data gained from AI training (often after it has already been obtained without compensation), Wikipedia places few restrictions on how AI companies can use the articles written by volunteers. Still, Deckelman said he hopes there will remain incentives for people to keep contributing, especially as cheap, automatically generated “junk content” begins to pollute the internet.

AI companies “should be concerned about how human-created content continues to exist and remain accessible,” she said.

From an AI developer's perspective, Epoch's research found that paying millions of humans to generate the text needed for AI models is “likely not an economical way” to improve technical performance.

As OpenAI begins training the next generation of its GPT large-scale language models, CEO Sam Altman told an audience at a United Nations event last month that the company is already experimenting with “generating large amounts of synthetic data” for training.

“I think what you need is high-quality data. There's low-quality synthetic data, there's low-quality human data,” Altman said, but he also expressed concern about over-reliance on synthetic data over other technical methods to improve AI models.

“It would be really odd if the best way to train a model was to generate, say, a quadrillion tokens of synthetic data and feed that back,” Altman said. “That just seems really inefficient.”

Source link