The data that powers artificial intelligence is fast disappearing | Tech News

AI For Business


According to research from the Data Provenance Initiative, over the past year, many of the most important web sources used to train AI models have restricted data usage.

AI, artificial intelligence
AI, artificial intelligence (Photo: Reuetrs)

The New York Times

Kevin Roose

For years, people building powerful artificial intelligence systems have trained their models using vast amounts of text, images, and videos taken from the Internet.

Now, that data is drying up.

Click here to connect with us on WhatsApp

A study published this week by the Data Provenance Initiative, an MIT-led research group, found that many of the most important web sources used to train AI models over the past year have restricted data use.

The study looked at 14,000 web domains contained in three commonly used AI training datasets and revealed an “emerging crisis of consent” as publishers and online platforms take steps to prevent data collection.

The researchers estimate that in three datasets, called C4, RefinedWeb, and Dolma, 5% of all data and 25% of data from the highest quality sources is restricted. These restrictions are set through the Robots Exclusion Protocol, a decades-old method by which website owners use a file called robots.txt to prevent automated bots from crawling their pages.

The study also found that 45% of one set of data, called C4, was restricted by website terms of use. “We're seeing a rapid decline in consent to data use across the web, which will impact not just AI companies, but researchers, academics, and nonprofits as well,” Shane Longpre, the study's lead author, said in an interview.

Data is the main building block of today's generative AI systems, with billions of examples of text, images, and videos as input. Much of that data is collected by researchers from public websites and compiled into large datasets that can be downloaded and used freely or supplemented with data from other sources. Learning from that data enables generative AI tools like OpenAI's ChatGPT, Google's Gemini, and Anthropic's Claude to describe, code, and generate images and videos. The more high-quality data that is input into these models, the better the quality of the output will generally be.

For years, AI developers have been able to collect data with relative ease. But the boom in generative AI in recent years has led to tensions with the owners of that data, many of whom are concerned about it being used to train AI, or at least want to be paid for it. As the backlash grows, some publishers have put up paywalls or changed their terms of use to limit the use of their data for AI training. Others are blocking the automated web crawlers used by companies like OpenAI, Anthropic, and Google.

Sites like Reddit and StackOverflow have begun charging AI companies for access to their data, and several publishers have taken legal action, including the New York Times, which sued OpenAI and Microsoft last year for copyright infringement, alleging the companies used news articles without permission to train their models.

Companies like OpenAI, Google, and Meta have made every effort in recent years to collect more data to improve their systems, and more recently, some AI companies have made deals with publishers such as News Corp, owner of The Associated Press and The Wall Street Journal, to provide ongoing access to their content.

The data crisis

– Not consenting to data use impacts researchers, academics, and non-profit organizations

– Only 5% of all data and 25% of data from the highest quality sources is limited to the dataset used to train the AI

– The boom in generative AI is causing tensions with data owners

– Publishers are putting up paywalls and changing their terms of service to limit use of their data

– Web crawlers used by companies like OpenAI, Anthropic, and Google are blocked by some companies

– Small AI companies and academic researchers who rely on public datasets are suffering

©2024 The New York Times News Service



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *