Apple, Anthropic and other companies use YouTube videos to train AI

According to the study, more than 170,000 YouTube videos are part of a massive dataset used to train AI systems from major tech companies. Proof News Co-author WiredApple, Anthropok, Nvidia and Salesforce are among the technology companies that used “YouTube Captions” data stolen without permission from the video platform. The training dataset is a collection of captions taken from YouTube videos belonging to over 48,000 channels, but does not include images from the videos.

The dataset also includes videos from popular creators such as MrBeast and Marques Brownlee, as well as clips from news outlets such as ABC News and the BBC. The New York Times. Over 100 videos The Verge The dataset contains, along with many other videos, Vox.

“Apple sources data for its AI from multiple companies,” said Brownlee, who goes by the handle MKBHD. I wrote in X's post“One of them scraped a ton of data and transcripts from YouTube videos, including mine,” he added. “This is going to be a long-term, evolving issue.”

YouTube did not immediately respond. The Vergeof Request for Comments.

As part of the investigation, Proof News We've also launched an interactive search tool, where you can use the search functionality to see if your content or that of your favorite YouTubers appears in the dataset.

The subtitle dataset is part of a larger collection of materials from nonprofit EleutherAI called The Pile, an open-source collection that also includes datasets of books, Wikipedia articles, and more. Last year, analysis of a dataset called Books3 revealed which authors' works had been used to train AI systems, and the dataset has been cited in lawsuits by the authors against companies that used it to train their AI.

AI companies are rarely proactively transparent about the data that feeds into their AI systems, and how exactly YouTube content is being used has been a hot topic in recent months. When OpenAI announced its powerful video generation tool, Sora, in March, CTO Mira Murati repeatedly dodged questions about whether the system was trained on YouTube videos.

“I won't go into the details of the data used, but it was publicly available or licensed data,” she said. The Wall Street Journal at that time, journal Regarding YouTube content specifically: “I wasn't sure about that,” Murati said.

In a previous interview, YouTube CEO Neal Mohan said that training AI with video content, including transcribing it, would violate the platform's terms of service. decoderGoogle CEO Sundar Pichai agreed with Mohan's assessment that if OpenAI had in fact trained Sora on YouTube content, it would have violated YouTube's terms of service.

“We have terms of service and we expect people to abide by those terms of service when we develop products, and that's how I felt,” Pichai said.

Source link