
A new report published today claims that a number of tech giants, including Apple, trained AI models on YouTube videos without the creators' consent.
They did this using subtitle files downloaded by a third party from over 170,000 videos. Affected creators include tech commentator Marquees Brownlee (MKBHD), MrBeast, PewDiePie, Stephen Colbert, John Oliver, Jimmy Kimmel, and many more…
A caption file is essentially a transcription of the video content.
Wired I will report.
A Proof News investigation found that some of the world's richest AI companies have used material from thousands of YouTube videos to train their AI. The companies did so despite YouTube banning them from harvesting material from the platform without permission.
Our research found that 173,536 YouTube video subtitles, extracted from over 48,000 channels, were used by major Silicon Valley companies, including Anthropic, Nvidia, Apple, and Salesforce.
The download was reportedly made by EleutherAI, a non-profit that helps developers train AI models. While the intent was likely to be to provide training materials to smaller developers and academics, the dataset is also used by several tech giants, including Apple.
According to a research paper published by EleutherAI, the dataset is part of a corpus called “Pile” that the nonprofit has made public. […]
Most of Pile's datasets are publicly accessible and available to anyone on the internet with enough space and computing power. Academics and developers outside of Big Tech have also made use of the datasets, but they are not the only ones.
Apple, Nvidia and Salesforce, companies valued at hundreds of billions or even trillions of dollars, have described in research papers and posts how they used Pile to train their AI. The documents also show that Apple used Pile to train OpenELM, a high-profile model released in April, weeks before it announced it would add new AI capabilities to its iPhones and MacBooks.
Wired Apple had not responded to a request for comment at the time of writing.
9to5Mac's take
It's important to emphasize that Apple did not download the data, but EleutherAI, the organization that appears to have violated YouTube's terms of service.
In any event, while Apple and the other companies named likely used publicly available datasets in good faith, this is a good example of the legal minefield that comes with scraping the web to train AI systems. There have been multiple examples of AI systems plagiarizing entire paragraphs of text when asked about niche topics, and the risk of using material without permission only grows when companies use datasets compiled by third parties.
We've reached out to Apple for comment and will update if we hear back.
Screenshot: MKBHD
FTC: We use automated affiliate links that generate revenue. more.