Early April, YouTube sent a clear message YouTube has warned AI model developers that downloading data from the platform and using it to train AI models is a clear violation of its terms of service.
This sentiment was reinforced the same week that YouTube released a public comment about its content being used to train AI models, which it said in a New York Times report:Unauthorized scraping or downloading of YouTube contentYouTube is banned, but a new report from Proof News reveals that data is being harvested from the platform and used by some of the biggest AI-driven tech companies to train their models.
according to ProofNews SurveySubtitles for 172,535 YouTube videos were extracted from over 48,000 channels, including some of the platform's most well-known creators, such as MKBHD (19 million subscribers), MrBeast (289 million), Jacksepticeye (31 million), PewDiePie (111 million), Stephen Colbert, John Oliver, and Jimmy Kimmel. Notably, the videos are transcribed into subtitle files.
According to the report, Apple, NVIDIA, Salesforce, Anthropic and others used a publicly available dataset called Pile that is accessible to anyone with internet access. Additionally, the report cites Apple, NVIDIA and Salesforce as saying in their respective research papers that they used Pile to train their AI models. In Apple's case, the Pile dataset was used to train OpenELM, a new AI model that the company released in April, just weeks before it announced Apple Intelligence.
It's worth noting that not all of the big tech companies mentioned above downloaded the YouTube video transcripts. The download was by EleutherAI, which created the dataset for educational and academic purposes. However, it appears that a big tech company discovered the dataset and decided to use it to train its own models. This raises the question of what happens when companies use third-party datasets to train their AI models, if those datasets contain data that users have not consented to being used for training purposes.
“AI companies are typically secretive about the origins of their training data, but a Proof News investigation found that some of the world's richest AI companies have used material from thousands of YouTube videos to train their AI, despite YouTube prohibiting the companies from harvesting material from the platform without permission.
The investigation found that 173,536 YouTube video subtitles, extracted from over 48,000 channels, were used by Silicon Valley giants such as Anthropic, NVIDIA, Apple, and Salesforce. The dataset, called YouTube Subtitles, contains transcripts of videos from educational and online learning channels such as Khan Academy, MIT, and Harvard. Videos from The Wall Street Journal, NPR, and BBC were also used to train the AI, as were “Late Show with Stephen Colbert,” “Last Week Tonight with John Oliver,” and “Jimmy Kimmel Live.”“Proof News' YouTube description reads: