To run their artificial intelligence models, companies need as much useful data as possible, but some of the biggest AI developers rely in part on YouTube videos that were transcribed without the creators' permission, in violation of YouTube's rules, a YouTube investigation has found. Proof News and Wired.
The two media outlets revealed that Apple, Nvidia, Anthropic and other major AI companies had trained their models, without the knowledge of the video creators, using a dataset called YouTube Subtitles, which incorporated transcriptions of about 175,000 videos from 48,000 channels.
The YouTube subtitles dataset consists of subtitle text from videos, often translated into multiple languages. It was built by EleutherAI, which describes its goal as lowering the barrier to AI development for people outside of big tech companies. It is just one component of a much larger EleutherAI dataset called Pile, which includes not only YouTube transcripts, but also Wikipedia articles, speeches from the European Parliament, and even, according to the report, emails from Enron.
But Pile has gained many fans among big tech companies: Apple, for example, adopted Pile to train its OpenELM AI model, and a Salesforce AI model released two years ago was trained on Pile and has since been downloaded more than 86,000 times.
The YouTube Closed Captions dataset contains a variety of popular channels across news, education, and entertainment. It also includes content from YouTube stars such as MrBeast and Marques Brownlee. All of these videos are used to train the AI model. Proof News set up a search tool to search the collection to see if certain videos or channels are included. As you can see below, the collection also contains some videos from TechRadar.
Secret Sharing
The YouTube caption dataset appears to violate YouTube's terms of service, which explicitly prohibit the automated scraping of videos and related data. Yet the dataset relies on exactly that, using scripts to download captions via YouTube's API. The study reported that the automated downloads excluded videos containing around 500 search terms.
The discovery sparked widespread surprise and anger from YouTube creators interviewed by Proof and Wired. Concerns about unauthorized use of content were legitimate, and some creators were infuriated by the idea of their work being used in an AI model without payment or permission, especially after learning that the dataset included transcripts of deleted videos. In one case, the data came from a creator who had subsequently removed their online presence entirely.
The report did not include a comment from EleutherAI. The organization noted that its mission is to democratize access to AI technology by making trained models publicly available. Judging from this dataset, that may clash with the interests of content creators and platforms. Legal and regulatory battles around AI were already complicated. Such revelations are likely to further destabilize the ethical and legal landscape of AI development. It is easy to propose a balance between AI innovation and ethical responsibility, but it will be much harder to achieve it.