New reports published in both Proof News and Wired allege that massive language models from Apple, Salesforce, Anthropic, and other major tech companies were trained on tens of thousands of YouTube videos without the creators' consent and in a way that potentially violates YouTube's terms.
The companies trained their models in part using “The Pile,” a collection compiled by nonprofit organization EleutherAI as a way to provide useful datasets to individuals and companies that don't have the resources to compete with Big Tech, but which are then used by those same large companies.
The pile includes a variety of things, including books, Wikipedia articles, and more. It also includes YouTube captions, collected by YouTube's captions API, extracted from 173,536 YouTube videos from over 48,000 channels, including videos from big-name YouTubers like MrBeast, PewDiePie, and popular tech commentator Marques Brownlee. In X, Brownlee blames Apple for its use of the dataset, but acknowledges that it's complicated to pin blame on Apple, since it doesn't collect the data itself. He writes:
Apple sources data for its AI from multiple companies
One of them scraped a ton of data and transcripts from YouTube videos, including mine.
Apple technically avoids the “flaw” because it doesn't scrape.
But this will be an evolving issue for a long time.
It also includes channels from numerous mainstream and online media brands, including videos written, produced and published by Ars Technica and its staff, as well as many of Condé Nast's other brands, such as Wired and The New Yorker.
Coincidentally, one of the videos used in the dataset was a short film produced by Ars Technica, which joked that the film had already been written by an AI. The Proof News article also notes that the video was trained on videos of parrots, meaning the AI model imitates parrots, imitates human speech, imitates other AI, and imitates humans.
As AI-generated content continues to proliferate on the internet, it will become increasingly difficult to put together datasets to train AI that don’t include content already generated by AI.
To be clear, some of this is not new news. Pile is heavily used and referenced in AI circles and has been known to be used by technology companies for training in the past. It has been cited in multiple lawsuits filed by intellectual property rights owners against AI and technology companies in the past. Defendants in these lawsuits, including OpenAI, have argued that this type of scraping is fair use. The cases have yet to be resolved in court.
But Proof News has done some research to pinpoint details about YouTube's caption usage, going so far as to create a tool that lets you search for individual videos and channels on Pile.
The study reveals just how powerful data collection can be and draws attention to how little control intellectual property owners have over how their work is used on the open web.
Creators' reactions
Proof News reached out to some of these creators, as well as the companies that used the dataset, for comment. Most creators expressed surprise that their content had been used in this way, and those who commented criticized EleutherAI and the companies that used the dataset. For example: The David Pakman Show Said:
No one comes to me and says, “I want to use this”… This is my livelihood and I put time, resources, money, and staff time into creating this content. There's really no shortage of work.
Julia Walsh, CEO of production company Complexly, said: Latest On Hank and John Green’s other educational content, he said:
We are outraged to learn that the educational content we carefully produced has been used in this way without our consent.
There are also questions about whether scraping this content violates YouTube's terms, which prohibit videos from being accessed by “automated means.” EleutherAI founder Sid Black said he downloaded the subtitles via YouTube's API using a script, just like a web browser would.
Anthropic is one of the companies that trained models on the dataset and says it has not violated any laws.
The Pile contains a small portion of YouTube subtitles… YouTube's terms cover direct use of its platform, which is separate from use of The Pile dataset. Any concerns about possible violations of YouTube's terms of service should be directed to the authors of The Pile.
A Google spokesperson told Proof News that Google has “taken steps for many years to prevent abusive and unauthorized scraping,” but declined to provide more specific answers. This is not the first time that AI and technology companies have come under fire for training models on YouTube videos without permission. Notably, OpenAI (the company behind ChatGPT and video generation tool Sora) is believed to have trained models using YouTube data, although not all of these allegations have been confirmed.
In an interview with The Verge's Nilay Patel, Google CEO Sundar Pichai suggested that using YouTube videos to train OpenAI's Sora would violate YouTube's terms of service, although certainly that use is different from scraping subtitles via an API.