Apple is the latest in a long line of generative AI developers — almost as long as the industry itself — who have been found scraping copyrighted content from social media to train their artificial intelligence systems.
A new report from Proof News claims that Apple used a dataset containing subtitles from 173,536 YouTube videos to train its AI. But Apple isn't alone in this violation: other AI giants, including Anthropic, Nvidia, and Salesforce, have also been found using it, despite YouTube's ban on using data without permission.
The dataset, called YouTube Subtitles, contains transcripts of videos from over 48,000 YouTube channels, ranging from Khan Academy, MIT, and Harvard to The Wall Street Journal, NPR, and the BBC. Transcripts from late-night variety shows such as “The Late Show with Stephen Colbert,” “Last Week Tonight with John Oliver,” and “Jimmy Kimmel Live” are also included in the YouTube Subtitles database. Videos from YouTube influencers such as Marcus Brownlee and Mr.Beast, as well as a number of conspiracy theorists, were also plagiarized without permission.
Get Microsoft Office for Windows/Mac for $25
$229
Save $204
Lifetime access to Word, Excel, PowerPoint, Outlook, OneNote, Publisher, and Access.
$229
Save $204
The dataset compiled by startup EleutherAI does not itself contain any video files, but it does include some translations into other languages, including Japanese, German and Arabic. EleutherAI reportedly obtained its data from a larger dataset called Pile, created by a non-profit organization that pulled data not only from YouTube but also from European Parliament records and Wikipedia.
Bloomberganthropocentrism and Databricks The company trained its models on Pile, according to the two companies' affiliated publications. “Pile contains a small portion of YouTube's subtitles,” Anthropic spokesperson Jennifer Martinez said in a statement to Proof News. “YouTube's terms govern our direct use of the platform, which is separate from our use of the Pile dataset. We would like to reach out to the Pile authors with any concerns about potential violations of YouTube's terms of service.”
Technical issues aside, AI startups exploiting content from the open internet has been an issue since ChatGPT debuted. Stability AI and Midjourney are currently facing lawsuits from content creators alleging that they plagiarized copyrighted works without permission. Google, which operates YouTube, itself faced a class action lawsuit in July last year and another in September, but the company claims that the lawsuits are “a major blow not only to Google's services but to the very idea of generative AI.”
Me: What data was used to train Sora? YouTube videos?
OpenAI CTO: I'm not sure about that actually…(I'd love to see the full version. From the WSJ In this interview, Murati answers many of the big questions about Sola. Ironically, you can watch the full interview on YouTube:… pic.twitter.com/51O8Wyt53c
— Joanna Stern (@JoannaStern) March 14, 2024
Additionally, these AI companies have a very hard time actually citing where they get their training data. In a March 2024 interview with The Wall Street Journal's Joanna Stern, OpenAI CTO Mira Murati stumbled multiple times when asked if the company used videos from YouTube, Facebook, and other social media platforms to train its models. “We're not going to go into the details of the data that was used,” Murati said.
And in July this year, Microsoft AI CEO Mustafa Suleyman argued that a vague “social contract” meant that anything found on the web would be subject to fairness.
“When it comes to content that's already on the open web, I think the social contract around that content has been fair use since the '90s,” Suleiman told CNBC. “Anyone can copy it, recreate it, replicate it. It's freeware, so to speak, and that's how it's been understood.”