A new study finds that tech companies have used captions from more than 48,000 YouTube channels, including top creators like MrBeast and Marques Brownlee, and higher education institutions like MIT and Harvard, to train their AI models, despite YouTube banning the scraping of content on the platform without permission.
A survey conducted by ProofNews found that Wiredfound that companies such as Anthropocene, Nvidia, Apple, and Salesforce use a dataset of 173,536 YouTube videos from Khan Academy, MIT, Harvard, The Wall Street Journal, NPR, BBC, and late-night shows. The Late Show with Stephen Colbert, Last Week Tonight with John Oliverand Jimmy Kimmel Live.
ChatGPT now saves chat history even if you opt out of sharing training data
Marques Brownlee posted an Instagram Reels video stating his opinion: “The truth is that Apple, and many other tech companies, train their AI models with data they purchase from third-party data scraping companies, some of which obtain the data in slightly illegal ways…Apple is technically not responsible for this.”
In an email to Mashable on Wednesday, July 17, Apple said that its use of The Pile data was for research purposes only. Apple said the data was fed into the OpenELM model, but that the data was not provided to any Apple AI features, including Apple Intelligence.
Wired The Journal said that a representative for EleutherAI, the nonprofit AI research institute that collected and distributed the YouTube dataset, did not respond to the Journal's request for comment. The dataset is part of a compilation the nonprofit is calling The Pile, which also includes European Parliament documents, the English Wikipedia and emails from Enron employees made public during a federal investigation in the early 2000s.
Mashable Lightspeed
Wired Most of the collections that make up “The Pile” are reportedly accessible to “anyone on the Internet with enough space and computer power.” These include: apple, NVIDIA, Salesforce, Bloomberg and DatabricksAll of these companies have publicly acknowledged that they use The Pile to train their AI models.
Jennifer Martinez, a spokesperson for AI startup Anthropic, which used The Pile to train its generative AI assistant, said in a statement that the company's “YouTube's terms cover direct use of the platform, which is separate from use of the Pile dataset. We would like to point out any potential violations of YouTube's terms of service to the Pile creators.”
“The double whammy is that we actually pay for more accurate manual transcriptions for every video we publish… meaning the stolen transcripts are paid content and have been stolen multiple times,” Brownlee added in his Instagram Reels.
His concerns echo those of creators around the world who worry that their work will be consumed or misused by AI without their compensation or permission — many of whom are now suing tech companies for using their work without their permission.
Wired Proof News reports that The Pile is still available on file-sharing services, but has been removed from official download sites. tool Find creators in the YouTube AI training dataset.
Updated: July 18, 2024, 8:11am PDT This story has been updated to include an emailed statement from Apple to Mashable.
topic
artificial intelligence