Apple, Anthropic, and other tech companies reportedly under investigation for using YouTube videos to train AI – Technology News

AI Video & Visuals


As artificial intelligence advances, so does the need for huge datasets, which some companies appear to be using illegally. Proof News reports that companies including Apple, Nvidia, Anthropic, and Salesforce have used subtitles from YouTube videos to train generative AI models.

The dataset, called YouTube Subtitles, reportedly contains video transcripts from educational and online learning channels such as Khan Academy, MIT, and Harvard. Additionally, videos from The Wall Street Journal, NPR, and BBC were also used to train the AI, as well as “Late Show with Stephen Colbert,” “Last Week Tonight with John Oliver,” and “Jimmy Kimmel Live.”

Illegal data market

According to Proof news, Apple used data from the Wall Street Journal, NPR, and BBC to train an OpenELM model that was released in April, just before the WWDC event. Bloomberg, Databrick, and Anthropic also used the dataset to train their AI models. Salesforce used Pile to build an AI model that it claimed was for “academic and research” purposes, but later released it to the public in 2022. The model has been downloaded over 85,000 times.

But what is “Pile” and why is its misuse a problem? EleutherAI is a YouTube captioning dataset, part of a larger compilation called Pile, which includes materials from Wikipedia and the European Parliament and is accessible to anyone with internet access and the know-how to find it. However, its misuse could lead to the leakage of sensitive or personal data. Moreover, creating the dataset may violate YouTube's terms of service, which prohibit the platform from using “automated means” to access videos.

“Pile was used to train Claude, Anthropic's generative AI assistant,” an Anthropic spokesperson explained. However, representatives for Nvidia, Apple, Bloomberg and Databricks declined to comment on Pile's use. Additionally, EleutherAI did not respond to Proof News' request for comment.

Safety net

The lawsuit against EleutherAI was voluntarily dropped by the plaintiffs, and The Pile has since been removed from official download sites, but is still available on file-sharing services.

Initial reports said that YouTube Subtitles, which launched in 2020, also included subtitles for more than 12,000 videos that have since been removed from YouTube.

Follow FE Tech Bytes twitter, Instagram, LinkedIn, Facebook





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *