Apple, Nvidia and other tech companies train AI with YouTube videos

AI Video & Visuals


As generative artificial intelligence booms, tech companies are seeking training data to improve their models, but some are taking it without permission.

High-tech companies such as Apple, Nvidia, and Antropic Trained an AI model to extract subtitles from tens of thousands of YouTube videos Platform Rules prohibiting unauthorized downloading or use of contentAccording to an investigation by Proof News published in collaboration with Wired.

The investigation found that the companies were using a dataset called YouTube Subtitles, which contains transcripts of 173,536 YouTube videos from over 48,000 channels. The videos in the dataset range from educational channels like Khan Academy and MIT, to news sites like The Wall Street Journal, to videos from top YouTube creators like MrBeast and Marques Brownlee.

“Apple sources data for its AI from multiple companies,” Brownlee writes. Post to X “One of them scraped a ton of data and transcripts from YouTube videos, including mine,” he said of the investigation.

Brownlee said that “Apple is not scraping so technically avoids 'negligence' here,” but added that “this is going to be a long-term, evolving issue.”

ProofNews too Created a tool It allows creators to search for their content within the dataset, which includes several videos from Quartz. The YouTube subtitles dataset does not include images of the videos, but it does include some translated subtitles into languages ​​such as German and Arabic.

The dataset was created by Eleuther AI, a non-profit AI research institute. Focused According to Proof News, the paper aims to “promote open science norms” and is part of a “pile” of material compiled by the nonprofit group from the European Parliament, the English-language Wikipedia and other sources.

“The Pile dataset referenced in the research paper was trained in 2021 for academic and research purposes,” a spokesperson for Salesforce, one of the companies named in the investigation for using the dataset, said in a statement shared with Quartz. “The dataset is publicly available and was released under a permissive license.”

Apple, Nvidia and Anthropic did not immediately respond to requests for comment.

In April, YouTube CEO Neil Mohan He told Bloomberg It warned that it would be a “clear violation” of the platform's policies for companies to use YouTube videos (including transcripts and video bits) to train AI models, such as OpenAI's text-to-video generator Sora. But a few days later, The New York Times reported that OpenAI Transcribed over 1 million hours of YouTube videos To train the GPT-4 model.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *