AI companies reportedly use YouTube video transcripts for training

AI Video & Visuals


A new report claims that generative artificial intelligence (Gen AI) companies have harvested transcripts of YouTube videos to train their engines, and several popular YouTubers, including MrBeast and Marques Brownlee, have expressed concern, claiming that their content is part of the vast dataset.

Investigation reveals that subtitles were extracted from over 170,000 YouTube videos

be investigation by Proof NewsSeveral major companies have trained their AI engines by scanning YouTube videos.The observations and claims were published in collaboration with Wired.

According to the investigation, several technology companies, including Apple, Anthropic, Nvidia, and Salesforce, have used “YouTube Subtitles.” Specifically, these companies plagiarized subtitles from a total of 173,536 YouTube videos.

In total, the companies used more than 48,000 YouTube channels to build their AI datasets and train their AI engines, according to the report, which includes content from YouTubers such as MrBeast (289 million subscribers), MKBHD (19 million subscribers), and PewDiePie (111 million subscribers), among many others.

In addition to YouTubers, videos from news media, etc. ABC News, BBCand The New York Times Here's a portion of the dataset: Long story short, a few tech giants have incorporated YouTube captions into their AI engines.

A tool that checks whether YouTube data has been used by AI companies has been posted online

according to The VergeThe YouTube video caption dataset is part of a larger collection of material. Technically, most of the companies using YouTube data relied on a dataset called The Pile from non-profit organization EleutherAI, which is supposed to be an open-source collection that also includes datasets of books, Wikipedia articles, and content available in the public domain.

To prove it, AI companies are using YouTube to build datasets and train their engines. Proof News An interactive search tool has also been released, allowing not only YouTubers but also the general public to explore the data.

Besides the obvious issue of paying or compensating YouTubers for their content, these companies also face legal challenges: YouTube has said that using video content (including transcripts) to train AI violates the platform's terms of service.

YouTube has reportedly refrained from responding to the reports, but it's likely that parent company Google will take steps to protect the video-sharing platform and its content creators.

For now, the dataset appears to include plain text data. In other words, it's possible that the AI ​​companies are only using video transcripts or subtitles, rather than the videos, to train their engines. Incidentally, the plain text data also includes live translations of videos in Japanese, German, and Arabic.

Google has previously admitted that it has removed some YouTube videos to train its AI engine, but the search giant has in place contracts with YouTubers. Needless to say, EleutherAI may not have such contracts with each of the YouTubers whose videos are part of the dataset the tech giant uses to train its AI.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *