AI companies reportedly use YouTube video transcripts for training

A new report claims that generative artificial intelligence (Gen AI) companies have harvested transcripts of YouTube videos to train their engines, and several popular YouTubers, including MrBeast and Marques Brownlee, have expressed concern, claiming that their content is part of the vast dataset.

Investigation reveals that subtitles were extracted from over 170,000 YouTube videos

be investigation by Proof NewsSeveral major companies have trained their AI engines by scanning YouTube videos.The observations and claims were published in collaboration with Wired.

According to the investigation, several technology companies, including Apple, Anthropic, Nvidia, and Salesforce, have used “YouTube Subtitles.” Specifically, these companies plagiarized subtitles from a total of 173,536 YouTube videos.

In total, the companies used more than 48,000 YouTube channels to build their AI datasets and train their AI engines, according to the report, which includes content from YouTubers such as MrBeast (289 million subscribers), MKBHD (19 million subscribers), and PewDiePie (111 million subscribers), among many others.

Apple sources data for its AI from multiple companies

One of them scraped a ton of data and transcripts from YouTube videos, including mine.

Apple technically avoids the “flaw” because it doesn't scrape.

But this will be an evolving issue for a long time https://t.co/U93riaeSlY

— Marques Brownlee (@MKBHD) July 16, 2024

In addition to YouTubers, videos from news media, etc. ABC News, BBCand The New York Times Here's a portion of the dataset: Long story short, a few tech giants have incorporated YouTube captions into their AI engines.

A tool that checks whether YouTube data has been used by AI companies has been posted online

according to The VergeThe YouTube video caption dataset is part of a larger collection of material. Technically, most of the companies using YouTube data relied on a dataset called The Pile from non-profit organization EleutherAI, which is supposed to be an open-source collection that also includes datasets of books, Wikipedia articles, and content available in the public domain.

To prove it, AI companies are using YouTube to build datasets and train their engines. Proof News An interactive search tool has also been released, allowing not only YouTubers but also the general public to explore the data.

“This is theft,” said Dave Wiskas, CEO of Nebula, a streaming service partly owned by creators, some of whom have had their work stolen from YouTube to train AI. https://t.co/X34e3LuODW

— Distributed AI Institute is on Mastodon (@DAIRInstitute) July 16, 2024

Besides the obvious issue of paying or compensating YouTubers for their content, these companies also face legal challenges: YouTube has said that using video content (including transcripts) to train AI violates the platform's terms of service.

YouTube has reportedly refrained from responding to the reports, but it's likely that parent company Google will take steps to protect the video-sharing platform and its content creators.

For now, the dataset appears to include plain text data. In other words, it's possible that the AI companies are only using video transcripts or subtitles, rather than the videos, to train their engines. Incidentally, the plain text data also includes live translations of videos in Japanese, German, and Arabic.

Google has previously admitted that it has removed some YouTube videos to train its AI engine, but the search giant has in place contracts with YouTubers. Needless to say, EleutherAI may not have such contracts with each of the YouTubers whose videos are part of the dataset the tech giant uses to train its AI.

Source link

Binance美国注册 commented on Meta’s Mark Zuckerberg on Threads, the future of AI, and Quest 3: Your article helped me a lot, is there any more re
binance us register commented on Campfire brings design review to Quest 3, adds AI assistant: Can you be more specific about the content of your
gate io commented on Over two-thirds of IT leaders concerned about deepfake attacks: Thank you for your sharing. I am worried that I la
Registrera commented on Cloud Trends and Cybersecurity Challenges: Navigating the Future | Data Center Knowledge: Thank you for your sharing. I am worried that I la
Binance推荐码 commented on BITS Pilani unveils ‘Rakesh Kapoor Innovation Centre’; aims to revolutionise future of education: Thanks for sharing. I read many of your blog posts

AI companies reportedly use YouTube video transcripts for training

Investigation reveals that subtitles were extracted from over 170,000 YouTube videos

A tool that checks whether YouTube data has been used by AI companies has been posted online

Leave a Reply

RECENT POSTS

Employees are not waiting for permission

Career path after B.Tech. (CSE – AI and Data Engineering)

I got a job at Big Tech AI. Treating my career like a science lab helped me overcome my fear of learning AI.

Investigation reveals that subtitles were extracted from over 170,000 YouTube videos

A tool that checks whether YouTube data has been used by AI companies has been posted online

Related Posts

Leave a Reply