AI giant uses YouTube videos to train AI without permission

AI Video & Visuals


Silicon Valley AI giants used YouTube video transcripts without the creators' consent to train generative artificial intelligence (AI) models. AI giants including Apple, Nvidia, Salesforce, and Anthropic used transcripts from 173,536 YouTube videos from over 48,000 channels, a Proof News and WIRED investigation found. Not only did they use the videos without the creators' consent, but they also violated YouTube's terms of service.

Proof News' investigation noted that Apple trained the OpenELM model with the transcripts, but Apple later clarified that it didn't use the OpenELM model to power its AI or machine learning capabilities, but rather created it specifically for research purposes.

Channels included in the dataset include YouTube superstars such as Marques Brownlee, PewDiePie, and MrBeast, as well as established news sites such as the BBC and the Wall Street Journal, and educational content creators from Khan Academy, MIT, Harvard, etc. Speaking to Proof News, YouTube creators criticized the AI ​​companies for using their creations to train AI datasets without their consent.

The transcripts were taken from a dataset called “YouTube Subtitles,” which in turn is taken from a larger compilation called “The Pile,” which includes content from the European Parliament and the English Wikipedia. Additionally, the dataset appears to include content promoting “flat Earth” conspiracy theories, as well as thousands of profanities, racist and sexist slurs found in YouTube subtitles.

YouTube's terms of service explicitly prohibit the use of automated scrapers against videos uploaded to the site. It also prohibits third-party videos from being used outside the platform. CEO Neil Mohan has also said that training AI models on YouTube videos violates the platform's terms of service, a view later confirmed by Google CEO Sundar Pichai.

Why is this important?

In April, The New York Times reported that OpenAI had transcribed YouTube videos and used them to train GPT-4 and its applications, without the consent of the platform or the creators. Training AI models requires huge amounts of data, and OpenAI CEO Sam Altman previously suggested that AI companies might run out of data on the internet. Access to large amounts of content is crucial, as AI models tend to perform better when trained with more data.

But much of this data is copyrighted, which could lead to legal issues for the startup. Open AI has been hit with at least eight copyright infringement lawsuits from a variety of companies and writers, including The New York Times. In the latest lawsuit, the Center for Investigative Reporting said, “OpenAI and Microsoft began siphoning off our stories to make their own products more powerful, but unlike other organizations that license our material, they never asked for permission or offered compensation.” A group of authors has also filed a similar lawsuit against Nvidia.

Another concern is the inclusion of unreliable or harmful content in training datasets. The popular training dataset LAION5-B, used by the text-to-image model Stable Diffusion, has been flagged for containing child sexual abuse material (CSAM). Other studies have also shown it contains multiple examples of hateful content and “rape, pornography, malicious stereotypes, racist and ethnic slurs, and other highly problematic content.” In fact, in at least one instance, Stable Diffusion was used to create CSAM.

Read also:



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *