AI's “original crime”: Investigation reveals that technology giants will train models by shaking millions of YouTube videos

A spectacular new investigation from the Atlantic revealed a vast, unauthorized data grab at the heart of the AI boom.

The report shows that tech giants like Meta, Microsoft and Nvidia have downloaded more than 15.8 million YouTube videos without permission.

This content, scraped from over 2 million channels, is used to train powerfully generated AI video models. This practice fuels the tech industry's competition to dominate the next wave of AI. They also pitted these companies against millions of creators, and now face existential threats.

Their own works are used to build tools that can make them obsolete. The revelation sparked immediate backlash from creators and rights advocates, escalating already tense debates about data, copyright and consent in the age of AI.

Issues of scale and consent

The scale of data collection is incredible. The study identified at least 13 different datasets used by Big Tech's Who's Who, including Amazon, Bytedance, Snap, and Tencent. This confirms previous reports of shattering by companies like Apple and humanity.

This mass download violates YouTube's terms of service, but is hardly checked. For creators, this news is a profound betrayal.

https://www.youtube.com/watch?v=ivpriywd-e

Woodworker John Peters asked, who captured the feeling of helplessness, although the channel was among those who had been cut off. “I think everything's going to be stolen… do you want to stop it or continue making videos and hope people want to connect with people?”

His dilemma is now shared by millions. The issue is not just about copyright, but also about the fundamental equity of the ecosystem in which creators' labor is harvested to build direct competitors.

YouTube tightrope walking

YouTube is caught in the middle, unfolding a series of reactive measurements. In December 2024, the platform introduced a new configuration, allowing creators to opt-in to third-party AI training. Importantly, this control is turned off by default, putting an AI company in a burden of consent.

This followed a previous update aimed at transparency and protection. In September 2024, YouTube began to enhance its content ID system, detecting AI-generated faces and voices. A month later, we introduced the “Captured Camera with Camera” label to check the real footage.

However, these tools do not address core issues. Google itself continues to train its own models like VEO 3 using YouTube content. This policy highlights a difficult conflict of interest for platforms seeking to serve both creators and the parent company's AI ambitions.

Legal battlefield

The industry's “scrub first, later” approach is facing legal calculations. The lawsuit is on the rise. Creators like David Millett are suing Nvidia and Openai for unfair enrichment and unfair competition over the use of videos.

These individual cases are part of a larger legal war against an industry built on the basis of the public's obscure data intake. The conflict escalated into the battle of high-stakes companies.

In a groundbreaking case, Disney and Universal filed a drastic lawsuit against the AI Lab's mid-journey, accusing them of building a model on stolen intellectual property. Disney's legal counsel Holocio Gutierrez, words are not engraved, state, “Copyright infringement is copyright infringement. The fact that it is being done by an AI company does not infringe it.”

However, the most important test of the doctrine of “fair use” has been unfolded in San Francisco courts. AI Humanity recently agreed to a record $1.5 billion settlement with the author regarding the use of copyrighted works. This is hailed as the “napstar moment” of the AI industry. However, in the surprising turn, the approval of the settlement is at stake.

US District Judge William Alsap denounced the proposal as “close to completion,” putting the entire contract at risk. His skepticism stems from previous rulings that separated AI training practices from data collection.

While he called that AI model “Typically transformative” He accused the use of pirated books from the Shadow Library of Mankind as “original crimes” equivalent to theft. This judicial scrutiny cast the case and industry's legal strategy into confusion.

Once the settlement is put on hold, humanity is once again facing potential trials that could result in catastrophic damage. As courts begin to draw sharp boundaries between transformative technology and complete data piracy, legal grounds under the generative AI boom are beginning to appear increasingly unstable.

Arms race fueled by creator content

The desperate data grab is driven by the intense “AI Arms Race.” Companies have put billions into developing text, image and video generation tools, and high-quality training data is an essential fuel. The goal is to acquire a market that is projected to be worth more than $2.5 billion by 2032.

Google has actively deployed 3 VEO models that allow video to be generated in the subscription layer using Synchronized Audio. Google Deepmind CEO Demis Hassabis declared, “We've emerged from a quiet era of video production.” Shows a high stake. Meanwhile, Microsoft has rebutted by offering Openai's powerful SORA model for free.

Even Meta was pivoted to Midi Johnny's tech license to maintain the pace, even after an internal setback. This competitive frenzy highlights why creator content is so valuable. It provides the vast, diverse, high quality ingredients needed to build the next generation of AI, regardless of its origin.

Source link