Researchers have discovered something that could shake up the AI industry to its core

AI companies like Google, Meta, Anthropic, and OpenAI have long argued that their large-scale language models are not technically sound. keep It remembers copyrighted works and instead “learns” from training data, just like the human mind.

It’s a carefully worded distinction essential to their attempts to protect themselves from a rapidly growing barrage of legal challenges.

It also cuts to the heart of copyright law itself. Copyright is a type of intellectual property law designed to protect original works and their creators. Under the U.S. Copyright Act of 1976, copyright owners have the exclusive right to “reproduce, adapt, distribute, publicly perform, and publicly display the work.”

Importantly, however, the “fair use” doctrine provides that others may use copyrighted material for purposes such as criticism, journalism, and research. This is what the AI industry has defended in court against accusations of infringement. OpenAI CEO Sam Altman went so far as to say that unless the industry is allowed to freely utilize copyrighted data to train models, it’s “over.”

Rightsholders have long accused and accused AI companies of training models on pirated and copyrighted works, effectively monetizing authors, journalists, and artists without paying them fairly. It has been a years-long legal battle that has already resulted in a high-profile settlement.

Now, a frightening new study could put AI companies on the defensive. In it, researchers from Stanford University and Yale University found convincing evidence that AI models are actually copying all the data and are not “learning” from it. Specifically, four prominent LLMs, OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, xAI’s Grok 3, and Anthropic’s Claude 3.7 Sonnet, happily reproduced long excerpts from popular, protected works with remarkable accuracy.

They found that Claude outputted “nearly the entire book” with 95.8 percent accuracy. Gemini reproduced the novel Harry Potter and the Philosopher’s Stone with 76.8 percent accuracy, while Claude reproduced George Orwell’s 1984 with over 94 percent accuracy compared to the original and still copyrighted reference material.

“While many believe that LLMs do not remember much of their training data, recent studies have shown that a significant amount of copyrighted text can be extracted from open-weight models,” the researchers wrote.

Some of these clones require researchers to jailbreak the model using a technique called Best-of-N. This is basically attacking the AI with different iterations of the same prompt. (These types of workarounds have already been used to defend itself in lawsuits brought by OpenAI. new york timeslawyers argue that “normal people would not use OpenAI’s products in this way”).

The impact of the latest findings could be significant as copyright lawsuits unfold in courts across the country. as atlantic ocean‘s Alex Reisner points out that this result further undermines the AI industry’s claim that LLM “learns” from these texts rather than storing information and recalling it later. This is evidence that “there is potentially huge legal liability for AI companies” and that “copyright infringement judgments could cost the industry billions of dollars.”

Whether AI companies are liable for copyright infringement remains a subject of intense debate. said Mark Lemley, a Stanford law professor who represented AI companies in copyright cases. atlantic ocean He says he doesn’t know if the AI model “includes” a copy of the book or if it can reproduce it “on-the-fly on demand.”

Not surprisingly, the industry continues to claim that it is not copying technically protected works. In 2023, Google notified the U.S. Copyright Office that “no copies of the training data, whether text, images, or other forms, exist in the model itself.”

OpenAI also told the agency in the same year that its models “do not save copies of the information they learn.”

to atlantic oceanReisner said the analogy that AI models learn just like humans is “a deceptive, feel-good idea that gets in the way of the public discussion we need to have about how AI companies are using the creative and intellectual work they so completely rely on.”

But it remains to be seen whether the judges overseeing the series of copyright cases agree with that opinion. The risks are considerable, especially as the AI industry balloons to untold value while making it increasingly difficult for writers, journalists, and other content creators to make a living.

AI and copyright details: OpenAI’s copyright situation appears to be in great jeopardy

Source link