Copyrighted books are used fairly for AI training. Here's what you need to know.

German Tech Internet I Arificial Intelligence — Using AI systems has become a part of our daily lives.

AFP via Getty Images

The sudden presence of generator AI systems in our daily lives has led many to question the legality of how AI systems are created and used. One question related to my practice: Does copyrighted intake make works like books, articles, photos, and art for training AI systems?

The court's ruling answers this new question. The answer is: Yes, using copyrighted works for AI training is used fairly, at least under the specific facts of these cases and evidence presented by the parties. However, since judges in both cases were somewhat broader in their dictionary about how their decisions were different, they provide a useful roadmap for how other cases will be decided, and how future AI systems will be designed to not infringe copyright. There is a certain degree of caution in the awards regarding each case of meta and humanity. Let's take a closer look.

Over 30 lawsuits have been filed over the past year or two by authors, news publishers, artists, photographers, musicians, record companies and other creators against a variety of AI systems, alleging that using copyrighted works of each of the authors for AI training purposes is a violation of copyright. The owner of the system always insists on fair use as a defense. They provide a useful roadmap on how other lawsuits will be decided, and how future AI systems will be designed to avoid copyright infringement.

The case of humanity

Humanity had planned to create a central library of “all books in the world.”

Getty Images

The first decision issued in June included a lawsuit by three authors. He argued that the PBC of humanity infringed the author's copyright, copying several books (and millions of others) and violating copyright by training a text-generating AI system called Claude. Human defense was a fair use.

Sitting in Northern California District Court, Judge Alsap, found that using books for training purposes was fair use, and that conversions of printed books purchased and converted to digitally by humanity was also fair use. However, using human use with the aim of creating a central library for the creation of a central library for humanity's “all books in the world” was not a fair use, beyond training Claude. Whether humanity is copying and copying the Central Library for purposes other than AI training (and clearly, there was some evidence in records that have been happening but not developed).

In Claude's design, humanity appears to have determined that books are the most valuable training material for systems designed to “think” like humans. Books provide speech, prose, and appropriate grammar patterns, among other things. Humanity has chosen to download millions of free digital copies of books from pirate sites. It also brought about a large central library of “all books in the world,” which was planned to buy millions of printed copies of books from bookstores, convert them into digital copies, throw away the printed copies, and maintain them “forever.” This activity was not carried out with the author's permission.

Importantly, Claude was designed to not replicate the plaintiff's books as producers. There was no such claim by the plaintiffs or evidence of doing so. Thus, claims of copyright infringement were limited to Claude's training, the construction of a central library, and the intake of books for unidentified, non-training purposes. Claude users ask questions and return text-based answers. Many users use it for free. Certain Claude companies and other users pay to use it, generating more than $1 billion a year in human revenue.

Humanity's arbitration

Both decisions came from the Silicon site, the U.S. District Court in Northern California … more valley.

TNS

To summarise the legal analysis, Judge Alsup assessed each “use” of the book individually, as it must be under the Supreme Court's 2023. Warhol vs. Goldsmith Fair use decision. Alsup first turned to book use, finding that using books to train Claude was a “typical” transformational use that did not replace the plaintiff's book market and was not qualified for fair use or anything like that.

He further discovered that the conversion of purchased printed books into digital files with copies of the prints discarded was a transformative use similar to the Supreme Court's 1984. Bettamax The court decided that home recording of free television shows with the aim of changing time was fair use. Here, Justice Alsup said that since humanity had legally purchased the book, simply shifted the format for space and search capabilities, and only one copy remained as the original printed copy was discarded (unlike the now deprecated Redigi platform in 2018).

In contrast, it was illegal to download over 7 million pirate copies from pirate sites because the use of the Central Library is not necessarily fair as a matter of law, and since use of the Central Library was illegal at first, it was illegal.

Is humanity responsible for unfair use, that is, the cost of business?

This lawsuit continues on the issue of damages for pirated copies of plaintiffs' books used for central library purposes, not for training purposes. The court said the fact that humanity later purchased a copy of the plaintiff's book to replace the pirated copies could not be spared liability, but could affect the amount of statutory damages that must be paid. The statutory range of damages is $750 per copy, with a minimum of $150,000 per copy.

It leads to doubt about millions of other copyright owners beyond the three plaintiffs. If a pending class action lawsuit is recognized, could humanity be required to pay statutory damages against 7 million copies? Given the appeal of Claude, is that just the cost of running an AI business?

Metacase

Meta's decision to source books using Shadow Libraries was approved by CEO Mark Zuckerberg.

AFP via Getty Images

The second decision, published two days after the decision of humanity on June 25th, includes 13 book authors. Most of them sued Meta, the creator of a generator AI model called llama, to use the plaintiff's book as training data.

Llama (like Claude) is free to download, but it generates billions of dollars for meta. Like humanity, Meta initially looked into the rights to licenses from the book publishers, but ultimately renounced those efforts and instead downloaded the desired book from a pirate site called “Shadow Libraries.” Also, like Claude, the llama was designed to not produce output that reproduces its source material in whole or substantial parts. This is a record showing that the llama is not urged to reproduce more than 50 words from the plaintiff's book.

Also, Judge Chhabria in the Northern District of California said that Meta used the plaintiff's work to train llamas was a fair use, but he was very reluctant to pursue the plaintiff's lawyer for making a “wrong” argument and failing to develop a proper record. Chhabria's decision is plagued by his perception of the dangers of AI systems.

Meta's arbitration

Like Justice Alsup, based on the parties' arguments and his previous records, Justice Chhabria discovered that Meta's use of the book as training data for Llama is “very transformative.” This was a very different purpose for creating an AI system from the plaintiffs' purposes for education and entertainment. Rejecting the plaintiff's claim that Llama can be used to mimic the plaintiff's writing style, Judge Chhabria pointed out that “style is not copyright.”

The fact that Meta sourced books from Shadow Libraries rather than certified copies did not make a difference. Judge Chhabria (in my opinion) reasoned that when he said fair use depends on whether the source copy is approved or not, he would charge the question of whether the secondary copy is legal.

The plaintiff said, “While trying to create a central library for purposes other than training, he attempted a successful argument in the case of humanity, Judge Chhabria concluded that evidence that the copy was used for purposes other than training is simply not supported. Llama was unable to produce an accurate or substantially similar version of the plaintiff's book, he found that there was no alternative harm, noting that the loss of licensing revenue for AI training was not a recognizable harm.

Judge Chhabria's Market Dilution Prediction

Judge Chhabria warns that the generation AI system could dilute the market in the low-value mass market … more Publications.

UCG/Universal Images Group via Getty Images

As Dicta clearly expresses his frustration over the results in favor of meta, Judge Chhabria discussed in detail how market harm should be shown in other cases and in other cases through the concept of “market dilution.”

Unlike award-winning fictional works such as news articles and “typical human-created romance and spy novels,” some may be susceptible to this harm. However, he said that since his previous plaintiffs did not argue those and did not present the same records, he could not rule on the same thing. This opportunity remains on another day.

AI System Roadmap for Non-Infringement

The court's decision provides an early roadmap on how to design AI systems.

Getty Images

Based on these two court decisions, here is my takeout to use the book to build a roadmap for non-infringing generation AI systems.

Using copyrighted books for training data purposes, whether the source book is pirated or not is fair.
Creating a central library of books is used fairly if you purchase a legal copy and convert a printed copy to digital and don't make additional copies.
The creation of a central library of books, where source material is made up of illegal copies, is probably not a fair use.
It should be designed so that the output of the system does not replicate the source material book in an accurate or substantially similar way (contrasts Disney and Universal lawsuits against the Mid Journey).
The system should be careful not to develop output that could clearly be a market diluted in the source material. For example, news articles and inconspicuous romance novels. And finally,
It doesn't matter if the system is commercial and generates billions of dollars. The copyright holder is not entitled to obtain license fees for fair use.

Source link