OpenAI destroyed AI training data. The staff member who was collecting the items has disappeared.

Newly released documents from the Authors Guild's class action lawsuit against OpenAI show that the company used two giant computers named “books1” and “books2” to train its GPT-3 artificial intelligence model. It shows that the dataset has been deleted.

Lawyers for the Authors Guild said in a court filing that the dataset likely includes “more than 100,000 published books” and that OpenAI uses copyrighted material to train AI models. He said that this is the center of the argument that he did so.

The guild has been seeking information about the dataset from OpenAI for several months. The company initially resisted, citing confidentiality concerns, but ultimately said it had deleted all copies of the data, according to legal filings reviewed by Business Insider.

High-quality training data is a key part of the powerful AI models that are taking the technology world by storm. OpenAI and other companies used data from the Internet, including many books, to build these models. Many of the companies that create this information hope to be compensated for providing intelligence to these new AI products. Technology companies don't want to be forced to pay. This dispute is currently being fought in court through several lawsuits.

In a 2020 white paper, OpenAI described the “books1” and “books2” datasets as an “Internet-based corpus of books” and said they accounted for 16% of the training data used to create GPT-3. Ta. The whitepaper also states that “books1” and “books2” together contain 67 billion tokens of data, which equates to approximately 50 billion words. For comparison, the King James Bible contains 783,137 words.

An unsealed letter from OpenAI's lawyers is labeled “Confidential – For Lawyers' Eyes Only” and states that the use of “books1” and “books2” for model training will be discontinued at the end of 2021. It is stated that the dataset will be deleted in mid-2022. Due to non-use. The letter also states that other data used to train GPT-3 has not been deleted and provides Authors Guild lawyers with access to those other datasets.

The unsealed documents also revealed that the two researchers who created “books1” and “books2” are no longer employed by OpenAI. OpenAI initially declined to share the identities of the two employees.

The startup later disclosed the employee's identity to the Writers Guild's lawyers, but did not release their names. OpenAI asked the court to keep the names of its two employees and information about the data set confidential. The Authors' Union opposed this, insisting on the public's right to know. The controversy continues.

“The models that power ChatGPT and our API today were not developed using these datasets,” OpenAI said in a statement Tuesday. “These datasets were created by former employees who left OpenAI, were last used in 2021, and were removed as they were no longer used in 2022.”

Source link