HaA state-of-the-art artificial intelligence system can help you avoid parking violations, write academic papers, and convince you that Pope Francis is a fashionista. But there are concerns that the virtual library behind this breathtaking technology is vast and operates in violation of personal data and copyright laws.
The massive datasets used to train the latest generation of these AI systems, such as those behind ChatGPT and Stable Diffusion, include billions of images scraped from the internet, millions of pirated electronic The book, which may contain the entire proceedings of the European Parliament for 16 years. Entire English Wikipedia.
But as regulators and courts around the world crack down on researchers collecting content without consent or notice, the industry’s voracious appetite for big data is starting to pose problems. In response, AI labs are fighting to keep datasets secret or pushing the issue to bold regulators.
In Italy, the country’s data protection regulator banned ChatGPT from operating after saying there was no legal basis to justify the collection and “massive storage” of personal data to train GPT AI. . Canada’s Privacy Commissioner on Tuesday launched an investigation into the company in response to complaints alleging “collection, use and disclosure of personal information without consent.”
The UK data watchdog has expressed its own concerns. “Data protection laws still apply if the personal information you are processing comes from publicly accessible sources,” said Steven He, Director of Technology and Innovation at the Information Commissioner’s Office. says Almond.
Michael Wooldridge, a professor of computer science at the University of Oxford, says that “large-scale language models” (LLMs), such as those that power OpenAI’s ChatGPT and Google’s Bard, collect vast amounts of data.
“This includes the entire World Wide Web—everything. “There’s probably a lot of data about you and me that’s collected by LLM. And it’s not stored in a big database somewhere. What information does it have about me?” It’s all buried in a huge opaque neural network.”
Wooldridge says copyright is “the storm to come” for AI companies. LLMs may have accessed copyrighted material, such as news articles. In fact, his GPT-4-backed chatbot, which comes with Microsoft’s Bing search engine, cites news sites in its answers. “I didn’t explicitly allow my work to be used as training his data, but it almost certainly did. Now I’m contributing to what these models know ’ he says.
“Many artists are seriously concerned that their lives are being endangered by generative AI. Expect to see a legal battle,” he adds.
Lawsuits have already taken place, with stock photo company Getty Images suing UK startup Stability AI, the company behind AI image generator Stable Diffusion. After the image production company allegedly infringed its copyright by training its system with millions of unauthorized Getty photos. In the U.S., a group of artists are suing his Midjourney and Stability AI companies, which have “violated the rights of millions of artists” in developing products using their work without permission. claims.
Worse for stability, Stable Diffusion occasionally spits out photos with the Getty Images watermark intact. In January, Google researchers gave the Stable Diffusion system a near-perfect reproduction of a portrait of U.S. evangelist Anne Graham of his Lotz, one of the unlicensed images he used for training. I was even able to encourage them to do so.
Copyright lawsuits and regulatory action against OpenAI are hampered by the company’s absolute secrecy regarding training data. Sam Altman, chief executive of his company OpenAI, which developed ChatGPT, said following the ban in Italy: However, the company declined to share information about the data used to train He GPT-4, the latest version of the underlying technology that powers ChatGPT.
Even in a “technical report” describing the AI, the company succinctly stated that it was trained “using both publicly available data (such as internet data) and data licensed from third-party providers.” I’m just there. Further information is being withheld due to “both the competitive landscape and the safety implications of large-scale models like GPT-4.”
Some take the opposite view. EleutherAI describes itself as a “non-profit AI laboratory” and was founded in 2020 with the aim of reproducing GPT-3 and making it available to the public. To that end, the group has put together Pile, an 825 gigabyte collection of datasets collected from all corners of the internet. This includes a 100 GB ebook taken from the pirated site bibliotik, another 100 GB of his computer code scraped from Github, and a 228 GB collection of his website collected from across the internet since 2008. contained.
Eleuther argues that all of the datasets in Pile are already widely shared, so editing them “doesn’t add much to the harm.” However, the group does not take the legal risks of hosting the data directly, instead relying on an anonymous group of “data enthusiasts” called Eye. Their copyright removal policy is a video of a choir of clothed women pretending to masturbate an imaginary penis. sing.
Some of the information generated by chatbots is also incorrect. ChatGPT, citing a non-existent news article, falsely accused George Jonathan Turley, his U.S. law professor at the University of Washington, of sexually harassing one of his students. The Italian regulator also noted the fact that ChatGPT’s responses are not “always consistent with the facts” and that “inaccurate personal data are processed”.
The annual report on advances in AI showed that private companies dominate the industry over academic institutions and governments.
According to the 2023 AI Index report compiled by California-based Stanford University, there were 32 significant machine learning models created by industry last year, compared with three models created by academia. Until 2014, most of the important models came from academia, but since then the cost of developing AI models, including staff and computational power, has risen.
“Overall, large language and multimodal models are getting bigger and more expensive,” says the report. An early iteration of his LLM behind ChatGPT, known as GPT-2, had his 1.5 billion parameters similar to neurons in the human brain and cost an estimated $50,000 to train. By comparison, Google’s PaLM had his 540 billion parameters and cost an estimated $8 million.
This has raised concerns that corporate bodies will take a less measured approach to risk than academic or government-backed projects. The signed letter called for an immediate moratorium on creating “giant AI experiments” for at least six months. The letter said there were concerns that technology companies were creating “an ever-more powerful digital mind” that no one could “understand, predict or reliably control.”
Dr. Andrew Rogowski of the Institute for Human-Centric AI at the University of Surrey, UK, said: Not always well represented.
“We need to focus on making AI smaller, more efficient, and requiring less data and power so we can democratize access to it.”