Not all 'open source' AI models are actually open: Here's the ranking

A single open orange padlock is surrounded by many locked green padlocks. — A truly open source model must allow researchers to replicate and explore it.Credit: MirageC/Getty

Tech giants including Meta and Microsoft call their artificial intelligence (AI) models “open source” but are not disclosing important information about the underlying technology, according to researchers who analyzed a number of popular chatbot models.

While there is no agreed-upon definition of open source for AI models, advocates say “full” openness is crucial to advancing science and efforts to hold AI accountable. What counts as open source may become more important once the European Union's Artificial Intelligence Act comes into force, which would subject models classified as open to looser restrictions.

Mark Dingemanse, a linguist at Radboud University in Nijmegen, Netherlands, says some large companies profit from claiming to follow an open-source model while “disclosing as little as possible,” a practice known as openwashing.

“What surprised us was that smaller players with relatively few resources put in more effort,” says Dingemans, who, together with his computational linguist colleague Andreas Riesenfeld, compiled a leaderboard to identify the most and least open models (see table). He presented his findings on June 5 in the proceedings of the 2024 ACM Conference on Fairness, Accountability and Transparency.¹.

The study cuts through “a lot of the hype and theory surrounding the current open-source debate,” says Abeba Birhane, a cognitive scientist at Trinity College Dublin and an adviser on AI accountability to the Mozilla Foundation, a nonprofit based in Mountain View, California.

Defining openness

The term “open source” comes from software and means access to the source code and no restrictions on using or distributing the program. But given the complexity of large AI models and the huge amounts of data involved, making them open source is by no means a simple task, and experts are still working on the definition of open source AI. It is not always desirable for companies to make all aspects of a model public, as this could expose them to commercial or legal risks, says Dingemans. Some argue that making a model completely free and open could risk it being misused.

Impact of strict EU AI legislation on research and ChatGPT

But the open source label also brings big benefits. Developers are already reaping the public relations benefits of presenting themselves as rigorous and transparent companies. And legal consequences may soon follow: EU AI law passed this year exempts open source general-purpose models up to a certain scale from broad transparency requirements, imposing smaller, as-yet-undefined obligations. “It's fair to say that in countries that fall under the jurisdiction of EU AI law, the term open source will carry unprecedented legal weight,” Dingemans says.

In their study, Dingemans and Riesenfeld evaluated 40 large-scale language models, which are systems that learn to generate text by associating words and phrases in large amounts of data. All of these models claim to be “open source” or “open.” They evaluated the models on 14 parameters, including availability of code and training data, public documentation, and ease of access to the models, to create a leaderboard for openness. For each parameter, they determined whether the model was open, partially open, or closed.

Amanda Bullock, chief executive of OpenUK, a London-based non-profit company focused on open technology, says this sliding-scale approach to analysing openness is useful and practical.

The researchers found that many models that claim to be open or open source (such as Meta's Llama and Google DeepMind's Gemma) are in fact just “openweights.” That means outside researchers can access and use the trained models, but they can't inspect or customize them. They also can't fully understand how the models were fine-tuned for a specific task, such as with human feedback. “Not revealing too much is a testament to being open,” Dingemans says.

Open source AI chatbots are booming. What does this mean for researchers?

Of particular concern, the authors say, is the lack of disclosure of what data the models are trained on: About half of the models they analyzed don't provide any details about the dataset beyond general descriptors.

A Google spokesperson said the company is “precise about the language” it uses to describe its models and that it labels its Gemma LLM as open, not open source. “Existing concepts of open source are not necessarily directly applicable to AI systems,” they added. Microsoft “strives to be as accurate as possible about what is available and to what extent,” the spokesperson said. “We make our artifacts — models, code, tools, datasets — publicly available because the developer and research community play a critical role in advancing AI technology.” Meta did not respond to a request for comment. Nature.

The analysis finds that models created by smaller companies and research groups tend to be more open than those from large tech companies. The authors highlight BLOOM, built primarily through an academic international collaboration, as an example of truly open source AI.

Peer review is “outdated”

The duo found that scientific papers detailing their models are extremely rare; peer review seems “almost completely outdated,” replaced by blog posts with cherry-picked examples and company preprints with scant details. “Companies might publish papers on their websites that look flashy and very technical, but when you read them closely, there's no indication at all of what data went into the system,” Dingemans says.

It is not yet clear how many of these models would fall under the EU's definition of open source, which in this law refers to models released under a “free and open” license that allows users to modify the model, for example, but says nothing about access to the training data. Refining this definition would likely form “a single pressure point for corporate lobbies and large companies to target,” the paper said.

Can we open the black box of AI?

Openness is also important for science, Dingemans says, because it's essential for reproducibility. “If you can't reproduce it, it's hard to call it science,” he says. The only way researchers can innovate is by tinkering with models, and to do that, they need enough information to build their own versions. Not only that, models must be open to scrutiny. “If you can't see inside to see how the sausage is made, you don't know whether to be impressed by it or not,” Dingemans says. For example, if a model passes a particular test, it might not be an accomplishment for that model if it was trained on many examples of that test. And without data accountability, no one can know if inappropriate or copyrighted data was used, he adds.

Riesenfeld says he wants to help other scientists “avoid falling into the same traps that we did” when looking for models to use in their teaching and research.

Source link