First of all, what does “open source AI” mean?

The battle between open source and proprietary software is well known, but a tension that has permeated the software industry for decades has now trickled into the field of artificial intelligence, in part because no one can agree on what “open source” actually means in the context of AI.

The New York Times recently published a glowing review of Meta CEO Mark Zuckerberg, noting that his “open source AI” initiative has rekindled his popularity in Silicon Valley. However, by most assessments, Meta's Llama-branded large-scale language model is not actually open source, which brings us to the heart of the debate.

The Open Source Initiative (OSI), under the direction of Executive Director Stefano Maffulli (pictured above), is addressing this challenge through conferences, workshops, panels, webinars, reports and more.

AI is not software code

For over 25 years, OSI has been the custodian of the Open Source Definition (OSD), defining how the term “open source” can and should be applied to software. Any license that meets this definition can legitimately be considered “open source,” but a wide range of licenses are permitted, from very permissive to not so permissive.

But applying traditional software licensing and naming conventions to AI is problematic. Open source evangelist and founder of venture capital firm OSS Capital Joseph Jacks goes so far as to say that “there is no such thing as open source AI,” stating that “open source was invented for software source code.” Furthermore, “neural network weights” (NNWs) is a term used in the artificial intelligence world to describe the parameters or coefficients that a network learns during the training process, but cannot be meaningfully compared to software.

“Neural net weights are not the source code of the software and are therefore indecipherable to humans. [and they are not] “It's debuggable,” Jacks points out. “And the fundamental rights of open source don't apply in any consistent way to NNW.”

These contradictions last year led Jacks and his OSS Capital colleague Heather Meeker to come up with their own definition, centered around the concept of “openweight.” Mafleuri agrees. “You're right,” he told TechCrunch. “One of the early discussions was whether we should call this open source AI, but everyone was already using that term.”

Llama illustration — **Image credits:** Larisa Amosova (via Getty)

Founded in 1998, OSI is a non-profit public benefit corporation that focuses on advocacy, education, and a wide range of open source related activities with the Open Source Definition at its core. Today, the organization relies on sponsors for funding and includes such notable members as Amazon, Google, Microsoft, Cisco, Intel, Salesforce, and Meta.

Meta's involvement with OSI is especially notable in relation to the current concept of “open source AI.” While Meta positions its AI as open source, the company does place notable restrictions on how the Llama model can be used. Of course, it is free for research and commercial use, but app developers with more than 700 million monthly users must apply for a special license from Meta, which will be granted at Meta's sole discretion.

Meta's wording around LLM has been somewhat flexible: the company called the Llama 2 model open source, but with the arrival of Llama 3 in April, it has toned down that term a bit in favor of phrases like “openly available” and “openly accessible,” though it still calls the model “open source” in some places.

“Everyone else in this discussion is in complete agreement that Llama itself cannot be considered open source,” Maffulli said. “People who have spoken to people who work at Meta understand that's a bit of a stretch.”

On top of that, one might argue there is a conflict of interest here: are the companies that have demonstrated a desire to piggyback on the open source brand also funding the maintainers of the “definition”?

That's one reason OSI is looking to diversify its funding, and recently won a grant from the Sloan Foundation, which is funding OSI's multi-stakeholder, global effort to achieve its definition of open source AI. TechCrunch revealed that the grant was worth about $250,000, and Makhlouri hopes it will change perspectives on its reliance on corporate funding.

“One of the things the Sloan grant makes even clearer is that we can say goodbye to Meta's funding at any time,” Mahri said. “We can do that even before the Sloan grant is paid out, because we know that we're going to be getting donations from other people, and Meta knows that very well. They're not going to interfere with this at all.” [process]Microsoft, GitHub, Amazon, and Google all fully understand that their organizational structures mean they cannot interfere.”

A working definition of open source AI

Conceptual diagram depicting finding a definition — **Image credits:** Alexei Morozov/Getty Images

The current draft Open Source AI Definition is at version 0.0.8 and consists of three main parts: a “Preamble” that outlines the scope of the document, the Open Source AI Definition itself, and a checklist of necessary components for an open source compliant AI system.

According to the current draft, open source AI systems must grant the freedom to use the system for any purpose without asking permission, the freedom for others to study how the system works and inspect its components, and the freedom to modify and share the system for any purpose.

But one of the biggest challenges was around data: whether an AI system can be classified as “open source” if a company doesn't make its training datasets available to others. Mahruli says it's more important to know where the data came from and how the developers labeled, deduplicated, and filtered it. It's also important to have access to the code that was used to assemble the datasets from various sources.

“Knowing that information is much better than just having a data set without the rest of the information,” Mafoury said.

While it would be nice to have access to the full dataset (OSI lists this as an “optional” component), Maffulli says that in many cases, this is not possible or practical. This may be because the dataset contains confidential or copyrighted information that developers are not allowed to redistribute. Additionally, there are techniques to train machine learning models in such a way that the data itself is not actually shared with the system, using techniques such as federated learning, differential privacy, and homomorphic encryption.

And this perfectly highlights the fundamental difference between “open source software” and “open source AI”: they may be similar in intent, but they are not comparable on an equal footing, and it is this difference that the OSI tries to capture in its definition.

In software, source code and binary code are two views of the same artifact: they reflect the same program in different forms. However, a training dataset and the subsequent trained model are different things. Using the same dataset does not necessarily allow you to consistently recreate the same model.

“There's a lot of statistical and random logic that happens during training, so it can't be replicated in the same way as software,” Makhouli added.

Therefore, an open source AI system should be easy to replicate with clear instructions. This is where the checklist aspect of the Open Source AI definition comes in handy. The definition is based on a recently published academic paper called “Model Openness Framework: Promoting Integrity and Openness for Reproducibility, Transparency, and Usability in Artificial Intelligence.”

The paper proposes the Model Openness Framework (MOF), a classification system that evaluates machine learning models “based on their completeness and openness.” MOF requires that certain components of AI model development, such as details about training methods and model parameters, be “included and released under an appropriate open license.”

Steady state

Stefano Maffouli speaking at the Digital Public Goods Alliance (DPGA) Member Summit in Addis Ababa — Stefano Maffouli giving a presentation at the Digital Public Goods Alliance (DPGA) Member Summit in Addis Ababa.

OSI calls its official releases of definitions “stable versions,” much like a company might release an application that has been thoroughly tested and debugged before prime time. OSI deliberately avoids calling them “final releases” because parts of the definitions are likely to evolve.

“We can't expect this definition to last 26 years like the Open Source Definition,” Mafuri says. “I don't think the first part of the definition, like, 'What is an AI system?', is going to change much. But the part that we refer to in the checklist — the list of components — will depend on the technology. Who knows what the technology will be tomorrow?”

A stable open source AI definition is expected to be approved by the board at its All Things Open conference at the end of October. In the interim, OSI has embarked on a global roadshow across five continents to solicit more “diverse opinions” on how “open source AI” should be defined going forward. But the final changes are likely to be just “small tweaks” here and there.

“This is the final stage,” says Makhlouri, “we've got the full functionality of the definition. We have all the pieces in place. We have the checklist, so we're making sure there are no surprises, that there are systems that we should include or exclude.”

Source link