opinion Free software and open source licenses evolved to handle code in the 1970s and 80s. Today, we have to transform again to accommodate AI models.
AI was born out of open source software. But copyright-based free software and open source licenses for dealing with software code are not suitable for the large-scale language model (LLM) neural nets and datasets that power open source software in AI. Something needs to be done, especially since many programming datasets are based on free software and open source code. That’s why many open source and AI leaders, including Open Source Initiative (OSI) Executive Director Stefano Mafuri, are working to combine AI and open source licensing in a way that makes sense for both. increase.
Think again, lest you think this is some sort of theoretical, legal argument that has no real world implications.examination J. Doe 1 et al vs GitHub. Plaintiffs in this case in the U.S. District Court for the Northern District of California allege that Microsoft, OpenAI, and GitHub stole open source code through OpenAI’s Codex and GitHub’s Copilot, commercial AI-based systems. result? Plaintiffs argue that “recommended” code often lacks the required open source license attribution and consists of near-identical copies of code scraped from public GitHub repositories.
This incident continues. The amended complaint includes allegations of violations of the Digital Millennium Copyright Act, breach of contract (violating open source licenses), unjust enrichment, unfair competition allegations, and breach of contract (selling licensed material in violation of GitHub’s policies). is
Do not assume that this type of litigation is just Microsoft’s problem. it’s not. Sean O’Brien, a cybersecurity lecturer at Yale Law School and founder of the Yale Privacy Lab, told colleague David Gewirtz: “Generated works. As more authors use AI-powered tools to ship code under their own licenses, feedback he loops are created. Software ecosystems are polluted with their own code.” will be subject to cease and desist requests by enterprising companies.”
he is right I’ve covered patent trolls for decades. We guarantee that license vandals will target “your” ChatGPT and Copilot code.
Some, such as German researcher and politician Felix Reda, argue that all AI-generated code is in the public domain.US attorney Richard Santareza, a founding member of the SmartEdgeLaw Group, pointed out to Gewirz that there were problems with contracts and copyright law. they are not the same thing. Santareza believes that companies that generate AI-generated code “consider the provided materials (including AI-generated code) as their property, just like all other intellectual property.” In any case, however, public domain code is not the same as open source code.
Added to that is the question of how to license the dataset. There are many “open” datasets under numerous open source licenses, which are usually not suitable.
During our conversation, Mahri from the Open Source Initiative elaborated on how different artifacts produced by AI and machine learning systems fall under different laws and regulations. The open source community must decide which laws best serve their interests. Mahri compared the current situation to the late ’70s and ’80s, when software emerged as a separate field and copyright applied to source and binary code.
We are at a similar crossroads today. AI programs such as TensorFlow, PyTorch, and Hugging Face Hub work well under open source licenses. New AI artifacts are another story. Datasets, models, weights, etc. do not fit perfectly into the traditional copyright model. Mahri argued that instead of resorting to “hacking,” the tech community should come up with something new that’s better suited to our purposes.
In particular, open-source licenses designed for software may not be optimal for AI work, Mahri noted. For example, the broad liberties of the MIT license may apply to the model, but questions arise about more complex licenses like Apache and the GPL. Mahri also noted the challenges of applying open source principles to sensitive areas such as healthcare, where regulations on data access present unique hurdles. The simple explanation for this is that medical data cannot be open sourced.
At the same time, most commercial LLM datasets are black boxes. I literally have no idea what’s in it. So, as the Electronic Frontier Foundation (EFF) puts it, we’re stuck in a situation where “garbage in, gospel out.” The EFF concludes that we need open data.
Mahri said OSI will work with Open Forum Europe, Creative Commons, Wikimedia Foundation, Hugging Face, GitHub, Linux Foundation, ACLU Mozilla and Internet Archive to define a common understanding of openness. is working on a draft of Source AI principles. This will be “important in dialogue with legislative bodies”. Even today, government agencies in the EU, US, and UK struggle to develop AI regulations, but are woefully unprepared to deal with the problem.
Stefano concluded by saying that we need to start “back to basics” with the GNU Manifesto, which predates most licenses and defines the “polar star” of the open source movement. Mahri suggested that the principles are surprisingly relevant even when applied to he AI systems. Focusing on first principles can help you navigate this complex intersection of AI and open source. ®
