Is AI Compatible With Free Software?

As everybody from The Linux Foundation to Open Source Initiative to Facebook can tell you, figuring out how to make artificial intelligence platforms fit the Four Freedoms has been no easy task.

Sophia The Robot, and David Hanson, Founder, Hanson Robotics Ltd., at Web Summit 2019 at the Altice Arena in Lisbon, Portugal. | Web Summit, CC BY 2.0, via Wikimedia Commons

The recent discussions about how artificial intelligence fits into free software reminds me of the debates early in the millennium about online services. Like online services, AI introduces new technology and circumstances that existing free licenses do not cover, and the discussion seems belated –too late, perhaps, to have much influence in future developments.

In the case of online services, the discussion resulted in the GNU Affero General Public License, a variant of the GNU GPL version 3 license with one added requirement: “if you run a modified program on a server and let other users communicate with it there, your server must also allow them to download the source code corresponding to the modified version running there.” Exactly what new licenses AI might require, though, remains to be seen. As with the discussions of online services, the debates over AI are complicated.

Open Source’s Acceptance by AI Developers

True, no one involved with AI seems to doubt the benefits of open source and most are quick to state that free software is a benefit to the development of AI. Jim Zemlin,the Executive Director of the Linux Foundation, praises open source in AI for its lack of obstacles and the advantages it gives to all participants with the phrase, “There are no moats by design, and it floats all boats by design.” Steven Vaughan-Nichols goes even further, maintaining that AI has roots in the Lisp programming languages and noting that ChatGPT was built using the Python-based TensorFlow and PyTorch.

Nor are these isolated opinions. The Linux Foundation’s AI and Data Foundation, founded in 2020 to “build and support an open community and a growing ecosystem of open source AI,” now includes dozens of project from major tech corporations, even though there is no consensus about the definition of open source AI. In AI circles, the benefits of open source are so widely accepted that Mark Zuckerberg enthuses about them and insists that Meta’s Llama 2 is open source, ignoring the fact that its license includes a clause that warns users that if they have “greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this agreement unless or until Meta otherwise expressly grants you such rights.”

If anyone still doubts AI’s benefits, they only have to look at DeepSeek’s overturning of the AI field with its MIT licensed source code in a matter of days.

Difficulties of Open Sourcing AI

The general enthusiasm for open source focuses solely the benefits of sharing source code – and, the trouble is, that is only part of the definition of free software. The Free Software Foundation’s Four Freedoms define free software as:

The freedom to run the program as you wish, for any purpose (freedom 0).

The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.

The freedom to redistribute copies so you can help others (freedom 2).

The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this”

Just as online services require accessibility to code that’s running on a server, AI should allow access to its training data so it can be studied, changed and redistributed. Moreover, without this accessibility, users cannot know if the training data is free or not. The Free Software Foundation addresses this issue at length, maintaining that: “we cannot say a [machine language] application is free unless all its training data and the related scripts for processing it respect all users, following the four freedoms.”

FSF recognizes that in some situations this could cause legitimate conflicts, since it would require the training data to be both human readable and freely distributable, but points out that under certain circumstances it might be permissible from a free software perspective.

“It may be that some nonfree ML have valid moral reasons for not releasing training data, such as personal medical data. In that case, we would describe the application as a whole as nonfree,” FSF adds. “But using it could be ethically excusable if it helps you do a specialized job that is vital for society, such as diagnosing disease or injury. For the FSF to consider usage of such a nonfree ML application to be just, its component software must be free, and the ML application as a whole would have to be distributed to users in a form and manner that reasonably and flexibly supports incremental training, or retraining differently from scratch, or both.”

In the end, the FSF offers these comments as part of an ongoing discussion. It invites comments, although whether the FSF currently has the moral authority to organize a community discussion the way it did when it produced GPLv.3 remains doubtful.

The failure to address this issue explains the severe criticism of the much publicized Open Source AI Definition. The text of the definition itself admits that “The Open Source AI Definition does not require a specific legal mechanism for assuring that the model parameters are freely available to all. They may be free by their nature, or a license or other legal instrument may be required to ensure their freedom. We expect this will become clearer over time, once the legal system has had more opportunity to address Open Source AI systems.”

At best, the definition is a loose set of guiding principles. Little wonder, then, that its declaration of support lists only thirty-five names, none of which are major business or community leaders. When the definition was linked on the LWN website, it was widely condemned in the comments, with many saying that OSI was no longer a trustworthy community leader. Security expert Bruce Schneier spoke for many when he wrote, “It’s terrible. It allows for secret training data and mechanisms. It allows for development to be done in secret. Since for a neural network, the training data is the source code—it’s how the model gets programmed—the definition makes no sense.”

Still, if nothing else, the Open Source AI Definition has served as a starting point for further discussion, as efforts are made to be more specific. Bruce Perens, who is credited with writing the original Open Source Definition, suggests “it’s incredibly simple to use the original Open Source Definition to “define” Open Source AI. It just requires this: “1. The infrastructure software of the AI system must comply with the original Open Source Definition. 2. The training data is the source code for the AI model, and must comply with the original Open Source Definition.” However, this comment does not address the problems with training data, making it too general to be of much use.

A more detailed response is the Open Weight Definition. This definition takes its name from the emphasis or weight placed upon a piece of training data, although it covers more issues than the name suggests. It lists ten criteria for open source AI:

Free Redistribution: The license shall not restrict any party from selling or giving away the weights as a component of an aggregate distribution containing components from several different sources. The license shall not require a royalty or other fee for such sale.

Model Weights: The product must include the weights, and must allow distribution of the weights. Where some form of a product is not distributed with the weights, there must be a well-publicized means of obtaining the weights for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The weights must be the actual form in which a practitioner would use the weights. Deliberately obfuscated weights are not allowed. Intermediate forms such as training checkpoints or partial states are not allowed. Transformed versions such as quantized or optimized forms are only allowed if clearly distinguished.”

Derived Works: The license must allow modifications and derived works, and must allow them to be distributed under the same terms as the license of the original weights.

Integrity of The Author’s Work: The license must explicitly permit the distribution of weights derived from modifications. The license may require derived works to carry a different name or version number from the original weights.

No Discrimination Against Persons or Groups: The license must not discriminate against any person or group of persons.

No Discrimination Against Fields of Endeavor: The license must not restrict anyone from making use of the weights in a specific field of endeavor. For example, it may not restrict the weights from being used in a business, or from being used for genetic research.

Distribution of License: The rights attached to the weights must apply to all to whom the weights are redistributed without the need for execution of an additional license by those parties.

License Must Not Be Specific to a Product: The rights attached to the weights must not depend on the weights being part of a particular distribution. If the weights are extracted from that distribution and used or distributed within the terms of their license, all parties to whom the weights are redistributed should have the same rights as those that are granted in conjunction with the original distribution.

License Must Not Restrict Other Software: The license must not place restrictions on other components that are distributed along with the weights. For example, the license must not insist that all other models distributed on the same medium must be Open Weight.
License Must Be Technology-Neutral:No provision of the license may be predicated on any individual technology or style of interface.”

The Open Weight Definition is the most thorough definition so far, but both it and Bruce Perens’ definition are more theoretical than practical.

Given the computing power required by many AIs, it is common for an AI to be hosted on a separate, smaller server from its training data. In such a case, the administrator of an AI may not be able to provide access to the training data, or guarantee that it is free licensed. Perhaps the only solution is take another inspiration from Debian, defining some AI, like the software in Debian’s contrib repository, as defining AI as software that is free in itself, but dependent on nonfree software – in other words, by an unsatisfactory compromise.

No doubt an MIT license, which allows free and proprietary software to interact more easily would be simpler to deal with, but many will not be satisfied until a free software license is written and field-tested that is in keeping with the traditional Four Freedoms. Currently, no license is generally accepted as meeting this standard, although debate continues.

Bruce Byfield

Bruce Byfield has been involved in FOSS since 1999. He has published more than 2000 articles, and is the writer of “Designing with LibreOffice,” which is available as a free download here.