AI and Open Source: Defining the New Era
We all know pretty well what open source means and what AI means. But what does open source AI mean? Is there even such a thing?
The Open Source Initiative (OSI) has been the steward of the Open Source Definition for over two decades, and has recently launched its first draft of the Open Source AI Definition — OSAID 1.0 — following intense community discussions, and amid corporate and governmental policy making and opinion shaping efforts.
In the latest episode of OpenObservability Talk, I sat down with Stefano Maffulli, Executive Director of the OSI, to hear about the efforts to define open source AI, and why it’s such a tricky thing.
AI challenges the open source definition
Open source means, among others, the right to modify the program. In software it is very clear what this means: any programmer should have access to the source code, the compilation scripts, the library dependencies etc. to be able to recreate the software and then fix bugs or otherwise modify the software.
But what do we need in order to modify an AI program? What is the preferred form for modification?
How is open source different in AI than in software?
The primary difference between open-source AI and traditional open-source software lies in the role of data. In software, the source code is the product — developers can modify it, compile it, and distribute it as they see fit, according to the basic freedoms of free and open source software (FOSS).
AI models, however, aren’t built purely from code. They are trained on large datasets, and those datasets are often the most valuable part of the system. In the machine learning realm, the equivalent freedom would require access to the model weights and parameters, as well as to the code used to build the the data set, the code used for the training of the of the system and the complete list of data used to train the system and the actual data set.
This presents an obvious dilemma: if the training data is closed or restricted, can the resulting AI system truly be considered “open”?
The Debate around open data in AI
This leads to an ongoing debate within the open-source community: How open should open-source AI be, exactly? Is it enough for an AI model’s code to be open, or do we also need full access to the training data and the model’s decision-making logic?
This debate is ongoing among the OSI, the Free Software Foundation, Software Freedom Conservancy, Open Source Declaration, Digital Public Goods Alliance and others. Many argue that to be truly open, AI systems must be completely transparent, from code to data. After all, that data shapes the AI’s decisions and biases, and without access to it, developers and end-users are left with very little insight into how or why the AI behaves the way it does.
However, others are more cautious. They point out that releasing sensitive or proprietary data could have negative consequences, including privacy violations or exposing intellectual property to competitors.
Stefano mentioned the example of the “poster child of Open Source AI”— EleutherAI— which is pushing the boundaries of open-source AI by not only releasing the models but also the tools to replicate them. This level of transparency is laudable, but it raises additional questions: How can we ensure that these models don’t perpetuate harmful biases or cause unintended consequences if their training data is flawed or incomplete?
Stefano brings another point: If we limit ourselves to just open data for training AI, we might end up reinforcing a power imbalance. Major corporations like OpenAI, Meta, and Anthropic are not bound by these restrictions, and this could stifle the growth of more diverse, open-source AI models.
OSAID 1.0: an emerging definition for open source AI
A significant step forward in this ongoing debate is the release of the Open Source AI Definition (OSAID) 1.0 by the Open Source Initiative (OSI). Launched at the All Things Open conference, this definition aims to provide a standardized framework for what constitutes open-source AI. For the first time, we have an official, industry-backed attempt to set guidelines for how AI models and their datasets should be treated within the open-source ecosystem.
At the core of OSAID 1.0 is a recognition that true open-source AI is more than just accessible code. It’s about responsible data governance, transparency, and creating systems that can be freely used, modified, and improved upon — without compromising ethical standards or stifling innovation.
Stefano clarifies that OSAID 1.0 is only the beginning. He emphasizes the OSI’s commitment to continue updating OSAID to ensure alignment with Open Source principles.
The broken social contract of data
The concept of a “social contract” is central to the discussions surrounding AI and data governance. Historically, individuals have consented to share their data with companies and organizations in exchange for products or services. However, that social contract has been broken in the world of AI.
Companies often use personal data for training AI models without fully informing individuals or obtaining explicit consent. In some cases, this data is collected, sold, or used in ways that violate privacy or ethical standards. This broken social contract has led to calls for stricter data governance, especially when it comes to AI training datasets.
Stefano touched on this issue during our conversation, highlighting how the misuse of data could erode trust in AI systems. Without transparent data practices and clear regulations, we risk creating AI models that are not only biased but also harmful. The open-source community has a unique opportunity to address this issue by advocating for ethical data practices and ensuring that AI models are built on solid, responsible foundations.
The new white paper: “Data Governance in Open Source AI”
To help address some of these issues, the OSI has partnered with the Open Future Foundation to release a new white paper titled “Data Governance in Open Source AI: Enabling Responsible and Systematic Access.” This paper provides a roadmap for how the open-source community can better govern AI training data.
The white paper outlines the importance of responsible data access, transparency in data collection practices, and the need for clear ethical guidelines when it comes to AI. It argues that open-source AI can only thrive if we establish a strong framework for data governance — one that ensures the data used is ethical, transparent, and available to all stakeholders. This is a critical step toward ensuring that AI models are not only open but also trustworthy.
The role of legislation in shaping open-source AI
Finally, as AI grows more pervasive, legislation is stepping in to regulate its use. The EU Artificial Intelligence Act is one of the most comprehensive legislative efforts to date, and it includes provisions that directly affect open-source AI. The act aims to create a legal framework that ensures AI systems are used safely, ethically, and transparently.
Stefano discussed the EU’s approach and how it differentiates between high-risk and low-risk AI applications. The AI Act also acknowledges the unique challenges posed by open-source AI and provides exemptions and provisions that could help foster innovation in this space.
The EU AI Act is already in place and a designated AI office is established under the EU Commissioner, in charge of its executions. The European Union is not alone: China has also made moves with AI regulations, and while there is no similar federal law in the US, there are state-level initiatives, such as in California and Colorado.
As open-source AI continues to evolve, it will be crucial for legislators to stay ahead of the curve, ensuring that regulations are flexible enough to accommodate innovation while protecting society from the potential risks of AI.
Want to learn more? Check out the OpenObservability Talks episode: Open Source AI: Perspectives from the OSI.