OSI readies controversial Open AI definition

By Joe Brockmeier
October 25, 2024

The Open Source Initiative (OSI) has been working on defining Open Source AI—that is what constitutes an AI system that can be used, studied, modified, and shared for any purpose—for almost two years. Its board will be voting on the Open Source AI Definition (OSAID) on Sunday, October 27, with the 1.0 version slated to be published on October 28. It is never possible to please everyone in such an endeavor, and it would be folly to make that a goal. However, a number of prominent figures in the open-source community have voiced concerns that OSI is setting the bar too low with the OSAID—which will undo decades of community work to cajole vendors into adhering to or respecting the original Open Source Definition (OSD).

Defining Open Source AI

OSI executive director Stefano Maffulli announced the organization's intent to provide a definition for open-source AI in June 2023. He took exception to announcements of "large language models, foundational models, tooling, services all claiming to be 'open' or 'Open Source'", while adding restrictions which run afoul of the OSD. A survey of large-language model (LLM) systems in 2023 found that ostensibly open-source LLMs did not live up to the name.

The problem is not quite as simple as saying "use an OSD-compliant license" for LLMs, because there are many more components to consider. The original OSD is understood to apply to the source code of a program in "the preferred form in which a programmer would modify the program". A program is not considered open source if a developer cannot study, use, modify, and share a program, and a license is not OSD‑compliant if it does not preserve those freedoms. A program can include non-free data and still be open source. For example, the game Quake III Arena (Q3A) is available under the GPLv2. That distribution, however, does not include the pak files that contain the maps, textures, and other content required to actually play the commercial game. Despite that, others can still use the Q3A code to create their own games, such as Tremulous.

When discussing an "AI system", however, things are much more complicated. There is more than just the code that is used to run the models to do work of some kind, and the data is not something that can be wholly separate from the system in the way that it can be with a game. When looking at, say, LLMs, there is the model architecture, the code used to train models, model parameters, the techniques and methodologies used for training, the procedures for labeling training data, the supporting libraries, and (of course) the data used to train the models.

OSI has been working on its definition since last year. It held a kickoff meeting on June 21, 2023 at the Mozilla headquarters in San Francisco. It invited participation afterward via a regular series of in-person and online sessions, and with a forum for online discussions. LWN covered one of the sessions, held at FOSDEM 2024, in February.

The current draft of the OSAID takes its definition of an AI system from the Organisation for Economic Co-operation and Development (OECD) Recommendation of the Council on Artificial Intelligence:

A machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments.

This includes source code for training and running the system, model parameters "such as weights or other configuration settings", as well as "sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system".

Preferred form to make modifications

Those elements must all be available under OSI-approved licenses, according to the proposed definition, which seems perfectly in line with what we've come to expect when something is called "open source". There is an exception, though, for things like the data information and model parameters which must be available under "OSI-approved terms". The definition of OSI-approved terms is not supplied yet.

There is no requirement to make the training data available. To be compliant with the current draft of the OSAID, an AI system need only provide "detailed information" about the data but not the data itself.

The OSI published version 0.0.9 on August 22. It acknowledged then that "training data is one of the most hotly debated parts of the definition". However, the OSI was choosing not to require training data:

After long deliberation and co-design sessions we have concluded that defining training data as a benefit, not a requirement, is the best way to go.

Training data is valuable to study AI systems: to understand the biases that have been learned, which can impact system behavior. But training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

As it stands, some feel that the OSAID falls short of allowing the four freedoms that it is supposed to ensure. For example, julia ferraioli wrote that without including data, the only things that the OSAID guarantees are the ability to use and distribute an AI system. "They would be able to build on top of it, through methods such as transfer learning and fine-tuning, but that's it."

Tom Callaway has written at length on LinkedIn about why open data should be a requirement. He acknowledges that there are good reasons that distributors of an AI system may not want, or be able, to distribute training data. For example, the data itself may have a high monetary value on its own, and a vendor may be unwilling or unable to share it. Acme Corp might license a data set and have permission to create an AI system using it, but not the ability to distribute the data itself. The data might have legal issues, ranging from confidentiality (e.g., medical data sets) to a desire to avoid lawsuits from using copyrighted data.

All of those are understandable reasons for not distributing data with an AI system, he said, but they don't argue for crafting a definition that allows companies to call their system open:

If we let the Open Source AI definition contain a loophole that makes data optional, we devalue the meaning of "open source" in all other contexts. While there are lots of companies who would like to see open source mean less, I think it's critical that we not compromise here, even if it means there are less Open Source AI systems at first.

Objections to lack of training data are more than an attachment to the original meaning of open source. Giacomo Tesio posted a list of issues he considered unaddressed in the RC2 version of the OSAID, including a claim that there is inherent insecurity due to the ability to plant undetectable backdoors in machine-learning models.

Others weigh in

The Free Software Foundation (FSF) announced that it was working on "a statement of criteria for free machine learning applications" to call something a free (or libre) machine-learning application. The FSF says that it is close to a definition, and is working on the exact text. However, it adds that "we believe that we cannot say a ML application 'is free' unless all its training data and the related scripts for processing it respect all users, following the four freedoms".

However, the FSF makes a distinction between non-free and unethical in this case:

It may be that some nonfree ML have valid moral reasons for not releasing training data, such as personal medical data. In that case, we would describe the application as a whole as nonfree. But using it could be ethically excusable if it helps you do a specialized job that is vital for society, such as diagnosing disease or injury.

The Software Freedom Conservancy has announced an "aspirational statement" about LLM-backed generative AI for programming called "Machine-Learning-Assisted Programming that Respects User Freedom". Unlike the OSAID, this target focuses solely on computer-assisted programming, and was developed in response to GitHub Copilot. The announcement did not directly name the OSI or the OSAID effort, but said "we have avoided any process that effectively auto-endorses the problematic practices of companies whose proprietary products are already widely deployed". It describes an ideal LLM system built only with FOSS, with all components available, and only for the creation of FOSS.

Response to criticisms

I emailed Maffulli about some of the criticisms of the current OSAID draft, and asked why OSI appears to be "lowering the bar" when the OSI has never budged on source availability and use restrictions. He replied:

I'll be blunt: you mention "source redistribution" in your question and that's what leads people like [Callaway] into a mental trap [...]

There are some groups believing that more components are required to guarantee more transparency. Other groups instead believe that model parameters and architecture are enough to modify AI. The Open Source AI Definition, developed publicly with a wide variety of stakeholders worldwide, with deep expertise on building AI (see the list of endorsers), found that while those approaches are legitimate, neither is optimal. The OSAID grants users the rights (with licenses) and the tools (with the list of required components) to meaningfully collaborate and innovate on (and fork, if required) AI systems. We have not compromised on our principles: we learned many new things from actual AI experts along the way.

Maffulli objected to the idea that the OSAID was weaker or making concessions, and said that the preferred form for modifying ML systems was what is in the OSAID: "it's not me nor OSI board saying that, it's in the list of endorsers and in [Carnegie Mellon University's] comment". He added that OSI had synthesized input from "AI builders, users, and deployers, content creators, unions, ethicists, lawyers, software developers from all over the world" to arrive at the definition. A "simple translation" of the OSD, he said, would not work.

Stephen O'Grady, founder of the RedMonk analyst firm, also makes the case that the OSD does not easily translate to AI projects. But he does not believe that the term open source "can or should be extended into the AI world" as he wrote in a blog post on October 22:

At its heart, the current deliberation around an open source definition for AI is an attempt to drag a term defined over two decades ago to describe a narrowly defined asset into the present to instead cover a brand new, far more complicated future set of artifacts.

O'Grady makes the case that the OSI has set out on a pragmatic path to define open-source AI, which requires nuance. Open source has succeeded, in part, because the OSD removes nuance. Does a license comply with the OSD or doesn't it? It's pretty easy to determine. Less so with the OSAID. The pragmatic path, he said:

Involves substantial compromise and, more problematically, requires explanation to be understood. And as the old political adage advises: "If you're explaining, you're losing."

It would have been better, he said, if the OSI had not tried to "bend and reshape a decades old definition" and instead had tried to craft something from a clean slate. That seems unlikely now, he said, after two years of trying to "thread the needle between idealism and capitalism to arrive at an ideologically sound and yet commercially acceptable" definition.

Indeed, it seems likely that the OSI board will move forward with the current draft of the OSAID or something close to it. The impact that will have is much less certain.

The preferred form for modifying ML systems'

Posted Oct 25, 2024 22:15 UTC (Fri) by ballombe (subscriber, #9523) [Link] (1 responses)

> Maffulli objected to the idea that the OSAID was weaker or making concessions, and said that the preferred form for modifying ML systems was what is in the OSAID: ""it's not me nor OSI board saying that, it's in the list of endorsers and in [Carnegie Mellon University's] comment"

Where does the CMU comment says anything about the 'the preferred form for modifying ML systems' ?

The preferred form for modifying ML systems'

Posted Oct 26, 2024 22:48 UTC (Sat) by somlo (subscriber, #92421) [Link]

Speaking of the CMU "endorsement", note that the largest percentage of signatories seem to be librarians, followed by business school faculty (Tepper), and a single member of the School of Computer Science faculty (specializing in "Ethics and Computational Technologies").

Conspicuously absent are any of the AI/ML "heavy hitters", so I'm not really sure what conclusion one should draw from that...

modification

Posted Oct 25, 2024 22:17 UTC (Fri) by shironeko (subscriber, #159952) [Link]

Maybe the supporter of the definition could demonstrate practically modifying a ML model without using the original training data, and show that it is just as easy as with the original data and it does not limit what you can do with it (e.g. demonstrate it can unlearn any parts of the original data as if they were not used).

Personally it would seem that, unfortunately for some, Stallman was right yet again.

ChatGPT says...

Posted Oct 25, 2024 22:31 UTC (Fri) by leromarinvit (subscriber, #56850) [Link] (6 responses)

Given the topic, asking an AI for its thoughts on the matter seemed apt. So I prompted ChatGPT (which is, of course, about as far removed from "Open Source AI" as seems possible):

> Given the following article, write a comment about whether or not training data should be included in the OSAID. Explain why, either way.
> [...]

It seems it wants to see training data included in the definition:

> Whether training data should be included in the Open Source AI Definition (OSAID) is a crucial issue, and I believe it should be required, not optional. Leaving training data out of the definition weakens the foundational principles of open source, particularly the ability to fully understand, modify, and build upon AI systems.

> Training data is central to how AI models are developed, and it plays a significant role in shaping the behavior and biases of the system. Without access to the data—only providing "detailed information"—users are restricted in their ability to recreate or meaningfully modify the system. This exclusion limits transparency and undermines the goal of fostering collaboration and innovation, which are core values of the open-source community.

> While there are valid concerns about the ethical and legal challenges of releasing training data, such as privacy issues and proprietary data restrictions, these challenges should be addressed case-by-case rather than lowering the bar for what qualifies as open source. By allowing AI systems to be considered "open" without sharing the training data, the OSAID risks diluting the meaning of open source and setting a precedent where essential components of AI systems can remain closed while still benefiting from the open-source label.

> In short, if the goal is to preserve the spirit of open source in AI, training data should not be optional, as omitting it compromises transparency, accountability, and the ability to fully exercise the four freedoms that open source aims to protect.

;-)

While I happen to agree with the conclusion, the reasoning seems more like a summary of the article to me, instead of actual independent thought (well, duh - it's an LLM, it can't think). In particular, what stands out to me is that with source code, I can not only add new features or modify existing ones, but also remove parts entirely. This seems impossible with the kind of iterative modification that seems to be the only kind taken into account here (to my admittedly very limited understanding of the matter).

Or, more to the point:
> training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

Exactly, they "have been learned". I can't make it "unlearn" them without having the original training data to modify and run the full training again.

Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.

ChatGPT says...

Posted Oct 26, 2024 0:01 UTC (Sat) by rettichschnidi (subscriber, #93261) [Link] (2 responses)

> Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.

True for LLMs, at least as of today. But there are many (smaller) ML/"AI" applications that can be trained by SME and even enthusiasts already today.

ChatGPT says...

Posted Oct 26, 2024 4:36 UTC (Sat) by jmspeex (subscriber, #51639) [Link]

Absolutely. All of the ML models that ended up in Opus 1.5 (https://opus-codec.org/demo/opus-1.5/) can be trained on a single GPU. And there's still people using the original RNNoise I trained in 2017 on a laptop *CPU*. There's tons of useful things you can do with DNNs that are 1M weights or smaller.

ChatGPT says...

Posted Oct 26, 2024 7:02 UTC (Sat) by mirabilos (subscriber, #84359) [Link]

Long as they participate in Reproducible Builds and an independent party has reproduced the exact same model from the same inputs once, it’s at least not totally unpalatable if not everyone has the pockets to regenerate it from scratch in practice.

Cost of training a model

Posted Oct 28, 2024 8:54 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.

I'm leery of arguments based on the cost of training a model, because they generalise to "no software is open source unless you're willing to donate a working system to anyone who wants to build it". Instead, I'd prefer to see arguments around both the reproducibility of training (since if training is not reproducible, a skilled hacker is unable to test whether their optimized training process produces the same results in under 1% of the cost of the original), or the incompleteness of training data.

Put another way; if it currently costs $100m in cloud GPU costs to train a model, but a clever person is able to work out how to do the same training on a $500 laptop for $100 in electricity, I want that clever person to be able to show that their technique works exactly as well as the original, and not to be held back by the lack of open models and training sets they can use to show off their technique.

Cost of training a model

Posted Oct 29, 2024 17:16 UTC (Tue) by paulj (subscriber, #341) [Link]

The human brain is capable of some pretty sophisticated training with just thousands of Joules, and really profound learning with MWs, so.... the day will come.

(We may run into some profound ethical issues with ML before that day, but that's another topic).

ChatGPT says...

Posted Oct 29, 2024 17:13 UTC (Tue) by paulj (subscriber, #341) [Link]

> Exactly, they "have been learned". I can't make it "unlearn" them without having the original training data to modify and run the full training again.

This isn't quite true, depending on what you mean.

a) You can start from scratch, just start with random parameters again, and train with your own data. I.e., just don't use the parameter set.

b) If there are specific things in the trained parameters that you don't like/want, you can /keep/ training the model but apply your own objective function to penalise the thing you don't like. Essentially "correcting" whatever it is you don't like.

The challenge with b may that you don't know what is in there that you don't want. However, you will probably have the same problem if you have the full input data. The model is basically given you the training data in a digested form that at least lets you query it though.

That "correcting" process is something all the big players with public AI tools surely must have /big/ teams of people working on: just filtering out whatever "fruity" content happens to be discovered in public LLMs and generative AIs. Teams manually looking for "dodgy" content, to then come up with the widest set of prompts that spit out that dodgy content (in all its combinations), to then pass on to other teams to create refinement training cases and test-cases. So that we get the safely sanitised LLMs / models suitable for public use.

Everything is not a nail...

Posted Oct 26, 2024 1:40 UTC (Sat) by chuckwolber (subscriber, #138181) [Link] (4 responses)

It was relatively easy to create a definition of openness that applies cleanly to deterministic systems like compiled software. It seems like we are trying to apply the same definition to a non-deterministic system (neural nets) and we keep failing to notice the real problem.

If I listen to music that ultimately influences my style as a musician, I am no less free when I play in my heavily influenced style and the music that influenced me is no less encumbered by copyright. The trained neural net (biological or digital) is its own qualia, which exists independent of the influences that trained it and it owes nothing to those influences.

Today those qualia are small, so it is easy to dismiss their subjectivity. Scale that qualia up closer to the complexity of a human mind and the problem should be more clear.

By way of analogy - demanding a full accounting of the training material to satisfy an openness requirement is like demanding that you provide a full accounting of everything you were exposed to since birth before we allow you to operate freely in open society. The very idea is absurd.

We invented a shortcut to that problem a long time ago - it is called a "social contract". We cannot possibly know everything that is going on in your head, but we can set forth expectations and apply relevant penalties.

I propose we rethink the OSAI in the same way.

Everything is not a nail...

Posted Oct 26, 2024 5:12 UTC (Sat) by shironeko (subscriber, #159952) [Link]

I hope you are not letting others to modify and redistribute your brain. By all definition, your mind is proprietary to you.

Everything is not a nail...

Posted Oct 26, 2024 13:11 UTC (Sat) by Wol (subscriber, #4433) [Link] (1 responses)

> Today those qualia are small, so it is easy to dismiss their subjectivity. Scale that qualia up closer to the complexity of a human mind and the problem should be more clear.

And herein lies the entire problem with AI, imho.

The qualia (whatever that means) of the human brain is much LESS than we give it credit for. But it is QUALITATIVELY DIFFERENT from what we call AI and LLM and that junk. AI and LLMs are bayesian inference machines. Human brains have external forces imposing a (not necessarily correct) definition of "right" and "wrong".

When a human brain messes up, its bayesian machine is likely to get eaten by a lion, or gored by a wildebeest, or whatever. When a computer bayesian machine messes up, it's too expensive to retrain.

I've mentioned the crab before, but if a 6502-powered robot crab can cope easily with the complexities of the surf zone, why can't a million-pound AI tell the difference between a car and a tank ... (I suspect it's because the human brain, and maybe that crab?, had a whole bunch of specialist units which probably was much larger than the central bayesian machine - we're going down the wrong route ... again ...)

Cheers,
Wol

Everything is not a nail...

Posted Oct 26, 2024 17:31 UTC (Sat) by khim (subscriber, #9252) [Link]

> When a human brain messes up, its bayesian machine is likely to get eaten by a lion, or gored by a wildebeest, or whatever.

That's extremely rare occurance.

> When a computer bayesian machine messes up, it's too expensive to retrain.

That could be true for LLMs, today, but many smaller-scaled AI models are retrained from scratch rountinely.

Pretty soon there would be more LLMs retrained from scratch than100 or 200 billions of people that ever lived on this planet.

Would that mean that LLMs would achieve parity with humans, when that would happen? Hard to predict… my gut feeling is that no, that wouldn't happen – but not because encounters with lions or wildebeests made humans are that much better, but because nature invented a lot of tricks over billions of years that we don't know how to replicate… yet.

> I've mentioned the crab before, but if a 6502-powered robot crab can cope easily with the complexities of the surf zone, why can't a million-pound AI tell the difference between a car and a tank

It could. In fact today AI already does smaller number of mistakes than human on such tasks.

Yes, AI does different mistakes, but average precision is better.

> I suspect it's because the human brain, and maybe that crab?, had a whole bunch of specialist units which probably was much larger than the central bayesian machine - we're going down the wrong route ... again ...

We just tend to dismiss the issues with “optical illusions” and exaggerate AI mistakes. To feel better about themselves, maybe?

When you are shown picture where the exact same color looks “white” in one place and “black” in the other place you don't even need AI to reveal the difference, simple sheet of paper is enough – but that's not perceived as human brain deficiency because we all are built the same and all fall prey to the exact same illusion.

AIs are built differently and they are fooled by different illusions than what humans misperceive – and that difference is, in our arrogance, treated as “human superiority”.

It's only when AI starts doing things so much better than human, beats human so supremely decisively, when there are no doubt that AI is “head and shoulders” above human for this or that task… only then humans… redefine this particular task as “not really important”. Like it happened to go and chess, among other things.

NN training is deterministic (enough)

Posted Oct 27, 2024 16:10 UTC (Sun) by ghodgkins (subscriber, #157257) [Link]

> a non-deterministic system (neural nets)

In my understanding, training an neural net is deterministic in the sense that matters for reproducibility. If you train the same model architecture in the same environment with the same data, you'll get the same final weights. This is true even if you draw from random distributions during training, as long as you choose the same seed(s) for the PRNG.

The input-output mapping of the trained model is usually also deterministic, except for some special-purpose stochastic models. Even those you may be able to make reproducible by fixing the PRNG seed, as above.

> The trained neural net (biological or digital) is its own qualia, which exists independent of the influences that trained it and it owes nothing to those influences.

It is not true that the weights "owe nothing" to the training data. As mentioned above, for a fixed PRNG seed, they are in fact a very complex closed-form function of that data - certainly "dependent" in the probability sense.

> By way of analogy - demanding a full accounting of the training material to satisfy an openness requirement is like demanding that you provide a full accounting of everything you were exposed to since birth before we allow you to operate freely in open society.

I think it's reasonable to have different expectations for software tools and the people that use them, and honestly kind of absurd not to.

> The very idea is absurd.

For humans, certainly. One key difference between humans and NNs here is that NNs have a thing called "training" with well-defined inputs and output, in a consistent and well-defined format, which makes enumerating the training data entirely feasible.

> We cannot possibly know everything that is going on in your head

But we can know everything that is going on inside a NN, although we may not be able to interpret it with respect to the inputs and outputs.

Debian policy

Posted Oct 26, 2024 6:57 UTC (Sat) by pabs (subscriber, #43278) [Link] (1 responses)

I prefer the Debian policy about libre AI:

https://salsa.debian.org/deeplearning-team/ml-policy

Debian policy

Posted Oct 31, 2024 13:21 UTC (Thu) by danpb (subscriber, #4831) [Link]

The document in that repo has this text in at the top of it "Disclaimer: This draft is UNOFFICIAL and does not represent the point of view of the Debian Project.". So it doesn't actually seem to be an actual Debian policy, unless this text has been moved somewhere else and this repo with the disclaimer is an outdated copy ?

None the less, I do like to see documents like this. The industry is awash with opinions/policies that are almost entirely driven by corporate vested interests whom want to push & sell use of AI everywhere. There is not nearly enough analysis & input from groups who are looking at it from an objective & independent POV, without their judgement being clouded by a need to sell AI.

Re: Response to criticisms

Posted Oct 26, 2024 6:59 UTC (Sat) by mirabilos (subscriber, #84359) [Link] (4 responses)

The OSI has actively ignored (including at least zack, who even was an OSI board director) or even silenced some criticism, including discussion forum bans and post deletion.

It certainly has lost any and all credibility with these actions and their whole procedere, which was designed to cater to corporate interests over open source principles right from the beginning (leading questions, lack of survey response possibilities that disagree, etc).

Many people certainly don’t want those theft machines blessed by something they identify with.

(I will, for one, *not* enable reply notifications, nor engage further here. I’ve written up on my webservers and discussed on the OSI MLs and in Fedi all that needs to be said.)

Re: Response to criticisms

Posted Oct 26, 2024 7:53 UTC (Sat) by rettichschnidi (subscriber, #93261) [Link]

> and post deletion

Can confirm. While my posts (https://files.reto-schneider.ch/2024-10-22%2021-15-xx%20O...) might have been below their quality expectation, having them marked as "Pending" (for review) and then silently dropped (deleted) after ~2 days seems not ok to me. At least some kind of feedback/notification would have been appreciated.

Re: Response to criticisms

Posted Oct 26, 2024 14:39 UTC (Sat) by IanKelling (subscriber, #89418) [Link] (2 responses)

A discussion forum opened many months after they already published a draft definition. For a long time, it was just a web comment system requiring proprietary javascript, which is still there. And yes, I made sure OSI knew from the beginning that they published a draft definition which suggested only one way to comment publicly, which excluded people who strongly believed in principles which were part of the definition.

The CMU statement which calls it a "participatory co-design process" which was "inclusive" makes me roll my eyes. It was drafted privately by the OSI board in some undisclosed way.

Re: Response to criticisms

Posted Oct 28, 2024 7:23 UTC (Mon) by WolfWings (subscriber, #56790) [Link] (1 responses)

...proprietary javascript? Sorry but how does that even occur? It's plain text, downloaded to your browser, at worst a little obfuscated and often minified but still readable generally with basic filters.

Or do you just mean it required javascript enabled at all to function and post comments?

Re: Response to criticisms

Posted Oct 28, 2024 7:45 UTC (Mon) by mjg59 (subscriber, #23239) [Link]

It's code that you have no permission to modify or redistribute, which is the generally accepted definition of proprietary software.

Why not two licences?

Posted Oct 26, 2024 15:26 UTC (Sat) by burki99 (subscriber, #17149) [Link] (8 responses)

It seems rather obvious: There are two views on this topic, both reasonable, but fundementally incompatible.

Therefore I would suggest to come up with
a) Pragmatic Open Source AI
b) Strict Open Source AI

a) would be the one suggested by OSI, b) would additionally require the training data to be available (which implies an open content license that allows redistribution of the training data).

Why not two licences?

Posted Oct 26, 2024 15:55 UTC (Sat) by jzb (editor, #7867) [Link]

License discussions are no place for reasonable suggestions. :)

In seriousness, though - that's not a terrible idea. I'm not sure I'd go with "Pragmatic" and "Strict" as the names, but something that indicates "everything but the data is available" and "this is the gold standard of openness, including open data" with the naming reflecting that might satisfy some of the folks unhappy with tagging anything less than fully open as "open source".

Why not two licences?

Posted Oct 26, 2024 21:26 UTC (Sat) by josh (subscriber, #17465) [Link] (5 responses)

Sounds great. For accuracy and consistency with widespread industry understanding of the term "Open Source", let's call the first one "proprietary AI", and the second one "Open Source AI".

Why not two licences?

Posted Oct 27, 2024 8:00 UTC (Sun) by burki99 (subscriber, #17149) [Link] (3 responses)

How about Open Code AI vs Open Data AI. That would keep the ambiguous term Source out of labels.

Why not two licences?

Posted Oct 27, 2024 23:12 UTC (Sun) by neggles (subscriber, #153254) [Link] (2 responses)

The AI/ML industry already has names for these two types of model; Open Weights for the former, and Open Data for the latter. Both are subtypes of "Open Source", just with differences in which components are openly available.

Contrary to popular belief, merely having access to the dataset doesn't actually give you the ability to reproduce a model. For example, the dataset for Stable Diffusion 1.5 is known and publicly available, the preprocessing steps taken are known, and the training process is known, but any attempt to exactly recreate it is unlikely to succeed due to changes in hardware, specific step counts, underlying library bug fixes and behaviour changes, specific corrupted images, etc.

Meta and OpenAI and MidJourney and many others have made newer, better variants of the same concept/using similar architectures, but nobody's made a reproduction, and not for lack of trying.

That said, why would you want to?

If you want to significantly improve the model in a general sense you'll need to adjust the architecture, which necessitates retraining it from clean anyway - and you'll likely want to replace the dataset with one more suited to your desired use case.

If you just want to make it good at generating some specific type of image, finetuning on top of the existing base model weights requires orders of magnitude less compute and a completely different domain-specific dataset anyway.

This obsession with making AI/ML model creators publish datasets honestly feels very disingenuous. It feels like the subset of society which has decided that ML model training constitutes copyright infringement are attempting to manipulate supporters of open source software - many of whom aggressively disagree with the concept of copyright in the first place - into indirectly helping them bring expensive copyright claims against AI companies, regardless of whether those claims hold any merit.

I'm not making any judgement as to whether those claims hold merit - it's irrelevant to my point - but I *will* point out that this particular group of people have made their judgement long before the courts have made theirs, and in general leave no quarter for discussion or room for reassessment of that view - despite often having little to know domain-specific knowledge or understanding of how this technology works.

There is no real benefit to requiring open datasets, and a lot of massive downsides; it would effectively kill open source AI because no company would risk the potential legal exposure from doing such a thing - even if the data *doesn't* have copyright issues, proving that in court could be rather difficult and time-consuming. Copyright trolling is just as real of a thing as patent trolling - just ask anyone who's been subject to frivolous DMCA takedown requests.

Why not two licences?

Posted Oct 28, 2024 20:00 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (1 responses)

> This obsession with making AI/ML model creators publish datasets honestly feels very disingenuous. It feels like the subset of society which has decided that ML model training constitutes copyright infringement are attempting to manipulate supporters of open source software - many of whom aggressively disagree with the concept of copyright in the first place - into indirectly helping them bring expensive copyright claims against AI companies, regardless of whether those claims hold any merit.
>
> I'm not making any judgement as to whether those claims hold merit - it's irrelevant to my point - but I *will* point out that this particular group of people have made their judgement long before the courts have made theirs, and in general leave no quarter for discussion or room for reassessment of that view - despite often having little to know domain-specific knowledge or understanding of how this technology works.

I generally agree that the anti-AI artists are on the wrong side here (or at least, they have not demonstrated to my satisfaction that their absolutist position holds water in every possible circumstance).

But I think it would be helpful to take a step back and avoid vilifying a group of people whose industry is currently undergoing a serious level of tech-induced upheaval. People are scared and worried for their economic future, and I think they do have a right to be upset and to advocate for their self interest, even as I disagree with them.

IMHO the most likely outcome is that the public gets bored of this debate long before it progresses to the point of actually affecting policy decisions of major western governments, at which point it will be up to the courts to clarify how and whether the existing "derivative work" right applies to AI training. It is most plausible to me that that will end up being a fact-specific inquiry which requires individualized analysis of both the model and the specific training work whose copyright is allegedly infringed.* Frankly, I'm not entirely convinced that any version of open source AI would be viable in such a regime - but the point is that artists posting angry messages on Xitter are probably not going to change that outcome one way or the other.

Disclaimer: I work for Google, which does some GenAI stuff that I'm not involved with. Opinions are my own, speculation about the future is quite possibly wrong.

* For example, The New York Times provided evidence (in their lawsuit against OpenAI) that ChatGPT can sometimes produce verbatim or near-verbatim copies of their columns. I could imagine courts wanting to see evidence similar to that before entering a finding of infringement.

Civil vs Common law

Posted Oct 29, 2024 10:03 UTC (Tue) by kleptog (subscriber, #1183) [Link]

> IMHO the most likely outcome is that the public gets bored of this debate long before it progresses to the point of actually affecting policy decisions of major western governments, at which point it will be up to the courts to clarify how and whether the existing "derivative work" right applies to AI training.

That might work for Common Law systems like the US, but in Civil Law systems (most of Europe at least) the courts generally can't make these kinds of major policy changes and will punt the problem to the legislature to sort out. The EU AI Act 2024 covers all the stuff that everyone could agree on already (e.g. training for non-commercial use is fine). The question of what to do about training data for commercial use is subject of much political debate. The EU executive/legislature is currently in a look and see mode to see if businesses can work out an agreement amongst themselves and are only likely to step in if it looks like someone is using their market power illegitimately, or we start seeing serious negative public policy impacts.

Why not two licences?

Posted Oct 28, 2024 13:45 UTC (Mon) by bud (subscriber, #5327) [Link]

This is exactly what I thought. The most important thing seems to be clarity/transparency. And a binary state (open source or not) may not be expressive enough. An interesting analogy may be the CC licenses. And they even use icons to create transparency at first sight..

Why not two licences?

Posted Oct 26, 2024 21:26 UTC (Sat) by tzafrir (subscriber, #11501) [Link]

The Debian ML policy (linked above) uses the name "ToxicCandy" for what you name here "Pragmatic".

"If you're explaining, you're losing."

Posted Oct 27, 2024 2:50 UTC (Sun) by NightMonkey (subscriber, #23051) [Link]

"If you're explaining, you're losing."

FYI, that "old political adage" was from U.S. President Ronald Reagan. Seems odd to see that used here.

Training data, weights and model

Posted Oct 29, 2024 15:27 UTC (Tue) by paulj (subscriber, #341) [Link] (3 responses)

To start off with, I think data-sets for training are obviously incredibly important. I think having open data sets available for training fully-reproducible and open "AI" models will be incredibly important to society, sooner rather than later. I think it will be so important that ultimately societies will do 1 or both of:

1. Provide public funding for projects to gather, collate, store and make available data for public use (in various arenas);
2. Regulate the use of large-scale corporate data-sets for training, potentially making provisions for fair-use access under reasonable terms

That said, I have to take with a common theme in some of the objections that suggest that a piece of software can not be considered open if some significant data inputs to that software are not open, e.g. as per Tom Calloway's suggestion:

> "If we let the Open Source AI definition contain a loophole that makes data optional, we devalue the meaning of "open source" in all other contexts."

Or the suggestion that ANN software is not open, even if different sets of weights are available, if input data is not open, e.g.:

> For example, julia ferraioli wrote that without including data, the only things that the OSAID guarantees are the ability to use and distribute an AI system.

The former suggestion, that at least all significant possible input data to some software must be open, for the software to be considered open, is not something the Free Software world has stood fast on before. We did not require that every possible C programme must be open for GCC to be considered open. There are a number of Free Software game engines, for which game-data packs exist that are proprietary - e.g., often the original game that spawned the game engine, along with other proprietary games using the same engine. We don't consider a GPL Quake or Descent game engine to not be Free Software because there are game packs that are non-free, do we?

The latter suggestion, that an ANN is not trainable without the original input data is technically wrong. The ANN is eminently trainable. At worst, you start with random weights. If you've been given a set of weights from a previous training run, you've actually got a _lossy compressed form_ of the original data-set, that will be a much much better starting point than otherwise (given you appear to be interested in the original data set).

We need to be clear there are a number of different components here:

1. The ANN software
2. The model specification (if not hard-coded in 1)
3a. The model state (the parameters and activation states)
3b. The input training data that created 3a

3a is a lossily compressed form of 3b.

1 is useful on its own.
2 is useful on its own (either described in some DSL to be used by some instance of 1, or described in literature)
1 + 2 are very useful.

1 + 2 + 3a are very very useful, and you can add your own input data set and train further from there. I.e., you are in a much better position starting your training from here, than from 3b, given you are demanding 3b to be open to you. 3a saves you a *lot* of compute!

3b will ultimately be required in many context for social reasons, as I mention at the start, but I see that as a different issue and one that will be solved by public means.

I think it will be extremely hard to /require/ that 3b is made available, just for logistical reasons, given the vast volume of data - even if the input data is already public. And if the data is already public (e.g. web crawls), you can assemble it yourself. Obviously 1 and 2 can and should be publishable under Free Software licences, without demanding that all cases of 3b must be made available under a Free Software licence.

So it comes down to whether someone should be able to publish a 3a, while not publishing 3b, and still be able to claim some "Open Source" label. If 3b is merely collated from public data, and the methodology to replicate that collation is obvious (web scraping, say) or otherwise described, perhaps there should be some accommodation that allows such 3a data-sets to be distributed along with Free Software by Free Software entities?

For other cases, where 3a is a distillation of some other data-set that not be replicated from public data, should this be treated as code or data? Data is what it is really. What if 3a is a distillation of personal data that could not be distributed otherwise under other laws (e.g. GDPR)? Should we say that no Free Software entity should be able to distribute a useful data-set, that could be used to run some Free Software LLM and do useful things for users?

What if the 3a lossy-compressed data is the /only/ way that the 3b data-set could be legally distributed anywhere in the world? E.g., 3b is personal medical data, and 3a is the model state of an ANN that uses homomorphic-encryption precisely to ensure that the very sensitive data of 3b can still be distilled to a compressed data set of 3a that can then be used by ANNs around the world? Is Free Software to be cut-off from such advances forever?

Training data, weights and model

Posted Oct 29, 2024 15:34 UTC (Tue) by paulj (subscriber, #341) [Link]

Wish I could edit. Strike the last question, if I can query it I can (slowly) access the data obviously. ;)

I was trying to think of some scenario where the training data would be too sensitive to distribute, but then... query access has same sensitivity, so.. unclear if that scenario would ever plausibly interact with Free Software issues. I can't think of a good example anyway.

Training data, weights and model

Posted Oct 30, 2024 12:45 UTC (Wed) by kleptog (subscriber, #1183) [Link] (1 responses)

I think that for most people, 3a (the model state) is the "preferred form of modification" anyway. Having all the training data is a nice idea, but useless in practice because you can't meaningfully review all the data anyway. And just retraining with the same data doesn't give you anything new. Just about anything you might want to do you can do with just the weights: fine tuning for the results you want.

The case for training on private datasets (like medical data) is a different issue. Obviously you can only train an AI to detect lung cancer if you have access to datasets with information about people who may have lung cancer. Ideally you'd like such datasets to be available on a Reasonable and Non-Discriminatory basis. This sort of access is already available for other purposes, I don't see how AI requires anything new here.

(Sometimes I wonder if in the future we'll have AI psychologists that specialise in diagnosing issues in AIs and recommend treatments i.e. new fine-tuning to reduce issues. Maybe the AI psychologist will also have an AI to help them.)

Training data, weights and model

Posted Oct 30, 2024 12:54 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

> (Sometimes I wonder if in the future we'll have AI psychologists that specialise in diagnosing issues in AIs and recommend treatments i.e. new fine-tuning to reduce issues. Maybe the AI psychologist will also have an AI to help them.)

Asimov had "robopsychologists" doing things just like that.