LWN: Comments on "OSI readies controversial Open AI definition"

Debian policy

danpb — Thu, 31 Oct 2024 13:21:11 +0000

The document in that repo has this text in at the top of it "Disclaimer: This draft is UNOFFICIAL and does not represent the point of view of the Debian Project.". So it doesn't actually seem to be an actual Debian policy, unless this text has been moved somewhere else and this repo with the disclaimer is an outdated copy ?

None the less, I do like to see documents like this. The industry is awash with opinions/policies that are almost entirely driven by corporate vested interests whom want to push & sell use of AI everywhere. There is not nearly enough analysis & input from groups who are looking at it from an objective & independent POV, without their judgement being clouded by a need to sell AI.

Training data, weights and model

mathstuf — Wed, 30 Oct 2024 12:54:49 +0000

> (Sometimes I wonder if in the future we'll have AI psychologists that specialise in diagnosing issues in AIs and recommend treatments i.e. new fine-tuning to reduce issues. Maybe the AI psychologist will also have an AI to help them.)

Asimov had "robopsychologists" doing things just like that.

Training data, weights and model

kleptog — Wed, 30 Oct 2024 12:45:40 +0000

I think that for most people, 3a (the model state) is the "preferred form of modification" anyway. Having all the training data is a nice idea, but useless in practice because you can't meaningfully review all the data anyway. And just retraining with the same data doesn't give you anything new. Just about anything you might want to do you can do with just the weights: fine tuning for the results you want.

The case for training on private datasets (like medical data) is a different issue. Obviously you can only train an AI to detect lung cancer if you have access to datasets with information about people who may have lung cancer. Ideally you'd like such datasets to be available on a Reasonable and Non-Discriminatory basis. This sort of access is already available for other purposes, I don't see how AI requires anything new here.

(Sometimes I wonder if in the future we'll have AI psychologists that specialise in diagnosing issues in AIs and recommend treatments i.e. new fine-tuning to reduce issues. Maybe the AI psychologist will also have an AI to help them.)

Cost of training a model

paulj — Tue, 29 Oct 2024 17:16:59 +0000

The human brain is capable of some pretty sophisticated training with just thousands of Joules, and really profound learning with MWs, so.... the day will come.

(We may run into some profound ethical issues with ML before that day, but that's another topic).

ChatGPT says...

paulj — Tue, 29 Oct 2024 17:13:57 +0000

> Exactly, they "have been learned". I can't make it "unlearn" them without having the original training data to modify and run the full training again.

This isn't quite true, depending on what you mean.

a) You can start from scratch, just start with random parameters again, and train with your own data. I.e., just don't use the parameter set.

b) If there are specific things in the trained parameters that you don't like/want, you can /keep/ training the model but apply your own objective function to penalise the thing you don't like. Essentially "correcting" whatever it is you don't like.

The challenge with b may that you don't know what is in there that you don't want. However, you will probably have the same problem if you have the full input data. The model is basically given you the training data in a digested form that at least lets you query it though.

That "correcting" process is something all the big players with public AI tools surely must have /big/ teams of people working on: just filtering out whatever "fruity" content happens to be discovered in public LLMs and generative AIs. Teams manually looking for "dodgy" content, to then come up with the widest set of prompts that spit out that dodgy content (in all its combinations), to then pass on to other teams to create refinement training cases and test-cases. So that we get the safely sanitised LLMs / models suitable for public use.

Training data, weights and model

paulj — Tue, 29 Oct 2024 15:34:28 +0000

Wish I could edit. Strike the last question, if I can query it I can (slowly) access the data obviously. ;)

I was trying to think of some scenario where the training data would be too sensitive to distribute, but then... query access has same sensitivity, so.. unclear if that scenario would ever plausibly interact with Free Software issues. I can't think of a good example anyway.

Training data, weights and model

paulj — Tue, 29 Oct 2024 15:27:43 +0000

To start off with, I think data-sets for training are obviously incredibly important. I think having open data sets available for training fully-reproducible and open "AI" models will be incredibly important to society, sooner rather than later. I think it will be so important that ultimately societies will do 1 or both of:

1. Provide public funding for projects to gather, collate, store and make available data for public use (in various arenas);
2. Regulate the use of large-scale corporate data-sets for training, potentially making provisions for fair-use access under reasonable terms

That said, I have to take with a common theme in some of the objections that suggest that a piece of software can not be considered open if some significant data inputs to that software are not open, e.g. as per Tom Calloway's suggestion:

> "If we let the Open Source AI definition contain a loophole that makes data optional, we devalue the meaning of "open source" in all other contexts."

Or the suggestion that ANN software is not open, even if different sets of weights are available, if input data is not open, e.g.:

> For example, julia ferraioli wrote that without including data, the only things that the OSAID guarantees are the ability to use and distribute an AI system.

The former suggestion, that at least all significant possible input data to some software must be open, for the software to be considered open, is not something the Free Software world has stood fast on before. We did not require that every possible C programme must be open for GCC to be considered open. There are a number of Free Software game engines, for which game-data packs exist that are proprietary - e.g., often the original game that spawned the game engine, along with other proprietary games using the same engine. We don't consider a GPL Quake or Descent game engine to not be Free Software because there are game packs that are non-free, do we?

The latter suggestion, that an ANN is not trainable without the original input data is technically wrong. The ANN is eminently trainable. At worst, you start with random weights. If you've been given a set of weights from a previous training run, you've actually got a _lossy compressed form_ of the original data-set, that will be a much much better starting point than otherwise (given you appear to be interested in the original data set).

We need to be clear there are a number of different components here:

1. The ANN software
2. The model specification (if not hard-coded in 1)
3a. The model state (the parameters and activation states)
3b. The input training data that created 3a

3a is a lossily compressed form of 3b.

1 is useful on its own.
2 is useful on its own (either described in some DSL to be used by some instance of 1, or described in literature)
1 + 2 are very useful.

1 + 2 + 3a are very very useful, and you can add your own input data set and train further from there. I.e., you are in a much better position starting your training from here, than from 3b, given you are demanding 3b to be open to you. 3a saves you a *lot* of compute!

3b will ultimately be required in many context for social reasons, as I mention at the start, but I see that as a different issue and one that will be solved by public means.

I think it will be extremely hard to /require/ that 3b is made available, just for logistical reasons, given the vast volume of data - even if the input data is already public. And if the data is already public (e.g. web crawls), you can assemble it yourself. Obviously 1 and 2 can and should be publishable under Free Software licences, without demanding that all cases of 3b must be made available under a Free Software licence.

So it comes down to whether someone should be able to publish a 3a, while not publishing 3b, and still be able to claim some "Open Source" label. If 3b is merely collated from public data, and the methodology to replicate that collation is obvious (web scraping, say) or otherwise described, perhaps there should be some accommodation that allows such 3a data-sets to be distributed along with Free Software by Free Software entities?

For other cases, where 3a is a distillation of some other data-set that not be replicated from public data, should this be treated as code or data? Data is what it is really. What if 3a is a distillation of personal data that could not be distributed otherwise under other laws (e.g. GDPR)? Should we say that no Free Software entity should be able to distribute a useful data-set, that could be used to run some Free Software LLM and do useful things for users?

What if the 3a lossy-compressed data is the /only/ way that the 3b data-set could be legally distributed anywhere in the world? E.g., 3b is personal medical data, and 3a is the model state of an ANN that uses homomorphic-encryption precisely to ensure that the very sensitive data of 3b can still be distilled to a compressed data set of 3a that can then be used by ANNs around the world? Is Free Software to be cut-off from such advances forever?

Civil vs Common law

kleptog — Tue, 29 Oct 2024 10:03:25 +0000

> IMHO the most likely outcome is that the public gets bored of this debate long before it progresses to the point of actually affecting policy decisions of major western governments, at which point it will be up to the courts to clarify how and whether the existing "derivative work" right applies to AI training.

That might work for Common Law systems like the US, but in Civil Law systems (most of Europe at least) the courts generally can't make these kinds of major policy changes and will punt the problem to the legislature to sort out. The EU AI Act 2024 covers all the stuff that everyone could agree on already (e.g. training for non-commercial use is fine). The question of what to do about training data for commercial use is subject of much political debate. The EU executive/legislature is currently in a look and see mode to see if businesses can work out an agreement amongst themselves and are only likely to step in if it looks like someone is using their market power illegitimately, or we start seeing serious negative public policy impacts.

Why not two licences?

NYKevin — Mon, 28 Oct 2024 20:00:24 +0000

> This obsession with making AI/ML model creators publish datasets honestly feels very disingenuous. It feels like the subset of society which has decided that ML model training constitutes copyright infringement are attempting to manipulate supporters of open source software - many of whom aggressively disagree with the concept of copyright in the first place - into indirectly helping them bring expensive copyright claims against AI companies, regardless of whether those claims hold any merit.
>
> I'm not making any judgement as to whether those claims hold merit - it's irrelevant to my point - but I *will* point out that this particular group of people have made their judgement long before the courts have made theirs, and in general leave no quarter for discussion or room for reassessment of that view - despite often having little to know domain-specific knowledge or understanding of how this technology works.

I generally agree that the anti-AI artists are on the wrong side here (or at least, they have not demonstrated to my satisfaction that their absolutist position holds water in every possible circumstance).

But I think it would be helpful to take a step back and avoid vilifying a group of people whose industry is currently undergoing a serious level of tech-induced upheaval. People are scared and worried for their economic future, and I think they do have a right to be upset and to advocate for their self interest, even as I disagree with them.

IMHO the most likely outcome is that the public gets bored of this debate long before it progresses to the point of actually affecting policy decisions of major western governments, at which point it will be up to the courts to clarify how and whether the existing "derivative work" right applies to AI training. It is most plausible to me that that will end up being a fact-specific inquiry which requires individualized analysis of both the model and the specific training work whose copyright is allegedly infringed.* Frankly, I'm not entirely convinced that any version of open source AI would be viable in such a regime - but the point is that artists posting angry messages on Xitter are probably not going to change that outcome one way or the other.

Disclaimer: I work for Google, which does some GenAI stuff that I'm not involved with. Opinions are my own, speculation about the future is quite possibly wrong.

* For example, The New York Times provided evidence (in their lawsuit against OpenAI) that ChatGPT can sometimes produce verbatim or near-verbatim copies of their columns. I could imagine courts wanting to see evidence similar to that before entering a finding of infringement.

Why not two licences?

bud — Mon, 28 Oct 2024 13:45:36 +0000

This is exactly what I thought. The most important thing seems to be clarity/transparency. And a binary state (open source or not) may not be expressive enough. An interesting analogy may be the CC licenses. And they even use icons to create transparency at first sight..

Cost of training a model

farnz — Mon, 28 Oct 2024 08:54:10 +0000

Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.

I'm leery of arguments based on the cost of training a model, because they generalise to "no software is open source unless you're willing to donate a working system to anyone who wants to build it". Instead, I'd prefer to see arguments around both the reproducibility of training (since if training is not reproducible, a skilled hacker is unable to test whether their optimized training process produces the same results in under 1% of the cost of the original), or the incompleteness of training data.

Put another way; if it currently costs $100m in cloud GPU costs to train a model, but a clever person is able to work out how to do the same training on a $500 laptop for $100 in electricity, I want that clever person to be able to show that their technique works exactly as well as the original, and not to be held back by the lack of open models and training sets they can use to show off their technique.

Re: Response to criticisms

mjg59 — Mon, 28 Oct 2024 07:45:31 +0000

It's code that you have no permission to modify or redistribute, which is the generally accepted definition of proprietary software.

Re: Response to criticisms

WolfWings — Mon, 28 Oct 2024 07:23:29 +0000

...proprietary javascript? Sorry but how does that even occur? It's plain text, downloaded to your browser, at worst a little obfuscated and often minified but still readable generally with basic filters.

Or do you just mean it required javascript enabled at all to function and post comments?

Why not two licences?

neggles — Sun, 27 Oct 2024 23:12:00 +0000

The AI/ML industry already has names for these two types of model; Open Weights for the former, and Open Data for the latter. Both are subtypes of "Open Source", just with differences in which components are openly available.

Contrary to popular belief, merely having access to the dataset doesn't actually give you the ability to reproduce a model. For example, the dataset for Stable Diffusion 1.5 is known and publicly available, the preprocessing steps taken are known, and the training process is known, but any attempt to exactly recreate it is unlikely to succeed due to changes in hardware, specific step counts, underlying library bug fixes and behaviour changes, specific corrupted images, etc.

Meta and OpenAI and MidJourney and many others have made newer, better variants of the same concept/using similar architectures, but nobody's made a reproduction, and not for lack of trying.

That said, why would you want to?

If you want to significantly improve the model in a general sense you'll need to adjust the architecture, which necessitates retraining it from clean anyway - and you'll likely want to replace the dataset with one more suited to your desired use case.

If you just want to make it good at generating some specific type of image, finetuning on top of the existing base model weights requires orders of magnitude less compute and a completely different domain-specific dataset anyway.

This obsession with making AI/ML model creators publish datasets honestly feels very disingenuous. It feels like the subset of society which has decided that ML model training constitutes copyright infringement are attempting to manipulate supporters of open source software - many of whom aggressively disagree with the concept of copyright in the first place - into indirectly helping them bring expensive copyright claims against AI companies, regardless of whether those claims hold any merit.

I'm not making any judgement as to whether those claims hold merit - it's irrelevant to my point - but I *will* point out that this particular group of people have made their judgement long before the courts have made theirs, and in general leave no quarter for discussion or room for reassessment of that view - despite often having little to know domain-specific knowledge or understanding of how this technology works.

There is no real benefit to requiring open datasets, and a lot of massive downsides; it would effectively kill open source AI because no company would risk the potential legal exposure from doing such a thing - even if the data *doesn't* have copyright issues, proving that in court could be rather difficult and time-consuming. Copyright trolling is just as real of a thing as patent trolling - just ask anyone who's been subject to frivolous DMCA takedown requests.

NN training is deterministic (enough)

ghodgkins — Sun, 27 Oct 2024 16:10:10 +0000

> a non-deterministic system (neural nets)

In my understanding, training an neural net is deterministic in the sense that matters for reproducibility. If you train the same model architecture in the same environment with the same data, you'll get the same final weights. This is true even if you draw from random distributions during training, as long as you choose the same seed(s) for the PRNG.

The input-output mapping of the trained model is usually also deterministic, except for some special-purpose stochastic models. Even those you may be able to make reproducible by fixing the PRNG seed, as above.

> The trained neural net (biological or digital) is its own qualia, which exists independent of the influences that trained it and it owes nothing to those influences.

It is not true that the weights "owe nothing" to the training data. As mentioned above, for a fixed PRNG seed, they are in fact a very complex closed-form function of that data - certainly "dependent" in the probability sense.

> By way of analogy - demanding a full accounting of the training material to satisfy an openness requirement is like demanding that you provide a full accounting of everything you were exposed to since birth before we allow you to operate freely in open society.

I think it's reasonable to have different expectations for software tools and the people that use them, and honestly kind of absurd not to.

> The very idea is absurd.

For humans, certainly. One key difference between humans and NNs here is that NNs have a thing called "training" with well-defined inputs and output, in a consistent and well-defined format, which makes enumerating the training data entirely feasible.

> We cannot possibly know everything that is going on in your head

But we can know everything that is going on inside a NN, although we may not be able to interpret it with respect to the inputs and outputs.

Why not two licences?

burki99 — Sun, 27 Oct 2024 08:00:20 +0000

How about Open Code AI vs Open Data AI. That would keep the ambiguous term Source out of labels.

"If you're explaining, you're losing."

NightMonkey — Sun, 27 Oct 2024 02:50:06 +0000

"If you're explaining, you're losing."

FYI, that "old political adage" was from U.S. President Ronald Reagan. Seems odd to see that used here.

The preferred form for modifying ML systems'

somlo — Sat, 26 Oct 2024 22:48:27 +0000

Speaking of the CMU "endorsement", note that the largest percentage of signatories seem to be librarians, followed by business school faculty (Tepper), and a single member of the School of Computer Science faculty (specializing in "Ethics and Computational Technologies").

Conspicuously absent are any of the AI/ML "heavy hitters", so I'm not really sure what conclusion one should draw from that...

Why not two licences?

tzafrir — Sat, 26 Oct 2024 21:26:42 +0000

The Debian ML policy (linked above) uses the name "ToxicCandy" for what you name here "Pragmatic".

Why not two licences?

josh — Sat, 26 Oct 2024 21:26:29 +0000

Sounds great. For accuracy and consistency with widespread industry understanding of the term "Open Source", let's call the first one "proprietary AI", and the second one "Open Source AI".

Everything is not a nail...

khim — Sat, 26 Oct 2024 17:31:59 +0000

> When a human brain messes up, its bayesian machine is likely to get eaten by a lion, or gored by a wildebeest, or whatever.

That's extremely rare occurance.

> When a computer bayesian machine messes up, it's too expensive to retrain.

That could be true for LLMs, today, but many smaller-scaled AI models are retrained from scratch rountinely.

Pretty soon there would be more LLMs retrained from scratch than100 or 200 billions of people that ever lived on this planet.

Would that mean that LLMs would achieve parity with humans, when that would happen? Hard to predict… my gut feeling is that no, that wouldn't happen – but not because encounters with lions or wildebeests made humans are that much better, but because nature invented a lot of tricks over billions of years that we don't know how to replicate… yet.

> I've mentioned the crab before, but if a 6502-powered robot crab can cope easily with the complexities of the surf zone, why can't a million-pound AI tell the difference between a car and a tank

It could. In fact today AI already does smaller number of mistakes than human on such tasks.

Yes, AI does different mistakes, but average precision is better.

> I suspect it's because the human brain, and maybe that crab?, had a whole bunch of specialist units which probably was much larger than the central bayesian machine - we're going down the wrong route ... again ...

We just tend to dismiss the issues with “optical illusions” and exaggerate AI mistakes. To feel better about themselves, maybe?

When you are shown picture where the exact same color looks “white” in one place and “black” in the other place you don't even need AI to reveal the difference, simple sheet of paper is enough – but that's not perceived as human brain deficiency because we all are built the same and all fall prey to the exact same illusion.

AIs are built differently and they are fooled by different illusions than what humans misperceive – and that difference is, in our arrogance, treated as “human superiority”.

It's only when AI starts doing things so much better than human, beats human so supremely decisively, when there are no doubt that AI is “head and shoulders” above human for this or that task… only then humans… redefine this particular task as “not really important”. Like it happened to go and chess, among other things.

Why not two licences?

jzb — Sat, 26 Oct 2024 15:55:23 +0000

License discussions are no place for reasonable suggestions. :)

In seriousness, though - that's not a terrible idea. I'm not sure I'd go with "Pragmatic" and "Strict" as the names, but something that indicates "everything but the data is available" and "this is the gold standard of openness, including open data" with the naming reflecting that might satisfy some of the folks unhappy with tagging anything less than fully open as "open source".

Why not two licences?

burki99 — Sat, 26 Oct 2024 15:26:19 +0000

It seems rather obvious: There are two views on this topic, both reasonable, but fundementally incompatible.

Therefore I would suggest to come up with
a) Pragmatic Open Source AI
b) Strict Open Source AI

a) would be the one suggested by OSI, b) would additionally require the training data to be available (which implies an open content license that allows redistribution of the training data).

Re: Response to criticisms

IanKelling — Sat, 26 Oct 2024 14:39:47 +0000

A discussion forum opened many months after they already published a draft definition. For a long time, it was just a web comment system requiring proprietary javascript, which is still there. And yes, I made sure OSI knew from the beginning that they published a draft definition which suggested only one way to comment publicly, which excluded people who strongly believed in principles which were part of the definition.

The CMU statement which calls it a "participatory co-design process" which was "inclusive" makes me roll my eyes. It was drafted privately by the OSI board in some undisclosed way.

Everything is not a nail...

Wol — Sat, 26 Oct 2024 13:11:20 +0000

> Today those qualia are small, so it is easy to dismiss their subjectivity. Scale that qualia up closer to the complexity of a human mind and the problem should be more clear.

And herein lies the entire problem with AI, imho.

The qualia (whatever that means) of the human brain is much LESS than we give it credit for. But it is QUALITATIVELY DIFFERENT from what we call AI and LLM and that junk. AI and LLMs are bayesian inference machines. Human brains have external forces imposing a (not necessarily correct) definition of "right" and "wrong".

When a human brain messes up, its bayesian machine is likely to get eaten by a lion, or gored by a wildebeest, or whatever. When a computer bayesian machine messes up, it's too expensive to retrain.

I've mentioned the crab before, but if a 6502-powered robot crab can cope easily with the complexities of the surf zone, why can't a million-pound AI tell the difference between a car and a tank ... (I suspect it's because the human brain, and maybe that crab?, had a whole bunch of specialist units which probably was much larger than the central bayesian machine - we're going down the wrong route ... again ...)

Cheers,
Wol

Re: Response to criticisms

rettichschnidi — Sat, 26 Oct 2024 07:53:10 +0000

> and post deletion

Can confirm. While my posts (https://files.reto-schneider.ch/2024-10-22%2021-15-xx%20O...) might have been below their quality expectation, having them marked as "Pending" (for review) and then silently dropped (deleted) after ~2 days seems not ok to me. At least some kind of feedback/notification would have been appreciated.

ChatGPT says...

mirabilos — Sat, 26 Oct 2024 07:02:28 +0000

Long as they participate in Reproducible Builds and an independent party has reproduced the exact same model from the same inputs once, it’s at least not totally unpalatable if not everyone has the pockets to regenerate it from scratch in practice.

Re: Response to criticisms

mirabilos — Sat, 26 Oct 2024 06:59:03 +0000

The OSI has actively ignored (including at least zack, who even was an OSI board director) or even silenced some criticism, including discussion forum bans and post deletion.

It certainly has lost any and all credibility with these actions and their whole procedere, which was designed to cater to corporate interests over open source principles right from the beginning (leading questions, lack of survey response possibilities that disagree, etc).

Many people certainly don’t want those theft machines blessed by something they identify with.

(I will, for one, *not* enable reply notifications, nor engage further here. I’ve written up on my webservers and discussed on the OSI MLs and in Fedi all that needs to be said.)

Debian policy

pabs — Sat, 26 Oct 2024 06:57:58 +0000

I prefer the Debian policy about libre AI:

https://salsa.debian.org/deeplearning-team/ml-policy

Everything is not a nail...

shironeko — Sat, 26 Oct 2024 05:12:17 +0000

I hope you are not letting others to modify and redistribute your brain. By all definition, your mind is proprietary to you.

ChatGPT says...

jmspeex — Sat, 26 Oct 2024 04:36:24 +0000

Absolutely. All of the ML models that ended up in Opus 1.5 (https://opus-codec.org/demo/opus-1.5/) can be trained on a single GPU. And there's still people using the original RNNoise I trained in 2017 on a laptop *CPU*. There's tons of useful things you can do with DNNs that are 1M weights or smaller.

Everything is not a nail...

chuckwolber — Sat, 26 Oct 2024 01:40:50 +0000

It was relatively easy to create a definition of openness that applies cleanly to deterministic systems like compiled software. It seems like we are trying to apply the same definition to a non-deterministic system (neural nets) and we keep failing to notice the real problem.

If I listen to music that ultimately influences my style as a musician, I am no less free when I play in my heavily influenced style and the music that influenced me is no less encumbered by copyright. The trained neural net (biological or digital) is its own qualia, which exists independent of the influences that trained it and it owes nothing to those influences.

Today those qualia are small, so it is easy to dismiss their subjectivity. Scale that qualia up closer to the complexity of a human mind and the problem should be more clear.

By way of analogy - demanding a full accounting of the training material to satisfy an openness requirement is like demanding that you provide a full accounting of everything you were exposed to since birth before we allow you to operate freely in open society. The very idea is absurd.

We invented a shortcut to that problem a long time ago - it is called a "social contract". We cannot possibly know everything that is going on in your head, but we can set forth expectations and apply relevant penalties.

I propose we rethink the OSAI in the same way.

ChatGPT says...

rettichschnidi — Sat, 26 Oct 2024 00:01:51 +0000

> Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.

True for LLMs, at least as of today. But there are many (smaller) ML/"AI" applications that can be trained by SME and even enthusiasts already today.

ChatGPT says...

leromarinvit — Fri, 25 Oct 2024 22:31:24 +0000

Given the topic, asking an AI for its thoughts on the matter seemed apt. So I prompted ChatGPT (which is, of course, about as far removed from "Open Source AI" as seems possible):

> Given the following article, write a comment about whether or not training data should be included in the OSAID. Explain why, either way.
> [...]

It seems it wants to see training data included in the definition:

> Whether training data should be included in the Open Source AI Definition (OSAID) is a crucial issue, and I believe it should be required, not optional. Leaving training data out of the definition weakens the foundational principles of open source, particularly the ability to fully understand, modify, and build upon AI systems.

> Training data is central to how AI models are developed, and it plays a significant role in shaping the behavior and biases of the system. Without access to the data—only providing "detailed information"—users are restricted in their ability to recreate or meaningfully modify the system. This exclusion limits transparency and undermines the goal of fostering collaboration and innovation, which are core values of the open-source community.

> While there are valid concerns about the ethical and legal challenges of releasing training data, such as privacy issues and proprietary data restrictions, these challenges should be addressed case-by-case rather than lowering the bar for what qualifies as open source. By allowing AI systems to be considered "open" without sharing the training data, the OSAID risks diluting the meaning of open source and setting a precedent where essential components of AI systems can remain closed while still benefiting from the open-source label.

> In short, if the goal is to preserve the spirit of open source in AI, training data should not be optional, as omitting it compromises transparency, accountability, and the ability to fully exercise the four freedoms that open source aims to protect.

;-)

While I happen to agree with the conclusion, the reasoning seems more like a summary of the article to me, instead of actual independent thought (well, duh - it's an LLM, it can't think). In particular, what stands out to me is that with source code, I can not only add new features or modify existing ones, but also remove parts entirely. This seems impossible with the kind of iterative modification that seems to be the only kind taken into account here (to my admittedly very limited understanding of the matter).

Or, more to the point:
> training data is not part of the preferred form for making modifications to an existing AI system. The insights and correlations in that data have already been learned.

Exactly, they "have been learned". I can't make it "unlearn" them without having the original training data to modify and run the full training again.

Of course, then there's still the practical matter of actually being able to run the (full) training. This seems thoroughly out of reach for anyone without very deep pockets for now, especially for the kind of individual hackers that shaped Free Software and Open Source initially.

modification

shironeko — Fri, 25 Oct 2024 22:17:17 +0000

Maybe the supporter of the definition could demonstrate practically modifying a ML model without using the original training data, and show that it is just as easy as with the original data and it does not limit what you can do with it (e.g. demonstrate it can unlearn any parts of the original data as if they were not used).

Personally it would seem that, unfortunately for some, Stallman was right yet again.

The preferred form for modifying ML systems'

ballombe — Fri, 25 Oct 2024 22:15:24 +0000

> Maffulli objected to the idea that the OSAID was weaker or making concessions, and said that the preferred form for modifying ML systems was what is in the OSAID: ""it's not me nor OSI board saying that, it's in the list of endorsers and in [Carnegie Mellon University's] comment"

Where does the CMU comment says anything about the 'the preferred form for modifying ML systems' ?