|
|
Subscribe / Log in / New account

Back to basics: Are weights software at all?

Back to basics: Are weights software at all?

Posted May 21, 2025 0:55 UTC (Wed) by pabs (subscriber, #43278)
In reply to: Back to basics: Are weights software at all? by NYKevin
Parent article: Debian AI General Resolution withdrawn

I think that we need to take a step back from the phrase "preferred form for modification" and think about what the Free Software movement is about. I see both the licensing and source aspects of the Free Software movement as aspiring to providing a high level of equality of access to a work between both the original author and far downstream recipients. Obviously full and universal equality is impossible because part of the work is only in the author's mind and not everyone can obtain and use computers, especially the amount of computing capacity needed for training many of the modern ML models. Clearly though, without the training data, there can be no equality of access to an ML model, even when enough compute is available.

PS: some stuff about source forms for non-code files:

https://wiki.debian.org/AutoGeneratedFiles


to post comments

Back to basics: Are weights software at all?

Posted May 21, 2025 1:14 UTC (Wed) by interalia (subscriber, #26615) [Link] (1 responses)

Yes I agree it'd be good to keep the intent of the rules in mind rather than their exact wording. My understanding is that "preferred form of modification" is there to prevent the release of obfuscated code as a letter-of-the-law compliance with source availability. I don't personally think that not releasing training data has that intent. Such a model is obviously less freely modifiable than if the data was available, but I don't feel it falls foul of the "preferred form" clause because it's not intended to be an intentional compliance-but-not-really practice. Definitely a thought-provoking issue though, and I can see arguments both ways.

Back to basics: Are weights software at all?

Posted May 21, 2025 1:32 UTC (Wed) by pabs (subscriber, #43278) [Link]

In these situations, there mostly isn't an existing model license for the companies making them to comply with, so this is a different situation to what you are talking about with using obfuscation to circumvent the GPL.

Usually the main reason for not releasing the training data itself, is that it isn't legally possible to redistribute it, or that it was illegally obtained in the first place (for eg Facebook torrenting books).

Not releasing provenance of the the training data is usually either to cover up illegal activity, or otherwise an anti-competitive act to place a barrier in front of other organisations aiming to reproduce and improve on a model, or even just audit its training data for biases.

Back to basics: Are weights software at all?

Posted May 21, 2025 6:52 UTC (Wed) by Wol (subscriber, #4433) [Link] (16 responses)

> I think that we need to take a step back from the phrase "preferred form for modification" and think about what the Free Software movement is about.

Yup. What was that about mp3's should be accompanied by an audacity project split into separate voices per instrument? Yes, that may be the "preferred form for modification", but if that's never existed? If I type hex into an editor to create an executable, does that mean it can't be distributed under the GPL?

There's far too much emphasis on the recipient demanding what they think they're entitled to, when it should be on what the giver is freely giving. AI is, I think, simply throwing a great big spotlight on this basic problem.

The big thing about Free Software is not what you have, not what you give, but that you CAN SHARE EVERYTHING that you are given.

Cheers,
Wol

Back to basics: Are weights software at all?

Posted May 21, 2025 7:05 UTC (Wed) by pabs (subscriber, #43278) [Link] (3 responses)

Strongly disagree, because that way lies obfuscated source code being OK, because it is what you are given.

I think Free Software is about providing a high level of equality of access to a work between both the original author and far downstream recipients, both legally and practically.

What is "source" in any given context is a *choice* the author makes about what level of access they want to pass on to their future self, and a separate choice about what to pass on to other people.

If their future self gets a better option than others do, then that clearly isn't Free Software. For eg keeping the non-obfuscated source private and distributing the obfuscated version to others.

If their future self deliberately gets a bad option just so that others also get that bad option, its debatable whether that is Free Software or not. For example throwing away the non-obfuscated source and only keeping the obfuscated version locally and in distributions to others.

If their future self deliberately gets a bad option for other reasons, then it completely depends on the situation.

So I think Free Software is about everything; what you make, what tools you use, what you keep, what you discard, what you give and what you receive. Everything has an impact on what future changes your future self and other people can make.

Back to basics: Are weights software at all?

Posted May 21, 2025 8:29 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

> Strongly disagree, because that way lies obfuscated source code being OK, because it is what you are given.

Strongly disagree, in that (if it's my own work) what I give you is down to me. End of. What you *want* might not exist.

I did, however, forget about the bit that you MUST offer to pass on EVERYTHING, if you pass on ANYTHING. Note the difference between "give", and "pass on".

Cheers,
Wol

Back to basics: Are weights software at all?

Posted May 21, 2025 9:38 UTC (Wed) by ballombe (subscriber, #9523) [Link]

> Strongly disagree, in that (if it's my own work) what I give you is down to me.
Sure but you cannot force me to accept it as 'Free software'.

Back to basics: Are weights software at all?

Posted May 21, 2025 10:19 UTC (Wed) by farnz (subscriber, #17727) [Link]

By your reasoning, though, I can call any software "free software", even when distributed as a binary for a given platform only, because that's what I've got, even if someone else (the original author) has a better form for modification available to them that they're keeping secret.

The whole point of this argument is to define the line between "software" and "free software per the DFSG" (or "free software per the FSF", or whoever); some freely redistributable things, where it's even legal to modify them, don't count as "free software per the DFSG", since the original author has chosen to not release enough of the work to cross that line. And that's OK; not everything that can be redistributed and modified must necessarily meet DFSG requirements.

Back to basics: Are weights software at all?

Posted May 21, 2025 9:08 UTC (Wed) by danielthompson (subscriber, #97243) [Link] (9 responses)

> The big thing about Free Software is not what you have, not what you give,
> but that you CAN SHARE EVERYTHING that you are given.

This summary overlooks the freedom to study and modify. I'm a programmer, thus the freedom to improve my craft and to apply my craft by modifying programs is as important to me as the freedom to share with others. I noted that "the four freedoms" are typically enumerated with the freedom to study and modify ahead of the freedom to share copies.

So if what I'm given is in a form that makes is more difficult for me to modify it than it is for the original author, then my freedom to modify is reduced... and I'd rather choose software that offers me that freedom.

To be honest *today* the resources needed to train AI makes these issues moot since I could not store the training set and would not be inclined to pay for the compute time to train from it. These practical differences mean open-weight and open-training-set models offer me similar capabilities... today.

However IMHO it would be a wrong to take a short-term practical view here and ignore the freedoms that could be important to me in the future as I gain access to more compute and storage resources. Practical short-term convenience always risks lures us away from cooperating and building the software commons of the future. Without demand (or investment) open training sets won't happen.

Back to basics: Are weights software at all?

Posted May 21, 2025 10:42 UTC (Wed) by Wol (subscriber, #4433) [Link] (8 responses)

> So if what I'm given is in a form that makes is more difficult for me to modify it than it is for the original author, then my freedom to modify is reduced... and I'd rather choose software that offers me that freedom.

And that is your prerogative. Doesn't stop the original from being Free Software though.

My point though is "what if your "difficult to modify" source is actually the same source available to the original author?".

What if someone says "we ran our model on the contents of the Gutenberg project"? (What if the recipient can't store a copy, but slurped it straight into the model?)

The problem is we're running the entire gamut here, I understand people want to REbuild stuff easily, but what if the giver is sharing everything they can? Or have?

I think a bright line we should NOT cross is "is the receiver demanding the giver does extra work to comply with the receiver's wants?". If the answer is "yes", and the giver is sharing what they have, then the gift is fully Free. The recipient should accept what's on offer, or take a hike.

Cheers,
Wol

Back to basics: Are weights software at all?

Posted May 21, 2025 11:30 UTC (Wed) by farnz (subscriber, #17727) [Link] (5 responses)

So a binary-only Linux kernel image supplied to me by Conexant and integrated with the product I'm selling would count as Free when I supply it to you, since all I have as the giver is a binary, and I'm sharing what I have?

I could do a lot of extra work to force Conexant to share the sources they used to create the Linux kernel image they supplied me (noting that they might have discarded that source before I bought a chip from them, so they might have to do a lot of work to recreate it), but you've just said that this is a bright line that we should NOT cross - it's Free because I'm giving you everything I have, and it would be a lot of extra work to give you sources for it.

Back to basics: Are weights software at all?

Posted May 21, 2025 12:13 UTC (Wed) by Wol (subscriber, #4433) [Link] (4 responses)

> So a binary-only Linux kernel image supplied to me by Conexant and integrated with the product I'm selling would count as Free when I supply it to you, since all I have as the giver is a binary, and I'm sharing what I have?

Except that you're not the giver, you're a sharer. And in that situation, the GPL says you should have an offer of the source, which you *can* share with me, so the binary isn't all you have.

Yet again, we're back to the situation of trying to enforce a licence against the licence grantor - IT CAN'T BE DONE. We're so used to thinking of "shared authorship" works where everyone is equal, we completely miss the situation of the original gifter, where we are just not equal, and there is absolutely nothing that can be done about it.

AI is (as I said) just throwing a big spotlight on this inequality, because it offends our feeling of "justice", the problem being that we have different ideas of what is "just". Again, as I said, my bright line is demanding someone does EXTRA work to avoid offending my sense of entitlement of more than is on offer.

> but you've just said that this is a bright line that we should NOT cross - it's Free because I'm giving you everything I have, and it would be a lot of extra work to give you sources for it.

No. It's a lot of extra work to create sources THAT NEVER EXISTED IN THE FIRST PLACE. If Conexant can't provide the source because they've lost it, that's their problem. If they can't provide what never existed, then it's ours.

It gets greyer if the source itself is " 'AI Model' < curl gutenberg " :-)

Cheers,
Wol

Back to basics: Are weights software at all?

Posted May 21, 2025 12:47 UTC (Wed) by daroc (editor, #160859) [Link]

This is not strictly relevant, but Project Gutenberg does ask people not to access the website in an automated way; if you want to download large amounts of data from them, the preferred approach is to set up a mirror with rsync.

Back to basics: Are weights software at all?

Posted May 21, 2025 12:53 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

In what sense is refusing to classify your output as "free software per the DFSG" trying to enforce anything against the licence grantor?

We are in the process of defining what the licence grantor has to do if they want their output to be classified as DFSG-free; if they do not disclose enough source to meet these requirements, then they will not have their output declared as DFSG-free.

We don't care whether it's not DFSG-free because they've lost the source, or because the source never existed in the first place, or because the grantor deemed it impractical to share the full source (e.g. the Project Gutenberg stuff). That's the grantor's problem if they want us to declare a piece of software DFSG-free.

Instead, we're setting out bright lines, beyond which the thing you're granting a licence to is clearly DFSG-free, with other bright lines where the thing is clearly not DFSG-free, and gradually reducing the amount of grey between the lines as cases come up.

Back to basics: Are weights software at all?

Posted May 21, 2025 16:01 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

> In what sense is refusing to classify your output as "free software per the DFSG" trying to enforce anything against the licence grantor?

In the sense that you're saying "here, have a copy of everything I've got" isn't enough. Would you say that Public Domain or 2-clause BSD is "just obviously DFSG- (or FSF-)Free"? Because, by the standards you're trying to apply here, it clearly isn't.

Cheers,
Wol

Back to basics: Are weights software at all?

Posted May 21, 2025 17:20 UTC (Wed) by farnz (subscriber, #17727) [Link]

Public Domain and 2-clause BSD are not "obviously" DFSG-Free in their own right, because they're licence texts, not software, and DFSG-Free applies to things that Debian deems "distributable software".

If you have enough source to qualify as DFSG-Free, and you license that source under 2-clause BSD, then your software is DFSG-Free. But if you don't have enough source to qualify as DFSG-Free, then while you may be using a DFSG-Free licence (such as 2-clause BSD), your software remains not DFSG-Free, because it's the software that's at issue, not the choice of licence text.

Note, too, that this is all about what is acceptable in Debian proper (the "main" component). Debian also supplies some resources for things in Debian packaging, but not DFSG-Free: "non-free" for things that are redistributable legally, but not DFSG-Free, and "contrib" for things that are DFSG-Free, but which depend on something in "non-free".

So, one compromise position that we could easily end up in is that most inference engines are "contrib", since they depend on a set of weights from "non-free", and thus outside Debian proper. For some special cases, there will be inference engines that are "main" (since they have weights in "main" for which full source is available - for example a programming LLM trained on a Debian source snapshot), but common use cases depend on weights from "non-free".

Back to basics: Are weights software at all?

Posted May 22, 2025 9:04 UTC (Thu) by danielthompson (subscriber, #97243) [Link] (1 responses)

>> So if what I'm given is in a form that makes is more difficult for me to modify
>> it than it is for the original author, then my freedom to modify is reduced...
>> and I'd rather choose software that offers me that freedom.
>
> And that is your prerogative. Doesn't stop the original from being Free Software though.

That depends largely on whose definition of Free Software is adopted. In the case of AI, and using language from the FSF definition of Free Software, a lot depends on whether you think the weights or the training data are the "preferred form of the program for making changes" (or even a program at all).

Given the iterative nature of AI training one could claim that the weights are the preferred form for making changes (e.g. you change an AI by adding more training data rather than removing unwanted old training data and retraining from scratch). I'm still forming an opinion on that since I think it depends on the ambition of the change. For example if you wanted to remove a oppressive bias such as misogyny, this is better done by filtering misogynistic training data from the training set than by trying to "train out" the misogyny from an existing set of weights. However it is different if our goal is to correct a neglectful bias such as under-representation; that could potentially be addressed iteratively.

> My point though is "what if your "difficult to modify" source is actually the same source
> available to the original author?".

Generally such source is not "more difficult for me to modify it than it is for the original author" and hence I'm more relaxed about it.

However if I was to rewrite that I would change it "than is is for the original author" to "that is *was* for the original author". That's because I don't buy into the "what if the original author lost, deleted or discarded the source" argument at all. Publishing derivatives such as binaries whilst offering clues to the recipe and encouraging repeat or reverse engineering is certainly a social good if the source truely is lost. However that doesn't necessarily make it Free Software.

Back to basics: Are weights software at all?

Posted May 22, 2025 9:41 UTC (Thu) by farnz (subscriber, #17727) [Link]

Note that this is currently an evolving field of research; Golden Gate Claude shows that there can be ways in which editing the weights adds or removes identified biases from the model.

It is possible that, when the dust settles and we understand LLMs a bit better, we'll know how to work directly with the weights to make any changes we want to make to the model, and we only use training data to "bootstrap" the LLM because it's simpler to do it that way than to craft an initial set of weights from first principles.

Back to basics: Are weights software at all?

Posted May 21, 2025 9:41 UTC (Wed) by JGR (subscriber, #93631) [Link]

> If I type hex into an editor to create an executable, does that mean it can't be distributed under the GPL?

You could well distribute it under the GPL, but Debian or other distributors might reasonably decline to package and distribute it.

Back to basics: Are weights software at all?

Posted May 22, 2025 5:19 UTC (Thu) by interalia (subscriber, #26615) [Link]

Even if the original author had Audacity files for generating the mp3s, and images in PSD or XCF fomat, but only distributed the resulting mp3s and PNGs under the GPL, then the preferred form of modification in the released file is .mp3 or .png because that's what was received under the GPL.

Again, my understanding of the intention of the GPL clause is to prevent person B getting the source to the app, modifying the .mp3 or .png, and sharing binaries with person C... then when person C asks for the source, obfuscating the mp3/png into some other obstructionist form which was not the one that B used. The fact that B themselves received a generated file and the original author made them from an Audacity/XCF file isn't relevant IMO, what's important is the "preferred form for modification" for the copy that B works with.

And ditto if someone typed hex into an editor and released that, that's the preferred form for what they released under the GPL.

But none of this means that someone like Debian has to accept that the hex was suitable under their social contract for Debian to distribute.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds