|
|
Subscribe / Log in / New account

Debian AI General Resolution withdrawn

By Joe Brockmeier
May 20, 2025

Despite careful planning and months of warning, Debian developer Mo Zhou has acknowledged that the project needs more time to grapple with the questions around AI models and the Debian Free Software Guidelines (DFSG). For now, he has withdrawn his proposed General Resolution (GR) that would have required the original training data for AI models to be released in order to be considered DFSG-compliant—though the debates on the topic continue.

Zhou has been working toward the GR for some time. In February, he posted an early draft to the Debian-project mailing list to ask for help and to give other developers time to provide input or develop their own counter-proposals. On April 19, he sent his revised proposal—with detailed reasoning for his stance, comments on possible implications of the resolution, and several appendices of resources—to the debian-vote mailing list, which we covered at the end of April.

The text for Debian members to consider and vote for (or against) was short and simple:

Proposal A: "AI models released under open source license without original training data or program" are not seen as DFSG-compliant.

If a model (or other artifact) is not DFSG-compliant then it cannot be included in the Debian main repository that comprises the Debian distribution; only those packages in the main repository are considered part of the Debian distribution, as described in the Debian Policy Manual. Being outside main does not entirely prohibit Debian from distributing an artifact, though. The project also has the contrib, non-free, and non-free-firmware repositories for software that, directly or via dependencies, does not comply with the DFSG. However, packages outside main are not considered part of the distribution. These details are spelled out more fully in the Debian wiki SourcesList page.

Zhou's proposal is in contrast to the Open Source Initiative's (OSI) Open Source AI Definition (OSAID) that was announced in October last year, which does not require the training data to be released. The OSAID requires that an AI system be released in a way that will grant the freedom to use, study, modify, and share an AI system or elements of the system. OSI has determined that it is sufficient to provide model parameters ("such as weights or other configuration settings") and detailed information about the training data that would allow "a skilled person can build a substantially equivalent system". LWN covered the OSAID in October 2024.

Initially, it looked like smooth sailing for Zhou's proposal. Early discussion was largely positive—though Zhou's proposal did attract a few counter-proposals. Thorsten Glaser's lengthy proposal would have raised the bar even higher for AI models to enter Debian main. For instance, it would have required model training to happen during package build or that the model be built "in a sufficiently reproducible way that a separate rebuilding effort from the same source will result in the same trained model". That would be in addition to requiring training data, of course. It would also have a dramatic impact on Debian's infrastructure in terms of requiring the hardware to actually perform training.

Sam Hartman put forward a proposal that would allow an application to define the preferred form of modification, which might or might not include training data. Bill Allombert pointed out that, without training data, Debian has no way to know what's inside the model. "The model could generate backdoors and non-free copyrighted material or even more harmful content." Hartman countered that the project had accepted x86 machine code as a preferred form of modification. Inspectability, he said, has never been at the core of the DFSG. He also predicted that there would, eventually, be "black box inspection tools" that would improve the ability to inspect models over time.

What would be simpler, Aigars Mahinovs said, would be to vote on the Debian project endorsing the OSAID. He later submitted a proposal to clarify that training data is not source code for the purposes of the DFSG. It would have, instead, required "training data information" as defined in the OSAID. The actual training data would be considered merely "an intermediate build artifact". Wouter Verhelst objected to this and said that the project would have to drop its reproducibility goals if AI models were accepted in main. LWN covered Debian's progress toward reproducibility in August last year. Stefano Zacchiroli suggested that there might be room to merge the proposals from Mahinovs and Hartman since they seemed to go in the same direction. Ultimately, however, none of the counter-proposals had received enough sponsors to be added to the ballot if Zhou's GR had gone to a vote.

Spam classifiers

A few Debian developers wondered about the impact on software that preceded the current AI craze and rampant rapacious data scraping that accompanies creating many of today's AI models. Applications that one would not usually lump in with AI, such as games, spam filters, optical-character recognition (OCR) tools, and text-to-speech software, also depend on trained models that are missing training data. Depending on one's reading, a fair amount of software already in Debian could be seen as non-free if Zhou's proposal were adopted.

For example, Ansgar Burchardt pointed out that it would not be possible to package spam or phishing emails as part of a training data set for Bayesian classifiers because those emails are unlikely to be under a free license. Russ Allbery said that he did not think that a classifier trained on such data would be DFSG-free, and did not think it should be included in Debian main.

That doesn't mean I think it's bad or immoral or anything like that. I have a database like that myself. :) It's simply not free software, and is outside the scope of what Debian is for. Not even all of Debian's own data is free software. For example, I would not consider the [Debian bug tracking system] database or the mailing list archives to be free software because the licensing status is not sufficiently clear.

The idea of blocking spam filtering software that uses trained Bayesian filters, which has been available in Debian main for ages, troubled some Debian developers. Hartman said that software freedom is supposed to be an achievable set of standards that empowers users. It may require users to forego convenient commercial software, but it should not be about sacrificing potential:

Users might want a Bayesian classifier--I do enough that I've trained one. Software in main like a mail reader or a mail system might well want to include a classifier. Saying that even if someone is as dedicated to freedom as they can be, they can never live up to our standards and include that reasonable functionality in Debian main makes me think we have lost sight of our users.

Zacchiroli predicted that, if Zhou's proposal won, Debian packagers who included some form of AI model without DFSG-free training data would be forced to patch their software to download data on first use or "just give up on maintaining those packages". He said he failed to understand how this served Debian's users, and would not really protect them from "evil OSAID-but-not-DFSG-free stuff" anyway.

Withdrawal

After much discussion, Zhou withdrew his proposal on May 8. He said that it had become clear that the community was unprepared to vote on the proposal. Initially, he wanted to simply address the "conceptual interpretation" of the DFSG with regard to AI models, but the real implications had given Debian members pause. He asked for suggestions on tools that might help him scan the Debian archive to figure out which packages might be affected by the GR.

Zhou also added that many people seemed to assume that pre-trained models were trustworthy. He said he would create a demonstration to illustrate how a backdoor could be planted in a neural network. This would allow those who consider models the preferred form of modification to demonstrate how they could fix the backdoor. He indicated that he would need a few months before he could return to working on the GR.

Russ Allbery thanked Zhou for his work and said that this happens a lot. People often wait until the GR is proposed before speaking up, and the discussion often brings opinions to the surface that had not been expressed before. He added that he thought delaying the GR was the right decision:

I also hope it doesn't discourage you from continuing to work on this. I don't think anyone is saying that we shouldn't have this conversation and a vote, only that we (myself very much included) are realizing that we hadn't actually thought this through as thoroughly as we had thought.

Hartman and Mahinovs followed suit and formally withdrew their proposals even though they did not have sufficient sponsors, just to be clear that they were not to be voted on in the absence of Zhou's GR. To date, Glaser has not formally withdrawn his proposal.

More complicated than first thought

It seemed that there was plenty of support at the beginning of the discussion for requiring training data with AI models to consider them DFSG-free. However, coming up with a definition of AI models that does not overlap with other, less controversial, data is clearly going to be difficult. It will be interesting to see how the discussion goes when Zhou returns to the topic down the road, and whether Debian can adopt a policy without unintended consequences.



to post comments

Back to basics: Are weights software at all?

Posted May 20, 2025 19:56 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (31 responses)

This argument tends to focus on whether or not the weights of an AI model are the preferred form for modification of the work. But I think there's another question we need to ask first: Are the weights really software, or are they data that is used as an input to software?

This may sound like a trivial "is a hotdog a sandwich?" type of question, but it's really not. Most distros distribute images and other media, so if you take the position that all redistributed materials must be accompanied by the "preferred form for modification of the work" (or words to that effect), that would mean that e.g. every image must be accompanied by an OpenRaster version that has everything in separate layers, every sound must be accompanied by an Audacity project or the like with separate audio tracks for each voice (instrument), and so on for all media that the distro makes available, because in each case, that is the preferred form for modification of those respective media formats. Is the average disto really going to do all of that? I suspect not.

So then how do we think about weights? There are at least two conflicting ways of describing weights:

1. Weights are a list of opaque numbers that have some interesting emergent properties when you run the right software over them.
2. Weights encode a program or procedure, and the software is merely a virtual machine they run on.

The problem with interpretation (2) is that the overwhelming majority of AI model architectures are not Turing complete, or anywhere close to Turing complete. You could argue that they are still "instructions" of a sort, but then so is an SVG image, or any non-raster font file. I've never heard of anyone demanding the font designer release any sort of project file - usually, putting the font under SIL (or another FOSS license) and distributing it as a finished OpenType file is considered adequate.

(I should also acknowledge that, at least in the US, non-raster fonts are considered "software" for copyright purposes, but of course there is no need for the FOSS community to make exactly the same distinctions as the law does.)

So where do we draw the line? What makes AI model weights different from every other kind of non-software asset that a Linux distro might happen to distribute?

Back to basics: Are weights software at all?

Posted May 20, 2025 22:10 UTC (Tue) by willy (subscriber, #9762) [Link]

Debian already answered this. The founding documents were phrased in such a way that one could interpret them as "All software that Debian distributes is Free" or "Everything that Debian distributes is Free Software". There was a misleadingly-described referendum in which Debian decided that the second interpretation was the correct one.

So there you go. For Debian's purposes, all data is software. And it must be DFSG-free. Anything under a less-pure license (eg the GFDL) goes to non-free.

Back to basics: Are weights software at all?

Posted May 20, 2025 22:18 UTC (Tue) by acarno (subscriber, #123476) [Link] (2 responses)

You raise excellent questions, and I don't have a good answer, but I suspect at some point this will come down to functionality. The functionality of an email application that includes a PNG icon file is to read email; the PNG icon file, whether modifiable or not, has no real impact on the behavior of the email application. There's a philosophical argument to be made, especially for Debian, but functionally that PNG file has no impact on your ability to read emails.

For an ML model, however, whether it's a GPT or a Bayesian classifier, changing the weights can substantially affect the function of the overall software package. Whether the weights constitute a Turing-complete system is (in my opinion) irrelevant - there are plenty of useful non-Turing-complete systems. What matters is: without those weights, the application using them is unable to function properly. Thus, I think you _have_ to consider them a critical component of the software.

Back to basics: Are weights software at all?

Posted May 21, 2025 17:28 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> For an ML model, however, whether it's a GPT or a Bayesian classifier, changing the weights can substantially affect the function of the overall software package.

Changing the control points of a non-raster font substantially affects the function of the font, because it makes the difference between an "A" and an unrecognizable blob.

Back to basics: Are weights software at all?

Posted May 21, 2025 19:19 UTC (Wed) by intelfx (subscriber, #130118) [Link]

> Changing the control points of a non-raster font substantially affects the function of the font, because it makes the difference between an "A" and an unrecognizable blob.

Yes, but it's still fundamentally *content*: it only impacts human perception of the program, not the functionality and the behavior of the *program itself*.

Just like a PNG icon in an email application: changing that icon makes the difference between a recognizable button on a control panel and an unrecognizable one, but it's still *content*, not code.

Back to basics: Are weights software at all?

Posted May 20, 2025 22:24 UTC (Tue) by jzb (editor, #7867) [Link] (2 responses)

"This may sound like a trivial "is a hotdog a sandwich?" type of question"

That's a solved question. Everyone knows that a hot dog is a taco.

"What makes AI model weights different from every other kind of non-software asset that a Linux distro might happen to distribute?"

In part I would say AI models are different from, say, audio files or PDFs because they are used to accomplish a task as opposed to being merely content. If I'm using Speech Note to transcribe audio to text, I get vastly different output depending on which model I use. It would be possible, I believe, for a model to be trained in some way to refuse to transcribe certain words/phrases or otherwise manipulate the output. Some evildoer could, for example, train the model not to use the Oxford comma!

Likewise, AI models that would be used for code generation might be created in a way to try to insert backdoors or just generate code with specific types of vulnerabilities. (Or a kind of reverse typosquatting where the model inserts calls to malicious Python modules that look like legitimate ones...)

Certain fonts might fall somewhere between content and AI models.

Anyway, the larger point that there's a fuzzy area for not-quite-software without accompanying "preferred form of modification" is taken—but I'd consider AI models different. What that difference means is open for some debate, but AI models are in a different category than PDFs, audio files, and fonts IMO.

Back to basics: Are weights software at all?

Posted May 20, 2025 23:19 UTC (Tue) by excors (subscriber, #95769) [Link]

> Likewise, AI models that would be used for code generation might be created in a way to try to insert backdoors or just generate code with specific types of vulnerabilities. (Or a kind of reverse typosquatting where the model inserts calls to malicious Python modules that look like legitimate ones...)

You don't even need a backdoored model for that: a recent study used various LLMs to generate code, and found around 20% of package references were hallucinated. The LLM just guessed a likely package name and API and hoped for the best. I expect the 'programmer' is typically going to run the code and paste the error messages back into the LLM, so they're not going to notice if an attacker has recently registered those package names. (https://arstechnica.com/security/2025/04/ai-generated-cod...)

Back to basics: Are weights software at all?

Posted May 21, 2025 0:56 UTC (Wed) by pabs (subscriber, #43278) [Link]

There is a font that is also an AI model:

https://fuglede.github.io/llama.ttf/

Back to basics: Are weights software at all?

Posted May 20, 2025 23:01 UTC (Tue) by gioele (subscriber, #61675) [Link]

> I've never heard of anyone demanding the font designer release any sort of project file - usually, putting the font under SIL (or another FOSS license) and distributing it as a finished OpenType file is considered adequate.

It's a bit of a gray and unenforced area, but yes, in Debian the "source code" of the fonts (= the fonts in a format accepted as preferred form of modification by font designers) should be available and the TTF/OTF files should be built from it.

Back to basics: Are weights software at all?

Posted May 21, 2025 0:55 UTC (Wed) by pabs (subscriber, #43278) [Link] (19 responses)

I think that we need to take a step back from the phrase "preferred form for modification" and think about what the Free Software movement is about. I see both the licensing and source aspects of the Free Software movement as aspiring to providing a high level of equality of access to a work between both the original author and far downstream recipients. Obviously full and universal equality is impossible because part of the work is only in the author's mind and not everyone can obtain and use computers, especially the amount of computing capacity needed for training many of the modern ML models. Clearly though, without the training data, there can be no equality of access to an ML model, even when enough compute is available.

PS: some stuff about source forms for non-code files:

https://wiki.debian.org/AutoGeneratedFiles

Back to basics: Are weights software at all?

Posted May 21, 2025 1:14 UTC (Wed) by interalia (subscriber, #26615) [Link] (1 responses)

Yes I agree it'd be good to keep the intent of the rules in mind rather than their exact wording. My understanding is that "preferred form of modification" is there to prevent the release of obfuscated code as a letter-of-the-law compliance with source availability. I don't personally think that not releasing training data has that intent. Such a model is obviously less freely modifiable than if the data was available, but I don't feel it falls foul of the "preferred form" clause because it's not intended to be an intentional compliance-but-not-really practice. Definitely a thought-provoking issue though, and I can see arguments both ways.

Back to basics: Are weights software at all?

Posted May 21, 2025 1:32 UTC (Wed) by pabs (subscriber, #43278) [Link]

In these situations, there mostly isn't an existing model license for the companies making them to comply with, so this is a different situation to what you are talking about with using obfuscation to circumvent the GPL.

Usually the main reason for not releasing the training data itself, is that it isn't legally possible to redistribute it, or that it was illegally obtained in the first place (for eg Facebook torrenting books).

Not releasing provenance of the the training data is usually either to cover up illegal activity, or otherwise an anti-competitive act to place a barrier in front of other organisations aiming to reproduce and improve on a model, or even just audit its training data for biases.

Back to basics: Are weights software at all?

Posted May 21, 2025 6:52 UTC (Wed) by Wol (subscriber, #4433) [Link] (16 responses)

> I think that we need to take a step back from the phrase "preferred form for modification" and think about what the Free Software movement is about.

Yup. What was that about mp3's should be accompanied by an audacity project split into separate voices per instrument? Yes, that may be the "preferred form for modification", but if that's never existed? If I type hex into an editor to create an executable, does that mean it can't be distributed under the GPL?

There's far too much emphasis on the recipient demanding what they think they're entitled to, when it should be on what the giver is freely giving. AI is, I think, simply throwing a great big spotlight on this basic problem.

The big thing about Free Software is not what you have, not what you give, but that you CAN SHARE EVERYTHING that you are given.

Cheers,
Wol

Back to basics: Are weights software at all?

Posted May 21, 2025 7:05 UTC (Wed) by pabs (subscriber, #43278) [Link] (3 responses)

Strongly disagree, because that way lies obfuscated source code being OK, because it is what you are given.

I think Free Software is about providing a high level of equality of access to a work between both the original author and far downstream recipients, both legally and practically.

What is "source" in any given context is a *choice* the author makes about what level of access they want to pass on to their future self, and a separate choice about what to pass on to other people.

If their future self gets a better option than others do, then that clearly isn't Free Software. For eg keeping the non-obfuscated source private and distributing the obfuscated version to others.

If their future self deliberately gets a bad option just so that others also get that bad option, its debatable whether that is Free Software or not. For example throwing away the non-obfuscated source and only keeping the obfuscated version locally and in distributions to others.

If their future self deliberately gets a bad option for other reasons, then it completely depends on the situation.

So I think Free Software is about everything; what you make, what tools you use, what you keep, what you discard, what you give and what you receive. Everything has an impact on what future changes your future self and other people can make.

Back to basics: Are weights software at all?

Posted May 21, 2025 8:29 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

> Strongly disagree, because that way lies obfuscated source code being OK, because it is what you are given.

Strongly disagree, in that (if it's my own work) what I give you is down to me. End of. What you *want* might not exist.

I did, however, forget about the bit that you MUST offer to pass on EVERYTHING, if you pass on ANYTHING. Note the difference between "give", and "pass on".

Cheers,
Wol

Back to basics: Are weights software at all?

Posted May 21, 2025 9:38 UTC (Wed) by ballombe (subscriber, #9523) [Link]

> Strongly disagree, in that (if it's my own work) what I give you is down to me.
Sure but you cannot force me to accept it as 'Free software'.

Back to basics: Are weights software at all?

Posted May 21, 2025 10:19 UTC (Wed) by farnz (subscriber, #17727) [Link]

By your reasoning, though, I can call any software "free software", even when distributed as a binary for a given platform only, because that's what I've got, even if someone else (the original author) has a better form for modification available to them that they're keeping secret.

The whole point of this argument is to define the line between "software" and "free software per the DFSG" (or "free software per the FSF", or whoever); some freely redistributable things, where it's even legal to modify them, don't count as "free software per the DFSG", since the original author has chosen to not release enough of the work to cross that line. And that's OK; not everything that can be redistributed and modified must necessarily meet DFSG requirements.

Back to basics: Are weights software at all?

Posted May 21, 2025 9:08 UTC (Wed) by danielthompson (subscriber, #97243) [Link] (9 responses)

> The big thing about Free Software is not what you have, not what you give,
> but that you CAN SHARE EVERYTHING that you are given.

This summary overlooks the freedom to study and modify. I'm a programmer, thus the freedom to improve my craft and to apply my craft by modifying programs is as important to me as the freedom to share with others. I noted that "the four freedoms" are typically enumerated with the freedom to study and modify ahead of the freedom to share copies.

So if what I'm given is in a form that makes is more difficult for me to modify it than it is for the original author, then my freedom to modify is reduced... and I'd rather choose software that offers me that freedom.

To be honest *today* the resources needed to train AI makes these issues moot since I could not store the training set and would not be inclined to pay for the compute time to train from it. These practical differences mean open-weight and open-training-set models offer me similar capabilities... today.

However IMHO it would be a wrong to take a short-term practical view here and ignore the freedoms that could be important to me in the future as I gain access to more compute and storage resources. Practical short-term convenience always risks lures us away from cooperating and building the software commons of the future. Without demand (or investment) open training sets won't happen.

Back to basics: Are weights software at all?

Posted May 21, 2025 10:42 UTC (Wed) by Wol (subscriber, #4433) [Link] (8 responses)

> So if what I'm given is in a form that makes is more difficult for me to modify it than it is for the original author, then my freedom to modify is reduced... and I'd rather choose software that offers me that freedom.

And that is your prerogative. Doesn't stop the original from being Free Software though.

My point though is "what if your "difficult to modify" source is actually the same source available to the original author?".

What if someone says "we ran our model on the contents of the Gutenberg project"? (What if the recipient can't store a copy, but slurped it straight into the model?)

The problem is we're running the entire gamut here, I understand people want to REbuild stuff easily, but what if the giver is sharing everything they can? Or have?

I think a bright line we should NOT cross is "is the receiver demanding the giver does extra work to comply with the receiver's wants?". If the answer is "yes", and the giver is sharing what they have, then the gift is fully Free. The recipient should accept what's on offer, or take a hike.

Cheers,
Wol

Back to basics: Are weights software at all?

Posted May 21, 2025 11:30 UTC (Wed) by farnz (subscriber, #17727) [Link] (5 responses)

So a binary-only Linux kernel image supplied to me by Conexant and integrated with the product I'm selling would count as Free when I supply it to you, since all I have as the giver is a binary, and I'm sharing what I have?

I could do a lot of extra work to force Conexant to share the sources they used to create the Linux kernel image they supplied me (noting that they might have discarded that source before I bought a chip from them, so they might have to do a lot of work to recreate it), but you've just said that this is a bright line that we should NOT cross - it's Free because I'm giving you everything I have, and it would be a lot of extra work to give you sources for it.

Back to basics: Are weights software at all?

Posted May 21, 2025 12:13 UTC (Wed) by Wol (subscriber, #4433) [Link] (4 responses)

> So a binary-only Linux kernel image supplied to me by Conexant and integrated with the product I'm selling would count as Free when I supply it to you, since all I have as the giver is a binary, and I'm sharing what I have?

Except that you're not the giver, you're a sharer. And in that situation, the GPL says you should have an offer of the source, which you *can* share with me, so the binary isn't all you have.

Yet again, we're back to the situation of trying to enforce a licence against the licence grantor - IT CAN'T BE DONE. We're so used to thinking of "shared authorship" works where everyone is equal, we completely miss the situation of the original gifter, where we are just not equal, and there is absolutely nothing that can be done about it.

AI is (as I said) just throwing a big spotlight on this inequality, because it offends our feeling of "justice", the problem being that we have different ideas of what is "just". Again, as I said, my bright line is demanding someone does EXTRA work to avoid offending my sense of entitlement of more than is on offer.

> but you've just said that this is a bright line that we should NOT cross - it's Free because I'm giving you everything I have, and it would be a lot of extra work to give you sources for it.

No. It's a lot of extra work to create sources THAT NEVER EXISTED IN THE FIRST PLACE. If Conexant can't provide the source because they've lost it, that's their problem. If they can't provide what never existed, then it's ours.

It gets greyer if the source itself is " 'AI Model' < curl gutenberg " :-)

Cheers,
Wol

Back to basics: Are weights software at all?

Posted May 21, 2025 12:47 UTC (Wed) by daroc (editor, #160859) [Link]

This is not strictly relevant, but Project Gutenberg does ask people not to access the website in an automated way; if you want to download large amounts of data from them, the preferred approach is to set up a mirror with rsync.

Back to basics: Are weights software at all?

Posted May 21, 2025 12:53 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

In what sense is refusing to classify your output as "free software per the DFSG" trying to enforce anything against the licence grantor?

We are in the process of defining what the licence grantor has to do if they want their output to be classified as DFSG-free; if they do not disclose enough source to meet these requirements, then they will not have their output declared as DFSG-free.

We don't care whether it's not DFSG-free because they've lost the source, or because the source never existed in the first place, or because the grantor deemed it impractical to share the full source (e.g. the Project Gutenberg stuff). That's the grantor's problem if they want us to declare a piece of software DFSG-free.

Instead, we're setting out bright lines, beyond which the thing you're granting a licence to is clearly DFSG-free, with other bright lines where the thing is clearly not DFSG-free, and gradually reducing the amount of grey between the lines as cases come up.

Back to basics: Are weights software at all?

Posted May 21, 2025 16:01 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

> In what sense is refusing to classify your output as "free software per the DFSG" trying to enforce anything against the licence grantor?

In the sense that you're saying "here, have a copy of everything I've got" isn't enough. Would you say that Public Domain or 2-clause BSD is "just obviously DFSG- (or FSF-)Free"? Because, by the standards you're trying to apply here, it clearly isn't.

Cheers,
Wol

Back to basics: Are weights software at all?

Posted May 21, 2025 17:20 UTC (Wed) by farnz (subscriber, #17727) [Link]

Public Domain and 2-clause BSD are not "obviously" DFSG-Free in their own right, because they're licence texts, not software, and DFSG-Free applies to things that Debian deems "distributable software".

If you have enough source to qualify as DFSG-Free, and you license that source under 2-clause BSD, then your software is DFSG-Free. But if you don't have enough source to qualify as DFSG-Free, then while you may be using a DFSG-Free licence (such as 2-clause BSD), your software remains not DFSG-Free, because it's the software that's at issue, not the choice of licence text.

Note, too, that this is all about what is acceptable in Debian proper (the "main" component). Debian also supplies some resources for things in Debian packaging, but not DFSG-Free: "non-free" for things that are redistributable legally, but not DFSG-Free, and "contrib" for things that are DFSG-Free, but which depend on something in "non-free".

So, one compromise position that we could easily end up in is that most inference engines are "contrib", since they depend on a set of weights from "non-free", and thus outside Debian proper. For some special cases, there will be inference engines that are "main" (since they have weights in "main" for which full source is available - for example a programming LLM trained on a Debian source snapshot), but common use cases depend on weights from "non-free".

Back to basics: Are weights software at all?

Posted May 22, 2025 9:04 UTC (Thu) by danielthompson (subscriber, #97243) [Link] (1 responses)

>> So if what I'm given is in a form that makes is more difficult for me to modify
>> it than it is for the original author, then my freedom to modify is reduced...
>> and I'd rather choose software that offers me that freedom.
>
> And that is your prerogative. Doesn't stop the original from being Free Software though.

That depends largely on whose definition of Free Software is adopted. In the case of AI, and using language from the FSF definition of Free Software, a lot depends on whether you think the weights or the training data are the "preferred form of the program for making changes" (or even a program at all).

Given the iterative nature of AI training one could claim that the weights are the preferred form for making changes (e.g. you change an AI by adding more training data rather than removing unwanted old training data and retraining from scratch). I'm still forming an opinion on that since I think it depends on the ambition of the change. For example if you wanted to remove a oppressive bias such as misogyny, this is better done by filtering misogynistic training data from the training set than by trying to "train out" the misogyny from an existing set of weights. However it is different if our goal is to correct a neglectful bias such as under-representation; that could potentially be addressed iteratively.

> My point though is "what if your "difficult to modify" source is actually the same source
> available to the original author?".

Generally such source is not "more difficult for me to modify it than it is for the original author" and hence I'm more relaxed about it.

However if I was to rewrite that I would change it "than is is for the original author" to "that is *was* for the original author". That's because I don't buy into the "what if the original author lost, deleted or discarded the source" argument at all. Publishing derivatives such as binaries whilst offering clues to the recipe and encouraging repeat or reverse engineering is certainly a social good if the source truely is lost. However that doesn't necessarily make it Free Software.

Back to basics: Are weights software at all?

Posted May 22, 2025 9:41 UTC (Thu) by farnz (subscriber, #17727) [Link]

Note that this is currently an evolving field of research; Golden Gate Claude shows that there can be ways in which editing the weights adds or removes identified biases from the model.

It is possible that, when the dust settles and we understand LLMs a bit better, we'll know how to work directly with the weights to make any changes we want to make to the model, and we only use training data to "bootstrap" the LLM because it's simpler to do it that way than to craft an initial set of weights from first principles.

Back to basics: Are weights software at all?

Posted May 21, 2025 9:41 UTC (Wed) by JGR (subscriber, #93631) [Link]

> If I type hex into an editor to create an executable, does that mean it can't be distributed under the GPL?

You could well distribute it under the GPL, but Debian or other distributors might reasonably decline to package and distribute it.

Back to basics: Are weights software at all?

Posted May 22, 2025 5:19 UTC (Thu) by interalia (subscriber, #26615) [Link]

Even if the original author had Audacity files for generating the mp3s, and images in PSD or XCF fomat, but only distributed the resulting mp3s and PNGs under the GPL, then the preferred form of modification in the released file is .mp3 or .png because that's what was received under the GPL.

Again, my understanding of the intention of the GPL clause is to prevent person B getting the source to the app, modifying the .mp3 or .png, and sharing binaries with person C... then when person C asks for the source, obfuscating the mp3/png into some other obstructionist form which was not the one that B used. The fact that B themselves received a generated file and the original author made them from an Audacity/XCF file isn't relevant IMO, what's important is the "preferred form for modification" for the copy that B works with.

And ditto if someone typed hex into an editor and released that, that's the preferred form for what they released under the GPL.

But none of this means that someone like Debian has to accept that the hex was suitable under their social contract for Debian to distribute.

Back to basics: Are weights software at all?

Posted May 21, 2025 10:37 UTC (Wed) by paulj (subscriber, #341) [Link] (2 responses)

AI weights are data. They are a highly compressed form of a corpus of data, made in a way that they encode the smaller features of that data and relationships between features of that data, rather than the data directly, such that many probable combinations of those features can be constructed.

Back to basics: Are weights software at all?

Posted May 21, 2025 16:57 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> They are a highly compressed form of a corpus of data,

I don't think that's the best way to describe it. Stable Diffusion 1.0 (which is now a bit old, but that has the advantage that this has been studied) had a size of ~4 GiB and was trained on ~2 billion images, so if it is a "compressed" form, then each image was compressed to ~2 bytes. You can't usefully compress an image to 2 bytes. Even if we assume that the images have some features in common and we can eliminate the redundancy, 2 bytes per image is just too small for this to be a plausible theory of what is happening.

And indeed, when researchers have tried to extract training images from Stable Diffusion, they've gotten very low success rates. A handful of individual images can be extracted, but just as elementary information theory would predict, the vast majority of images cannot be recovered at all.

I would prefer to think of the model weights as a statistical aggregation or summary of the training data, in a very high dimensional "embedding" space that we don't fully understand.

Back to basics: Are weights software at all?

Posted May 27, 2025 10:11 UTC (Tue) by paulj (subscriber, #341) [Link]

I wrote " they encode the smaller features of that data and relationships between features of that data, rather than the data directly,". That's the magic of "AI", it deconstructs the data into much smaller and much more common and repeating features and relationships within that data-set.

There is - clearly - a tonne of redundant information in (say) a many TB data-set of JPG photos of crows; similarly a tonne of redundant data in a many TB data-set of photos of parrots; etc. And you were to eliminate the redundancies within each of those data-sets, there would still be further redundancies /across/ those data-sets. And when you've minimised the redundancies in your data-set of photos of all birds; and your data-set of photos of all cars, etc.; then you can start on the redundancies there exist across even /those/ (e.g. items in the background).

And these are redundancies which traditional compression JPG can /not/ access and eliminate - for a start off, JPG has no way to even express redundancies across multiple images. NNs can extract these redundancies, and reduce them to sets of weights that encode much smaller features and relationships, and then allow reconstructions. You don't get the same image (or text or ...) back, but you can get something that is very very /close/.

> when researchers have tried to extract training images from Stable Diffusion, they've gotten very low success rates. A handful of individual images can be extracted, but just as elementary information theory would predict, the vast majority of images cannot be recovered at all.

That is not true. You can reconstruct images quite well, provided you know exactly how to "ask" for the image you want. Unfortunately, I can't go into detail for now... but I'm sitting beside people in this research lab who are working on this type of stuff - sometimes acting as their "rubber duck" - and I can see the results with my own eyes.

untrained models can still be useful

Posted May 21, 2025 10:38 UTC (Wed) by jond (subscriber, #37669) [Link]

For many years now I've been using crm114 -- a statistical classifier -- to filter mail. It's available in Debian but it is not distributed with a pre-trained model: you start from 0 and train it yourself. I have had fantastic results from it for a long time and I hope that continues. I can appreciate that people get value from pre-trained models, I guess Spam Assassin's is shipped pre-trained? But starting without training, for this use-case at least, is not the end of the world IMHO; the crm114 classifier becomes useful very quickly.

Bayesian filters…

Posted May 21, 2025 10:47 UTC (Wed) by mirabilos (subscriber, #84359) [Link] (9 responses)

… are much more effective if one trains them on the spam oneself receives anyway. I throw away and retrain mine every two years or so.

Bayesian filters…

Posted May 22, 2025 2:14 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (3 responses)

When I read comments like this, I can't help but think of ESR's infamous printer story.[1] I daresay that Aunt Tillie would have no interest whatsoever in training a classifier herself, and would strongly prefer to download a pre-trained model.

Actually, let's be a bit more realistic here. Aunt Tillie does not want to download a pre-trained model. Aunt Tillie wants to stop seeing spam in her inbox, and is (hopefully) willing to press a few buttons to make it happen. She does not know or care how the spam stops, she just wants it to stop. You may not think like Aunt Tillie does, but can you at least see that this is a valid thing for a person to want in the first place?

[1]: See LWN's coverage at the time: https://lwn.net/Articles/73178/

Bayesian filters…

Posted May 22, 2025 15:37 UTC (Thu) by mirabilos (subscriber, #84359) [Link] (2 responses)

Erm, yes. I press a few buttons to say “this was really spam”. That’s the training. That’s it. (Except there’s also a “this really wasn’t spam” to tell it to not repeat the false positive.)

That’s all the training is.

Bayesian filters…

Posted May 24, 2025 9:54 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (1 responses)

The idea is that you "press a few buttons" once, over the course of maybe five minutes. Not "press a few buttons" many times over the course of days or weeks at a minimum. You may think that's an unreasonable expectation, but all the proprietary vendors provide it. Why should FOSS be *forced* to provide an inferior experience for users who just don't care that much about the quality of the classifier?

Bayesian filters…

Posted May 25, 2025 10:21 UTC (Sun) by ballombe (subscriber, #9523) [Link]

If you do not care about quality, just do not press the buttons, problem solved.

Proprietary vendor sell to advertisers the right to be considered ham by the filters.
You get what you pay for, as always.

Bayesian filters…

Posted May 22, 2025 3:54 UTC (Thu) by dskoll (subscriber, #1630) [Link] (4 responses)

That is not my experience.

I used to run a commercial e-mail security company. Our software included a feature whereby we'd collect tokens from emails that our customers all over the world had marked as spam or ham. The reports were securely encrypted using GnuPG, of course, and we only collected tokens, not actual messages.

We'd then build a (large) token table for use by a Bayesian analyzer and send that out a few times a day to all our scanners and our customers' scanners.

It was really effective. It turns out in the real world that most people actually do agree on what spam is, and when a new spam variant appears, having a lot of training data quickly can help limit its effectiveness.

I can't remember the exact statistics, but I'm pretty sure our training corpus was at least hundreds of thousands of messages each of ham and spam, and millions of tokens.

Bayesian filters…

Posted May 22, 2025 15:36 UTC (Thu) by mirabilos (subscriber, #84359) [Link] (3 responses)

In my experience, the filter simply gets too slow over time while still retaining a false negative rate high enough to be annoying.

Bayesian filters…

Posted May 22, 2025 18:15 UTC (Thu) by jond (subscriber, #37669) [Link] (1 responses)

Which filter do you use? Iirc crm114 uses fixed size token db and is constant time.

Bayesian filters…

Posted May 22, 2025 19:54 UTC (Thu) by mirabilos (subscriber, #84359) [Link]

bmf

Hadn’t heard of the others.

Bayesian filters…

Posted May 22, 2025 18:24 UTC (Thu) by dskoll (subscriber, #1630) [Link]

We used our own Bayes implementation based on Dan Bernstein's cdb database, and it is extremely fast.

LLM training time

Posted May 23, 2025 18:51 UTC (Fri) by DemiMarie (subscriber, #164188) [Link] (5 responses)

One problem with large models is that even if the training data is available, it might not be usable without excessive computing power. I suspect that in almost all cases training during package build will not be feasible.

LLM training time

Posted May 23, 2025 18:58 UTC (Fri) by mb (subscriber, #50428) [Link] (4 responses)

That's true. But what makes the difference is that it is *possible*.
With reproducible builds it also requires *massive* computing power to recompile the world from a manually verified set of machine instructions. But it's possible.

Non-reproducible AI models are proprietary, even if companies call them "open".

LLM training time

Posted May 26, 2025 9:07 UTC (Mon) by taladar (subscriber, #68407) [Link] (3 responses)

There is a big difference between "massive computing power" to recompile every package in a repository which is probably just a few weeks on a single system and "massive computing power" to retrain an AI model which takes longer than the lifetime of a single computer on a single computer.

LLM training time

Posted May 26, 2025 20:51 UTC (Mon) by Klaasjan (subscriber, #4951) [Link]

Thanks! I would nominate this for quote of the week.

LLM training time

Posted May 26, 2025 21:40 UTC (Mon) by excors (subscriber, #95769) [Link]

I think you're still underselling how expensive LLM training is. One of Meta's "open source" Llama models would take 3500 *years* to train on a single GPU. (https://huggingface.co/meta-llama/Llama-3.1-405B#hardware...)

And that's not a regular desktop GPU, that's a $30,000 GPU with 80GB RAM. Meta has about half a million of them. Several other tech companies are operating at a similar scale. It's insane.

LLM training time

Posted May 27, 2025 9:55 UTC (Tue) by farnz (subscriber, #17727) [Link]

This argument critically depends on there being no advances in either the cost of training a model, or the compute available to the end user for given money, or the ability to verify that a given set of weights were derived from a given set of training data.

Now, it's definitely true that today, validating that the training data you have matches the model can cost you billions of dollars, and is thus mostly impractical; but it's also possible that in the future, someone will come up with a clever trick to reproducibly train much faster than we can today (reducing the cost from billions to thousands), or that someone will come up with a neat way to verify that given weights came from a given reproducible training process and training data set cheaply.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds