Debian AI General Resolution withdrawn
Despite careful planning and months of warning, Debian developer Mo Zhou has acknowledged that the project needs more time to grapple with the questions around AI models and the Debian Free Software Guidelines (DFSG). For now, he has withdrawn his proposed General Resolution (GR) that would have required the original training data for AI models to be released in order to be considered DFSG-compliant—though the debates on the topic continue.
Zhou has been working toward the GR for some time. In February, he posted an early draft to the Debian-project mailing list to ask for help and to give other developers time to provide input or develop their own counter-proposals. On April 19, he sent his revised proposal—with detailed reasoning for his stance, comments on possible implications of the resolution, and several appendices of resources—to the debian-vote mailing list, which we covered at the end of April.
The text for Debian members to consider and vote for (or against) was short and simple:
Proposal A: "AI models released under open source license without original training data or program" are not seen as DFSG-compliant.
If a model (or other artifact) is not DFSG-compliant then it cannot be included in the Debian main repository that comprises the Debian distribution; only those packages in the main repository are considered part of the Debian distribution, as described in the Debian Policy Manual. Being outside main does not entirely prohibit Debian from distributing an artifact, though. The project also has the contrib, non-free, and non-free-firmware repositories for software that, directly or via dependencies, does not comply with the DFSG. However, packages outside main are not considered part of the distribution. These details are spelled out more fully in the Debian wiki SourcesList page.
Zhou's proposal is in contrast to the Open Source Initiative's
(OSI) Open Source
AI Definition (OSAID) that was announced
in October last year, which does not require the training data to be
released. The OSAID requires that an AI system be released in a way
that will grant the freedom to use, study, modify, and share an AI
system or elements of the system. OSI has determined that it is
sufficient to provide model parameters ("such as weights or other
configuration settings
") and detailed information about the training
data that would allow "a skilled person can build a substantially
equivalent system
". LWN covered the OSAID in
October 2024.
Initially, it looked like smooth sailing for Zhou's proposal. Early
discussion was largely positive—though Zhou's proposal did
attract a few counter-proposals. Thorsten Glaser's lengthy
proposal would have raised the bar even higher for AI models to
enter Debian main. For instance, it would have required model training
to happen during package build or that the model be built "in a
sufficiently reproducible way that a separate rebuilding effort from
the same source will result in the same trained model
". That would
be in addition to requiring training data, of course. It would also
have a dramatic impact on Debian's infrastructure in terms of
requiring the hardware to actually perform training.
Sam Hartman put forward a proposal
that would allow an application to define the preferred form of
modification, which might or might not include training data. Bill
Allombert pointed
out that, without training data, Debian has no way to know what's
inside the model. "The model could generate backdoors and non-free
copyrighted material or even more harmful content.
" Hartman countered
that the project had accepted x86 machine code as a preferred form of
modification. Inspectability, he said, has never been at the core of
the DFSG. He also predicted that there would, eventually, be "black
box inspection tools
" that would improve the ability to inspect
models over time.
What would be simpler, Aigars Mahinovs said,
would be to vote on the Debian project endorsing the OSAID. He later submitted
a proposal to clarify that training data is not source code for the
purposes of the DFSG. It would have, instead, required "training data
information
" as defined in the OSAID. The actual training data
would be considered merely "an intermediate build
artifact
". Wouter Verhelst objected to this and said
that the project would have to drop its reproducibility goals if AI
models were accepted in main. LWN covered Debian's progress
toward reproducibility in August last year. Stefano Zacchiroli suggested
that there might be room to merge the proposals from Mahinovs and
Hartman since they seemed to go in the same direction. Ultimately,
however, none of the counter-proposals had received enough sponsors to
be added to the ballot if Zhou's GR had gone to a vote.
Spam classifiers
A few Debian developers wondered about the impact on software that preceded the current AI craze and rampant rapacious data scraping that accompanies creating many of today's AI models. Applications that one would not usually lump in with AI, such as games, spam filters, optical-character recognition (OCR) tools, and text-to-speech software, also depend on trained models that are missing training data. Depending on one's reading, a fair amount of software already in Debian could be seen as non-free if Zhou's proposal were adopted.
For example, Ansgar Burchardt pointed out that it would not be possible to package spam or phishing emails as part of a training data set for Bayesian classifiers because those emails are unlikely to be under a free license. Russ Allbery said that he did not think that a classifier trained on such data would be DFSG-free, and did not think it should be included in Debian main.
That doesn't mean I think it's bad or immoral or anything like that. I have a database like that myself. :) It's simply not free software, and is outside the scope of what Debian is for. Not even all of Debian's own data is free software. For example, I would not consider the [Debian bug tracking system] database or the mailing list archives to be free software because the licensing status is not sufficiently clear.
The idea of blocking spam filtering software that uses trained Bayesian filters, which has been available in Debian main for ages, troubled some Debian developers. Hartman said that software freedom is supposed to be an achievable set of standards that empowers users. It may require users to forego convenient commercial software, but it should not be about sacrificing potential:
Users might want a Bayesian classifier--I do enough that I've trained one. Software in main like a mail reader or a mail system might well want to include a classifier. Saying that even if someone is as dedicated to freedom as they can be, they can never live up to our standards and include that reasonable functionality in Debian main makes me think we have lost sight of our users.
Zacchiroli predicted
that, if Zhou's proposal won, Debian packagers who included some form
of AI model without DFSG-free training data would be forced to patch
their software to download data on first use or "just give up on
maintaining those packages
". He said he failed to understand how
this served Debian's users, and would not really protect them from
"evil OSAID-but-not-DFSG-free stuff
" anyway.
Withdrawal
After much discussion, Zhou withdrew
his proposal on May 8. He said that it had become clear that the
community was unprepared to vote on the proposal. Initially, he wanted
to simply address the "conceptual interpretation
" of the DFSG
with regard to AI models, but the real implications had given Debian
members pause. He asked for suggestions on tools that might help him
scan the Debian archive to figure out which packages might be affected
by the GR.
Zhou also added that many people seemed to assume that pre-trained models were trustworthy. He said he would create a demonstration to illustrate how a backdoor could be planted in a neural network. This would allow those who consider models the preferred form of modification to demonstrate how they could fix the backdoor. He indicated that he would need a few months before he could return to working on the GR.
Russ Allbery thanked Zhou for his work and said that this happens a lot. People often wait until the GR is proposed before speaking up, and the discussion often brings opinions to the surface that had not been expressed before. He added that he thought delaying the GR was the right decision:
I also hope it doesn't discourage you from continuing to work on this. I don't think anyone is saying that we shouldn't have this conversation and a vote, only that we (myself very much included) are realizing that we hadn't actually thought this through as thoroughly as we had thought.
Hartman and Mahinovs followed suit and formally withdrew their proposals even though they did not have sufficient sponsors, just to be clear that they were not to be voted on in the absence of Zhou's GR. To date, Glaser has not formally withdrawn his proposal.
More complicated than first thought
It seemed that there was plenty of support at the beginning of the discussion for requiring training data with AI models to consider them DFSG-free. However, coming up with a definition of AI models that does not overlap with other, less controversial, data is clearly going to be difficult. It will be interesting to see how the discussion goes when Zhou returns to the topic down the road, and whether Debian can adopt a policy without unintended consequences.
Posted May 20, 2025 19:56 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (31 responses)
This may sound like a trivial "is a hotdog a sandwich?" type of question, but it's really not. Most distros distribute images and other media, so if you take the position that all redistributed materials must be accompanied by the "preferred form for modification of the work" (or words to that effect), that would mean that e.g. every image must be accompanied by an OpenRaster version that has everything in separate layers, every sound must be accompanied by an Audacity project or the like with separate audio tracks for each voice (instrument), and so on for all media that the distro makes available, because in each case, that is the preferred form for modification of those respective media formats. Is the average disto really going to do all of that? I suspect not.
So then how do we think about weights? There are at least two conflicting ways of describing weights:
1. Weights are a list of opaque numbers that have some interesting emergent properties when you run the right software over them.
The problem with interpretation (2) is that the overwhelming majority of AI model architectures are not Turing complete, or anywhere close to Turing complete. You could argue that they are still "instructions" of a sort, but then so is an SVG image, or any non-raster font file. I've never heard of anyone demanding the font designer release any sort of project file - usually, putting the font under SIL (or another FOSS license) and distributing it as a finished OpenType file is considered adequate.
(I should also acknowledge that, at least in the US, non-raster fonts are considered "software" for copyright purposes, but of course there is no need for the FOSS community to make exactly the same distinctions as the law does.)
So where do we draw the line? What makes AI model weights different from every other kind of non-software asset that a Linux distro might happen to distribute?
Posted May 20, 2025 22:10 UTC (Tue)
by willy (subscriber, #9762)
[Link]
So there you go. For Debian's purposes, all data is software. And it must be DFSG-free. Anything under a less-pure license (eg the GFDL) goes to non-free.
Posted May 20, 2025 22:18 UTC (Tue)
by acarno (subscriber, #123476)
[Link] (2 responses)
For an ML model, however, whether it's a GPT or a Bayesian classifier, changing the weights can substantially affect the function of the overall software package. Whether the weights constitute a Turing-complete system is (in my opinion) irrelevant - there are plenty of useful non-Turing-complete systems. What matters is: without those weights, the application using them is unable to function properly. Thus, I think you _have_ to consider them a critical component of the software.
Posted May 21, 2025 17:28 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Changing the control points of a non-raster font substantially affects the function of the font, because it makes the difference between an "A" and an unrecognizable blob.
Posted May 21, 2025 19:19 UTC (Wed)
by intelfx (subscriber, #130118)
[Link]
Yes, but it's still fundamentally *content*: it only impacts human perception of the program, not the functionality and the behavior of the *program itself*.
Just like a PNG icon in an email application: changing that icon makes the difference between a recognizable button on a control panel and an unrecognizable one, but it's still *content*, not code.
Posted May 20, 2025 22:24 UTC (Tue)
by jzb (editor, #7867)
[Link] (2 responses)
"This may sound like a trivial "is a hotdog a sandwich?" type of question" That's a solved question. Everyone knows that a hot dog is a taco. "What makes AI model weights different from every other kind of non-software asset that a Linux distro might happen to distribute?" In part I would say AI models are different from, say, audio files or PDFs because they are used to accomplish a task as opposed to being merely content. If I'm using Speech Note to transcribe audio to text, I get vastly different output depending on which model I use. It would be possible, I believe, for a model to be trained in some way to refuse to transcribe certain words/phrases or otherwise manipulate the output. Some evildoer could, for example, train the model not to use the Oxford comma! Likewise, AI models that would be used for code generation might be created in a way to try to insert backdoors or just generate code with specific types of vulnerabilities. (Or a kind of reverse typosquatting where the model inserts calls to malicious Python modules that look like legitimate ones...) Certain fonts might fall somewhere between content and AI models. Anyway, the larger point that there's a fuzzy area for not-quite-software without accompanying "preferred form of modification" is taken—but I'd consider AI models different. What that difference means is open for some debate, but AI models are in a different category than PDFs, audio files, and fonts IMO.
Posted May 20, 2025 23:19 UTC (Tue)
by excors (subscriber, #95769)
[Link]
You don't even need a backdoored model for that: a recent study used various LLMs to generate code, and found around 20% of package references were hallucinated. The LLM just guessed a likely package name and API and hoped for the best. I expect the 'programmer' is typically going to run the code and paste the error messages back into the LLM, so they're not going to notice if an attacker has recently registered those package names. (https://arstechnica.com/security/2025/04/ai-generated-cod...)
Posted May 21, 2025 0:56 UTC (Wed)
by pabs (subscriber, #43278)
[Link]
Posted May 20, 2025 23:01 UTC (Tue)
by gioele (subscriber, #61675)
[Link]
It's a bit of a gray and unenforced area, but yes, in Debian the "source code" of the fonts (= the fonts in a format accepted as preferred form of modification by font designers) should be available and the TTF/OTF files should be built from it.
Posted May 21, 2025 0:55 UTC (Wed)
by pabs (subscriber, #43278)
[Link] (19 responses)
PS: some stuff about source forms for non-code files:
Posted May 21, 2025 1:14 UTC (Wed)
by interalia (subscriber, #26615)
[Link] (1 responses)
Posted May 21, 2025 1:32 UTC (Wed)
by pabs (subscriber, #43278)
[Link]
Usually the main reason for not releasing the training data itself, is that it isn't legally possible to redistribute it, or that it was illegally obtained in the first place (for eg Facebook torrenting books).
Not releasing provenance of the the training data is usually either to cover up illegal activity, or otherwise an anti-competitive act to place a barrier in front of other organisations aiming to reproduce and improve on a model, or even just audit its training data for biases.
Posted May 21, 2025 6:52 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (16 responses)
Yup. What was that about mp3's should be accompanied by an audacity project split into separate voices per instrument? Yes, that may be the "preferred form for modification", but if that's never existed? If I type hex into an editor to create an executable, does that mean it can't be distributed under the GPL?
There's far too much emphasis on the recipient demanding what they think they're entitled to, when it should be on what the giver is freely giving. AI is, I think, simply throwing a great big spotlight on this basic problem.
The big thing about Free Software is not what you have, not what you give, but that you CAN SHARE EVERYTHING that you are given.
Cheers,
Posted May 21, 2025 7:05 UTC (Wed)
by pabs (subscriber, #43278)
[Link] (3 responses)
I think Free Software is about providing a high level of equality of access to a work between both the original author and far downstream recipients, both legally and practically.
What is "source" in any given context is a *choice* the author makes about what level of access they want to pass on to their future self, and a separate choice about what to pass on to other people.
If their future self gets a better option than others do, then that clearly isn't Free Software. For eg keeping the non-obfuscated source private and distributing the obfuscated version to others.
If their future self deliberately gets a bad option just so that others also get that bad option, its debatable whether that is Free Software or not. For example throwing away the non-obfuscated source and only keeping the obfuscated version locally and in distributions to others.
If their future self deliberately gets a bad option for other reasons, then it completely depends on the situation.
So I think Free Software is about everything; what you make, what tools you use, what you keep, what you discard, what you give and what you receive. Everything has an impact on what future changes your future self and other people can make.
Posted May 21, 2025 8:29 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
Strongly disagree, in that (if it's my own work) what I give you is down to me. End of. What you *want* might not exist.
I did, however, forget about the bit that you MUST offer to pass on EVERYTHING, if you pass on ANYTHING. Note the difference between "give", and "pass on".
Cheers,
Posted May 21, 2025 9:38 UTC (Wed)
by ballombe (subscriber, #9523)
[Link]
Posted May 21, 2025 10:19 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
The whole point of this argument is to define the line between "software" and "free software per the DFSG" (or "free software per the FSF", or whoever); some freely redistributable things, where it's even legal to modify them, don't count as "free software per the DFSG", since the original author has chosen to not release enough of the work to cross that line. And that's OK; not everything that can be redistributed and modified must necessarily meet DFSG requirements.
Posted May 21, 2025 9:08 UTC (Wed)
by danielthompson (subscriber, #97243)
[Link] (9 responses)
This summary overlooks the freedom to study and modify. I'm a programmer, thus the freedom to improve my craft and to apply my craft by modifying programs is as important to me as the freedom to share with others. I noted that "the four freedoms" are typically enumerated with the freedom to study and modify ahead of the freedom to share copies.
So if what I'm given is in a form that makes is more difficult for me to modify it than it is for the original author, then my freedom to modify is reduced... and I'd rather choose software that offers me that freedom.
To be honest *today* the resources needed to train AI makes these issues moot since I could not store the training set and would not be inclined to pay for the compute time to train from it. These practical differences mean open-weight and open-training-set models offer me similar capabilities... today.
However IMHO it would be a wrong to take a short-term practical view here and ignore the freedoms that could be important to me in the future as I gain access to more compute and storage resources. Practical short-term convenience always risks lures us away from cooperating and building the software commons of the future. Without demand (or investment) open training sets won't happen.
Posted May 21, 2025 10:42 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (8 responses)
And that is your prerogative. Doesn't stop the original from being Free Software though.
My point though is "what if your "difficult to modify" source is actually the same source available to the original author?".
What if someone says "we ran our model on the contents of the Gutenberg project"? (What if the recipient can't store a copy, but slurped it straight into the model?)
The problem is we're running the entire gamut here, I understand people want to REbuild stuff easily, but what if the giver is sharing everything they can? Or have?
I think a bright line we should NOT cross is "is the receiver demanding the giver does extra work to comply with the receiver's wants?". If the answer is "yes", and the giver is sharing what they have, then the gift is fully Free. The recipient should accept what's on offer, or take a hike.
Cheers,
Posted May 21, 2025 11:30 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (5 responses)
I could do a lot of extra work to force Conexant to share the sources they used to create the Linux kernel image they supplied me (noting that they might have discarded that source before I bought a chip from them, so they might have to do a lot of work to recreate it), but you've just said that this is a bright line that we should NOT cross - it's Free because I'm giving you everything I have, and it would be a lot of extra work to give you sources for it.
Posted May 21, 2025 12:13 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (4 responses)
Except that you're not the giver, you're a sharer. And in that situation, the GPL says you should have an offer of the source, which you *can* share with me, so the binary isn't all you have.
Yet again, we're back to the situation of trying to enforce a licence against the licence grantor - IT CAN'T BE DONE. We're so used to thinking of "shared authorship" works where everyone is equal, we completely miss the situation of the original gifter, where we are just not equal, and there is absolutely nothing that can be done about it.
AI is (as I said) just throwing a big spotlight on this inequality, because it offends our feeling of "justice", the problem being that we have different ideas of what is "just". Again, as I said, my bright line is demanding someone does EXTRA work to avoid offending my sense of entitlement of more than is on offer.
> but you've just said that this is a bright line that we should NOT cross - it's Free because I'm giving you everything I have, and it would be a lot of extra work to give you sources for it.
No. It's a lot of extra work to create sources THAT NEVER EXISTED IN THE FIRST PLACE. If Conexant can't provide the source because they've lost it, that's their problem. If they can't provide what never existed, then it's ours.
It gets greyer if the source itself is " 'AI Model' < curl gutenberg " :-)
Cheers,
Posted May 21, 2025 12:47 UTC (Wed)
by daroc (editor, #160859)
[Link]
Posted May 21, 2025 12:53 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (2 responses)
We are in the process of defining what the licence grantor has to do if they want their output to be classified as DFSG-free; if they do not disclose enough source to meet these requirements, then they will not have their output declared as DFSG-free.
We don't care whether it's not DFSG-free because they've lost the source, or because the source never existed in the first place, or because the grantor deemed it impractical to share the full source (e.g. the Project Gutenberg stuff). That's the grantor's problem if they want us to declare a piece of software DFSG-free.
Instead, we're setting out bright lines, beyond which the thing you're granting a licence to is clearly DFSG-free, with other bright lines where the thing is clearly not DFSG-free, and gradually reducing the amount of grey between the lines as cases come up.
Posted May 21, 2025 16:01 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (1 responses)
In the sense that you're saying "here, have a copy of everything I've got" isn't enough. Would you say that Public Domain or 2-clause BSD is "just obviously DFSG- (or FSF-)Free"? Because, by the standards you're trying to apply here, it clearly isn't.
Cheers,
Posted May 21, 2025 17:20 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
If you have enough source to qualify as DFSG-Free, and you license that source under 2-clause BSD, then your software is DFSG-Free. But if you don't have enough source to qualify as DFSG-Free, then while you may be using a DFSG-Free licence (such as 2-clause BSD), your software remains not DFSG-Free, because it's the software that's at issue, not the choice of licence text.
Note, too, that this is all about what is acceptable in Debian proper (the "main" component). Debian also supplies some resources for things in Debian packaging, but not DFSG-Free: "non-free" for things that are redistributable legally, but not DFSG-Free, and "contrib" for things that are DFSG-Free, but which depend on something in "non-free".
So, one compromise position that we could easily end up in is that most inference engines are "contrib", since they depend on a set of weights from "non-free", and thus outside Debian proper. For some special cases, there will be inference engines that are "main" (since they have weights in "main" for which full source is available - for example a programming LLM trained on a Debian source snapshot), but common use cases depend on weights from "non-free".
Posted May 22, 2025 9:04 UTC (Thu)
by danielthompson (subscriber, #97243)
[Link] (1 responses)
That depends largely on whose definition of Free Software is adopted. In the case of AI, and using language from the FSF definition of Free Software, a lot depends on whether you think the weights or the training data are the "preferred form of the program for making changes" (or even a program at all).
Given the iterative nature of AI training one could claim that the weights are the preferred form for making changes (e.g. you change an AI by adding more training data rather than removing unwanted old training data and retraining from scratch). I'm still forming an opinion on that since I think it depends on the ambition of the change. For example if you wanted to remove a oppressive bias such as misogyny, this is better done by filtering misogynistic training data from the training set than by trying to "train out" the misogyny from an existing set of weights. However it is different if our goal is to correct a neglectful bias such as under-representation; that could potentially be addressed iteratively.
> My point though is "what if your "difficult to modify" source is actually the same source
Generally such source is not "more difficult for me to modify it than it is for the original author" and hence I'm more relaxed about it.
However if I was to rewrite that I would change it "than is is for the original author" to "that is *was* for the original author". That's because I don't buy into the "what if the original author lost, deleted or discarded the source" argument at all. Publishing derivatives such as binaries whilst offering clues to the recipe and encouraging repeat or reverse engineering is certainly a social good if the source truely is lost. However that doesn't necessarily make it Free Software.
Posted May 22, 2025 9:41 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
It is possible that, when the dust settles and we understand LLMs a bit better, we'll know how to work directly with the weights to make any changes we want to make to the model, and we only use training data to "bootstrap" the LLM because it's simpler to do it that way than to craft an initial set of weights from first principles.
Posted May 21, 2025 9:41 UTC (Wed)
by JGR (subscriber, #93631)
[Link]
You could well distribute it under the GPL, but Debian or other distributors might reasonably decline to package and distribute it.
Posted May 22, 2025 5:19 UTC (Thu)
by interalia (subscriber, #26615)
[Link]
Again, my understanding of the intention of the GPL clause is to prevent person B getting the source to the app, modifying the .mp3 or .png, and sharing binaries with person C... then when person C asks for the source, obfuscating the mp3/png into some other obstructionist form which was not the one that B used. The fact that B themselves received a generated file and the original author made them from an Audacity/XCF file isn't relevant IMO, what's important is the "preferred form for modification" for the copy that B works with.
And ditto if someone typed hex into an editor and released that, that's the preferred form for what they released under the GPL.
But none of this means that someone like Debian has to accept that the hex was suitable under their social contract for Debian to distribute.
Posted May 21, 2025 10:37 UTC (Wed)
by paulj (subscriber, #341)
[Link] (2 responses)
Posted May 21, 2025 16:57 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
I don't think that's the best way to describe it. Stable Diffusion 1.0 (which is now a bit old, but that has the advantage that this has been studied) had a size of ~4 GiB and was trained on ~2 billion images, so if it is a "compressed" form, then each image was compressed to ~2 bytes. You can't usefully compress an image to 2 bytes. Even if we assume that the images have some features in common and we can eliminate the redundancy, 2 bytes per image is just too small for this to be a plausible theory of what is happening.
And indeed, when researchers have tried to extract training images from Stable Diffusion, they've gotten very low success rates. A handful of individual images can be extracted, but just as elementary information theory would predict, the vast majority of images cannot be recovered at all.
I would prefer to think of the model weights as a statistical aggregation or summary of the training data, in a very high dimensional "embedding" space that we don't fully understand.
Posted May 27, 2025 10:11 UTC (Tue)
by paulj (subscriber, #341)
[Link]
There is - clearly - a tonne of redundant information in (say) a many TB data-set of JPG photos of crows; similarly a tonne of redundant data in a many TB data-set of photos of parrots; etc. And you were to eliminate the redundancies within each of those data-sets, there would still be further redundancies /across/ those data-sets. And when you've minimised the redundancies in your data-set of photos of all birds; and your data-set of photos of all cars, etc.; then you can start on the redundancies there exist across even /those/ (e.g. items in the background).
And these are redundancies which traditional compression JPG can /not/ access and eliminate - for a start off, JPG has no way to even express redundancies across multiple images. NNs can extract these redundancies, and reduce them to sets of weights that encode much smaller features and relationships, and then allow reconstructions. You don't get the same image (or text or ...) back, but you can get something that is very very /close/.
> when researchers have tried to extract training images from Stable Diffusion, they've gotten very low success rates. A handful of individual images can be extracted, but just as elementary information theory would predict, the vast majority of images cannot be recovered at all.
That is not true. You can reconstruct images quite well, provided you know exactly how to "ask" for the image you want. Unfortunately, I can't go into detail for now... but I'm sitting beside people in this research lab who are working on this type of stuff - sometimes acting as their "rubber duck" - and I can see the results with my own eyes.
Posted May 21, 2025 10:38 UTC (Wed)
by jond (subscriber, #37669)
[Link]
Posted May 21, 2025 10:47 UTC (Wed)
by mirabilos (subscriber, #84359)
[Link] (9 responses)
Posted May 22, 2025 2:14 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
Actually, let's be a bit more realistic here. Aunt Tillie does not want to download a pre-trained model. Aunt Tillie wants to stop seeing spam in her inbox, and is (hopefully) willing to press a few buttons to make it happen. She does not know or care how the spam stops, she just wants it to stop. You may not think like Aunt Tillie does, but can you at least see that this is a valid thing for a person to want in the first place?
[1]: See LWN's coverage at the time: https://lwn.net/Articles/73178/
Posted May 22, 2025 15:37 UTC (Thu)
by mirabilos (subscriber, #84359)
[Link] (2 responses)
That’s all the training is.
Posted May 24, 2025 9:54 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted May 25, 2025 10:21 UTC (Sun)
by ballombe (subscriber, #9523)
[Link]
Proprietary vendor sell to advertisers the right to be considered ham by the filters.
Posted May 22, 2025 3:54 UTC (Thu)
by dskoll (subscriber, #1630)
[Link] (4 responses)
That is not my experience.
I used to run a commercial e-mail security company. Our software included a feature whereby we'd collect tokens from emails that our customers all over the world had marked as spam or ham. The reports were securely encrypted using GnuPG, of course, and we only collected tokens, not actual messages.
We'd then build a (large) token table for use by a Bayesian analyzer and send that out a few times a day to all our scanners and our customers' scanners.
It was really effective. It turns out in the real world that most people actually do agree on what spam is, and when a new spam variant appears, having a lot of training data quickly can help limit its effectiveness.
I can't remember the exact statistics, but I'm pretty sure our training corpus was at least hundreds of thousands of messages each of ham and spam, and millions of tokens.
Posted May 22, 2025 15:36 UTC (Thu)
by mirabilos (subscriber, #84359)
[Link] (3 responses)
Posted May 22, 2025 18:15 UTC (Thu)
by jond (subscriber, #37669)
[Link] (1 responses)
Posted May 22, 2025 19:54 UTC (Thu)
by mirabilos (subscriber, #84359)
[Link]
Hadn’t heard of the others.
Posted May 22, 2025 18:24 UTC (Thu)
by dskoll (subscriber, #1630)
[Link]
We used our own Bayes implementation based on Dan Bernstein's cdb database, and it is extremely fast.
Posted May 23, 2025 18:51 UTC (Fri)
by DemiMarie (subscriber, #164188)
[Link] (5 responses)
Posted May 23, 2025 18:58 UTC (Fri)
by mb (subscriber, #50428)
[Link] (4 responses)
Non-reproducible AI models are proprietary, even if companies call them "open".
Posted May 26, 2025 9:07 UTC (Mon)
by taladar (subscriber, #68407)
[Link] (3 responses)
Posted May 26, 2025 20:51 UTC (Mon)
by Klaasjan (subscriber, #4951)
[Link]
Posted May 26, 2025 21:40 UTC (Mon)
by excors (subscriber, #95769)
[Link]
And that's not a regular desktop GPU, that's a $30,000 GPU with 80GB RAM. Meta has about half a million of them. Several other tech companies are operating at a similar scale. It's insane.
Posted May 27, 2025 9:55 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
Now, it's definitely true that today, validating that the training data you have matches the model can cost you billions of dollars, and is thus mostly impractical; but it's also possible that in the future, someone will come up with a clever trick to reproducibly train much faster than we can today (reducing the cost from billions to thousands), or that someone will come up with a neat way to verify that given weights came from a given reproducible training process and training data set cheaply.
Back to basics: Are weights software at all?
2. Weights encode a program or procedure, and the software is merely a virtual machine they run on.
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Wol
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Wol
Back to basics: Are weights software at all?
Sure but you cannot force me to accept it as 'Free software'.
By your reasoning, though, I can call any software "free software", even when distributed as a binary for a given platform only, because that's what I've got, even if someone else (the original author) has a better form for modification available to them that they're keeping secret.
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
> but that you CAN SHARE EVERYTHING that you are given.
Back to basics: Are weights software at all?
Wol
So a binary-only Linux kernel image supplied to me by Conexant and integrated with the product I'm selling would count as Free when I supply it to you, since all I have as the giver is a binary, and I'm sharing what I have?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Wol
Back to basics: Are weights software at all?
In what sense is refusing to classify your output as "free software per the DFSG" trying to enforce anything against the licence grantor?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Wol
Public Domain and 2-clause BSD are not "obviously" DFSG-Free in their own right, because they're licence texts, not software, and DFSG-Free applies to things that Debian deems "distributable software".
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
>> it than it is for the original author, then my freedom to modify is reduced...
>> and I'd rather choose software that offers me that freedom.
>
> And that is your prerogative. Doesn't stop the original from being Free Software though.
> available to the original author?".
Note that this is currently an evolving field of research; Golden Gate Claude shows that there can be ways in which editing the weights adds or removes identified biases from the model.
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Even if the original author had Audacity files for generating the mp3s, and images in PSD or XCF fomat, but only distributed the resulting mp3s and PNGs under the GPL, then the preferred form of modification in the released file is .mp3 or .png because that's what was received under the GPL.
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
Back to basics: Are weights software at all?
untrained models can still be useful
Bayesian filters…
Bayesian filters…
Bayesian filters…
Bayesian filters…
Bayesian filters…
You get what you pay for, as always.
Bayesian filters…
Bayesian filters…
Bayesian filters…
Bayesian filters…
Bayesian filters…
LLM training time
LLM training time
With reproducible builds it also requires *massive* computing power to recompile the world from a manually verified set of machine instructions. But it's possible.
LLM training time
LLM training time
LLM training time
This argument critically depends on there being no advances in either the cost of training a model, or the compute available to the end user for given money, or the ability to verify that a given set of weights were derived from a given set of training data.
LLM training time
