Debian AI General Resolution withdrawn
Despite careful planning and months of warning, Debian developer Mo Zhou has acknowledged that the project needs more time to grapple with the questions around AI models and the Debian Free Software Guidelines (DFSG). For now, he has withdrawn his proposed General Resolution (GR) that would have required the original training data for AI models to be released in order to be considered DFSG-compliant—though the debates on the topic continue.
Zhou has been working toward the GR for some time. In February, he posted an early draft to the Debian-project mailing list to ask for help and to give other developers time to provide input or develop their own counter-proposals. On April 19, he sent his revised proposal—with detailed reasoning for his stance, comments on possible implications of the resolution, and several appendices of resources—to the debian-vote mailing list, which we covered at the end of April.
The text for Debian members to consider and vote for (or against) was short and simple:
Proposal A: "AI models released under open source license without original training data or program" are not seen as DFSG-compliant.
If a model (or other artifact) is not DFSG-compliant then it cannot be included in the Debian main repository that comprises the Debian distribution; only those packages in the main repository are considered part of the Debian distribution, as described in the Debian Policy Manual. Being outside main does not entirely prohibit Debian from distributing an artifact, though. The project also has the contrib, non-free, and non-free-firmware repositories for software that, directly or via dependencies, does not comply with the DFSG. However, packages outside main are not considered part of the distribution. These details are spelled out more fully in the Debian wiki SourcesList page.
Zhou's proposal is in contrast to the Open Source Initiative's
(OSI) Open Source
AI Definition (OSAID) that was announced
in October last year, which does not require the training data to be
released. The OSAID requires that an AI system be released in a way
that will grant the freedom to use, study, modify, and share an AI
system or elements of the system. OSI has determined that it is
sufficient to provide model parameters ("such as weights or other
configuration settings
") and detailed information about the training
data that would allow "a skilled person can build a substantially
equivalent system
". LWN covered the OSAID in
October 2024.
Initially, it looked like smooth sailing for Zhou's proposal. Early
discussion was largely positive—though Zhou's proposal did
attract a few counter-proposals. Thorsten Glaser's lengthy
proposal would have raised the bar even higher for AI models to
enter Debian main. For instance, it would have required model training
to happen during package build or that the model be built "in a
sufficiently reproducible way that a separate rebuilding effort from
the same source will result in the same trained model
". That would
be in addition to requiring training data, of course. It would also
have a dramatic impact on Debian's infrastructure in terms of
requiring the hardware to actually perform training.
Sam Hartman put forward a proposal
that would allow an application to define the preferred form of
modification, which might or might not include training data. Bill
Allombert pointed
out that, without training data, Debian has no way to know what's
inside the model. "The model could generate backdoors and non-free
copyrighted material or even more harmful content.
" Hartman countered
that the project had accepted x86 machine code as a preferred form of
modification. Inspectability, he said, has never been at the core of
the DFSG. He also predicted that there would, eventually, be "black
box inspection tools
" that would improve the ability to inspect
models over time.
What would be simpler, Aigars Mahinovs said,
would be to vote on the Debian project endorsing the OSAID. He later submitted
a proposal to clarify that training data is not source code for the
purposes of the DFSG. It would have, instead, required "training data
information
" as defined in the OSAID. The actual training data
would be considered merely "an intermediate build
artifact
". Wouter Verhelst objected to this and said
that the project would have to drop its reproducibility goals if AI
models were accepted in main. LWN covered Debian's progress
toward reproducibility in August last year. Stefano Zacchiroli suggested
that there might be room to merge the proposals from Mahinovs and
Hartman since they seemed to go in the same direction. Ultimately,
however, none of the counter-proposals had received enough sponsors to
be added to the ballot if Zhou's GR had gone to a vote.
Spam classifiers
A few Debian developers wondered about the impact on software that preceded the current AI craze and rampant rapacious data scraping that accompanies creating many of today's AI models. Applications that one would not usually lump in with AI, such as games, spam filters, optical-character recognition (OCR) tools, and text-to-speech software, also depend on trained models that are missing training data. Depending on one's reading, a fair amount of software already in Debian could be seen as non-free if Zhou's proposal were adopted.
For example, Ansgar Burchardt pointed out that it would not be possible to package spam or phishing emails as part of a training data set for Bayesian classifiers because those emails are unlikely to be under a free license. Russ Allbery said that he did not think that a classifier trained on such data would be DFSG-free, and did not think it should be included in Debian main.
That doesn't mean I think it's bad or immoral or anything like that. I have a database like that myself. :) It's simply not free software, and is outside the scope of what Debian is for. Not even all of Debian's own data is free software. For example, I would not consider the [Debian bug tracking system] database or the mailing list archives to be free software because the licensing status is not sufficiently clear.
The idea of blocking spam filtering software that uses trained Bayesian filters, which has been available in Debian main for ages, troubled some Debian developers. Hartman said that software freedom is supposed to be an achievable set of standards that empowers users. It may require users to forego convenient commercial software, but it should not be about sacrificing potential:
Users might want a Bayesian classifier--I do enough that I've trained one. Software in main like a mail reader or a mail system might well want to include a classifier. Saying that even if someone is as dedicated to freedom as they can be, they can never live up to our standards and include that reasonable functionality in Debian main makes me think we have lost sight of our users.
Zacchiroli predicted
that, if Zhou's proposal won, Debian packagers who included some form
of AI model without DFSG-free training data would be forced to patch
their software to download data on first use or "just give up on
maintaining those packages
". He said he failed to understand how
this served Debian's users, and would not really protect them from
"evil OSAID-but-not-DFSG-free stuff
" anyway.
Withdrawal
After much discussion, Zhou withdrew
his proposal on May 8. He said that it had become clear that the
community was unprepared to vote on the proposal. Initially, he wanted
to simply address the "conceptual interpretation
" of the DFSG
with regard to AI models, but the real implications had given Debian
members pause. He asked for suggestions on tools that might help him
scan the Debian archive to figure out which packages might be affected
by the GR.
Zhou also added that many people seemed to assume that pre-trained models were trustworthy. He said he would create a demonstration to illustrate how a backdoor could be planted in a neural network. This would allow those who consider models the preferred form of modification to demonstrate how they could fix the backdoor. He indicated that he would need a few months before he could return to working on the GR.
Russ Allbery thanked Zhou for his work and said that this happens a lot. People often wait until the GR is proposed before speaking up, and the discussion often brings opinions to the surface that had not been expressed before. He added that he thought delaying the GR was the right decision:
I also hope it doesn't discourage you from continuing to work on this. I don't think anyone is saying that we shouldn't have this conversation and a vote, only that we (myself very much included) are realizing that we hadn't actually thought this through as thoroughly as we had thought.
Hartman and Mahinovs followed suit and formally withdrew their proposals even though they did not have sufficient sponsors, just to be clear that they were not to be voted on in the absence of Zhou's GR. To date, Glaser has not formally withdrawn his proposal.
More complicated than first thought
It seemed that there was plenty of support at the beginning of the discussion for requiring training data with AI models to consider them DFSG-free. However, coming up with a definition of AI models that does not overlap with other, less controversial, data is clearly going to be difficult. It will be interesting to see how the discussion goes when Zhou returns to the topic down the road, and whether Debian can adopt a policy without unintended consequences.
