Debian debates AI models and the DFSG
The Debian project is discussing a General Resolution (GR) that would, if approved, clarify that AI models must include training data to be compliant with the Debian Free Software Guidelines (DFSG) and be distributed by Debian as free software. While GR discussions are sometimes contentious, the discussion around the proposal from Debian developer Mo Zhou has been anything but—there seems to be consensus that AI models are not DFSG-compliant if they lack training data. There are, however, some questions about the exact language and questions about the impact the GR will have on existing packages in the Debian archive.
While many folks in the free-software community are generally skeptical about AI and would be happy to see the trend come to an end, Zhou is certainly not in the anti-AI camp. He is a Ph.D. student at Johns Hopkins University, and his academic web site states that his research interest is in computer vision and machine learning. He has created a project called DebGPT that explores using LLMs to aid in Debian development. Clearly, he sees some value in the technology, but also wants to adhere to free-software principles.
GR proposal
In February, Zhou wrote
to the debian-project mailing list to say that he had created
"something draft-ish
" for a general resolution about applying
the DFSG to AI models, which he later defined thusly:
A pre-trained "AI model" is usually stored on disk in binary formats designed for numerical arrays, as a "model checkpoint" or "state dictionary", which is essentially a collection of matrices and vectors, holding the learned information from the training data or simulator. When the user make use of such file, it is usually loaded by an inference program, which performs numerical computations to produce outputs based on the learned information in the model.
He called for help in adding reference materials and to shape up the early draft before posting it. Zhou sent his revised proposal to the debian-vote mailing list on April 19, with a detailed explanation with his reasoning for the GR and several appendices containing background information on AI technology, previous discussions, and comments on possible implications if the proposal is passed.
Debian has taken up the topic previously (see LWN's coverage from 2018) but never settled the question. The goal now is to reach a consensus on handling AI models that are released under DFSG-compliant licenses, but do not provide training data. Zhou's proposal notes that the software that runs AI models, such as Python scripts or C++ programs, are out of scope of the proposal since traditional software is already a well-defined case.
The actual text of the proposal, what Debian members would vote for (or against), is short and to the point:
Proposal A: "AI models released under open source license without original training data or program" are not seen as DFSG-compliant.
Francois Mazen, Timo Röhling, Matthias Urlichs, Christian Kastner, Boyuan Yang, and others have replied to support and sponsor the proposal. Resolutions are required to have five additional sponsors before they are put to discussion and eligible for a vote. Currently, if put to a vote, Debian members would have a choice between "A" or "none of the above". It is possible, according to the resolution procedure, that amendments or alternative proposals, such as "AI models are DFSG-compliant if under DFSG licenses", could be added during the discussion period.
Thorsten Glaser posted
what he called a counter-proposal on April 23, and requested
comments. While Zhou's proposal would simply clarify that models
without training data do not meet the DSFG, Glaser goes much
further. For example, he wants Debian to require that models be
"trained only from legally obtained and used works
" and that
the data itself be under a suitable license for distribution. His
proposal would also place heavy requirements on building models that
would be hosted in Debian's main archive:
For a model to enter the main archive, the model training itself must *either* happen during package build (which, for models of a certain size, may need special infrastructure; the handling of this is outside of the scope of this resolution), *or* the model resulting from training must build in a sufficiently reproducible way that a separate rebuilding effort from the same source will result in the same trained model.
Finally, the current language would ask that training sources not
be obtained unethically and "the ecological impact of training and
using AI models be considered
". What constitutes ethical or
unethical acquisition of training sources is not defined. When asked
by Carsten Leonhardt to summarize the difference between the
proposals, Glaser replied
that his was "a hard anti-AI stance (with select
exceptions)
". Thomas Goirand said
that he would second Glaser's proposal, but he is the only one so far
to endorse it.
Possible impact
Gunnar Wolf replied
to sponsor the proposal and added that that Debian "cannot
magically extend DFSG-freeness to a binary we have no way to
recreate
". That does not mean, he said, that Debian is shut out
entirely from participating in the LLM world. Users could always
download models from other sources or the models could even be
uploaded to Debian's non-free
repository.
Among the potential implications listed in Appendix D
of the proposal is the downside that there are almost no useful AI
models that would be able to enter the main
section of Debian's archive under this interpretation. The upside
is that Debian does not have to immediately deal with "the
technical problem of handling 10+GB models in .deb packages
" or
expect downstream mirrors that host the main repository to carry
such large binary files.
Simon McVittie asked
if anyone had an idea whether any models existed in Debian's main
repository that already match the definition. He said it was typical
for proposals to provide an estimate of how many packages that would
be made "insta-RC-buggy
". In other words, how many packages
would be subject to release-critical bugs
if the GR passes? Since Debian is currently in freeze to prepare for
the release of Debian 13 ("trixie"), he wanted to know if the GR
would take effect immediately, or would it take effect at the
beginning of the cycle for the next release? The pre-release freeze is
already lengthy, and he thought it would be best to avoid making it
longer in order to deal with any packages affected by this GR.
Russ Allbery observed
that GNU Backgammon
comes with neural-network weights that do not have source code. He
admitted that he did not give that much thought when he was
maintaining the package because it predated the LLM craze. "I'm not
even sure if the data on which it's trained (backgammon games, I think
mostly against bots) is copyrightable.
" He was also unsure whether
other old-school machine-learning applications might be lurking
around, and said he had no strong opinion about what to do if there
were.
Games are not the only software that may be impacted. Ansgar Burchardt said that the GR might impact other useful software. His list included the Tesseract optical-character-recognition (OCR) software, OpenCV image-recognition software, Festival text-to-speech software, or other software with weights and data of uncertain origin. Urlichs suggested that Burchardt could write a counter-proposal or a more-nuanced proposal that would take some of those packages into account. He also questioned whether the software would need to be removed—the packages could be relocated to Debian's contrib archive and models placed in non-free.
Next steps
So far, Burchardt has not offered any proposals of his own, but there is still time. Discussion will continue for at least two weeks from the initial proposal, though the Debian Project Leader could shorten the discussion period by calling for a vote sooner. The proposal already has enough seconds to proceed; if the discussion reflects the overall mood of Debian developers, the GR would be likely to pass.
If it does pass, it will be in contrast to the Open Source
Initiative's controversial Open
Source AI Definition (OSAID) which LWN looked at last year. The
OSI requires that model weights be provided under "OSI-approved
terms
" (which are yet to be specified), but does not require
training data to be supplied in order to meet its definition for
open-source AI. That has been a sticking point for many, who feel that
the OSAID devalues the OSI's Open
Source Definition (OSD)—which was derived from the DFSG in the
first place.
Posted Apr 25, 2025 22:44 UTC (Fri)
by geofft (subscriber, #59789)
[Link] (38 responses)
Anyway, there doesn't seem to be a lot of coverage in the proposal about what a _good_ large model would do in terms of licensing. The seconded proposal says, in the proposal text, that AI models whose weight/operational code is FOSS-licensed and "is trained on data or simulator that is private, proprietary, or inaccessible to the public" and/or "does not provide the original training program" is not DFSG-free. In the proposal introduction it is noted that if the weights are not FOSS-licensed in the first place then it can't possibly be DFSG-free, and that if the weights, training data, and training code are all under DFSG-compliant licenses then the system is DFSG-free. But that leaves a lot of room in between for what should be considered DFSG-free and for what ought to be encouraged.
Ian Jackson says in a reply (https://lwn.net/ml/all/26631.60259.508655.293013@chiark.g...) that they think the proposal obviously means that the training code and data have to be in Debian main themselves, and that Debian Members and ftpmasters will read it this way. This is surprising to me (though, to be fair, I have never gotten around to becoming a Debian Member), especially in the context of extremely large data sets, and I am curious if this is in fact how others read it. (I'm also curious whether Ian intends that the training code and data need to make it into a binary package or if just being in the source package is enough.)
In an email replying to the earlier draft (https://lists.debian.org/debian-project/2025/02/msg00041....) Stefano Zacchiroli does not assume that Mo's earlier draft means that the training code and data must be distributed alongside the runtime AI model, and in fact points out that in practice, the upstream tarballs / source repositories usually won't have them and it will be up to the packager to figure out how to address that. He asks, "Do we repack source packages to include the training datasets? Do we create *separate* source packages for the training datasets? Do we create a separate (ftp? git-annex? git-lfs?) hosting place where to host large training datasets to avoid exploding mirror sizes? Do we simply refer to external hosting places that are not under Debian control?"
A more interesting question to me is the acceptable licenses of the training code and data. If they must go into Debian main, then that question has a straightforward answer; they must be DFSG-free. But you can imagine plenty of ways to make data available in a way that is not "private, proprietary, or inaccessible to the public" yet is not DFSG-free, especially for models trained on creative content (prose in natural languages, code in programming languages, visual art, speech samples, music, medical data, etc.). Taking code as an example, one obvious issue is LLMs that are trained on code that is available to the public but not under a DFSG-compatible license, whether that's unlicensed public code, things under a non-commercial license, or one of the new please-let-us-IPO licenses. This code can't be shipped in Debian main. Some, but not all, of it could go into non-free. Is that permissible for a model in main? Or for natural-language prose, a lot of prominent LLMs have relied heavily on "fair use" arguments in their training, meaning that Debian can't be in the business of redistributing the original texts on which they were trained, but the act of training is (claimed to be) non-infringing (in a certain jurisdiction), and the resulting model is (claimed to be) not a derivative work and so no restrictions on distributing or licensing the model are implied.
The question arises with training code as well as data. What if rerunning training requires CUDA (which is quite common), and the training is implemented in a way that there's alternative (CPU, OpenCL, etc.) backend? CUDA is proprietary but redistributable without cost, and is available in Debian non-free. For this reason, an actual program with a dependency (Depends:, Build-Depends:, or even Recommends:) on CUDA couldn't go into Debian main, but it could go into contrib. If the training program requires CUDA but the runtime AI model does not, can the model go into Debian main?
For this reason I actually don't think there is a ton of distinction between Mo Zhou's proposal and Thorsten Glaser's, despite the latter being _ideologically_ anti-AI. If Ian Jackson's reading is correct, both proposals require that training code and data be suitable for Debian main. Arguably, Ian Jackson's reading is more stringent in practice than Thorsten Glaser's proposal, in that (if I'm reading right) Thorsten's proposal allows a DFSG-free AI with redistributable but DFSG-incompatible training data to go into contrib with its training data packaged up in non-free, but Mo's proposal with Ian's interpretation says that AI models that cannot be packaged up with their own training data "are not seen as DFSG-compliant," which would keep them out of contrib.
If Mo's proposal goes to a vote I think it would be very worthwhile to clarify the intention here, because I expect this to be one of the first practical issues for a serious "more free" competitor to the current crop of proprietary or open-weights LLMs.
Finally, I think there is actually a much stronger anti-AI part of Thorsten's proposal that was underdiscussed, because it affects things that are _not_ AI models. The proposal has Debian take the view that "'generative AI' output are derivative works of their inputs (including training data and the prompt)" and so "Any work resulting from generative use of a model can at most be as free as the model itself; e.g. programming with a model from contrib/non-free assisting prevents the result from entering main." In other words, if someone uses GitHub Copilot or Cursor to help them write some software, then adopting Thorsten's proposal means that Debian cannot consider their software DFSG-free, regardless of what the stated license is on that software, because the underlying models are not DFSG-free.
This strikes me as hugely impractical (which is not necessarily wrong, if your goal is "a hard anti-AI stance"!). These tools are, in practice incredibly popular. While probably not much of the "old guard" writing things like glibc is likely using these tools, I would expect a decent fraction of casual FOSS contributors to be using them, to the point where I assume that by now, at least one package in the current Debian archive has at least some code that was written with the help of such a tool. You would need upstream maintainers to not only commit to refusing to use these tools themselves but also to telling contributors not to use them either, and you would need Debian packagers to tell people who send in patches or Salsa merge requests the same thing.
To Thorsten's credit, he does take exactly this approach (https://mbsd.evolvis.org/permalinks/wlog2021_e20240726.htm). But adopting this proposal would essentially make Debian require every project it packages to adopt this approach, too, and perhaps to engage in a significant audit or scrub of code from between the availability of tools like Copilot and the point of adopting such a policy.
My understanding of the current legal situation is that even for things like the NYT lawsuit against OpenAI, the claim is that the act of training was not fair use, and OpenAI's commercial gain from that act of infringement must be stopped. That is, I think, a different claim from that both the act of training produces a model that is a copyrightable derivative work _and_ that the act of doing inference with that model also produces derivative works of the model, such that all model outputs, and not just the model itself, are subject to the training data's license (or are infringing). Again, to be clear, I see the moral argument for such a position. But it seems likely to me that this is a much more copyright-maximalist position than the actual legal reality, and so it shouldn't be adopted lightly.
Among other things, I think the FOSS community needs to constantly remember that the whole moral deal about copyleft is that it subverts copyright; it doesn't embrace it. The point is that, to the extent that copyright exists, we might as well turn it around to good ends; that's not the same thing as wanting to stretch its scope as far as possible. (As an analogy, consider that despite many FOSS licenses having explicit text about patent licenses, the common position in the FOSS community is that software patents should not exist, not that we should patent everything novel and cool in the FOSS world and assign the patents to Conservancy to better arm their lawyers.) Distros like Debian benefit, for instance, from the idea that facts and data are not themselves copyrightable, in that this is what enables you to reverse-engineer a proprietary system and produce a compatible FOSS system that is not a derivative work. I worry that if Debian argues that the outputs of AI models are derivative works of their training data, it becomes harder to argue that a clean-room reimplementation of a protocol from a reverse-engineered spec is not a derivative work of the proprietary system that was reverse-engineered. Even if you want to incur the logistical trouble of refusing upstreams that have accepted AI-assisted pull requests, it might be a better rationale to say that Debian believes that AI-generated code does not live up to the spirit of the Social Contract and the goals of its developers, rather than to say that Debian believes the licenses inherit in this way.
Posted Apr 26, 2025 13:37 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (23 responses)
Devil's advocate, let me re-write that last sentence ...
"In other words, if a student learns by studying proprietary software and uses that experience to help them write some software, then adopting Thorsten's proposal means that Debian cannot consider their software DFSG-free, regardless of what the stated licence is on that software, because the underlying software that code is based on is not DFSG-free".
At the end of the day, whether something is derivative (and hence copyright-contaminated) is a legal call. I've just been reading (a day or two ago) about how the brain is a very efficient Bayesian filter. An AI/LLM is also (I guess) a Bayesian filter of some sort. If a human brain fine tunes the output of an artificial brain, where do you draw the line? How do you know where the line is?
Cheers,
Posted Apr 27, 2025 1:46 UTC (Sun)
by geofft (subscriber, #59789)
[Link] (4 responses)
On the subject of human minds, in actual legal systems, I am under the impression that the mind is considered inviolate and special in various ways. For instance, while arguments have been raised that loading files into RAM is performing a copy and therefore could infringe copyright, I do not believe any court has ever taken the opinion that a human reading a document, no matter how closely, could infringe copyright simply by the act of reading it, or that listeners to a musical work who have the music stuck in their heads have performed a copy. (Of course if they later write down what they remember, as Mozart supposedly did to Allegri's "Miserere," then that counts as a copy subject to copyright law.) In another application, in the US, the Fifth Amendment protection against self-incrimination means that written documents can be subpoenaed but the contents of one's mind cannot, and there's an emerging precedent that for the same reasons, police can compel you to unlock a device via biometrics (face/fingerprint/retina) but not via revealing even a short passcode.
Debian, of course, is free to bind its behavior more strictly than the law requires, and I think the FOSS community has a history of wanting people to avoid looking too closely at proprietary code out of a general fear that arguments would be made in court about infringing copies, even if the court is not going to rule that a programmer's brain is a derivative work. It would probably be a bad outcome to start treating this as a norm instead of just a defense to have in your quiver, and end up with the rule that someone who has learned from one free software project should be considered tainted when trying to contribute to another free software project under an incompatible license, e.g., that someone who has read glibc's sources too closely cannot send MIT-licensed patches to musl.
Posted Apr 27, 2025 16:45 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
Isn't it strange how you have just perfectly described European law as it refers to AI, which has so many authors, song-writers and composers screaming foul.
The law has taken the opinion that an AI reading a document, no matter how closely, does not infringe copyright simply by the act of reading (and remembering) it.
Of course, if they then later write down what they remember, then that counts as a copy subject to copyright law.
In other words, there's no such thing as "Copyright Washing" in European law. The law simply makes no distinction whatsoever between artificial intelligence and human intelligence (or lack thereof!).
Cheers,
Posted Apr 27, 2025 18:47 UTC (Sun)
by farnz (subscriber, #17727)
[Link] (2 responses)
Remember that this is tied in not just with what's legally acceptable, but also what's politically acceptable; part of the reason it's such a big fuss is that in countries with very little support for the unemployed, any technology that's being sold as "reduces the number of people needed to reach a certain level of attainment" is a political timebomb. As a result, it's quite likely to either stop being controversial (once the limits are clear, and it's not a big threat to jobs and to people doing serious work), or be something that's clamped down on hard (because it's that, or civil unrest); saying "let's hold it off for now, and make a serious decision once it's not a mess" is not a bad thing in itself.
Posted May 23, 2025 11:41 UTC (Fri)
by sammythesnake (guest, #17693)
[Link] (1 responses)
The risk otherwise is that future resolutions to relax these restrictions will be fighting against an entrenched status quo that wasn't intended to remain...
Posted May 23, 2025 14:27 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
Even if you put the rationale that you intend it to be re-visited, there will be an entrenched status quo that says "why should we?". The sunset would give them no choice.
Cheers,
Posted Apr 27, 2025 6:20 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link] (17 responses)
Posted Apr 27, 2025 8:43 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (16 responses)
Which is? Or are you arguing that the human brain is magic?
There's a lot of people who THINK they know how the brain works. There's a lot of people who are trying to build computer models of their model of how the human brain works. And for the most part the computer model is crap because the mental model bears no resemblance to reality.
The classic example, why are people throwing general-purpose compute-heavy hardware at these problems? How much of the brain is single-purpose dedicated hardware? Is it 90%? Something like that certainly . And they talk to other bits of dedicated hardware. That's why my car's "driver assist" features are such a damn nightmare. They're based on general-purpose ideas and hardware, and don't talk to each other. Case in point - in heavy traffic when a car pulls out of my way, my car fails to recognise we're not on a collision course, pretty much comes to a halt until it loses sight of it, then floors the accelerator until a different piece of hardware suddenly realises it's going to crash into the car in front and screams at me to take over and brake!
Or adaptive cruise control. Which can't tell the difference between its internal database and a road traffic sign. Which regular sees signs which clearly don't apply to the road you're on but it applies them anyway. Which completely ignores the driver's control input when it disagrees with the computer database's input.
Compare that with a real human. I can recognise a stationary car at a distance. I can tell when the trajectory of my car and the car in front don't intersect. I can identify speed limits and brake lights at a distance. And my actions NOW are informed by what I can see happening in a minute's time.
There's only one major difference between a human and a machine. The people trying to build human models aren't talking to the people studying humans, and the computer models therefore bear bugger all resemblance to the human systems they're trying to model. Pretty much par for the course in most human research, sadly ... :-(
Cheers,
Posted Apr 27, 2025 17:35 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link] (15 responses)
“Generative AI” operates like a compiler or a lossy compression/decompression. Its output is fully dependent on its inputs, as it runs on a deterministic machine (if a PRNG is used, it does count as input, making its output reproducible), so the outputs are just mechanical transformations of the input (and thus both a derived work and the provider and operator of the “AI” cannot claim independent copyright on it).
Posted Apr 27, 2025 20:12 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (1 responses)
Devil's advocate again, but doesn't that mean an AI output is a mechanical derivation of the prompt a human fed in? At which point any copyright (liability) belongs firmly with the user prompting the AI. So all these people saying "AI generated an exact copy of Shakespeare" (or whatever) are clearly fully liable for any copyright violation that may be involved in that reproduction ...
Cheers,
Posted Apr 27, 2025 20:39 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link]
And yes, the responsibility is with the operator. The copyright exeption for text and data mining (if not opted out) only applies for analysēs like getting trends, not for generative use (probably not even the genAI summarisation functionality).
Posted Apr 27, 2025 20:39 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (11 responses)
That is simply not how courts define "derivative work," and it's rather obvious to see why. If you run an algorithm over somebody else's creative work, that algorithm could do anything from "return the input unchanged," to "count the words, sentences, paragraphs, and pages, and output all of those numbers," to "always output the number 7."
"Derivative work" is defined differently in different countries (and some countries call it something else entirely), but most definitions are going to at least vaguely resemble the US definition, which in practice says that a work is derivative of another work if the plaintiff (author of or rightsholder to the original) can prove two things:
* Access: The person who created the allegedly infringing work (i.e. the person who pushed the buttons that caused the AI to output it) would have been able to copy the original.
Access is usually pretty easy to prove - in most cases, "publicly available" is good enough. For an AI case, you might also need to prove that the work was actually in the training set, but I don't think we have caselaw on that yet. Substantial similarity, on the other hand, is entirely fact-bound and there are no bright lines - the only way to prove it is to put the two works side by side and point to specific individual elements that you say were improperly copied. This also means you must *have* a specific original in mind when you file the lawsuit - you can't just say (e.g. as a class action) "well, it's one of these millions of works that was infringed." That sort of pleading would probably get dismissed even pre-Twiqbal, but nowadays, it would get you laughed out of the courtroom.
Posted Apr 27, 2025 21:15 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link] (10 responses)
The sum of the outputs of a genAI thingy is a derivative work of the sum of (possibly a subset of) the inputs.
Of course you need to figure out similarity for each one if you want to sue for infringement.
But we’re not looking at suing for infringement, we’re looking for doing a positive-definition of what is acceptable, and for this, this level of detail is sufficient.
Posted Apr 27, 2025 23:32 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (9 responses)
The term "derivative work" has a specific legal meaning. If you want to invent your own meaning, I would encourage you to come up with your own word, so that it does not become conflated with the legal meaning that clearly does not apply in this context (substantial similarity is only meaningful in the context of comparing specific individual works, not large classes of works, and without substantial similarity, there is no derivative work, under US law).
> But we’re not looking at suing for infringement, we’re looking for doing a positive-definition of what is acceptable, and for this, this level of detail is sufficient.
This brings us to the broader problem here: Copyright is only about 300 years old, and its contours have changed dramatically in that short time. Most basic principles of law and ethics are far older than that, and have been far more stable in their overall form and function (at least in semi-modern times). Just about everyone agrees on the basic definition of "stealing" a physical object, and that it is (in general) a wrongful act, but theft has been a thing since antiquity. If you're going to lean on the ethical side of this instead of the legal side, you're quickly going to find that there is much less consensus on what should be considered right and wrong than you reasonably need to have this sort of discussion. Here is an example I would encourage you to think carefully about:
Suppose someone never registers their copyright, or fails to comply with some other legal requirement. Should their copyright automatically lapse, or even fail to vest in the first place? I think most folks would say that it should not, hence why we got rid of those requirements. But then you ask people another question: Should Warner-Chappell now own the copyright to Happy Birthday, or did they rightfully lose it as a result of someone failing to comply with US copyright formalities in the 1930's? Most folks will tell you that Warner should not own Happy Birthday, and that they did rightfully lose it. But that is inconsistent - either formalities are rightful, and should still apply, or they are wrongful, and Warner should not have lost the copyright. You can resolve this by appealing to the excessive term of modern copyright, but that is dodging the question. I'm not asking what the ideal copyright system would look like - I'm asking for the rightful outcome in this one specific case.
But perhaps you disagree, and think we can resolve it based on the term. That's fine, I have other examples. There are numerous works from the 50's and 60's whose US copyrights were not renewed, mostly serialized pulp fiction that was seldom or never reprinted. Many of them can now be read on the Internet Archive for free. Is it rightful that those authors, some of whom might still be alive, are not paid for the reproduction of their works? Or, as a matter of ethics, should the Internet Archive take down all of those works until such time as their authors can be identified and some sort of payment-in-lieu-of-licence can be worked out?
Or we can look at even more recent developments:
* Is it wrongful to distribute tools that break DRM (because it enables piracy), or is DRM itself wrongful (because people should control their computers)?
I want to be clear - the above are rhetorical questions. Replying with specific answers would entirely miss the point. I'm not asking for your specific viewpoint. I'm asking you to consider whether a significant number of people might have a different viewpoint from you, and whether you have any legitimate right to insist that your viewpoint is more ethically correct than theirs. If we can't even agree on these more basic questions of how copyright ought to work, then I submit to you that it is impossible for us to come to agreement on how copyright should interact with generative AI.
TL;DR: The basic ethical premises of copyright are still up for debate, so it is deeply questionable whether we can even have this discussion in the first place.
Posted Apr 28, 2025 7:53 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (7 responses)
Was it even possible for them to comply lawfully with US copyright formalities in the 1930s? Especially in patents, but also in copyright, how much unrecognised "prior art" is out there?
Cheers,
Posted Apr 28, 2025 9:39 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (6 responses)
Based on the facts that have been publicly disclosed, the problems with "Happy Birthday" were roughly as follows:
* The melody was indisputably in the public domain, having been published in 1893 under a different name ("Good Morning to All") and with a trivial difference in arrangement (one note became two).
The question I asked is whether this is a just outcome, which (now that I look more closely) is a bit of a muddle. But you could just as easily imagine an alternative version of events in which the judge instead rules that the 1922 publication caused the song to enter the public domain. That is not an implausible outcome - the judge said it was a triable issue of fact that could have gone either way.
Posted Apr 28, 2025 17:22 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (5 responses)
It's only *legally* *defined* under patent law.
But if somebody sues me for copyright copying, because I wrote a piece of music with da-da-da-dum, da-da-da-dum, I sure as hell am going to point them at Beethoven as prior art!
Terry Pratchett was accused of ripping off Hogwarts when he created Unseen University. Quite apart from the fact he would have needed a time machine, there's absolutely shit-loads of prior art going back to the 1920s if not the century before about British boarding schools and the like. Bunter, anyone?
Yes, it's not *legally* prior art, but what on earth else are you going to call it?
Cheers,
Posted Apr 28, 2025 17:34 UTC (Mon)
by pizza (subscriber, #46)
[Link] (4 responses)
Methinks you need to bone up on the distinction between "ideas" and "specific expressions".
Posted Apr 28, 2025 20:26 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (3 responses)
It's all very well a plaintiff saying "you nicked my ideas", but when said plaintiff has clearly nicked the same idea from somewhere else (of which they may not be aware of - music would be a classic case), then there's a problem.
What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.
Even worse when an AI does it - I gather there's been a "situation" recently where YouTube has misattributed Schubert's "the trout" (even worse, as played by a washing machine), and accused a whole bunch of people of copyright violations, taken a large chunk of their commissions, and given it to somebody who has no connection with Schubert or the trout.
I think that's a major problem with a lot of IP at the moment ...
Cheers,
Posted Apr 30, 2025 16:37 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Ideas are categorically exempt from copyright protection in the US (and most of the world). See for example 17 USC 102(b), and see also case law such as Baker v. Selden.
> What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.
The "specific expression" refers to the actual words, images, or sounds used to convey something, not the broader concept of it. If you can show that the plaintiff copied the specific expression from elsewhere, then under US law, this has two rather significant consequences (both laid out in 17 USC 103):
1. They don't own that specific expression, because it's not part of "the material contributed by the author of such work."
You cannot demand that the plaintiff give you an accounting of every place where it possibly could have come from and exhaustively prove that it is entirely original, because that would be plainly unworkable. Instead, you as the defendant have to find the original work (possibly through discovery), introduce it at trial, and argue that it precludes the plaintiff from owning the specific expression.
You can argue that the copied expression is a "scène à faire," literally meaning "a scene that must be done." The original basis for this was genre fiction, in which (for example) a mystery novel simply must have a scene at the end where the detective (who also must exist) explains who commited the crime, how and why they did it, and what clues led the detective to that conclusion. If you don't have that scene, it's not a "real" mystery novel, it's some other genre, and since nobody can be allowed to own the mystery genre as a whole, nobody can own that type of scene either.
In the programming context, scènes à faire includes constructs like for(i = 0; i < max; i++){...}. Nobody can be allowed to own that, because you can't reasonably write a (large, complex) program in a C-like language and never write a loop that looks like that.
> I gather there's been a "situation" recently where YouTube has misattributed Schubert's "the trout" (even worse, as played by a washing machine), and accused a whole bunch of people of copyright violations, taken a large chunk of their commissions, and given it to somebody who has no connection with Schubert or the trout.
Disclaimer: I work for Google, not as a copyright lawyer, and so I can't speak on their behalf. The following is my personal interpretation of YouTube's behavior, based on public information and reasonable inference.
YouTube's copyright system is primarily designed to protect YouTube from getting sued, and secondarily designed to discourage users from suing each other. You cannot assume that the outcomes you see on YouTube are necessarily what a court of law would have done, because YouTube does not have the power to make binding rulings on whether X is a copyright infringement of Y. So they're stuck making rulings on the basis of whether YouTube can plausibly be sued, which leads to all sorts of undesirable-but-unavoidable biases in favor of the copyright holder (many of them explicitly codified into law, e.g. in 17 USC 512 and similar laws in other jurisdictions). Of course, anyone dissatisfied with YouTube's handling of an issue remains free to take it to the "real" court system under a variety of legal theories (slander of title, tortious interference, conversion of revenues, 17 USC 512(f) misrepresentation, etc.).
I will agree that the legal system generally does a poor job of producing just and efficient outcomes in this space. There is a reason that both YouTube and its users strongly prefer to avoid going to court. But scènes à faire has nothing to do with this. What you describe sounds like outright fraud (taking the facts as you have described them and assuming there's nothing else going on here). Unfortunately, modern copyright law was simply not designed under the assumption that services like YouTube might exist, and updating it has proved difficult (especially in the modern US political system where Congress can barely agree to keep the government open). A few years ago, the YouTuber Tom Scott made an excellent ~43 minute video explaining the broader problem in detail, which I found highly informative and more entertaining than you might expect: https://www.youtube.com/watch?v=1Jwo5qc78QU
Posted Apr 30, 2025 18:35 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
It is, but who's responsible? As far as I know, the guy receiving the improper royalties could well be completely unaware of the source of the royalties. He uploaded a piece of music, which he just happened to record at the same time his washing machine finished its cycle, and YouTube's automated systems assumed it was his copyright. Whoops!
So if YouTube wants to avoid being sued, somebody who's prepared to take the risk will probably take them to the cleaners ... (rather appropriate seeing as it's a washing machine rofl)
Cheers,
Posted Apr 30, 2025 18:42 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
> The "specific expression" refers to the actual words, images, or sounds used to convey something, not the broader concept of it.
Again, I'm thinking of a particular example. I don't know the outcome, but some musician sued saying another musician had "copied his guitar riff". Given that a riff is a chord sequence, eg IV V I, the shorter the riff the more likely it is another musician either stumbled on it by accident, or it's actually a common sequence in a lot of music. If I try and play a well-known piano sequence on the guitar, chances are I'll transpose the key and it'll sound very like someone else's riff that I may never even have heard ...
(I am a guitar player, but classical, so I don't play riffs ... :-)
Again, this is a big problem with modern copyright law where so much may be - as you describe it - a "scene a faire", but people don't recognise them as such.
Cheers,
Posted Apr 28, 2025 8:13 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
As an aside, and this only bolsters your point, I live in a jurisdiction (England and Wales) where plenty of things that are obviously theft when you look at them don't quite reach the legal bar. The requirements here for theft are that you appropriated property, that it belonged to someone else, that you appropriated it dishonestly, and that you intended to permanently deprive the owner of that property. So, for example, if I take your lawnmower at the beginning of summer to mow my lawn, with intent to return it to you when winter starts and it's too cold for grass to grow, I've not committed theft, in law.
We have had to introduce specific laws for cases like taking someone's car for a joyride without permission, refusing to return a credit to your bank account that was made in error, or leaving without paying when you owe payment on the spot, precisely because the legal definition of theft is too narrow to cover these cases, even though most of us would agree that they were "theft" in a colloquial sense.
Posted May 6, 2025 10:42 UTC (Tue)
by aigarius (subscriber, #7329)
[Link]
Posted Apr 26, 2025 14:13 UTC (Sat)
by lumin (subscriber, #130448)
[Link] (3 responses)
I realized that problem when writing the text. However, if the proposal says:
"AI models released under DFSG-compliant license without training data or program" is not seen as DFSG-compliant.
-> "... under DFSG-compliant ..." is not seen as DFSG-compliant.
It looks weird as it looks like a paradox. I did not thought too much further on this wording issue, and replaced it with open source. I think people can anyway understand what I mean there. But you are right, we'd better revise the wording if this is going to officially land onto somewhere.
Posted Apr 26, 2025 14:59 UTC (Sat)
by gioele (subscriber, #61675)
[Link]
It is however an already know "paradox": if you release a "GPL executable" but you not provide the its source, including «the scripts used to control compilation and installation of the executable», then you are not really complying with the terms of the GPL.
Posted Apr 27, 2025 2:00 UTC (Sun)
by geofft (subscriber, #59789)
[Link]
If you want to change it, I'd suggest replacing "open source license" with "DFSG-compatible license" to contrast with "DFSG-compliant" at the end of the sentence. (Policy also uses "comply with the DFSG" to describe what can go into main and contrib.) "Compatible," to me, means it's possible to use it in a way that fits with the DFSG, but it doesn't mean that it's impossible to use it in another way. If I want to write a POSIX-compliant shell script, it's very helpful to test it against a POSIX-compatible shell, but that by itself is not enough.
Posted Apr 27, 2025 15:14 UTC (Sun)
by smcv (subscriber, #53363)
[Link]
I believe there was one case in particular where the upstream developer of a piece of software under a BSD/MIT-style license (was it Pine?) had an unusual interpretation of the license and asserted that their software was only redistributable if it was at no cost (free of charge), which would have made it impossible to sell Debian CDs with their software included. Debian responded to this by treating that specific piece of software as non-Free, even though the license it was released under is one that we would usually have accepted.
Posted Apr 26, 2025 21:24 UTC (Sat)
by kleptog (subscriber, #1183)
[Link] (7 responses)
If you believe all software should be free, then believing that all ML models should be free seems the logical next step.
As for reproducing models, that seems like a nonstarter as the first step in training is to initialise with random weights. I guess you could make that deterministic, but the question is why you care. The training set is going to be at least 100 times the size of the final model. Requiring full reproducibility on very large models is going to require distributing such a vast amount of data, and for what purpose? People are just going to use the final model weights to build more fine-tuned models, reproducing from the source data just isn't very useful.
The impact on the many other places machine learning models have been used for years in software is also something that shouldn't be discounted. This discussion could have a lot more impact than it appears at first glance.
Posted Apr 26, 2025 21:31 UTC (Sat)
by mb (subscriber, #50428)
[Link] (2 responses)
Just use the closed source binary? Why would you want to have the source? Just use the binary!
Posted Apr 27, 2025 11:25 UTC (Sun)
by kleptog (subscriber, #1183)
[Link] (1 responses)
In many ways, the model is actually a suitable stand in for the original data. I think that for the vast majority of use cases people are better off taking an existing public model and fine-tuning it to suit their purpose with their own data, than trying to rebuild a model from scratch at great expense.
Posted Apr 29, 2025 8:46 UTC (Tue)
by taladar (subscriber, #68407)
[Link]
As long as that is the case free software concepts are hard to apply to AI models.
Posted Apr 27, 2025 2:41 UTC (Sun)
by geofft (subscriber, #59789)
[Link]
- Simply requiring it to be published is a good way to incentivize keeping people honest about what it is. (Facebook, for instance, is alleged to have trained Llama-3 against books literally torrented via LibGen, and they attempted to seal parts of the lawsuit because they were embarrassed about the situation becoming public.) Having it be published with the expectation that people might plausibly try to reproduce it, and ask hard questions if they can't, is an even stronger incentive.
- If you want to remove something from the "knowledge" of the model, it is basically impossible to do that reliably starting from the trained model, at least given the current state of the art in interpretability, whereas it is conceptually straightforward (though resource-intensive) to just remove that training data and run the training again.
- It may well be the case that a future AI, itself, could do something useful with another model and its training data that would be impractical for a human, e.g., interpretability work to explain some behavior with reference to the training data that induced it. That we don't see a use for the full training data now is no reason to decide we won't find a use later.
- Starting with the same data but a different training program could yield interesting results, ranging from simply modifying the training program to do the same thing but with more parameters or a different approach to tokenization, to training in a wholly different way.
- I do think one of the many arguments for FOSS is that it's good for learning and openness, even independent of the practicality of using the software. FOSS licenses may not have termination clauses or timeouts, which of course is for practical reasons of allowing people to package and redistribute software confidently, but I think we'd all intuitively disapprove of a license that terminated even if it were 100% guaranteed that nobody was ever going to actually run the software again - having the sources around and available is a contribution to the commons.
- In any of the weird hypothetical evil-overlord-AI scenarios like https://ai-2027.com/ , it seems pretty clear that having meaningful capacity to build your own AIs will be immensely helpful in fighting back. Of course "the good guys" are going to be at a strong disadvantage for myriad reasons, but not having any access to training data will put them even farther back. In a scenario where a strong enough LLM has started actively concealing its own inner "thoughts" from its output without people noticing, I'm not sure that fine-tuning a trained subversive model is going to be sufficiently helpful to really get it onto your side.
Also keep in mind that all the involved hardware is rapidly getting more capable, so I think that an end user might retrain a model they're using from Debian stable or oldstable and find it much more feasible than it would have been for them to do that same retraining at the time that model was uploaded to Debian unstable. (For the same reason, there may also be an effect that the actual type of GPUs used when originally training the model were in high demand at the time but are now no longer top-of-line and are easier and cheaper to rent or buy.)
Posted Apr 27, 2025 6:25 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link]
And those TESCREAL models don’t even do that, even cannot in practice. And attribution is a burden so low that not honouring it is a much harsher violation than, say, one point of the GPL.
Posted Apr 28, 2025 20:15 UTC (Mon)
by ballombe (subscriber, #9523)
[Link]
... which is exactly the line played but openAI, facebook et al when it comes to copyright.
Posted Apr 28, 2025 22:01 UTC (Mon)
by jzb (editor, #7867)
[Link]
I'm also a little surprised that people supporting copyleft as a way to subvert copyright law simultaneously try to argue copyright maximalism with respect to machine learning models. It feels a lot like "we want to subvert copyright law, but only when it's in our favour". I'm not sure why this is surprising. Copyleft uses copyright to do a 180 and grant rights to users that are normally reserved, and to insist that others convey the same rights. It does this for code that the licensor has the rights to—it does not take anything from others that it has no rights to. The mechanism requires copyright laws to work. I don't see how that is inconsistent with "we oppose technologies that hoover up other people's copyrighted works without permission". I've known free-software folks who argue that proprietary licensing is unethical, but they still acknowledge that creators have the right to set licensing terms for their code, even if they disagree with the choices they make.
Posted Apr 27, 2025 6:22 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link] (1 responses)
Posted Apr 27, 2025 6:31 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link]
Posted Apr 26, 2025 1:02 UTC (Sat)
by pabs (subscriber, #43278)
[Link] (1 responses)
Posted Apr 26, 2025 13:34 UTC (Sat)
by lumin (subscriber, #130448)
[Link]
Posted Apr 26, 2025 1:04 UTC (Sat)
by pabs (subscriber, #43278)
[Link]
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Wol
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Wol
I can see strong arguments for holding to the strictest possible answer in the short term (such that if jurisdictions start to make the judgement call, you're safe unless their decision is a shock to everyone), but time-limiting it with a view to coming back when the commercial dust settles and jurisdictions have either made this a non-question, or extending the time limit if it's still a controversy.
Short-term versus long-term positions
Short-term versus long-term positions
Short-term versus long-term positions
Wol
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Wol
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Wol
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
* Substantial similarity: There are significant copyrightable elements in common between the original and the allegedly infringing work, and the number of such elements and degree of similarity is high enough to justify an inference of copying.
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
* A clickwrap license is a contract you agree to by interacting with some computer software (usually clicking an "I agree" button). Should these contracts be effective? If yes, then EULAs are effective and proprietary software is ethically legitimate. If not, then that includes contracts for the sale of goods, which effectively means that no form of online commerce should be allowed. Is either of those extremes correct, or should we adopt some middle position (and exactly what does that position look like)?
* Is it wrongful for a hobbyist to create a derivative work of some giant corporation's intellectual property, assuming the hobbyist makes at most a small profit from it?
* Is it wrongful for a giant corporation to create a derivative work of some hobbyist's intellectual property, assuming the corporation makes at most a small profit from it?
* Is it OK if the answers to the above two bullets differ, or would that be a logical contradiction?
Debian debates AI models and the DFSG
Wol
Debian debates AI models and the DFSG
* A copyright on the lyrics was registered in 1935.
* In 1927, a copy of the lyrics was published, set to the melody of "Good Morning to All," without a copyright notice. If authorized, this publication would have the effect of forfeiting any copyright before it could even be registered. Warner would later argue in court that this publication was unauthorized, or at least not authorized by the appropriate party.
* There were a number of other arguments raised. The judge considered all of these arguments at summary judgment, and concluded that most of them (including the 1922 publication) would need to go to trial.
* But the judge did find one basis for summary judgment: The sale of the copyright from the Hill sisters (one or both of whom wrote the song) to the Summy Company (which Warner eventually bought) was apparently a bit of a mess. It went through multiple rounds of litigation, three separate agreements, and the second agreement was missing from the modern record. The judge ruled, as a result, that there was no evidence the Hills had specifically sold the lyric rights to the Summy Company, so Warner lost on that basis.
Debian debates AI models and the DFSG
Wol
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Wol
Debian debates AI models and the DFSG
2. If the copying was unlawful (the original was copyrighted and they had no license), then that entire part of the work is unprotected by copyright.
Debian debates AI models and the DFSG
Wol
Debian debates AI models and the DFSG
Wol
Aside on the legal versus common definition of "theft"
Just about everyone agrees on the basic definition of "stealing" a physical object, and that it is (in general) a wrongful act, but theft has been a thing since antiquity. If you're going to lean on the ethical side of this instead of the legal side, you're quickly going to find that there is much less consensus on what should be considered right and wrong than you reasonably need to have this sort of discussion.
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
>
> -> "... under DFSG-compliant ..." is not seen as DFSG-compliant.
>
> It looks weird as it looks like a paradox.
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
They are not copyright minimalists despite what they say, see their reaction to DeepSeek.
It makes sense for GNU GPL software to want be protected from exploitation by them.
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Debian debates AI models and the DFSG
Debian Deep Learning Team Machine Learning Policy
Debian Deep Learning Team Machine Learning Policy
Tesseract