Debian debates AI models and the DFSG

By Joe Brockmeier
April 25, 2025

The Debian project is discussing a General Resolution (GR) that would, if approved, clarify that AI models must include training data to be compliant with the Debian Free Software Guidelines (DFSG) and be distributed by Debian as free software. While GR discussions are sometimes contentious, the discussion around the proposal from Debian developer Mo Zhou has been anything but—there seems to be consensus that AI models are not DFSG-compliant if they lack training data. There are, however, some questions about the exact language and questions about the impact the GR will have on existing packages in the Debian archive.

While many folks in the free-software community are generally skeptical about AI and would be happy to see the trend come to an end, Zhou is certainly not in the anti-AI camp. He is a Ph.D. student at Johns Hopkins University, and his academic web site states that his research interest is in computer vision and machine learning. He has created a project called DebGPT that explores using LLMs to aid in Debian development. Clearly, he sees some value in the technology, but also wants to adhere to free-software principles.

GR proposal

In February, Zhou wrote to the debian-project mailing list to say that he had created "something draft-ish" for a general resolution about applying the DFSG to AI models, which he later defined thusly:

A pre-trained "AI model" is usually stored on disk in binary formats designed for numerical arrays, as a "model checkpoint" or "state dictionary", which is essentially a collection of matrices and vectors, holding the learned information from the training data or simulator. When the user make use of such file, it is usually loaded by an inference program, which performs numerical computations to produce outputs based on the learned information in the model.

He called for help in adding reference materials and to shape up the early draft before posting it. Zhou sent his revised proposal to the debian-vote mailing list on April 19, with a detailed explanation with his reasoning for the GR and several appendices containing background information on AI technology, previous discussions, and comments on possible implications if the proposal is passed.

Debian has taken up the topic previously (see LWN's coverage from 2018) but never settled the question. The goal now is to reach a consensus on handling AI models that are released under DFSG-compliant licenses, but do not provide training data. Zhou's proposal notes that the software that runs AI models, such as Python scripts or C++ programs, are out of scope of the proposal since traditional software is already a well-defined case.

The actual text of the proposal, what Debian members would vote for (or against), is short and to the point:

Proposal A: "AI models released under open source license without original training data or program" are not seen as DFSG-compliant.

Francois Mazen, Timo Röhling, Matthias Urlichs, Christian Kastner, Boyuan Yang, and others have replied to support and sponsor the proposal. Resolutions are required to have five additional sponsors before they are put to discussion and eligible for a vote. Currently, if put to a vote, Debian members would have a choice between "A" or "none of the above". It is possible, according to the resolution procedure, that amendments or alternative proposals, such as "AI models are DFSG-compliant if under DFSG licenses", could be added during the discussion period.

Thorsten Glaser posted what he called a counter-proposal on April 23, and requested comments. While Zhou's proposal would simply clarify that models without training data do not meet the DSFG, Glaser goes much further. For example, he wants Debian to require that models be "trained only from legally obtained and used works" and that the data itself be under a suitable license for distribution. His proposal would also place heavy requirements on building models that would be hosted in Debian's main archive:

For a model to enter the main archive, the model training itself must *either* happen during package build (which, for models of a certain size, may need special infrastructure; the handling of this is outside of the scope of this resolution), *or* the model resulting from training must build in a sufficiently reproducible way that a separate rebuilding effort from the same source will result in the same trained model.

Finally, the current language would ask that training sources not be obtained unethically and "the ecological impact of training and using AI models be considered". What constitutes ethical or unethical acquisition of training sources is not defined. When asked by Carsten Leonhardt to summarize the difference between the proposals, Glaser replied that his was "a hard anti-AI stance (with select exceptions)". Thomas Goirand said that he would second Glaser's proposal, but he is the only one so far to endorse it.

Possible impact

Gunnar Wolf replied to sponsor the proposal and added that that Debian "cannot magically extend DFSG-freeness to a binary we have no way to recreate". That does not mean, he said, that Debian is shut out entirely from participating in the LLM world. Users could always download models from other sources or the models could even be uploaded to Debian's non-free repository.

Among the potential implications listed in Appendix D of the proposal is the downside that there are almost no useful AI models that would be able to enter the main section of Debian's archive under this interpretation. The upside is that Debian does not have to immediately deal with "the technical problem of handling 10+GB models in .deb packages" or expect downstream mirrors that host the main repository to carry such large binary files.

Simon McVittie asked if anyone had an idea whether any models existed in Debian's main repository that already match the definition. He said it was typical for proposals to provide an estimate of how many packages that would be made "insta-RC-buggy". In other words, how many packages would be subject to release-critical bugs if the GR passes? Since Debian is currently in freeze to prepare for the release of Debian 13 ("trixie"), he wanted to know if the GR would take effect immediately, or would it take effect at the beginning of the cycle for the next release? The pre-release freeze is already lengthy, and he thought it would be best to avoid making it longer in order to deal with any packages affected by this GR.

Russ Allbery observed that GNU Backgammon comes with neural-network weights that do not have source code. He admitted that he did not give that much thought when he was maintaining the package because it predated the LLM craze. "I'm not even sure if the data on which it's trained (backgammon games, I think mostly against bots) is copyrightable." He was also unsure whether other old-school machine-learning applications might be lurking around, and said he had no strong opinion about what to do if there were.

Games are not the only software that may be impacted. Ansgar Burchardt said that the GR might impact other useful software. His list included the Tesseract optical-character-recognition (OCR) software, OpenCV image-recognition software, Festival text-to-speech software, or other software with weights and data of uncertain origin. Urlichs suggested that Burchardt could write a counter-proposal or a more-nuanced proposal that would take some of those packages into account. He also questioned whether the software would need to be removed—the packages could be relocated to Debian's contrib archive and models placed in non-free.

Next steps

So far, Burchardt has not offered any proposals of his own, but there is still time. Discussion will continue for at least two weeks from the initial proposal, though the Debian Project Leader could shorten the discussion period by calling for a vote sooner. The proposal already has enough seconds to proceed; if the discussion reflects the overall mood of Debian developers, the GR would be likely to pass.

If it does pass, it will be in contrast to the Open Source Initiative's controversial Open Source AI Definition (OSAID) which LWN looked at last year. The OSI requires that model weights be provided under "OSI-approved terms" (which are yet to be specified), but does not require training data to be supplied in order to meet its definition for open-source AI. That has been a sticking point for many, who feel that the OSAID devalues the OSI's Open Source Definition (OSD)—which was derived from the DFSG in the first place.

Debian debates AI models and the DFSG

Posted Apr 25, 2025 22:44 UTC (Fri) by geofft (subscriber, #59789) [Link] (38 responses)

A few thoughts, starting with the most minor: I am a little surprised the proposal uses the phrasing "open source". Debian has traditionally used the phrasing "free software," including in the name of the DFSG, the Debian Social Contract, and Debian policy.

Anyway, there doesn't seem to be a lot of coverage in the proposal about what a _good_ large model would do in terms of licensing. The seconded proposal says, in the proposal text, that AI models whose weight/operational code is FOSS-licensed and "is trained on data or simulator that is private, proprietary, or inaccessible to the public" and/or "does not provide the original training program" is not DFSG-free. In the proposal introduction it is noted that if the weights are not FOSS-licensed in the first place then it can't possibly be DFSG-free, and that if the weights, training data, and training code are all under DFSG-compliant licenses then the system is DFSG-free. But that leaves a lot of room in between for what should be considered DFSG-free and for what ought to be encouraged.

Ian Jackson says in a reply (https://lwn.net/ml/all/26631.60259.508655.293013@chiark.g...) that they think the proposal obviously means that the training code and data have to be in Debian main themselves, and that Debian Members and ftpmasters will read it this way. This is surprising to me (though, to be fair, I have never gotten around to becoming a Debian Member), especially in the context of extremely large data sets, and I am curious if this is in fact how others read it. (I'm also curious whether Ian intends that the training code and data need to make it into a binary package or if just being in the source package is enough.)

In an email replying to the earlier draft (https://lists.debian.org/debian-project/2025/02/msg00041....) Stefano Zacchiroli does not assume that Mo's earlier draft means that the training code and data must be distributed alongside the runtime AI model, and in fact points out that in practice, the upstream tarballs / source repositories usually won't have them and it will be up to the packager to figure out how to address that. He asks, "Do we repack source packages to include the training datasets? Do we create *separate* source packages for the training datasets? Do we create a separate (ftp? git-annex? git-lfs?) hosting place where to host large training datasets to avoid exploding mirror sizes? Do we simply refer to external hosting places that are not under Debian control?"

A more interesting question to me is the acceptable licenses of the training code and data. If they must go into Debian main, then that question has a straightforward answer; they must be DFSG-free. But you can imagine plenty of ways to make data available in a way that is not "private, proprietary, or inaccessible to the public" yet is not DFSG-free, especially for models trained on creative content (prose in natural languages, code in programming languages, visual art, speech samples, music, medical data, etc.). Taking code as an example, one obvious issue is LLMs that are trained on code that is available to the public but not under a DFSG-compatible license, whether that's unlicensed public code, things under a non-commercial license, or one of the new please-let-us-IPO licenses. This code can't be shipped in Debian main. Some, but not all, of it could go into non-free. Is that permissible for a model in main? Or for natural-language prose, a lot of prominent LLMs have relied heavily on "fair use" arguments in their training, meaning that Debian can't be in the business of redistributing the original texts on which they were trained, but the act of training is (claimed to be) non-infringing (in a certain jurisdiction), and the resulting model is (claimed to be) not a derivative work and so no restrictions on distributing or licensing the model are implied.

The question arises with training code as well as data. What if rerunning training requires CUDA (which is quite common), and the training is implemented in a way that there's alternative (CPU, OpenCL, etc.) backend? CUDA is proprietary but redistributable without cost, and is available in Debian non-free. For this reason, an actual program with a dependency (Depends:, Build-Depends:, or even Recommends:) on CUDA couldn't go into Debian main, but it could go into contrib. If the training program requires CUDA but the runtime AI model does not, can the model go into Debian main?

For this reason I actually don't think there is a ton of distinction between Mo Zhou's proposal and Thorsten Glaser's, despite the latter being _ideologically_ anti-AI. If Ian Jackson's reading is correct, both proposals require that training code and data be suitable for Debian main. Arguably, Ian Jackson's reading is more stringent in practice than Thorsten Glaser's proposal, in that (if I'm reading right) Thorsten's proposal allows a DFSG-free AI with redistributable but DFSG-incompatible training data to go into contrib with its training data packaged up in non-free, but Mo's proposal with Ian's interpretation says that AI models that cannot be packaged up with their own training data "are not seen as DFSG-compliant," which would keep them out of contrib.

If Mo's proposal goes to a vote I think it would be very worthwhile to clarify the intention here, because I expect this to be one of the first practical issues for a serious "more free" competitor to the current crop of proprietary or open-weights LLMs.

Finally, I think there is actually a much stronger anti-AI part of Thorsten's proposal that was underdiscussed, because it affects things that are _not_ AI models. The proposal has Debian take the view that "'generative AI' output are derivative works of their inputs (including training data and the prompt)" and so "Any work resulting from generative use of a model can at most be as free as the model itself; e.g. programming with a model from contrib/non-free assisting prevents the result from entering main." In other words, if someone uses GitHub Copilot or Cursor to help them write some software, then adopting Thorsten's proposal means that Debian cannot consider their software DFSG-free, regardless of what the stated license is on that software, because the underlying models are not DFSG-free.

This strikes me as hugely impractical (which is not necessarily wrong, if your goal is "a hard anti-AI stance"!). These tools are, in practice incredibly popular. While probably not much of the "old guard" writing things like glibc is likely using these tools, I would expect a decent fraction of casual FOSS contributors to be using them, to the point where I assume that by now, at least one package in the current Debian archive has at least some code that was written with the help of such a tool. You would need upstream maintainers to not only commit to refusing to use these tools themselves but also to telling contributors not to use them either, and you would need Debian packagers to tell people who send in patches or Salsa merge requests the same thing.

To Thorsten's credit, he does take exactly this approach (https://mbsd.evolvis.org/permalinks/wlog2021_e20240726.htm). But adopting this proposal would essentially make Debian require every project it packages to adopt this approach, too, and perhaps to engage in a significant audit or scrub of code from between the availability of tools like Copilot and the point of adopting such a policy.

My understanding of the current legal situation is that even for things like the NYT lawsuit against OpenAI, the claim is that the act of training was not fair use, and OpenAI's commercial gain from that act of infringement must be stopped. That is, I think, a different claim from that both the act of training produces a model that is a copyrightable derivative work _and_ that the act of doing inference with that model also produces derivative works of the model, such that all model outputs, and not just the model itself, are subject to the training data's license (or are infringing). Again, to be clear, I see the moral argument for such a position. But it seems likely to me that this is a much more copyright-maximalist position than the actual legal reality, and so it shouldn't be adopted lightly.

Among other things, I think the FOSS community needs to constantly remember that the whole moral deal about copyleft is that it subverts copyright; it doesn't embrace it. The point is that, to the extent that copyright exists, we might as well turn it around to good ends; that's not the same thing as wanting to stretch its scope as far as possible. (As an analogy, consider that despite many FOSS licenses having explicit text about patent licenses, the common position in the FOSS community is that software patents should not exist, not that we should patent everything novel and cool in the FOSS world and assign the patents to Conservancy to better arm their lawyers.) Distros like Debian benefit, for instance, from the idea that facts and data are not themselves copyrightable, in that this is what enables you to reverse-engineer a proprietary system and produce a compatible FOSS system that is not a derivative work. I worry that if Debian argues that the outputs of AI models are derivative works of their training data, it becomes harder to argue that a clean-room reimplementation of a protocol from a reverse-engineered spec is not a derivative work of the proprietary system that was reverse-engineered. Even if you want to incur the logistical trouble of refusing upstreams that have accepted AI-assisted pull requests, it might be a better rationale to say that Debian believes that AI-generated code does not live up to the spirit of the Social Contract and the goals of its developers, rather than to say that Debian believes the licenses inherit in this way.

Debian debates AI models and the DFSG

Posted Apr 26, 2025 13:37 UTC (Sat) by Wol (subscriber, #4433) [Link] (23 responses)

> The proposal has Debian take the view that "'generative AI' output are derivative works of their inputs (including training data and the prompt)" and so "Any work resulting from generative use of a model can at most be as free as the model itself; e.g. programming with a model from contrib/non-free assisting prevents the result from entering main." In other words, if someone uses GitHub Copilot or Cursor to help them write some software, then adopting Thorsten's proposal means that Debian cannot consider their software DFSG-free, regardless of what the stated license is on that software, because the underlying models are not DFSG-free.

Devil's advocate, let me re-write that last sentence ...

"In other words, if a student learns by studying proprietary software and uses that experience to help them write some software, then adopting Thorsten's proposal means that Debian cannot consider their software DFSG-free, regardless of what the stated licence is on that software, because the underlying software that code is based on is not DFSG-free".

At the end of the day, whether something is derivative (and hence copyright-contaminated) is a legal call. I've just been reading (a day or two ago) about how the brain is a very efficient Bayesian filter. An AI/LLM is also (I guess) a Bayesian filter of some sort. If a human brain fine tunes the output of an artificial brain, where do you draw the line? How do you know where the line is?

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 27, 2025 1:46 UTC (Sun) by geofft (subscriber, #59789) [Link] (4 responses)

Right - my understanding is that this is a legal call that no jurisdiction has actually called yet, and Thorsten's proposal is effectively committing Debian to taking a particular view on the question and prematurely binding its own behavior to the strictest possible answer to that question. This feels a little bit like if, in the early days of the Oracle v. Google lawsuit, Debian took the view that in fact the Java API was copyrightable and therefore not only Android but also GNU Classpath, WINE (now a derivative work of NT), and maybe even the Linux kernel itself (now a derivative work of commercial UNIX) ought to be kicked out of the archive. While it would be a little bit nice for free software authors to make life hard for proprietary software authors who want to copy their API designs, it seems clearly good for free software that everyone has taken the view that APIs are not copyrightable (which is actually not how Oracle v. Google decision ended up - lower courts held that APIs were copyrightable, and the Supreme Court simply ruled that in this particular case, Google's use was fair use).

On the subject of human minds, in actual legal systems, I am under the impression that the mind is considered inviolate and special in various ways. For instance, while arguments have been raised that loading files into RAM is performing a copy and therefore could infringe copyright, I do not believe any court has ever taken the opinion that a human reading a document, no matter how closely, could infringe copyright simply by the act of reading it, or that listeners to a musical work who have the music stuck in their heads have performed a copy. (Of course if they later write down what they remember, as Mozart supposedly did to Allegri's "Miserere," then that counts as a copy subject to copyright law.) In another application, in the US, the Fifth Amendment protection against self-incrimination means that written documents can be subpoenaed but the contents of one's mind cannot, and there's an emerging precedent that for the same reasons, police can compel you to unlock a device via biometrics (face/fingerprint/retina) but not via revealing even a short passcode.

Debian, of course, is free to bind its behavior more strictly than the law requires, and I think the FOSS community has a history of wanting people to avoid looking too closely at proprietary code out of a general fear that arguments would be made in court about infringing copies, even if the court is not going to rule that a programmer's brain is a derivative work. It would probably be a bad outcome to start treating this as a norm instead of just a defense to have in your quiver, and end up with the rule that someone who has learned from one free software project should be considered tainted when trying to contribute to another free software project under an incompatible license, e.g., that someone who has read glibc's sources too closely cannot send MIT-licensed patches to musl.

Debian debates AI models and the DFSG

Posted Apr 27, 2025 16:45 UTC (Sun) by Wol (subscriber, #4433) [Link]

> I do not believe any court has ever taken the opinion that a human reading a document, no matter how closely, could infringe copyright simply by the act of reading it, or that listeners to a musical work who have the music stuck in their heads have performed a copy. (Of course if they later write down what they remember, as Mozart supposedly did to Allegri's "Miserere," then that counts as a copy subject to copyright law.)

Isn't it strange how you have just perfectly described European law as it refers to AI, which has so many authors, song-writers and composers screaming foul.

The law has taken the opinion that an AI reading a document, no matter how closely, does not infringe copyright simply by the act of reading (and remembering) it.

Of course, if they then later write down what they remember, then that counts as a copy subject to copyright law.

In other words, there's no such thing as "Copyright Washing" in European law. The law simply makes no distinction whatsoever between artificial intelligence and human intelligence (or lack thereof!).

Cheers,
Wol

Short-term versus long-term positions

Posted Apr 27, 2025 18:47 UTC (Sun) by farnz (subscriber, #17727) [Link] (2 responses)

I can see strong arguments for holding to the strictest possible answer in the short term (such that if jurisdictions start to make the judgement call, you're safe unless their decision is a shock to everyone), but time-limiting it with a view to coming back when the commercial dust settles and jurisdictions have either made this a non-question, or extending the time limit if it's still a controversy.

Remember that this is tied in not just with what's legally acceptable, but also what's politically acceptable; part of the reason it's such a big fuss is that in countries with very little support for the unemployed, any technology that's being sold as "reduces the number of people needed to reach a certain level of attainment" is a political timebomb. As a result, it's quite likely to either stop being controversial (once the limits are clear, and it's not a big threat to jobs and to people doing serious work), or be something that's clamped down on hard (because it's that, or civil unrest); saying "let's hold it off for now, and make a serious decision once it's not a mess" is not a bad thing in itself.

Short-term versus long-term positions

Posted May 23, 2025 11:41 UTC (Fri) by sammythesnake (guest, #17693) [Link] (1 responses)

That "wait and see but be cautious in the meantime' approach sounds pretty sensible to me, but I think the resolution ought to explicitly state that rationale and the intention to potentially relax the restrictions if future clarification allows.

The risk otherwise is that future resolutions to relax these restrictions will be fighting against an entrenched status quo that wasn't intended to remain...

Short-term versus long-term positions

Posted May 23, 2025 14:27 UTC (Fri) by Wol (subscriber, #4433) [Link]

Or put in a sunset clause - the restrictions will apply for at most say 5 years and then a new GR is required ...

Even if you put the rationale that you intend it to be re-visited, there will be an entrenched status quo that says "why should we?". The sunset would give them no choice.

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 27, 2025 6:20 UTC (Sun) by mirabilos (subscriber, #84359) [Link] (17 responses)

No, there’s a difference between a human and a machine after all.

Debian debates AI models and the DFSG

Posted Apr 27, 2025 8:43 UTC (Sun) by Wol (subscriber, #4433) [Link] (16 responses)

> No, there’s a difference between a human and a machine after all.

Which is? Or are you arguing that the human brain is magic?

There's a lot of people who THINK they know how the brain works. There's a lot of people who are trying to build computer models of their model of how the human brain works. And for the most part the computer model is crap because the mental model bears no resemblance to reality.

The classic example, why are people throwing general-purpose compute-heavy hardware at these problems? How much of the brain is single-purpose dedicated hardware? Is it 90%? Something like that certainly . And they talk to other bits of dedicated hardware. That's why my car's "driver assist" features are such a damn nightmare. They're based on general-purpose ideas and hardware, and don't talk to each other. Case in point - in heavy traffic when a car pulls out of my way, my car fails to recognise we're not on a collision course, pretty much comes to a halt until it loses sight of it, then floors the accelerator until a different piece of hardware suddenly realises it's going to crash into the car in front and screams at me to take over and brake!

Or adaptive cruise control. Which can't tell the difference between its internal database and a road traffic sign. Which regular sees signs which clearly don't apply to the road you're on but it applies them anyway. Which completely ignores the driver's control input when it disagrees with the computer database's input.

Compare that with a real human. I can recognise a stationary car at a distance. I can tell when the trajectory of my car and the car in front don't intersect. I can identify speed limits and brake lights at a distance. And my actions NOW are informed by what I can see happening in a minute's time.

There's only one major difference between a human and a machine. The people trying to build human models aren't talking to the people studying humans, and the computer models therefore bear bugger all resemblance to the human systems they're trying to model. Pretty much par for the course in most human research, sadly ... :-(

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 27, 2025 17:35 UTC (Sun) by mirabilos (subscriber, #84359) [Link] (15 responses)

I’m *completely* ignoring the philosophical part about this (which I of course have an opinion, but don’t have the spoons to debate here) because, at this point, it fully suffices that copyright law requires a human natural person to create works, and everything else is just mechanical derivation.

“Generative AI” operates like a compiler or a lossy compression/decompression. Its output is fully dependent on its inputs, as it runs on a deterministic machine (if a PRNG is used, it does count as input, making its output reproducible), so the outputs are just mechanical transformations of the input (and thus both a derived work and the provider and operator of the “AI” cannot claim independent copyright on it).

Debian debates AI models and the DFSG

Posted Apr 27, 2025 20:12 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

> it fully suffices that copyright law requires a human natural person to create works, and everything else is just mechanical derivation.

Devil's advocate again, but doesn't that mean an AI output is a mechanical derivation of the prompt a human fed in? At which point any copyright (liability) belongs firmly with the user prompting the AI. So all these people saying "AI generated an exact copy of Shakespeare" (or whatever) are clearly fully liable for any copyright violation that may be involved in that reproduction ...

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 27, 2025 20:39 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

It is (but it is also a derivative of the “training data”).

And yes, the responsibility is with the operator. The copyright exeption for text and data mining (if not opted out) only applies for analysēs like getting trends, not for generative use (probably not even the genAI summarisation functionality).

Debian debates AI models and the DFSG

Posted Apr 27, 2025 20:39 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (11 responses)

> so the outputs are just mechanical transformations of the input (and thus both a derived work and the provider and operator of the “AI” cannot claim independent copyright on it).

That is simply not how courts define "derivative work," and it's rather obvious to see why. If you run an algorithm over somebody else's creative work, that algorithm could do anything from "return the input unchanged," to "count the words, sentences, paragraphs, and pages, and output all of those numbers," to "always output the number 7."

"Derivative work" is defined differently in different countries (and some countries call it something else entirely), but most definitions are going to at least vaguely resemble the US definition, which in practice says that a work is derivative of another work if the plaintiff (author of or rightsholder to the original) can prove two things:

* Access: The person who created the allegedly infringing work (i.e. the person who pushed the buttons that caused the AI to output it) would have been able to copy the original.
* Substantial similarity: There are significant copyrightable elements in common between the original and the allegedly infringing work, and the number of such elements and degree of similarity is high enough to justify an inference of copying.

Access is usually pretty easy to prove - in most cases, "publicly available" is good enough. For an AI case, you might also need to prove that the work was actually in the training set, but I don't think we have caselaw on that yet. Substantial similarity, on the other hand, is entirely fact-bound and there are no bright lines - the only way to prove it is to put the two works side by side and point to specific individual elements that you say were improperly copied. This also means you must *have* a specific original in mind when you file the lawsuit - you can't just say (e.g. as a class action) "well, it's one of these millions of works that was infringed." That sort of pleading would probably get dismissed even pre-Twiqbal, but nowadays, it would get you laughed out of the courtroom.

Debian debates AI models and the DFSG

Posted Apr 27, 2025 21:15 UTC (Sun) by mirabilos (subscriber, #84359) [Link] (10 responses)

The fallacy is that you look at one work only.

The sum of the outputs of a genAI thingy is a derivative work of the sum of (possibly a subset of) the inputs.

Of course you need to figure out similarity for each one if you want to sue for infringement.

But we’re not looking at suing for infringement, we’re looking for doing a positive-definition of what is acceptable, and for this, this level of detail is sufficient.

Debian debates AI models and the DFSG

Posted Apr 27, 2025 23:32 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (9 responses)

> The sum of the outputs of a genAI thingy is a derivative work of the sum of (possibly a subset of) the inputs.

The term "derivative work" has a specific legal meaning. If you want to invent your own meaning, I would encourage you to come up with your own word, so that it does not become conflated with the legal meaning that clearly does not apply in this context (substantial similarity is only meaningful in the context of comparing specific individual works, not large classes of works, and without substantial similarity, there is no derivative work, under US law).

> But we’re not looking at suing for infringement, we’re looking for doing a positive-definition of what is acceptable, and for this, this level of detail is sufficient.

This brings us to the broader problem here: Copyright is only about 300 years old, and its contours have changed dramatically in that short time. Most basic principles of law and ethics are far older than that, and have been far more stable in their overall form and function (at least in semi-modern times). Just about everyone agrees on the basic definition of "stealing" a physical object, and that it is (in general) a wrongful act, but theft has been a thing since antiquity. If you're going to lean on the ethical side of this instead of the legal side, you're quickly going to find that there is much less consensus on what should be considered right and wrong than you reasonably need to have this sort of discussion. Here is an example I would encourage you to think carefully about:

Suppose someone never registers their copyright, or fails to comply with some other legal requirement. Should their copyright automatically lapse, or even fail to vest in the first place? I think most folks would say that it should not, hence why we got rid of those requirements. But then you ask people another question: Should Warner-Chappell now own the copyright to Happy Birthday, or did they rightfully lose it as a result of someone failing to comply with US copyright formalities in the 1930's? Most folks will tell you that Warner should not own Happy Birthday, and that they did rightfully lose it. But that is inconsistent - either formalities are rightful, and should still apply, or they are wrongful, and Warner should not have lost the copyright. You can resolve this by appealing to the excessive term of modern copyright, but that is dodging the question. I'm not asking what the ideal copyright system would look like - I'm asking for the rightful outcome in this one specific case.

But perhaps you disagree, and think we can resolve it based on the term. That's fine, I have other examples. There are numerous works from the 50's and 60's whose US copyrights were not renewed, mostly serialized pulp fiction that was seldom or never reprinted. Many of them can now be read on the Internet Archive for free. Is it rightful that those authors, some of whom might still be alive, are not paid for the reproduction of their works? Or, as a matter of ethics, should the Internet Archive take down all of those works until such time as their authors can be identified and some sort of payment-in-lieu-of-licence can be worked out?

Or we can look at even more recent developments:

* Is it wrongful to distribute tools that break DRM (because it enables piracy), or is DRM itself wrongful (because people should control their computers)?
* A clickwrap license is a contract you agree to by interacting with some computer software (usually clicking an "I agree" button). Should these contracts be effective? If yes, then EULAs are effective and proprietary software is ethically legitimate. If not, then that includes contracts for the sale of goods, which effectively means that no form of online commerce should be allowed. Is either of those extremes correct, or should we adopt some middle position (and exactly what does that position look like)?
* Is it wrongful for a hobbyist to create a derivative work of some giant corporation's intellectual property, assuming the hobbyist makes at most a small profit from it?
* Is it wrongful for a giant corporation to create a derivative work of some hobbyist's intellectual property, assuming the corporation makes at most a small profit from it?
* Is it OK if the answers to the above two bullets differ, or would that be a logical contradiction?

I want to be clear - the above are rhetorical questions. Replying with specific answers would entirely miss the point. I'm not asking for your specific viewpoint. I'm asking you to consider whether a significant number of people might have a different viewpoint from you, and whether you have any legitimate right to insist that your viewpoint is more ethically correct than theirs. If we can't even agree on these more basic questions of how copyright ought to work, then I submit to you that it is impossible for us to come to agreement on how copyright should interact with generative AI.

TL;DR: The basic ethical premises of copyright are still up for debate, so it is deeply questionable whether we can even have this discussion in the first place.

Debian debates AI models and the DFSG

Posted Apr 28, 2025 7:53 UTC (Mon) by Wol (subscriber, #4433) [Link] (7 responses)

> Should Warner-Chappell now own the copyright to Happy Birthday, or did they rightfully lose it as a result of someone failing to comply with US copyright formalities in the 1930's?

Was it even possible for them to comply lawfully with US copyright formalities in the 1930s? Especially in patents, but also in copyright, how much unrecognised "prior art" is out there?

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 28, 2025 9:39 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (6 responses)

Prior art is not a thing under copyright law. It's only relevant to patent law. There is a general requirement that authorship be "original," but this is a very low bar and not commonly litigated in cases like this (contrast e.g. Feist v. Rural).

Based on the facts that have been publicly disclosed, the problems with "Happy Birthday" were roughly as follows:

* The melody was indisputably in the public domain, having been published in 1893 under a different name ("Good Morning to All") and with a trivial difference in arrangement (one note became two).
* A copyright on the lyrics was registered in 1935.
* In 1927, a copy of the lyrics was published, set to the melody of "Good Morning to All," without a copyright notice. If authorized, this publication would have the effect of forfeiting any copyright before it could even be registered. Warner would later argue in court that this publication was unauthorized, or at least not authorized by the appropriate party.
* There were a number of other arguments raised. The judge considered all of these arguments at summary judgment, and concluded that most of them (including the 1922 publication) would need to go to trial.
* But the judge did find one basis for summary judgment: The sale of the copyright from the Hill sisters (one or both of whom wrote the song) to the Summy Company (which Warner eventually bought) was apparently a bit of a mess. It went through multiple rounds of litigation, three separate agreements, and the second agreement was missing from the modern record. The judge ruled, as a result, that there was no evidence the Hills had specifically sold the lyric rights to the Summy Company, so Warner lost on that basis.

The question I asked is whether this is a just outcome, which (now that I look more closely) is a bit of a muddle. But you could just as easily imagine an alternative version of events in which the judge instead rules that the 1922 publication caused the song to enter the public domain. That is not an implausible outcome - the judge said it was a triable issue of fact that could have gone either way.

Debian debates AI models and the DFSG

Posted Apr 28, 2025 17:22 UTC (Mon) by Wol (subscriber, #4433) [Link] (5 responses)

> Prior art is not a thing under copyright law. It's only relevant to patent law.

It's only *legally* *defined* under patent law.

But if somebody sues me for copyright copying, because I wrote a piece of music with da-da-da-dum, da-da-da-dum, I sure as hell am going to point them at Beethoven as prior art!

Terry Pratchett was accused of ripping off Hogwarts when he created Unseen University. Quite apart from the fact he would have needed a time machine, there's absolutely shit-loads of prior art going back to the 1920s if not the century before about British boarding schools and the like. Bunter, anyone?

Yes, it's not *legally* prior art, but what on earth else are you going to call it?

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 28, 2025 17:34 UTC (Mon) by pizza (subscriber, #46) [Link] (4 responses)

> Yes, it's not *legally* prior art, but what on earth else are you going to call it?

Methinks you need to bone up on the distinction between "ideas" and "specific expressions".

Debian debates AI models and the DFSG

Posted Apr 28, 2025 20:26 UTC (Mon) by Wol (subscriber, #4433) [Link] (3 responses)

I think various Judges and plaintiffs do, too.

It's all very well a plaintiff saying "you nicked my ideas", but when said plaintiff has clearly nicked the same idea from somewhere else (of which they may not be aware of - music would be a classic case), then there's a problem.

What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.

Even worse when an AI does it - I gather there's been a "situation" recently where YouTube has misattributed Schubert's "the trout" (even worse, as played by a washing machine), and accused a whole bunch of people of copyright violations, taken a large chunk of their commissions, and given it to somebody who has no connection with Schubert or the trout.

I think that's a major problem with a lot of IP at the moment ...

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 30, 2025 16:37 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (2 responses)

> It's all very well a plaintiff saying "you nicked my ideas", but when said plaintiff has clearly nicked the same idea from somewhere else (of which they may not be aware of - music would be a classic case), then there's a problem.

Ideas are categorically exempt from copyright protection in the US (and most of the world). See for example 17 USC 102(b), and see also case law such as Baker v. Selden.

> What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.

The "specific expression" refers to the actual words, images, or sounds used to convey something, not the broader concept of it. If you can show that the plaintiff copied the specific expression from elsewhere, then under US law, this has two rather significant consequences (both laid out in 17 USC 103):

1. They don't own that specific expression, because it's not part of "the material contributed by the author of such work."
2. If the copying was unlawful (the original was copyrighted and they had no license), then that entire part of the work is unprotected by copyright.

You cannot demand that the plaintiff give you an accounting of every place where it possibly could have come from and exhaustively prove that it is entirely original, because that would be plainly unworkable. Instead, you as the defendant have to find the original work (possibly through discovery), introduce it at trial, and argue that it precludes the plaintiff from owning the specific expression.

You can argue that the copied expression is a "scène à faire," literally meaning "a scene that must be done." The original basis for this was genre fiction, in which (for example) a mystery novel simply must have a scene at the end where the detective (who also must exist) explains who commited the crime, how and why they did it, and what clues led the detective to that conclusion. If you don't have that scene, it's not a "real" mystery novel, it's some other genre, and since nobody can be allowed to own the mystery genre as a whole, nobody can own that type of scene either.

In the programming context, scènes à faire includes constructs like for(i = 0; i < max; i++){...}. Nobody can be allowed to own that, because you can't reasonably write a (large, complex) program in a C-like language and never write a loop that looks like that.

> I gather there's been a "situation" recently where YouTube has misattributed Schubert's "the trout" (even worse, as played by a washing machine), and accused a whole bunch of people of copyright violations, taken a large chunk of their commissions, and given it to somebody who has no connection with Schubert or the trout.

Disclaimer: I work for Google, not as a copyright lawyer, and so I can't speak on their behalf. The following is my personal interpretation of YouTube's behavior, based on public information and reasonable inference.

YouTube's copyright system is primarily designed to protect YouTube from getting sued, and secondarily designed to discourage users from suing each other. You cannot assume that the outcomes you see on YouTube are necessarily what a court of law would have done, because YouTube does not have the power to make binding rulings on whether X is a copyright infringement of Y. So they're stuck making rulings on the basis of whether YouTube can plausibly be sued, which leads to all sorts of undesirable-but-unavoidable biases in favor of the copyright holder (many of them explicitly codified into law, e.g. in 17 USC 512 and similar laws in other jurisdictions). Of course, anyone dissatisfied with YouTube's handling of an issue remains free to take it to the "real" court system under a variety of legal theories (slander of title, tortious interference, conversion of revenues, 17 USC 512(f) misrepresentation, etc.).

I will agree that the legal system generally does a poor job of producing just and efficient outcomes in this space. There is a reason that both YouTube and its users strongly prefer to avoid going to court. But scènes à faire has nothing to do with this. What you describe sounds like outright fraud (taking the facts as you have described them and assuming there's nothing else going on here). Unfortunately, modern copyright law was simply not designed under the assumption that services like YouTube might exist, and updating it has proved difficult (especially in the modern US political system where Congress can barely agree to keep the government open). A few years ago, the YouTuber Tom Scott made an excellent ~43 minute video explaining the broader problem in detail, which I found highly informative and more entertaining than you might expect: https://www.youtube.com/watch?v=1Jwo5qc78QU

Debian debates AI models and the DFSG

Posted Apr 30, 2025 18:35 UTC (Wed) by Wol (subscriber, #4433) [Link]

> What you describe sounds like outright fraud (taking the facts as you have described them and assuming there's nothing else going on here).

It is, but who's responsible? As far as I know, the guy receiving the improper royalties could well be completely unaware of the source of the royalties. He uploaded a piece of music, which he just happened to record at the same time his washing machine finished its cycle, and YouTube's automated systems assumed it was his copyright. Whoops!

So if YouTube wants to avoid being sued, somebody who's prepared to take the risk will probably take them to the cleaners ... (rather appropriate seeing as it's a washing machine rofl)

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 30, 2025 18:42 UTC (Wed) by Wol (subscriber, #4433) [Link]

> > What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.

> The "specific expression" refers to the actual words, images, or sounds used to convey something, not the broader concept of it.

Again, I'm thinking of a particular example. I don't know the outcome, but some musician sued saying another musician had "copied his guitar riff". Given that a riff is a chord sequence, eg IV V I, the shorter the riff the more likely it is another musician either stumbled on it by accident, or it's actually a common sequence in a lot of music. If I try and play a well-known piano sequence on the guitar, chances are I'll transpose the key and it'll sound very like someone else's riff that I may never even have heard ...

(I am a guitar player, but classical, so I don't play riffs ... :-)

Again, this is a big problem with modern copyright law where so much may be - as you describe it - a "scene a faire", but people don't recognise them as such.

Cheers,
Wol

Aside on the legal versus common definition of "theft"

Posted Apr 28, 2025 8:13 UTC (Mon) by farnz (subscriber, #17727) [Link]

Just about everyone agrees on the basic definition of "stealing" a physical object, and that it is (in general) a wrongful act, but theft has been a thing since antiquity. If you're going to lean on the ethical side of this instead of the legal side, you're quickly going to find that there is much less consensus on what should be considered right and wrong than you reasonably need to have this sort of discussion.

As an aside, and this only bolsters your point, I live in a jurisdiction (England and Wales) where plenty of things that are obviously theft when you look at them don't quite reach the legal bar. The requirements here for theft are that you appropriated property, that it belonged to someone else, that you appropriated it dishonestly, and that you intended to permanently deprive the owner of that property. So, for example, if I take your lawnmower at the beginning of summer to mow my lawn, with intent to return it to you when winter starts and it's too cold for grass to grow, I've not committed theft, in law.

We have had to introduce specific laws for cases like taking someone's car for a joyride without permission, refusing to return a credit to your bank account that was made in error, or leaving without paying when you owe payment on the spot, precisely because the legal definition of theft is too narrow to cover these cases, even though most of us would agree that they were "theft" in a colloquial sense.

Debian debates AI models and the DFSG

Posted May 6, 2025 10:42 UTC (Tue) by aigarius (subscriber, #7329) [Link]

That is not entirely correct, for example, a fully automated process creating a thumbnail of an image for a search engine to show in search results is ruled to be fair use of the copyright in courts. Copyright law does *not* really make much of a distinction between human outputs and automated software outputs. The considerations on copyright and fair use are not reliant on that distinction at all. Me resizing an image in GIMP has the same effect (both technical and legal) as a script doing the same to an uploaded image automatically.

Debian debates AI models and the DFSG

Posted Apr 26, 2025 14:13 UTC (Sat) by lumin (guest, #130448) [Link] (3 responses)

> A few thoughts, starting with the most minor: I am a little surprised the proposal uses the phrasing "open source". Debian has traditionally used the phrasing "free software," including in the name of the DFSG, the Debian Social Contract, and Debian policy.

I realized that problem when writing the text. However, if the proposal says:

"AI models released under DFSG-compliant license without training data or program" is not seen as DFSG-compliant.

-> "... under DFSG-compliant ..." is not seen as DFSG-compliant.

It looks weird as it looks like a paradox. I did not thought too much further on this wording issue, and replaced it with open source. I think people can anyway understand what I mean there. But you are right, we'd better revise the wording if this is going to officially land onto somewhere.

Debian debates AI models and the DFSG

Posted Apr 26, 2025 14:59 UTC (Sat) by gioele (subscriber, #61675) [Link]

> "AI models released under DFSG-compliant license without training data or program" is not seen as DFSG-compliant.
>
> -> "... under DFSG-compliant ..." is not seen as DFSG-compliant.
>
> It looks weird as it looks like a paradox.

It is however an already know "paradox": if you release a "GPL executable" but you not provide the its source, including «the scripts used to control compilation and installation of the executable», then you are not really complying with the terms of the GPL.

Debian debates AI models and the DFSG

Posted Apr 27, 2025 2:00 UTC (Sun) by geofft (subscriber, #59789) [Link]

I don't think it really needs to be changed, but I agree with gioele that this does not read like a paradox. Being DFSG-free is an overall state of a package, of which being released under a DFSG-compatible license is a necessary-but-not-sufficient requirement.

If you want to change it, I'd suggest replacing "open source license" with "DFSG-compatible license" to contrast with "DFSG-compliant" at the end of the sentence. (Policy also uses "comply with the DFSG" to describe what can go into main and contrib.) "Compatible," to me, means it's possible to use it in a way that fits with the DFSG, but it doesn't mean that it's impossible to use it in another way. If I want to write a POSIX-compliant shell script, it's very helpful to test it against a POSIX-compatible shell, but that by itself is not enough.

Debian debates AI models and the DFSG

Posted Apr 27, 2025 15:14 UTC (Sun) by smcv (subscriber, #53363) [Link]

In the past Debian has consistently said that the things that we might or might not accept as Free (DFSG-compliant) are pieces of software, not licenses. So if foo is licensed under GPL-2.0-or-later and so is bar, we might accept foo as Free but say that bar is not Free, for example if some of bar's source code is missing.

I believe there was one case in particular where the upstream developer of a piece of software under a BSD/MIT-style license (was it Pine?) had an unusual interpretation of the license and asserted that their software was only redistributable if it was at no cost (free of charge), which would have made it impossible to sell Debian CDs with their software included. Debian responded to this by treating that specific piece of software as non-Free, even though the license it was released under is one that we would usually have accepted.

Debian debates AI models and the DFSG

Posted Apr 26, 2025 21:24 UTC (Sat) by kleptog (subscriber, #1183) [Link] (7 responses)

I'm also a little surprised that people supporting copyleft as a way to subvert copyright law simultaneously try to argue copyright maximalism with respect to machine learning models. It feels a lot like "we want to subvert copyright law, but only when it's in our favour".

If you believe all software should be free, then believing that all ML models should be free seems the logical next step.

As for reproducing models, that seems like a nonstarter as the first step in training is to initialise with random weights. I guess you could make that deterministic, but the question is why you care. The training set is going to be at least 100 times the size of the final model. Requiring full reproducibility on very large models is going to require distributing such a vast amount of data, and for what purpose? People are just going to use the final model weights to build more fine-tuned models, reproducing from the source data just isn't very useful.

The impact on the many other places machine learning models have been used for years in software is also something that shouldn't be discounted. This discussion could have a lot more impact than it appears at first glance.

Debian debates AI models and the DFSG

Posted Apr 26, 2025 21:31 UTC (Sat) by mb (subscriber, #50428) [Link] (2 responses)

>and for what purpose? People are just going to use the final model

Just use the closed source binary? Why would you want to have the source? Just use the binary!

Debian debates AI models and the DFSG

Posted Apr 27, 2025 11:25 UTC (Sun) by kleptog (subscriber, #1183) [Link] (1 responses)

The fundamental difference is that where a binary program can't be used for anything else than execution, the weights of an ML model contain everything there is about the model and can be used as the input to create new models. It can be studied and modified in useful ways.

In many ways, the model is actually a suitable stand in for the original data. I think that for the vast majority of use cases people are better off taking an existing public model and fine-tuning it to suit their purpose with their own data, than trying to rebuild a model from scratch at great expense.

Debian debates AI models and the DFSG

Posted Apr 29, 2025 8:46 UTC (Tue) by taladar (subscriber, #68407) [Link]

I think the heart of the problem of defining a useful free software analog for AI models is precisely that we haven't really figured out good ways to inspect and modify them yet. Even the experts in the field struggle to modify models deliberately to fix individual flawed behavior/responses or larger scale behavior patterns or to determine the reason a model behaved in the way it did.

As long as that is the case free software concepts are hard to apply to AI models.

Debian debates AI models and the DFSG

Posted Apr 27, 2025 2:41 UTC (Sun) by geofft (subscriber, #59789) [Link]

I can think of several good reasons to want to have the training data available, e.g.,

- Simply requiring it to be published is a good way to incentivize keeping people honest about what it is. (Facebook, for instance, is alleged to have trained Llama-3 against books literally torrented via LibGen, and they attempted to seal parts of the lawsuit because they were embarrassed about the situation becoming public.) Having it be published with the expectation that people might plausibly try to reproduce it, and ask hard questions if they can't, is an even stronger incentive.

- If you want to remove something from the "knowledge" of the model, it is basically impossible to do that reliably starting from the trained model, at least given the current state of the art in interpretability, whereas it is conceptually straightforward (though resource-intensive) to just remove that training data and run the training again.

- It may well be the case that a future AI, itself, could do something useful with another model and its training data that would be impractical for a human, e.g., interpretability work to explain some behavior with reference to the training data that induced it. That we don't see a use for the full training data now is no reason to decide we won't find a use later.

- Starting with the same data but a different training program could yield interesting results, ranging from simply modifying the training program to do the same thing but with more parameters or a different approach to tokenization, to training in a wholly different way.

- I do think one of the many arguments for FOSS is that it's good for learning and openness, even independent of the practicality of using the software. FOSS licenses may not have termination clauses or timeouts, which of course is for practical reasons of allowing people to package and redistribute software confidently, but I think we'd all intuitively disapprove of a license that terminated even if it were 100% guaranteed that nobody was ever going to actually run the software again - having the sources around and available is a contribution to the commons.

- In any of the weird hypothetical evil-overlord-AI scenarios like https://ai-2027.com/ , it seems pretty clear that having meaningful capacity to build your own AIs will be immensely helpful in fighting back. Of course "the good guys" are going to be at a strong disadvantage for myriad reasons, but not having any access to training data will put them even farther back. In a scenario where a strong enough LLM has started actively concealing its own inner "thoughts" from its output without people noticing, I'm not sure that fine-tuning a trained subversive model is going to be sufficiently helpful to really get it onto your side.

Also keep in mind that all the involved hardware is rapidly getting more capable, so I think that an end user might retrain a model they're using from Debian stable or oldstable and find it much more feasible than it would have been for them to do that same retraining at the time that model was uploaded to Debian unstable. (For the same reason, there may also be an effect that the actual type of GPUs used when originally training the model were in high demand at the time but are now no longer top-of-line and are easier and cheaper to rent or buy.)

Debian debates AI models and the DFSG

Posted Apr 27, 2025 6:25 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

Not even a copyleft person. A BSD person, permissive licence. A licence so permissive that it only requires what is commonly called attribution. (And a bit of indemnification in exchange for the gift.)

And those TESCREAL models don’t even do that, even cannot in practice. And attribution is a burden so low that not honouring it is a much harsher violation than, say, one point of the GPL.

Debian debates AI models and the DFSG

Posted Apr 28, 2025 20:15 UTC (Mon) by ballombe (subscriber, #9523) [Link]

> It feels a lot like "we want to subvert copyright law, but only when it's in our favour".

... which is exactly the line played but openAI, facebook et al when it comes to copyright.
They are not copyright minimalists despite what they say, see their reaction to DeepSeek.
It makes sense for GNU GPL software to want be protected from exploitation by them.

Debian debates AI models and the DFSG

Posted Apr 28, 2025 22:01 UTC (Mon) by jzb (editor, #7867) [Link]

I'm not sure why this is surprising. Copyleft uses copyright to do a 180 and grant rights to users that are normally reserved, and to insist that others convey the same rights. It does this for code that the licensor has the rights to—it does not take anything from others that it has no rights to. The mechanism requires copyright laws to work.

I don't see how that is inconsistent with "we oppose technologies that hoover up other people's copyrighted works without permission". I've known free-software folks who argue that proprietary licensing is unethical, but they still acknowledge that creators have the right to set licensing terms for their code, even if they disagree with the choices they make.

Debian debates AI models and the DFSG

Posted Apr 27, 2025 6:22 UTC (Sun) by mirabilos (subscriber, #84359) [Link] (1 responses)

It was intended to read that the model trained on non-free data can only go to non-free, not to contrib; contrib is only DFSG-free parts (i.e. parts that could otherwise go to main) that require non-free dependencies (or things outside of Debian) at runtime, as for all packages in contrib (though non-models could also use them at build time, I’m with M. Zhou on not allowing that as the model is a lossy compression of the input).

Debian debates AI models and the DFSG

Posted Apr 27, 2025 6:31 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

Basically, the idea for contrib was, it could require a proprietary niewida library at runtime, but the model itself is free enough for main, then it goes to contrib instead of non-free.

Debian Deep Learning Team Machine Learning Policy

Posted Apr 26, 2025 1:02 UTC (Sat) by pabs (subscriber, #43278) [Link] (1 responses)

I am surprised to see the Debian Deep Learning Team Machine Learning Policy not being linked by this article.

https://salsa.debian.org/deeplearning-team/ml-policy

Debian Deep Learning Team Machine Learning Policy

Posted Apr 26, 2025 13:34 UTC (Sat) by lumin (guest, #130448) [Link]

It is referred in Appendix C anyway.

Tesseract

Posted Apr 26, 2025 1:04 UTC (Sat) by pabs (subscriber, #43278) [Link]

IIRC the situation with Tesseract has improved to the point where it would meet even the stricter requirements being proposed.