LWN: Comments on "Debian debates AI models and the DFSG"

Short-term versus long-term positions

Wol — Fri, 23 May 2025 14:27:06 +0000

Or put in a sunset clause - the restrictions will apply for at most say 5 years and then a new GR is required ...

Even if you put the rationale that you intend it to be re-visited, there will be an entrenched status quo that says "why should we?". The sunset would give them no choice.

Cheers,
Wol

Short-term versus long-term positions

sammythesnake — Fri, 23 May 2025 11:41:11 +0000

That "wait and see but be cautious in the meantime' approach sounds pretty sensible to me, but I think the resolution ought to explicitly state that rationale and the intention to potentially relax the restrictions if future clarification allows.

The risk otherwise is that future resolutions to relax these restrictions will be fighting against an entrenched status quo that wasn't intended to remain...

Debian debates AI models and the DFSG

aigarius — Tue, 06 May 2025 10:42:55 +0000

That is not entirely correct, for example, a fully automated process creating a thumbnail of an image for a search engine to show in search results is ruled to be fair use of the copyright in courts. Copyright law does *not* really make much of a distinction between human outputs and automated software outputs. The considerations on copyright and fair use are not reliant on that distinction at all. Me resizing an image in GIMP has the same effect (both technical and legal) as a script doing the same to an uploaded image automatically.

Debian debates AI models and the DFSG

Wol — Wed, 30 Apr 2025 18:42:16 +0000

> > What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.

> The "specific expression" refers to the actual words, images, or sounds used to convey something, not the broader concept of it.

Again, I'm thinking of a particular example. I don't know the outcome, but some musician sued saying another musician had "copied his guitar riff". Given that a riff is a chord sequence, eg IV V I, the shorter the riff the more likely it is another musician either stumbled on it by accident, or it's actually a common sequence in a lot of music. If I try and play a well-known piano sequence on the guitar, chances are I'll transpose the key and it'll sound very like someone else's riff that I may never even have heard ...

(I am a guitar player, but classical, so I don't play riffs ... :-)

Again, this is a big problem with modern copyright law where so much may be - as you describe it - a "scene a faire", but people don't recognise them as such.

Cheers,
Wol

Debian debates AI models and the DFSG

Wol — Wed, 30 Apr 2025 18:35:15 +0000

> What you describe sounds like outright fraud (taking the facts as you have described them and assuming there's nothing else going on here).

It is, but who's responsible? As far as I know, the guy receiving the improper royalties could well be completely unaware of the source of the royalties. He uploaded a piece of music, which he just happened to record at the same time his washing machine finished its cycle, and YouTube's automated systems assumed it was his copyright. Whoops!

So if YouTube wants to avoid being sued, somebody who's prepared to take the risk will probably take them to the cleaners ... (rather appropriate seeing as it's a washing machine rofl)

Cheers,
Wol

Debian debates AI models and the DFSG

NYKevin — Wed, 30 Apr 2025 16:37:49 +0000

> It's all very well a plaintiff saying "you nicked my ideas", but when said plaintiff has clearly nicked the same idea from somewhere else (of which they may not be aware of - music would be a classic case), then there's a problem.

Ideas are categorically exempt from copyright protection in the US (and most of the world). See for example 17 USC 102(b), and see also case law such as Baker v. Selden.

> What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.

The "specific expression" refers to the actual words, images, or sounds used to convey something, not the broader concept of it. If you can show that the plaintiff copied the specific expression from elsewhere, then under US law, this has two rather significant consequences (both laid out in 17 USC 103):

1. They don't own that specific expression, because it's not part of "the material contributed by the author of such work."
2. If the copying was unlawful (the original was copyrighted and they had no license), then that entire part of the work is unprotected by copyright.

You cannot demand that the plaintiff give you an accounting of every place where it possibly could have come from and exhaustively prove that it is entirely original, because that would be plainly unworkable. Instead, you as the defendant have to find the original work (possibly through discovery), introduce it at trial, and argue that it precludes the plaintiff from owning the specific expression.

You can argue that the copied expression is a "scène à faire," literally meaning "a scene that must be done." The original basis for this was genre fiction, in which (for example) a mystery novel simply must have a scene at the end where the detective (who also must exist) explains who commited the crime, how and why they did it, and what clues led the detective to that conclusion. If you don't have that scene, it's not a "real" mystery novel, it's some other genre, and since nobody can be allowed to own the mystery genre as a whole, nobody can own that type of scene either.

In the programming context, scènes à faire includes constructs like for(i = 0; i < max; i++){...}. Nobody can be allowed to own that, because you can't reasonably write a (large, complex) program in a C-like language and never write a loop that looks like that.

> I gather there's been a "situation" recently where YouTube has misattributed Schubert's "the trout" (even worse, as played by a washing machine), and accused a whole bunch of people of copyright violations, taken a large chunk of their commissions, and given it to somebody who has no connection with Schubert or the trout.

Disclaimer: I work for Google, not as a copyright lawyer, and so I can't speak on their behalf. The following is my personal interpretation of YouTube's behavior, based on public information and reasonable inference.

YouTube's copyright system is primarily designed to protect YouTube from getting sued, and secondarily designed to discourage users from suing each other. You cannot assume that the outcomes you see on YouTube are necessarily what a court of law would have done, because YouTube does not have the power to make binding rulings on whether X is a copyright infringement of Y. So they're stuck making rulings on the basis of whether YouTube can plausibly be sued, which leads to all sorts of undesirable-but-unavoidable biases in favor of the copyright holder (many of them explicitly codified into law, e.g. in 17 USC 512 and similar laws in other jurisdictions). Of course, anyone dissatisfied with YouTube's handling of an issue remains free to take it to the "real" court system under a variety of legal theories (slander of title, tortious interference, conversion of revenues, 17 USC 512(f) misrepresentation, etc.).

I will agree that the legal system generally does a poor job of producing just and efficient outcomes in this space. There is a reason that both YouTube and its users strongly prefer to avoid going to court. But scènes à faire has nothing to do with this. What you describe sounds like outright fraud (taking the facts as you have described them and assuming there's nothing else going on here). Unfortunately, modern copyright law was simply not designed under the assumption that services like YouTube might exist, and updating it has proved difficult (especially in the modern US political system where Congress can barely agree to keep the government open). A few years ago, the YouTuber Tom Scott made an excellent ~43 minute video explaining the broader problem in detail, which I found highly informative and more entertaining than you might expect: https://www.youtube.com/watch?v=1Jwo5qc78QU

Debian debates AI models and the DFSG

taladar — Tue, 29 Apr 2025 08:46:27 +0000

I think the heart of the problem of defining a useful free software analog for AI models is precisely that we haven't really figured out good ways to inspect and modify them yet. Even the experts in the field struggle to modify models deliberately to fix individual flawed behavior/responses or larger scale behavior patterns or to determine the reason a model behaved in the way it did.

As long as that is the case free software concepts are hard to apply to AI models.

Debian debates AI models and the DFSG

jzb — Mon, 28 Apr 2025 22:01:22 +0000

I'm also a little surprised that people supporting copyleft as a way to subvert copyright law simultaneously try to argue copyright maximalism with respect to machine learning models. It feels a lot like "we want to subvert copyright law, but only when it's in our favour".

I'm not sure why this is surprising. Copyleft uses copyright to do a 180 and grant rights to users that are normally reserved, and to insist that others convey the same rights. It does this for code that the licensor has the rights to—it does not take anything from others that it has no rights to. The mechanism requires copyright laws to work.

I don't see how that is inconsistent with "we oppose technologies that hoover up other people's copyrighted works without permission". I've known free-software folks who argue that proprietary licensing is unethical, but they still acknowledge that creators have the right to set licensing terms for their code, even if they disagree with the choices they make.

Debian debates AI models and the DFSG

Wol — Mon, 28 Apr 2025 20:26:31 +0000

I think various Judges and plaintiffs do, too.

It's all very well a plaintiff saying "you nicked my ideas", but when said plaintiff has clearly nicked the same idea from somewhere else (of which they may not be aware of - music would be a classic case), then there's a problem.

What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.

Even worse when an AI does it - I gather there's been a "situation" recently where YouTube has misattributed Schubert's "the trout" (even worse, as played by a washing machine), and accused a whole bunch of people of copyright violations, taken a large chunk of their commissions, and given it to somebody who has no connection with Schubert or the trout.

I think that's a major problem with a lot of IP at the moment ...

Cheers,
Wol

Debian debates AI models and the DFSG

ballombe — Mon, 28 Apr 2025 20:15:50 +0000

> It feels a lot like "we want to subvert copyright law, but only when it's in our favour".

... which is exactly the line played but openAI, facebook et al when it comes to copyright.
They are not copyright minimalists despite what they say, see their reaction to DeepSeek.
It makes sense for GNU GPL software to want be protected from exploitation by them.

Debian debates AI models and the DFSG

pizza — Mon, 28 Apr 2025 17:34:21 +0000

> Yes, it's not *legally* prior art, but what on earth else are you going to call it?

Methinks you need to bone up on the distinction between "ideas" and "specific expressions".

Debian debates AI models and the DFSG

Wol — Mon, 28 Apr 2025 17:22:49 +0000

> Prior art is not a thing under copyright law. It's only relevant to patent law.

It's only *legally* *defined* under patent law.

But if somebody sues me for copyright copying, because I wrote a piece of music with da-da-da-dum, da-da-da-dum, I sure as hell am going to point them at Beethoven as prior art!

Terry Pratchett was accused of ripping off Hogwarts when he created Unseen University. Quite apart from the fact he would have needed a time machine, there's absolutely shit-loads of prior art going back to the 1920s if not the century before about British boarding schools and the like. Bunter, anyone?

Yes, it's not *legally* prior art, but what on earth else are you going to call it?

Cheers,
Wol

Debian debates AI models and the DFSG

NYKevin — Mon, 28 Apr 2025 09:39:31 +0000

Prior art is not a thing under copyright law. It's only relevant to patent law. There is a general requirement that authorship be "original," but this is a very low bar and not commonly litigated in cases like this (contrast e.g. Feist v. Rural).

Based on the facts that have been publicly disclosed, the problems with "Happy Birthday" were roughly as follows:

* The melody was indisputably in the public domain, having been published in 1893 under a different name ("Good Morning to All") and with a trivial difference in arrangement (one note became two).
* A copyright on the lyrics was registered in 1935.
* In 1927, a copy of the lyrics was published, set to the melody of "Good Morning to All," without a copyright notice. If authorized, this publication would have the effect of forfeiting any copyright before it could even be registered. Warner would later argue in court that this publication was unauthorized, or at least not authorized by the appropriate party.
* There were a number of other arguments raised. The judge considered all of these arguments at summary judgment, and concluded that most of them (including the 1922 publication) would need to go to trial.
* But the judge did find one basis for summary judgment: The sale of the copyright from the Hill sisters (one or both of whom wrote the song) to the Summy Company (which Warner eventually bought) was apparently a bit of a mess. It went through multiple rounds of litigation, three separate agreements, and the second agreement was missing from the modern record. The judge ruled, as a result, that there was no evidence the Hills had specifically sold the lyric rights to the Summy Company, so Warner lost on that basis.

The question I asked is whether this is a just outcome, which (now that I look more closely) is a bit of a muddle. But you could just as easily imagine an alternative version of events in which the judge instead rules that the 1922 publication caused the song to enter the public domain. That is not an implausible outcome - the judge said it was a triable issue of fact that could have gone either way.

Aside on the legal versus common definition of "theft"

farnz — Mon, 28 Apr 2025 08:13:12 +0000

Just about everyone agrees on the basic definition of "stealing" a physical object, and that it is (in general) a wrongful act, but theft has been a thing since antiquity. If you're going to lean on the ethical side of this instead of the legal side, you're quickly going to find that there is much less consensus on what should be considered right and wrong than you reasonably need to have this sort of discussion.

As an aside, and this only bolsters your point, I live in a jurisdiction (England and Wales) where plenty of things that are obviously theft when you look at them don't quite reach the legal bar. The requirements here for theft are that you appropriated property, that it belonged to someone else, that you appropriated it dishonestly, and that you intended to permanently deprive the owner of that property. So, for example, if I take your lawnmower at the beginning of summer to mow my lawn, with intent to return it to you when winter starts and it's too cold for grass to grow, I've not committed theft, in law.

We have had to introduce specific laws for cases like taking someone's car for a joyride without permission, refusing to return a credit to your bank account that was made in error, or leaving without paying when you owe payment on the spot, precisely because the legal definition of theft is too narrow to cover these cases, even though most of us would agree that they were "theft" in a colloquial sense.

Debian debates AI models and the DFSG

Wol — Mon, 28 Apr 2025 07:53:53 +0000

> Should Warner-Chappell now own the copyright to Happy Birthday, or did they rightfully lose it as a result of someone failing to comply with US copyright formalities in the 1930's?

Was it even possible for them to comply lawfully with US copyright formalities in the 1930s? Especially in patents, but also in copyright, how much unrecognised "prior art" is out there?

Cheers,
Wol

Debian debates AI models and the DFSG

NYKevin — Sun, 27 Apr 2025 23:32:48 +0000

> The sum of the outputs of a genAI thingy is a derivative work of the sum of (possibly a subset of) the inputs.

The term "derivative work" has a specific legal meaning. If you want to invent your own meaning, I would encourage you to come up with your own word, so that it does not become conflated with the legal meaning that clearly does not apply in this context (substantial similarity is only meaningful in the context of comparing specific individual works, not large classes of works, and without substantial similarity, there is no derivative work, under US law).

> But we’re not looking at suing for infringement, we’re looking for doing a positive-definition of what is acceptable, and for this, this level of detail is sufficient.

This brings us to the broader problem here: Copyright is only about 300 years old, and its contours have changed dramatically in that short time. Most basic principles of law and ethics are far older than that, and have been far more stable in their overall form and function (at least in semi-modern times). Just about everyone agrees on the basic definition of "stealing" a physical object, and that it is (in general) a wrongful act, but theft has been a thing since antiquity. If you're going to lean on the ethical side of this instead of the legal side, you're quickly going to find that there is much less consensus on what should be considered right and wrong than you reasonably need to have this sort of discussion. Here is an example I would encourage you to think carefully about:

Suppose someone never registers their copyright, or fails to comply with some other legal requirement. Should their copyright automatically lapse, or even fail to vest in the first place? I think most folks would say that it should not, hence why we got rid of those requirements. But then you ask people another question: Should Warner-Chappell now own the copyright to Happy Birthday, or did they rightfully lose it as a result of someone failing to comply with US copyright formalities in the 1930's? Most folks will tell you that Warner should not own Happy Birthday, and that they did rightfully lose it. But that is inconsistent - either formalities are rightful, and should still apply, or they are wrongful, and Warner should not have lost the copyright. You can resolve this by appealing to the excessive term of modern copyright, but that is dodging the question. I'm not asking what the ideal copyright system would look like - I'm asking for the rightful outcome in this one specific case.

But perhaps you disagree, and think we can resolve it based on the term. That's fine, I have other examples. There are numerous works from the 50's and 60's whose US copyrights were not renewed, mostly serialized pulp fiction that was seldom or never reprinted. Many of them can now be read on the Internet Archive for free. Is it rightful that those authors, some of whom might still be alive, are not paid for the reproduction of their works? Or, as a matter of ethics, should the Internet Archive take down all of those works until such time as their authors can be identified and some sort of payment-in-lieu-of-licence can be worked out?

Or we can look at even more recent developments:

* Is it wrongful to distribute tools that break DRM (because it enables piracy), or is DRM itself wrongful (because people should control their computers)?
* A clickwrap license is a contract you agree to by interacting with some computer software (usually clicking an "I agree" button). Should these contracts be effective? If yes, then EULAs are effective and proprietary software is ethically legitimate. If not, then that includes contracts for the sale of goods, which effectively means that no form of online commerce should be allowed. Is either of those extremes correct, or should we adopt some middle position (and exactly what does that position look like)?
* Is it wrongful for a hobbyist to create a derivative work of some giant corporation's intellectual property, assuming the hobbyist makes at most a small profit from it?
* Is it wrongful for a giant corporation to create a derivative work of some hobbyist's intellectual property, assuming the corporation makes at most a small profit from it?
* Is it OK if the answers to the above two bullets differ, or would that be a logical contradiction?

I want to be clear - the above are rhetorical questions. Replying with specific answers would entirely miss the point. I'm not asking for your specific viewpoint. I'm asking you to consider whether a significant number of people might have a different viewpoint from you, and whether you have any legitimate right to insist that your viewpoint is more ethically correct than theirs. If we can't even agree on these more basic questions of how copyright ought to work, then I submit to you that it is impossible for us to come to agreement on how copyright should interact with generative AI.

TL;DR: The basic ethical premises of copyright are still up for debate, so it is deeply questionable whether we can even have this discussion in the first place.

Debian debates AI models and the DFSG

mirabilos — Sun, 27 Apr 2025 21:15:59 +0000

The fallacy is that you look at one work only.

The sum of the outputs of a genAI thingy is a derivative work of the sum of (possibly a subset of) the inputs.

Of course you need to figure out similarity for each one if you want to sue for infringement.

But we’re not looking at suing for infringement, we’re looking for doing a positive-definition of what is acceptable, and for this, this level of detail is sufficient.

Debian debates AI models and the DFSG

mirabilos — Sun, 27 Apr 2025 20:39:49 +0000

It is (but it is also a derivative of the “training data”).

And yes, the responsibility is with the operator. The copyright exeption for text and data mining (if not opted out) only applies for analysēs like getting trends, not for generative use (probably not even the genAI summarisation functionality).

Debian debates AI models and the DFSG

NYKevin — Sun, 27 Apr 2025 20:39:17 +0000

> so the outputs are just mechanical transformations of the input (and thus both a derived work and the provider and operator of the “AI” cannot claim independent copyright on it).

That is simply not how courts define "derivative work," and it's rather obvious to see why. If you run an algorithm over somebody else's creative work, that algorithm could do anything from "return the input unchanged," to "count the words, sentences, paragraphs, and pages, and output all of those numbers," to "always output the number 7."

"Derivative work" is defined differently in different countries (and some countries call it something else entirely), but most definitions are going to at least vaguely resemble the US definition, which in practice says that a work is derivative of another work if the plaintiff (author of or rightsholder to the original) can prove two things:

* Access: The person who created the allegedly infringing work (i.e. the person who pushed the buttons that caused the AI to output it) would have been able to copy the original.
* Substantial similarity: There are significant copyrightable elements in common between the original and the allegedly infringing work, and the number of such elements and degree of similarity is high enough to justify an inference of copying.

Access is usually pretty easy to prove - in most cases, "publicly available" is good enough. For an AI case, you might also need to prove that the work was actually in the training set, but I don't think we have caselaw on that yet. Substantial similarity, on the other hand, is entirely fact-bound and there are no bright lines - the only way to prove it is to put the two works side by side and point to specific individual elements that you say were improperly copied. This also means you must *have* a specific original in mind when you file the lawsuit - you can't just say (e.g. as a class action) "well, it's one of these millions of works that was infringed." That sort of pleading would probably get dismissed even pre-Twiqbal, but nowadays, it would get you laughed out of the courtroom.

Debian debates AI models and the DFSG

Wol — Sun, 27 Apr 2025 20:12:39 +0000

> it fully suffices that copyright law requires a human natural person to create works, and everything else is just mechanical derivation.

Devil's advocate again, but doesn't that mean an AI output is a mechanical derivation of the prompt a human fed in? At which point any copyright (liability) belongs firmly with the user prompting the AI. So all these people saying "AI generated an exact copy of Shakespeare" (or whatever) are clearly fully liable for any copyright violation that may be involved in that reproduction ...

Cheers,
Wol

Short-term versus long-term positions

farnz — Sun, 27 Apr 2025 18:47:14 +0000

I can see strong arguments for holding to the strictest possible answer in the short term (such that if jurisdictions start to make the judgement call, you're safe unless their decision is a shock to everyone), but time-limiting it with a view to coming back when the commercial dust settles and jurisdictions have either made this a non-question, or extending the time limit if it's still a controversy.

Remember that this is tied in not just with what's legally acceptable, but also what's politically acceptable; part of the reason it's such a big fuss is that in countries with very little support for the unemployed, any technology that's being sold as "reduces the number of people needed to reach a certain level of attainment" is a political timebomb. As a result, it's quite likely to either stop being controversial (once the limits are clear, and it's not a big threat to jobs and to people doing serious work), or be something that's clamped down on hard (because it's that, or civil unrest); saying "let's hold it off for now, and make a serious decision once it's not a mess" is not a bad thing in itself.

Debian debates AI models and the DFSG

mirabilos — Sun, 27 Apr 2025 17:35:43 +0000

I’m *completely* ignoring the philosophical part about this (which I of course have an opinion, but don’t have the spoons to debate here) because, at this point, it fully suffices that copyright law requires a human natural person to create works, and everything else is just mechanical derivation.

“Generative AI” operates like a compiler or a lossy compression/decompression. Its output is fully dependent on its inputs, as it runs on a deterministic machine (if a PRNG is used, it does count as input, making its output reproducible), so the outputs are just mechanical transformations of the input (and thus both a derived work and the provider and operator of the “AI” cannot claim independent copyright on it).

Debian debates AI models and the DFSG

Wol — Sun, 27 Apr 2025 16:45:43 +0000

> I do not believe any court has ever taken the opinion that a human reading a document, no matter how closely, could infringe copyright simply by the act of reading it, or that listeners to a musical work who have the music stuck in their heads have performed a copy. (Of course if they later write down what they remember, as Mozart supposedly did to Allegri's "Miserere," then that counts as a copy subject to copyright law.)

Isn't it strange how you have just perfectly described European law as it refers to AI, which has so many authors, song-writers and composers screaming foul.

The law has taken the opinion that an AI reading a document, no matter how closely, does not infringe copyright simply by the act of reading (and remembering) it.

Of course, if they then later write down what they remember, then that counts as a copy subject to copyright law.

In other words, there's no such thing as "Copyright Washing" in European law. The law simply makes no distinction whatsoever between artificial intelligence and human intelligence (or lack thereof!).

Cheers,
Wol

Debian debates AI models and the DFSG

smcv — Sun, 27 Apr 2025 15:14:51 +0000

In the past Debian has consistently said that the things that we might or might not accept as Free (DFSG-compliant) are pieces of software, not licenses. So if foo is licensed under GPL-2.0-or-later and so is bar, we might accept foo as Free but say that bar is not Free, for example if some of bar's source code is missing.

I believe there was one case in particular where the upstream developer of a piece of software under a BSD/MIT-style license (was it Pine?) had an unusual interpretation of the license and asserted that their software was only redistributable if it was at no cost (free of charge), which would have made it impossible to sell Debian CDs with their software included. Debian responded to this by treating that specific piece of software as non-Free, even though the license it was released under is one that we would usually have accepted.

Debian debates AI models and the DFSG

kleptog — Sun, 27 Apr 2025 11:25:02 +0000

The fundamental difference is that where a binary program can't be used for anything else than execution, the weights of an ML model contain everything there is about the model and can be used as the input to create new models. It can be studied and modified in useful ways.

In many ways, the model is actually a suitable stand in for the original data. I think that for the vast majority of use cases people are better off taking an existing public model and fine-tuning it to suit their purpose with their own data, than trying to rebuild a model from scratch at great expense.

Debian debates AI models and the DFSG

Wol — Sun, 27 Apr 2025 08:43:07 +0000

> No, there’s a difference between a human and a machine after all.

Which is? Or are you arguing that the human brain is magic?

There's a lot of people who THINK they know how the brain works. There's a lot of people who are trying to build computer models of their model of how the human brain works. And for the most part the computer model is crap because the mental model bears no resemblance to reality.

The classic example, why are people throwing general-purpose compute-heavy hardware at these problems? How much of the brain is single-purpose dedicated hardware? Is it 90%? Something like that certainly . And they talk to other bits of dedicated hardware. That's why my car's "driver assist" features are such a damn nightmare. They're based on general-purpose ideas and hardware, and don't talk to each other. Case in point - in heavy traffic when a car pulls out of my way, my car fails to recognise we're not on a collision course, pretty much comes to a halt until it loses sight of it, then floors the accelerator until a different piece of hardware suddenly realises it's going to crash into the car in front and screams at me to take over and brake!

Or adaptive cruise control. Which can't tell the difference between its internal database and a road traffic sign. Which regular sees signs which clearly don't apply to the road you're on but it applies them anyway. Which completely ignores the driver's control input when it disagrees with the computer database's input.

Compare that with a real human. I can recognise a stationary car at a distance. I can tell when the trajectory of my car and the car in front don't intersect. I can identify speed limits and brake lights at a distance. And my actions NOW are informed by what I can see happening in a minute's time.

There's only one major difference between a human and a machine. The people trying to build human models aren't talking to the people studying humans, and the computer models therefore bear bugger all resemblance to the human systems they're trying to model. Pretty much par for the course in most human research, sadly ... :-(

Cheers,
Wol

Debian debates AI models and the DFSG

mirabilos — Sun, 27 Apr 2025 06:31:20 +0000

Basically, the idea for contrib was, it could require a proprietary niewida library at runtime, but the model itself is free enough for main, then it goes to contrib instead of non-free.

Debian debates AI models and the DFSG

mirabilos — Sun, 27 Apr 2025 06:25:47 +0000

Not even a copyleft person. A BSD person, permissive licence. A licence so permissive that it only requires what is commonly called attribution. (And a bit of indemnification in exchange for the gift.)

And those TESCREAL models don’t even do that, even cannot in practice. And attribution is a burden so low that not honouring it is a much harsher violation than, say, one point of the GPL.

Debian debates AI models and the DFSG

mirabilos — Sun, 27 Apr 2025 06:22:53 +0000

It was intended to read that the model trained on non-free data can only go to non-free, not to contrib; contrib is only DFSG-free parts (i.e. parts that could otherwise go to main) that require non-free dependencies (or things outside of Debian) at runtime, as for all packages in contrib (though non-models could also use them at build time, I’m with M. Zhou on not allowing that as the model is a lossy compression of the input).

Debian debates AI models and the DFSG

mirabilos — Sun, 27 Apr 2025 06:20:51 +0000

No, there’s a difference between a human and a machine after all.

Debian debates AI models and the DFSG

geofft — Sun, 27 Apr 2025 02:41:57 +0000

I can think of several good reasons to want to have the training data available, e.g.,

- Simply requiring it to be published is a good way to incentivize keeping people honest about what it is. (Facebook, for instance, is alleged to have trained Llama-3 against books literally torrented via LibGen, and they attempted to seal parts of the lawsuit because they were embarrassed about the situation becoming public.) Having it be published with the expectation that people might plausibly try to reproduce it, and ask hard questions if they can't, is an even stronger incentive.

- If you want to remove something from the "knowledge" of the model, it is basically impossible to do that reliably starting from the trained model, at least given the current state of the art in interpretability, whereas it is conceptually straightforward (though resource-intensive) to just remove that training data and run the training again.

- It may well be the case that a future AI, itself, could do something useful with another model and its training data that would be impractical for a human, e.g., interpretability work to explain some behavior with reference to the training data that induced it. That we don't see a use for the full training data now is no reason to decide we won't find a use later.

- Starting with the same data but a different training program could yield interesting results, ranging from simply modifying the training program to do the same thing but with more parameters or a different approach to tokenization, to training in a wholly different way.

- I do think one of the many arguments for FOSS is that it's good for learning and openness, even independent of the practicality of using the software. FOSS licenses may not have termination clauses or timeouts, which of course is for practical reasons of allowing people to package and redistribute software confidently, but I think we'd all intuitively disapprove of a license that terminated even if it were 100% guaranteed that nobody was ever going to actually run the software again - having the sources around and available is a contribution to the commons.

- In any of the weird hypothetical evil-overlord-AI scenarios like https://ai-2027.com/ , it seems pretty clear that having meaningful capacity to build your own AIs will be immensely helpful in fighting back. Of course "the good guys" are going to be at a strong disadvantage for myriad reasons, but not having any access to training data will put them even farther back. In a scenario where a strong enough LLM has started actively concealing its own inner "thoughts" from its output without people noticing, I'm not sure that fine-tuning a trained subversive model is going to be sufficiently helpful to really get it onto your side.

Also keep in mind that all the involved hardware is rapidly getting more capable, so I think that an end user might retrain a model they're using from Debian stable or oldstable and find it much more feasible than it would have been for them to do that same retraining at the time that model was uploaded to Debian unstable. (For the same reason, there may also be an effect that the actual type of GPUs used when originally training the model were in high demand at the time but are now no longer top-of-line and are easier and cheaper to rent or buy.)

Debian debates AI models and the DFSG

geofft — Sun, 27 Apr 2025 02:00:03 +0000

I don't think it really needs to be changed, but I agree with gioele that this does not read like a paradox. Being DFSG-free is an overall state of a package, of which being released under a DFSG-compatible license is a necessary-but-not-sufficient requirement.

If you want to change it, I'd suggest replacing "open source license" with "DFSG-compatible license" to contrast with "DFSG-compliant" at the end of the sentence. (Policy also uses "comply with the DFSG" to describe what can go into main and contrib.) "Compatible," to me, means it's possible to use it in a way that fits with the DFSG, but it doesn't mean that it's impossible to use it in another way. If I want to write a POSIX-compliant shell script, it's very helpful to test it against a POSIX-compatible shell, but that by itself is not enough.

Debian debates AI models and the DFSG

geofft — Sun, 27 Apr 2025 01:46:42 +0000

Right - my understanding is that this is a legal call that no jurisdiction has actually called yet, and Thorsten's proposal is effectively committing Debian to taking a particular view on the question and prematurely binding its own behavior to the strictest possible answer to that question. This feels a little bit like if, in the early days of the Oracle v. Google lawsuit, Debian took the view that in fact the Java API was copyrightable and therefore not only Android but also GNU Classpath, WINE (now a derivative work of NT), and maybe even the Linux kernel itself (now a derivative work of commercial UNIX) ought to be kicked out of the archive. While it would be a little bit nice for free software authors to make life hard for proprietary software authors who want to copy their API designs, it seems clearly good for free software that everyone has taken the view that APIs are not copyrightable (which is actually not how Oracle v. Google decision ended up - lower courts held that APIs were copyrightable, and the Supreme Court simply ruled that in this particular case, Google's use was fair use).

On the subject of human minds, in actual legal systems, I am under the impression that the mind is considered inviolate and special in various ways. For instance, while arguments have been raised that loading files into RAM is performing a copy and therefore could infringe copyright, I do not believe any court has ever taken the opinion that a human reading a document, no matter how closely, could infringe copyright simply by the act of reading it, or that listeners to a musical work who have the music stuck in their heads have performed a copy. (Of course if they later write down what they remember, as Mozart supposedly did to Allegri's "Miserere," then that counts as a copy subject to copyright law.) In another application, in the US, the Fifth Amendment protection against self-incrimination means that written documents can be subpoenaed but the contents of one's mind cannot, and there's an emerging precedent that for the same reasons, police can compel you to unlock a device via biometrics (face/fingerprint/retina) but not via revealing even a short passcode.

Debian, of course, is free to bind its behavior more strictly than the law requires, and I think the FOSS community has a history of wanting people to avoid looking too closely at proprietary code out of a general fear that arguments would be made in court about infringing copies, even if the court is not going to rule that a programmer's brain is a derivative work. It would probably be a bad outcome to start treating this as a norm instead of just a defense to have in your quiver, and end up with the rule that someone who has learned from one free software project should be considered tainted when trying to contribute to another free software project under an incompatible license, e.g., that someone who has read glibc's sources too closely cannot send MIT-licensed patches to musl.

Debian debates AI models and the DFSG

mb — Sat, 26 Apr 2025 21:31:11 +0000

>and for what purpose? People are just going to use the final model

Just use the closed source binary? Why would you want to have the source? Just use the binary!

Debian debates AI models and the DFSG

kleptog — Sat, 26 Apr 2025 21:24:19 +0000

If you believe all software should be free, then believing that all ML models should be free seems the logical next step.

As for reproducing models, that seems like a nonstarter as the first step in training is to initialise with random weights. I guess you could make that deterministic, but the question is why you care. The training set is going to be at least 100 times the size of the final model. Requiring full reproducibility on very large models is going to require distributing such a vast amount of data, and for what purpose? People are just going to use the final model weights to build more fine-tuned models, reproducing from the source data just isn't very useful.

The impact on the many other places machine learning models have been used for years in software is also something that shouldn't be discounted. This discussion could have a lot more impact than it appears at first glance.

Debian debates AI models and the DFSG

gioele — Sat, 26 Apr 2025 14:59:54 +0000

> "AI models released under DFSG-compliant license without training data or program" is not seen as DFSG-compliant.
>
> -> "... under DFSG-compliant ..." is not seen as DFSG-compliant.
>
> It looks weird as it looks like a paradox.

It is however an already know "paradox": if you release a "GPL executable" but you not provide the its source, including «the scripts used to control compilation and installation of the executable», then you are not really complying with the terms of the GPL.

Debian debates AI models and the DFSG

lumin — Sat, 26 Apr 2025 14:13:38 +0000

> A few thoughts, starting with the most minor: I am a little surprised the proposal uses the phrasing "open source". Debian has traditionally used the phrasing "free software," including in the name of the DFSG, the Debian Social Contract, and Debian policy.

I realized that problem when writing the text. However, if the proposal says:

"AI models released under DFSG-compliant license without training data or program" is not seen as DFSG-compliant.

-> "... under DFSG-compliant ..." is not seen as DFSG-compliant.

It looks weird as it looks like a paradox. I did not thought too much further on this wording issue, and replaced it with open source. I think people can anyway understand what I mean there. But you are right, we'd better revise the wording if this is going to officially land onto somewhere.

Debian debates AI models and the DFSG

Wol — Sat, 26 Apr 2025 13:37:52 +0000

> The proposal has Debian take the view that "'generative AI' output are derivative works of their inputs (including training data and the prompt)" and so "Any work resulting from generative use of a model can at most be as free as the model itself; e.g. programming with a model from contrib/non-free assisting prevents the result from entering main." In other words, if someone uses GitHub Copilot or Cursor to help them write some software, then adopting Thorsten's proposal means that Debian cannot consider their software DFSG-free, regardless of what the stated license is on that software, because the underlying models are not DFSG-free.

Devil's advocate, let me re-write that last sentence ...

"In other words, if a student learns by studying proprietary software and uses that experience to help them write some software, then adopting Thorsten's proposal means that Debian cannot consider their software DFSG-free, regardless of what the stated licence is on that software, because the underlying software that code is based on is not DFSG-free".

At the end of the day, whether something is derivative (and hence copyright-contaminated) is a legal call. I've just been reading (a day or two ago) about how the brain is a very efficient Bayesian filter. An AI/LLM is also (I guess) a Bayesian filter of some sort. If a human brain fine tunes the output of an artificial brain, where do you draw the line? How do you know where the line is?

Cheers,
Wol

Debian Deep Learning Team Machine Learning Policy

lumin — Sat, 26 Apr 2025 13:34:43 +0000

It is referred in Appendix C anyway.

Tesseract

pabs — Sat, 26 Apr 2025 01:04:09 +0000

IIRC the situation with Tesseract has improved to the point where it would meet even the stricter requirements being proposed.