Debian debates AI models and the DFSG

Posted Apr 27, 2025 17:35 UTC (Sun) by mirabilos (subscriber, #84359)
In reply to: Debian debates AI models and the DFSG by Wol
Parent article: Debian debates AI models and the DFSG

I’m *completely* ignoring the philosophical part about this (which I of course have an opinion, but don’t have the spoons to debate here) because, at this point, it fully suffices that copyright law requires a human natural person to create works, and everything else is just mechanical derivation.

“Generative AI” operates like a compiler or a lossy compression/decompression. Its output is fully dependent on its inputs, as it runs on a deterministic machine (if a PRNG is used, it does count as input, making its output reproducible), so the outputs are just mechanical transformations of the input (and thus both a derived work and the provider and operator of the “AI” cannot claim independent copyright on it).

Debian debates AI models and the DFSG

Posted Apr 27, 2025 20:12 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

> it fully suffices that copyright law requires a human natural person to create works, and everything else is just mechanical derivation.

Devil's advocate again, but doesn't that mean an AI output is a mechanical derivation of the prompt a human fed in? At which point any copyright (liability) belongs firmly with the user prompting the AI. So all these people saying "AI generated an exact copy of Shakespeare" (or whatever) are clearly fully liable for any copyright violation that may be involved in that reproduction ...

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 27, 2025 20:39 UTC (Sun) by mirabilos (subscriber, #84359) [Link]

It is (but it is also a derivative of the “training data”).

And yes, the responsibility is with the operator. The copyright exeption for text and data mining (if not opted out) only applies for analysēs like getting trends, not for generative use (probably not even the genAI summarisation functionality).

Debian debates AI models and the DFSG

Posted Apr 27, 2025 20:39 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (11 responses)

> so the outputs are just mechanical transformations of the input (and thus both a derived work and the provider and operator of the “AI” cannot claim independent copyright on it).

That is simply not how courts define "derivative work," and it's rather obvious to see why. If you run an algorithm over somebody else's creative work, that algorithm could do anything from "return the input unchanged," to "count the words, sentences, paragraphs, and pages, and output all of those numbers," to "always output the number 7."

"Derivative work" is defined differently in different countries (and some countries call it something else entirely), but most definitions are going to at least vaguely resemble the US definition, which in practice says that a work is derivative of another work if the plaintiff (author of or rightsholder to the original) can prove two things:

* Access: The person who created the allegedly infringing work (i.e. the person who pushed the buttons that caused the AI to output it) would have been able to copy the original.
* Substantial similarity: There are significant copyrightable elements in common between the original and the allegedly infringing work, and the number of such elements and degree of similarity is high enough to justify an inference of copying.

Access is usually pretty easy to prove - in most cases, "publicly available" is good enough. For an AI case, you might also need to prove that the work was actually in the training set, but I don't think we have caselaw on that yet. Substantial similarity, on the other hand, is entirely fact-bound and there are no bright lines - the only way to prove it is to put the two works side by side and point to specific individual elements that you say were improperly copied. This also means you must *have* a specific original in mind when you file the lawsuit - you can't just say (e.g. as a class action) "well, it's one of these millions of works that was infringed." That sort of pleading would probably get dismissed even pre-Twiqbal, but nowadays, it would get you laughed out of the courtroom.

Debian debates AI models and the DFSG

Posted Apr 27, 2025 21:15 UTC (Sun) by mirabilos (subscriber, #84359) [Link] (10 responses)

The fallacy is that you look at one work only.

The sum of the outputs of a genAI thingy is a derivative work of the sum of (possibly a subset of) the inputs.

Of course you need to figure out similarity for each one if you want to sue for infringement.

But we’re not looking at suing for infringement, we’re looking for doing a positive-definition of what is acceptable, and for this, this level of detail is sufficient.

Debian debates AI models and the DFSG

Posted Apr 27, 2025 23:32 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (9 responses)

> The sum of the outputs of a genAI thingy is a derivative work of the sum of (possibly a subset of) the inputs.

The term "derivative work" has a specific legal meaning. If you want to invent your own meaning, I would encourage you to come up with your own word, so that it does not become conflated with the legal meaning that clearly does not apply in this context (substantial similarity is only meaningful in the context of comparing specific individual works, not large classes of works, and without substantial similarity, there is no derivative work, under US law).

> But we’re not looking at suing for infringement, we’re looking for doing a positive-definition of what is acceptable, and for this, this level of detail is sufficient.

This brings us to the broader problem here: Copyright is only about 300 years old, and its contours have changed dramatically in that short time. Most basic principles of law and ethics are far older than that, and have been far more stable in their overall form and function (at least in semi-modern times). Just about everyone agrees on the basic definition of "stealing" a physical object, and that it is (in general) a wrongful act, but theft has been a thing since antiquity. If you're going to lean on the ethical side of this instead of the legal side, you're quickly going to find that there is much less consensus on what should be considered right and wrong than you reasonably need to have this sort of discussion. Here is an example I would encourage you to think carefully about:

Suppose someone never registers their copyright, or fails to comply with some other legal requirement. Should their copyright automatically lapse, or even fail to vest in the first place? I think most folks would say that it should not, hence why we got rid of those requirements. But then you ask people another question: Should Warner-Chappell now own the copyright to Happy Birthday, or did they rightfully lose it as a result of someone failing to comply with US copyright formalities in the 1930's? Most folks will tell you that Warner should not own Happy Birthday, and that they did rightfully lose it. But that is inconsistent - either formalities are rightful, and should still apply, or they are wrongful, and Warner should not have lost the copyright. You can resolve this by appealing to the excessive term of modern copyright, but that is dodging the question. I'm not asking what the ideal copyright system would look like - I'm asking for the rightful outcome in this one specific case.

But perhaps you disagree, and think we can resolve it based on the term. That's fine, I have other examples. There are numerous works from the 50's and 60's whose US copyrights were not renewed, mostly serialized pulp fiction that was seldom or never reprinted. Many of them can now be read on the Internet Archive for free. Is it rightful that those authors, some of whom might still be alive, are not paid for the reproduction of their works? Or, as a matter of ethics, should the Internet Archive take down all of those works until such time as their authors can be identified and some sort of payment-in-lieu-of-licence can be worked out?

Or we can look at even more recent developments:

* Is it wrongful to distribute tools that break DRM (because it enables piracy), or is DRM itself wrongful (because people should control their computers)?
* A clickwrap license is a contract you agree to by interacting with some computer software (usually clicking an "I agree" button). Should these contracts be effective? If yes, then EULAs are effective and proprietary software is ethically legitimate. If not, then that includes contracts for the sale of goods, which effectively means that no form of online commerce should be allowed. Is either of those extremes correct, or should we adopt some middle position (and exactly what does that position look like)?
* Is it wrongful for a hobbyist to create a derivative work of some giant corporation's intellectual property, assuming the hobbyist makes at most a small profit from it?
* Is it wrongful for a giant corporation to create a derivative work of some hobbyist's intellectual property, assuming the corporation makes at most a small profit from it?
* Is it OK if the answers to the above two bullets differ, or would that be a logical contradiction?

I want to be clear - the above are rhetorical questions. Replying with specific answers would entirely miss the point. I'm not asking for your specific viewpoint. I'm asking you to consider whether a significant number of people might have a different viewpoint from you, and whether you have any legitimate right to insist that your viewpoint is more ethically correct than theirs. If we can't even agree on these more basic questions of how copyright ought to work, then I submit to you that it is impossible for us to come to agreement on how copyright should interact with generative AI.

TL;DR: The basic ethical premises of copyright are still up for debate, so it is deeply questionable whether we can even have this discussion in the first place.

Debian debates AI models and the DFSG

Posted Apr 28, 2025 7:53 UTC (Mon) by Wol (subscriber, #4433) [Link] (7 responses)

> Should Warner-Chappell now own the copyright to Happy Birthday, or did they rightfully lose it as a result of someone failing to comply with US copyright formalities in the 1930's?

Was it even possible for them to comply lawfully with US copyright formalities in the 1930s? Especially in patents, but also in copyright, how much unrecognised "prior art" is out there?

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 28, 2025 9:39 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (6 responses)

Prior art is not a thing under copyright law. It's only relevant to patent law. There is a general requirement that authorship be "original," but this is a very low bar and not commonly litigated in cases like this (contrast e.g. Feist v. Rural).

Based on the facts that have been publicly disclosed, the problems with "Happy Birthday" were roughly as follows:

* The melody was indisputably in the public domain, having been published in 1893 under a different name ("Good Morning to All") and with a trivial difference in arrangement (one note became two).
* A copyright on the lyrics was registered in 1935.
* In 1927, a copy of the lyrics was published, set to the melody of "Good Morning to All," without a copyright notice. If authorized, this publication would have the effect of forfeiting any copyright before it could even be registered. Warner would later argue in court that this publication was unauthorized, or at least not authorized by the appropriate party.
* There were a number of other arguments raised. The judge considered all of these arguments at summary judgment, and concluded that most of them (including the 1922 publication) would need to go to trial.
* But the judge did find one basis for summary judgment: The sale of the copyright from the Hill sisters (one or both of whom wrote the song) to the Summy Company (which Warner eventually bought) was apparently a bit of a mess. It went through multiple rounds of litigation, three separate agreements, and the second agreement was missing from the modern record. The judge ruled, as a result, that there was no evidence the Hills had specifically sold the lyric rights to the Summy Company, so Warner lost on that basis.

The question I asked is whether this is a just outcome, which (now that I look more closely) is a bit of a muddle. But you could just as easily imagine an alternative version of events in which the judge instead rules that the 1922 publication caused the song to enter the public domain. That is not an implausible outcome - the judge said it was a triable issue of fact that could have gone either way.

Debian debates AI models and the DFSG

Posted Apr 28, 2025 17:22 UTC (Mon) by Wol (subscriber, #4433) [Link] (5 responses)

> Prior art is not a thing under copyright law. It's only relevant to patent law.

It's only *legally* *defined* under patent law.

But if somebody sues me for copyright copying, because I wrote a piece of music with da-da-da-dum, da-da-da-dum, I sure as hell am going to point them at Beethoven as prior art!

Terry Pratchett was accused of ripping off Hogwarts when he created Unseen University. Quite apart from the fact he would have needed a time machine, there's absolutely shit-loads of prior art going back to the 1920s if not the century before about British boarding schools and the like. Bunter, anyone?

Yes, it's not *legally* prior art, but what on earth else are you going to call it?

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 28, 2025 17:34 UTC (Mon) by pizza (subscriber, #46) [Link] (4 responses)

> Yes, it's not *legally* prior art, but what on earth else are you going to call it?

Methinks you need to bone up on the distinction between "ideas" and "specific expressions".

Debian debates AI models and the DFSG

Posted Apr 28, 2025 20:26 UTC (Mon) by Wol (subscriber, #4433) [Link] (3 responses)

I think various Judges and plaintiffs do, too.

It's all very well a plaintiff saying "you nicked my ideas", but when said plaintiff has clearly nicked the same idea from somewhere else (of which they may not be aware of - music would be a classic case), then there's a problem.

What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.

Even worse when an AI does it - I gather there's been a "situation" recently where YouTube has misattributed Schubert's "the trout" (even worse, as played by a washing machine), and accused a whole bunch of people of copyright violations, taken a large chunk of their commissions, and given it to somebody who has no connection with Schubert or the trout.

I think that's a major problem with a lot of IP at the moment ...

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 30, 2025 16:37 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (2 responses)

> It's all very well a plaintiff saying "you nicked my ideas", but when said plaintiff has clearly nicked the same idea from somewhere else (of which they may not be aware of - music would be a classic case), then there's a problem.

Ideas are categorically exempt from copyright protection in the US (and most of the world). See for example 17 USC 102(b), and see also case law such as Baker v. Selden.

> What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.

The "specific expression" refers to the actual words, images, or sounds used to convey something, not the broader concept of it. If you can show that the plaintiff copied the specific expression from elsewhere, then under US law, this has two rather significant consequences (both laid out in 17 USC 103):

1. They don't own that specific expression, because it's not part of "the material contributed by the author of such work."
2. If the copying was unlawful (the original was copyrighted and they had no license), then that entire part of the work is unprotected by copyright.

You cannot demand that the plaintiff give you an accounting of every place where it possibly could have come from and exhaustively prove that it is entirely original, because that would be plainly unworkable. Instead, you as the defendant have to find the original work (possibly through discovery), introduce it at trial, and argue that it precludes the plaintiff from owning the specific expression.

You can argue that the copied expression is a "scène à faire," literally meaning "a scene that must be done." The original basis for this was genre fiction, in which (for example) a mystery novel simply must have a scene at the end where the detective (who also must exist) explains who commited the crime, how and why they did it, and what clues led the detective to that conclusion. If you don't have that scene, it's not a "real" mystery novel, it's some other genre, and since nobody can be allowed to own the mystery genre as a whole, nobody can own that type of scene either.

In the programming context, scènes à faire includes constructs like for(i = 0; i < max; i++){...}. Nobody can be allowed to own that, because you can't reasonably write a (large, complex) program in a C-like language and never write a loop that looks like that.

> I gather there's been a "situation" recently where YouTube has misattributed Schubert's "the trout" (even worse, as played by a washing machine), and accused a whole bunch of people of copyright violations, taken a large chunk of their commissions, and given it to somebody who has no connection with Schubert or the trout.

Disclaimer: I work for Google, not as a copyright lawyer, and so I can't speak on their behalf. The following is my personal interpretation of YouTube's behavior, based on public information and reasonable inference.

YouTube's copyright system is primarily designed to protect YouTube from getting sued, and secondarily designed to discourage users from suing each other. You cannot assume that the outcomes you see on YouTube are necessarily what a court of law would have done, because YouTube does not have the power to make binding rulings on whether X is a copyright infringement of Y. So they're stuck making rulings on the basis of whether YouTube can plausibly be sued, which leads to all sorts of undesirable-but-unavoidable biases in favor of the copyright holder (many of them explicitly codified into law, e.g. in 17 USC 512 and similar laws in other jurisdictions). Of course, anyone dissatisfied with YouTube's handling of an issue remains free to take it to the "real" court system under a variety of legal theories (slander of title, tortious interference, conversion of revenues, 17 USC 512(f) misrepresentation, etc.).

I will agree that the legal system generally does a poor job of producing just and efficient outcomes in this space. There is a reason that both YouTube and its users strongly prefer to avoid going to court. But scènes à faire has nothing to do with this. What you describe sounds like outright fraud (taking the facts as you have described them and assuming there's nothing else going on here). Unfortunately, modern copyright law was simply not designed under the assumption that services like YouTube might exist, and updating it has proved difficult (especially in the modern US political system where Congress can barely agree to keep the government open). A few years ago, the YouTuber Tom Scott made an excellent ~43 minute video explaining the broader problem in detail, which I found highly informative and more entertaining than you might expect: https://www.youtube.com/watch?v=1Jwo5qc78QU

Debian debates AI models and the DFSG

Posted Apr 30, 2025 18:35 UTC (Wed) by Wol (subscriber, #4433) [Link]

> What you describe sounds like outright fraud (taking the facts as you have described them and assuming there's nothing else going on here).

It is, but who's responsible? As far as I know, the guy receiving the improper royalties could well be completely unaware of the source of the royalties. He uploaded a piece of music, which he just happened to record at the same time his washing machine finished its cycle, and YouTube's automated systems assumed it was his copyright. Whoops!

So if YouTube wants to avoid being sued, somebody who's prepared to take the risk will probably take them to the cleaners ... (rather appropriate seeing as it's a washing machine rofl)

Cheers,
Wol

Debian debates AI models and the DFSG

Posted Apr 30, 2025 18:42 UTC (Wed) by Wol (subscriber, #4433) [Link]

> > What's to stop a plaintiff claiming you stole their "specific expression" (given that there are not *too* many "specific expressions" if you're talking small fragments) when they're unaware of where they got it from.

> The "specific expression" refers to the actual words, images, or sounds used to convey something, not the broader concept of it.

Again, I'm thinking of a particular example. I don't know the outcome, but some musician sued saying another musician had "copied his guitar riff". Given that a riff is a chord sequence, eg IV V I, the shorter the riff the more likely it is another musician either stumbled on it by accident, or it's actually a common sequence in a lot of music. If I try and play a well-known piano sequence on the guitar, chances are I'll transpose the key and it'll sound very like someone else's riff that I may never even have heard ...

(I am a guitar player, but classical, so I don't play riffs ... :-)

Again, this is a big problem with modern copyright law where so much may be - as you describe it - a "scene a faire", but people don't recognise them as such.

Cheers,
Wol

Aside on the legal versus common definition of "theft"

Posted Apr 28, 2025 8:13 UTC (Mon) by farnz (subscriber, #17727) [Link]

Just about everyone agrees on the basic definition of "stealing" a physical object, and that it is (in general) a wrongful act, but theft has been a thing since antiquity. If you're going to lean on the ethical side of this instead of the legal side, you're quickly going to find that there is much less consensus on what should be considered right and wrong than you reasonably need to have this sort of discussion.

As an aside, and this only bolsters your point, I live in a jurisdiction (England and Wales) where plenty of things that are obviously theft when you look at them don't quite reach the legal bar. The requirements here for theft are that you appropriated property, that it belonged to someone else, that you appropriated it dishonestly, and that you intended to permanently deprive the owner of that property. So, for example, if I take your lawnmower at the beginning of summer to mow my lawn, with intent to return it to you when winter starts and it's too cold for grass to grow, I've not committed theft, in law.

We have had to introduce specific laws for cases like taking someone's car for a joyride without permission, refusing to return a credit to your bank account that was made in error, or leaving without paying when you owe payment on the spot, precisely because the legal definition of theft is too narrow to cover these cases, even though most of us would agree that they were "theft" in a colloquial sense.

Debian debates AI models and the DFSG

Posted May 6, 2025 10:42 UTC (Tue) by aigarius (subscriber, #7329) [Link]

That is not entirely correct, for example, a fully automated process creating a thumbnail of an image for a search engine to show in search results is ruled to be fair use of the copyright in courts. Copyright law does *not* really make much of a distinction between human outputs and automated software outputs. The considerations on copyright and fair use are not reliant on that distinction at all. Me resizing an image in GIMP has the same effect (both technical and legal) as a script doing the same to an uploaded image automatically.