Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 4:23 UTC (Mon) by mirabilos (subscriber, #84359)
Parent article: Debian dismisses AI-contributions policy

Ugh. I had not seen this.

I’m appalled.

As FOSS contributors, we are DIRECTLY damaged by ML/LLM (so-called “AI”) operators violating our copyrights and licences.

So, no, a wait-and-see approach here is nowhere near discouraging enough.

I totally am in favour of… say GotoSocial’s approach at adding a little checkbox for submitters to confirm they’ve not been using AI sludge to form parts of the submission.

Any use of those tools is totally inacceptable in multiple ways:
- their inputs were obtained without consent, often illegally, and even if not mostly scraped, using up lots of resources
- they were pre-filtered exploiting cheap labour in the global south, leaving its people with deep psychological problems and nowhere even near suitable compensation (if such a thing is even possible)
- running the training and these models uses so much resources, not just electricity but also incidentally water, that it’s environmentally inacceptable
- their outputs are mechanical combinations of their inputs and therefore derivative works, but the models are incapable of producing correct attribution and licencing and retaining correct copyright information (and yes, this is proven by now, it’s possible to extract large amounts of “training data” verbatim)
- they are run by unscrupulous commercial proprietary exploiters and VC sharks and former “blockchain” techbros

I’ve been utterly shocked that the OSI fell, slightly shocked that Creative Commons fell, but I’m totally appalled now.

AI sludge definitely is not, and will never be, acceptable for contribution to MirBSD, and I can say that with confidence.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 9:05 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (56 responses)

> their outputs are mechanical combinations of their inputs and therefore derivative works,

Dear LWN staff:

Can we please have an article explaining substantial similarity analysis (and similar concepts in other parts of the world)? I have seen this fallacy repeated over and over again in these comments, suggesting that we could all use a refresher on what a "derivative work" is and how it is actually determined in real courts of law. It could also be worth going over the abstraction-filtration-comparison test, but that might just be too deep into the weeds for a non-lawyer audience.

Anyway, to save you all a whole bunch of searching: The process used to create the allegedly infringing work is mostly irrelevant to the legal analysis. What matters is how similar that particular work is to the alleged original. It's a very fact-intensive inquiry that must be done on a per-work basis. In other words, you have to show that each output is infringing, one by one, by comparing it with the particular training data that it allegedly infringes on.

A blanket statement that "all" works produced in a certain fashion are derivative works is really only believable when the process, by its very nature, will always generate outputs which closely match their inputs (e.g. lossy encodings like JPEG, MP3, etc.), and so it is always straightforward to show that each particular output is similar to its corresponding input. But you can't do that with LLMs, because there is no such thing as a "corresponding input." Instead, you have to go through the painstaking process that The New York Times's lawyers went through, when they sued OpenAI for allegedly infringing on their articles. The NYT did not simply say "our articles were used as training material, so everything the AI outputs is derivative," because they would have been laughed out of court with an argument like that. Instead, they coaxed the LLM into reproducing verbatim or near-verbatim passages from old NYT articles, then sued for those particular infringing outputs.

For completeness, OpenAI's position, as far as I understand it, is that this sort of regurgitation is a rare bug, and also the NYT's "coaxing" was too aggressive and did not match the way normal users interact with the service (in fact, OpenAI has said that the prompts also contained large portions of the original articles, and so the AI may have simply mirrored the writing style of those prompts in order to complete them). They are still in litigation, but I predict a settlement at some point, so we might not get to see a court rule on those arguments.

My point is this: Yes, some outputs probably are infringing on some training inputs, at least for LLMs where regurgitation has been demonstrated. That is a far cry from all or even most outputs. We do not know where the law is going to come down on models (there are a lot of unanswered questions about how you even apply similarity analysis to a model in the first place), but for outputs, it is hard to believe that all or even most outputs have a corresponding training input (or small collection of training inputs) that they are very similar to.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 10:06 UTC (Mon) by bluca (subscriber, #118303) [Link]

This is a very nice idea, and I too think it would help a lot.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 10:18 UTC (Mon) by mb (subscriber, #50428) [Link] (51 responses)

Ok, I got it.

So, it is possible to write a non-AI obfuscator that takes Copyrighted $INPUT and make Public Domain $OUTPUT from it by transforming it often enough, so that the original $INPUT is not recognizable anymore.
One just has to ensure that $OUTPUT cannot be traced back to $INPUT. The transformation algorithm just has to be complicated enough.

It doesn't matter, that $OUTPUT obviously came from $INPUT, if I put $INPUT into a complicated algorithm and get $OUTPUT. The $OUTPUT is not a Derived Work w.r.t Copyright law in this case.

Right?

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 10:36 UTC (Mon) by farnz (subscriber, #17727) [Link] (48 responses)

Wrong.

The law barely cares at all about how you transform the input to get the output; that's in the category of "insignificant shenanigans", in the same sense that a discussion about your route to and from work, and whether you drive a particular make of car, or a different brand, is also insignificant when looking at a speeding ticket.

The law cares about precisely two things when deciding whether a given work is derivative or not:

Did anyone or anything involved in creating the new work ever have access to the original? For example, I've read the Newsflesh series of books, and if I'm giving you advice on writing a story, which you translate and relay to a final author, there's enough access to those books that the final author could be creating a derived work of the Newsflesh series. If there's no path for copying, then you've entered the "clean room" reimplementation state, and are not infringing copyright.
Second, following something like the abstraction-filtration-comparison test, is the alleged derived work sufficiently closely related to the original that it's clearly derived, or is it sufficiently differed to be an independent or transformative work?

If $OUTPUT obviously came from $INPUT, applying the AFC test, then you've got yourself a derived work. If $OUTPUT is transformative or independent, even if the algorithm had access to $INPUT, then it's not a derived work.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 10:48 UTC (Mon) by mb (subscriber, #50428) [Link] (46 responses)

> Wrong.

Ok, but you conclude with:

> If $OUTPUT is transformative or independent, even if the algorithm had access to $INPUT, then it's not a derived work.

So, why is that apparently only true for an AI algorithm and not my non-AI obfuscator?
I'm confused.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 10:51 UTC (Mon) by farnz (subscriber, #17727) [Link] (45 responses)

So, why is that apparently only true for an AI algorithm and not my non-AI obfuscator?
I'm confused.

It's equally true for both; you have not given a source for why you believe that an AI will be treated differently. If the output is, in the copyright sense, transformative or independent of the input, then it's not an infringement. If it's derived from the input, then it's infringing.

And that applies no matter what the steps in the middle are. They can be an AI algorithm, a human working alone, some sort of obfuscation, anything. The steps don't matter; only the inputs and the output matter.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 11:00 UTC (Mon) by mb (subscriber, #50428) [Link] (44 responses)

>It's equally true for both

Ok. I'm fine with that explanation.
AI programs and non-AI programs are both equally capable of transforming Copyrighted work into Public Domain. That makes sense.

>you have not given a source for why you believe that an AI will be treated differently

This discussion is the current source.

But people claiming that AI is some kind of magic Copyright remover comes up over and over again. But that simply doesn't make sense, if it's not equally true for conventional algorithms.
You now explained that it's true for both. Which makes sense.

But that makes me come to the conclusion that Copyright actually is completely useless these days.
I'm fine with that, though. I publish mostly under permissive licenses these days, because I don't really care anymore what people do with the code. I would publish into the Public Domain, if I could.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 11:04 UTC (Mon) by farnz (subscriber, #17727) [Link] (41 responses)

I've not seen anyone other than you in the current discussion claiming that AI is a magic copyright remover; the closest I can see are some misunderstandings of what someone else was saying, and the (correct) statement that in the EU, copyright does not restrict you from feeding a work into an algorithm (but the EU is silent on what copyright implications there are to the output of that algorithm).

So, given that you brought it into discussion, I'd like to know where you got the idea that AI is a magic copyright remover from, so that I can consider debunking it at the source, not when it gets relayed via you.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 11:14 UTC (Mon) by mb (subscriber, #50428) [Link] (23 responses)

>I've not seen anyone

Come on. Look harder.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 13:26 UTC (Mon) by farnz (subscriber, #17727) [Link] (22 responses)

I've read every comment on this article, and you are the only person who claims that AI is a "magic copyright removal" tool. Nobody else makes that claim that I can see - a link to a comment where someone other than you makes that claim would be of interest.

The closest I can see is this comment, which could be summarized down to it's already hard to enforce open source licensing in cases where literal copying can be proven, and making it harder to show infringement is going to make it harder to enforce copyright.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 13:57 UTC (Mon) by mb (subscriber, #50428) [Link] (21 responses)

>Nobody else makes that claim that I can see

Come on. It's even written in the article itself. Did you read it?

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 14:21 UTC (Mon) by farnz (subscriber, #17727) [Link] (20 responses)

I did, and I do not see any claim that AIs are "magic copyright removal tools"; I see claims that AIs can be used to hide infringement, but not that their output cannot be a derived work of their inputs.

Indeed, I see the opposite - people being concerned that someone will use an AI to create something that later causes problems for Debian, since it infringes copyrights that then get enforced.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:05 UTC (Mon) by mb (subscriber, #50428) [Link] (19 responses)

Ok, let me quote one [1] of the things:

> He specified "commercial AI" because ""these systems are copyright laundering machines""
> that abuse free software

[1] I'm not going to search the internet for you.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:09 UTC (Mon) by farnz (subscriber, #17727) [Link] (18 responses)

That's not a claim that AIs are magical copyright removing tools; that's a claim that AIs hide infringement - in English, if something is a "laundering machine", it cleans away the origin of something, but doesn't remove the fact that it was originally dirty.

Again, I ask you to point to where you're getting your claim from, given that this is now the third time I've asked you to identify where you're getting this from, and been pointed to things that don't support the idea that AIs are "magical copyright removal machines", and I've had you insult my reading ability because I dare to question you.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:13 UTC (Mon) by mb (subscriber, #50428) [Link] (17 responses)

I'm sorry. This is getting silly.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:19 UTC (Mon) by farnz (subscriber, #17727) [Link] (16 responses)

The whole reason I'm asking is that "AI as magic copyright removal machine" is a falsehood, and you've clearly picked that up from somewhere; rather than trying to correct people who pick it up from the same source, I'd like to go back to the source and correct it there.

Now, if it's a simple misunderstanding of what the article says, there's not a lot that can be done there - English is a horrorshow of a language at the best of times - but if you've picked it up from something someone else has actually claimed, that claim can be corrected at source.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:32 UTC (Mon) by mb (subscriber, #50428) [Link] (15 responses)

>and you've clearly picked that up from somewhere

No.
Just like I call my washing machine the "magic dirt removal machine" I call AI source code transformers "magic copyright removal machines".

And I really don't understand why you insist so hard that there is anything wrong with that. Because there isn't.
It's an aggravation to point out the fact that something that had been there in the input is no longer there in the output after a processing step.

I understand that you don't like these words, for whatever reason, but that's not really my problem to solve.

Thanks a lot for trying to correct me. I learnt a lot in this discussion. But I will continue to use these words, because in my opinion these words describe very well what actually happens.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:40 UTC (Mon) by bluca (subscriber, #118303) [Link] (13 responses)

> It's an aggravation to point out the fact that something that had been there in the input is no longer there in the output after a processing step.

The point is, that's not a fact at all, as it has already been explained many times.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:47 UTC (Mon) by mb (subscriber, #50428) [Link] (12 responses)

Ok. So the output from a LLM that was trained (input) with GPL'ed software is still GPLed?
If not, then it has been removed (laundered).

You guys have to decide on something. Both can't be true. There is nothing in-between. There is no such thing as "half-GPLed".
This is not Schrödinger's LLM.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 16:32 UTC (Mon) by bluca (subscriber, #118303) [Link] (10 responses)

The license of the training material is completely irrelevant with regards to building and training a LLM. Data mining on publicly available datasets is not performed under whatever license the dataset had, but under the copyright exception granted by the law, which trumps any license you might attach to it. Fun fact: tons of code on Github is published without a license _at all_, and thus is effectively proprietary, as that's obviously the default absent a license. Guess what, such a repository can still be data mined for machine learning training purposes (unless the repository owner ticked the opt-out checkbox), just like any other publicly available dataset.

Now, again as it has already been explained, whether the output of a prompt is copyrightable, and whether it's a derived work of existing copyrighted material, is an entirely separate question that depends on many things, but crucially, not on which tool happened to have been used to write it out.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 16:53 UTC (Mon) by mb (subscriber, #50428) [Link] (5 responses)

> Data mining on publicly available datasets is not performed under whatever license the dataset had,
> but under the copyright exception granted by the law, which trumps any license you might attach to it.

Ok. Got it now. So

>the fact that something that had been there in the input is no longer there in the output after a processing step.

is true after all.
The input was copyright protected and the special exception made it non-copyright-protected because of reasons.
And for whatever strange reason that only applies to AI algorithms, because the EU says so.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 17:15 UTC (Mon) by farnz (subscriber, #17727) [Link]

>the fact that something that had been there in the input is no longer there in the output after a processing step.
is true after all. The input was copyright protected and the special exception made it non-copyright-protected because of reasons. And for whatever strange reason that only applies to AI algorithms, because the EU says so.

No, this is also false.

Copyright law says that there are certain actions I am capable of taking, such as making a literal copy, or a "derived work" (a non-literal copy), which the law prohibits unless you have permission from the copyright holder. There are other actions that copyright allows, such as reading your text, or (in the EU) feeding that text as input to an algorithm; they may be banned by other laws, but copyright law says that these actions are completely legal.

The GPL says that the copyright holder gives you permission to do certain acts that copyright law prohibits as long as you comply with certain terms. If I fail to comply with those terms, then the GPL does not give me permission, and I now have a copyright issue to face up to.

The law says nothing about the copyright protection on the output of the LLM; it is entirely plausible that an LLM will output something that's a derived work of the input as far as copyright law is concerned, and if that's the case, then the output of the LLM infringes. Determining if the output infringes on a given input is done by a comparison process between the input and the output - and this applies regardless of what the algorithm that generated the output is.

Further, this continues to apply even if the LLM itself is not a derived work of the input data; it might be fine to send you the LLM, but not to send you the result of giving the LLM certain prompts as input, since the result of those prompts is derived from some or all of the input in such a way that you can't get permission to distribute the resulting work.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 17:15 UTC (Mon) by bluca (subscriber, #118303) [Link] (2 responses)

> The input was copyright protected and the special exception made it non-copyright-protected because of reasons.

No, because what you are stubbornly refusing to understand, despite it having been explained a lot of times, is:

> Now, again as it has already been explained, whether the output of a prompt is copyrightable, and whether it's a derived work of existing copyrighted material, is an entirely separate question that depends on many things, but crucially, not on which tool happened to have been used to write it out.

This is a legal matter, not a programming one. The same paradigms used to understand software cannot be used to try and understand legal issues.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 17:33 UTC (Mon) by mb (subscriber, #50428) [Link] (1 responses)

> The same paradigms used to understand software cannot be used to try and understand legal issues.

Yes. That is the main problem. It does not have to make logical sense for it to be "correct" under law.

> stubbornly

I am just applying logical reasoning. The logical chain obviously is not correctly implemented. Which is often the case in law, of course. Just like the logical reasoning chain breaks if the information goes through a human brain. And that's Ok.

Just saying that some people claiming here things like "it's *obvious* that LLMs are like this and that w.r.t. Copyright" are plain wrong. Nothing is obvious in this context. It's partly counter-logical and defined with contradicting assumptions.

But that's Ok, as long as a majority agrees that it's fine.
But that doesn't mean I personally have to agree. Copyright is a train wreck and it's only getting worse and worse.

Parts of Debian dismiss AI-contributions policy

Posted May 14, 2024 5:11 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

I hesitate to wade into a thread that looks like it has long since become unproductive, but at this point, I think it might be helpful to remember that copyright is a "color" in the sense described at https://ansuz.sooke.bc.ca/entry/23.

Unfortunately, while that article is very well-written and generally illuminates the right way to think about verbatim copying, it can be unintentionally misleading when we're talking about derivative works. The "colors" involved in verbatim copying are relatively straightforward - either X is a copy of Y, or it is not, and this is purely a matter of how you created X. But when we get to derivative works, there are really two* separate components that need to be considered:

- Access (a "color" of the bits, describing whether the defendant could have looked at the alleged original).
- Similarity (a function of the bits, and not a color)

The problem is, if you've been following copyright law for some time, you might be used to working in exclusively one mode of analysis at a time (i.e. either the "bits have color" mode of analysis or the "bits are colorless" mode of analysis). The problem is, access is a colored property, and similarity is a colorless property. You need to be prepared to combine both modalities, or at least to perform each of them sequentially, in order to reason correctly about derivative works. You cannot insist that "it must be one or the other," because as a matter of law, it's both.

* Technically, there is also the third component of originality, but that only matters if you want to copyright the derivative work, which is an entirely different discussion altogether. That one is also a "color" which depends on how much human creativity has gone into the work.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 22:18 UTC (Mon) by mirabilos (subscriber, #84359) [Link]

> The input was copyright protected and the special exception made it non-copyright-protected because of reasons.

No, wrong.

They’re copyright-protected, but *analysing* copyright-protected works for text and data mining is an action permitted without the permission of the rights holders.

See my other post in this subthread. This limitation of copyright protection does not extend to doing *anything* with the output of such models.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 22:00 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (3 responses)

Fun that you mention data mining: in preparation of a copyright and licence workshop I’m holding at $dayjob, I re-read that paragraph as well.

Text and data mining are opt-out, and the opt-out must be machine-readable. but this limitation of copyright only applies to doing automated analysēs of works to obtain information about patterns, trends and correlations.

(I grant that creating an LLM model itself probably falls under this clause.)

But not only are the copies of works made for text and data mining to be deleted as soon as they are no longer necessary for this (which these models clearly don’t do, given how it’s possible to obtain “training data” by the millions), it also does not allow you to reproduce the output of such models.

Text and data mining is, after all, only permissible to obtain “information about especially patterns, trends and correlations”, not to produce outputs as genAI does.

Therefore, the limitation of copyright does NOT apply to LLM output, and therefore the normal copyright rules (i.e. mechanically combined of its inputs, whose licences hold true) apply.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 22:44 UTC (Mon) by bluca (subscriber, #118303) [Link] (2 responses)

> But not only are the copies of works made for text and data mining to be deleted as soon as they are no longer necessary for this (which these models clearly don’t do, given how it’s possible to obtain “training data” by the millions),

It doesn't say that it must be deleted, it says:

> Reproductions and extractions made pursuant to paragraph 1 may be retained for as long as is necessary for the purposes of text and data mining.

Not quite the same thing. I don't know whether it's true that verbatim copies of training data are actually stored as you imply, as I am not a ML expert - it would seem strange and pointless, but I don't really know. But even assuming that was true, if that's required to make the LLM work, then the regulation clearly allows for it.

> it also does not allow you to reproduce the output of such models.

Every LLM producer treats such instances as bugs to be fixed. And they are really hard to reproduce, judging from how contrived and tortured the sequence of prompts need to be to make that actually happen. The NYT had to basically copy and paste themselves portions of their articles in the prompt to make ChatGPT spit them back, as showed in their litigation vs OpenAI.

> Therefore, the limitation of copyright does NOT apply to LLM output, and therefore the normal copyright rules (i.e. mechanically combined of its inputs, whose licences hold true) apply.

And yet, the NYT decided to sue in the US, where the law is murky and based on fair use case-by-case decisions, rather than in the EU where they have an office and it would have been a slam dunk, according to you. Could it be that you are wrong? It's very easy to test it, why don't you sue any of the companies that publishes an LLM and see what happens?

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 23:16 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (1 responses)

Yes, they treat that as bugs PRECISELY because they want to get around to do illegal copyright laundering.

Doesn’t change the fact that it is possible, and in sufficiently amount to consider that models contain sufficient amounts from their input works for the output to be a mechanically produced derivative of them.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 23:57 UTC (Mon) by bluca (subscriber, #118303) [Link]

> Yes, they treat that as bugs PRECISELY because they want to get around to do illegal copyright laundering.

They are treated as bugs because they are bugs, despite the contrived and absurd ways that are necessary to reproduce them. Doesn't really prove anything.

> Doesn’t change the fact that it is possible, and in sufficiently amount to consider that models contain sufficient amounts from their input works for the output to be a mechanically produced derivative of them.

Illiterate FUD. Go to court and prove that, if you really believe that.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 16:58 UTC (Mon) by farnz (subscriber, #17727) [Link]

The output from the LLM is almost certainly GPLed in as far as the output from the LLM is (per copyright law) a derived work of the GPLed input. The complexity is that not all LLM outputs will be a derived work as far as copyright law is concerned, and where they are not derived works, there is no copyright, hence there is nothing to be GPLed.

And that's the key issue - the algorithm between "read a work as input" and "write a work as output" is completely and utterly irrelevant to the question of "does the output infringe on the copyright applicable to the input?". That depends on whether the output is, using something like an abstraction-filtration-comparison test, substantially the same as the input, or not.

For example, I can copy if (!ret) { if (ret == 1) ret = 0; goto cleanup; } directly from the kernel source code into another program, and that has no GPL implications at all, even though it's a literal copy-and-paste of 5 lines of kernel code that I received under the GPL. However, if I copy a different 5 lines of kernel code, I am plausibly creating a derived work, because I'm copying something relatively expressive.

This is why both can be true; as a matter of law, not all copying is copyright infringement, and thus not all copying has GPL implications when the input was GPLed code.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:42 UTC (Mon) by Wol (subscriber, #4433) [Link]

The reason those words are bad is because "launder" - in legal terms - is used to describe illegal activity.

So to describe an AI as a "magical copyright laundering machine" is to admit / claim that it's involved in illegal activity.

Cheers,
Wol

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 12:26 UTC (Mon) by anselm (subscriber, #2796) [Link] (16 responses)

the EU is silent on what copyright implications there are to the output of that algorithm

Elsewhere in copyright law, the general premise is that copyright is only available for works which are the “personal mental creation” of a human being. Speciesism aside, something that comes out of an LLM is obviously not the personal mental creation of anyone, and that seems to take care of that, even without the EU pronouncing on it in the context of training AI models.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 13:26 UTC (Mon) by kleptog (subscriber, #1183) [Link] (4 responses)

> Elsewhere in copyright law, the general premise is that copyright is only available for works which are the “personal mental creation” of a human being. Speciesism aside, something that comes out of an LLM is obviously not the personal mental creation of anyone, and that seems to take care of that, even without the EU pronouncing on it in the context of training AI models.

LLMs are prompted, they don't produce output out of thin air. Therefore the output is the creation of the human that triggered the prompt. Now whether that person was pressing buttons on a device that sent network packets to a server that processed all those keystrokes into a block of text to be sent to an LLM in the cloud is irrelevant. Somewhere along the way a human decided to invoke the LLM and controlled which input to send to it and what to do with the output. That human being is responsible for respecting copyright. Whether the output is copyrightable depends mostly on how original the prompt is.

The idea that LLM output cannot be copyrighted is silly. That would be like claiming that documents produced by a human typing into LibreOffice cannot be "the personal mental creation of anyone". LLMs, like LibreOffice, are tools, nothing more. There's a human at the keyboard who is responsible. Sure, most of the output of an LLM isn't going to be original enough to be copyrightable, but that's quite different from saying *all* output from LLMs is not copyrightable.

As with legal things in general, it depends.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 13:54 UTC (Mon) by mb (subscriber, #50428) [Link] (1 responses)

>LLMs are prompted, they don't produce output out of thin air.
>Therefore the output is the creation of the human that triggered the prompt.

Ok, so if I enter wget into my shell prompt to download some copyrighted music, it makes me the creator?

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 14:18 UTC (Mon) by farnz (subscriber, #17727) [Link]

You are the creator of that copy, and in as far as there is anything copyrightable in creating that copy, you own that copyright.

However, that copy is (in most cases) either an exact copy of an existing work, or a derived work of an existing work; if it's an exact copy, then there is nothing copyrightable in the creation of the copy, so you own nothing.

If it's a derived work, then you own copyright in the final work thanks to the creative expression you put in to create the copy, but doing things with that work infringes the copyright in the original work unless you have appropriate permission from the copyright holder on the original work, or a suitable exception in copyright law.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 21:51 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (1 responses)

😹😹😹

> LLMs are prompted, they don't produce output out of thin air. Therefore the output is the creation of the human that triggered the prompt.

This is ridiculous. The “prompt” is merely a tiny parametrisation of a query that extracts from the huge database of (copyrighted) works.

Do read the links I listed in https://lwn.net/Comments/973578/

> The idea that LLM output cannot be copyrighted is silly.

😹😹😹😹😹😹😹

You’re silly.

This is literally enshrined into copyright law. For example:

> Werke im Sinne dieses Gesetzes sind nur persönliche geistige Schöpfungen.

“Works as defined by this [copyright] law are only personal intellectual creations that pass threshold of originality.” (UrhG §2(2))

Wikipedia explains the “personal” part of this following general jurisprudence:

> Persönliches Schaffen: setzt „ein Handlungsergebnis, das durch den gestaltenden, formprägenden Einfluß eines Menschen geschaffen wurde“ voraus. Maschinelle Produktionen oder von Tieren erzeugte Gegenstände und Darbietungen erfüllen dieses Kriterium nicht. Der Schaffungsprozeß ist Realakt und bedarf nicht der Geschäftsfähigkeit des Schaffenden.

“demands the result of an act from the creative, form-shaping influence of a human: mechanical production or things or acts produced by animals do not fulfill this criterium (but legal competence is not necessary).” (<https://de.wikipedia.org/wiki/Urheberrecht_(Deutschland)#Schutzgegenstand_des_Urheberrechts:_Das_Werk>)

So, yes, LLM output cannot be copyrighted (as a new work/edition) in ipso.

And to create an adaption of LLM output, the human doing so must not only invest significant *creativity* (not just effort / sweat of brow!) to pass threshold of originality, but they also must have the permission of the copyright (exploitation rights, to be precise) holders of the original works to do so (and, in droit d’auteur, may not deface, so the authors even if not holders of exploitation rights also have something to say).

This has gone on for a while

Posted May 13, 2024 22:24 UTC (Mon) by corbet (editor, #1) [Link]

While this discussion can be seen as on-topic for LWN, I would also point out that we are not copyright lawyers, and that there may not be a lot of value in continuing to go around in circles here. Perhaps it's time to wind it down?

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 13:40 UTC (Mon) by farnz (subscriber, #17727) [Link] (1 responses)

That's looking at the other end of it - the question here is not whether an LLM's output can be copyrighted, but whether an LLM's output can infringe someone else's copyright. And the general stance elsewhere in copyright law is that the tooling used is irrelevant to whether or not a given tool output infringed copyright on that tool's inputs. It might, it might not, but that depends on the details of the inputs and outputs (and importantly, not on the tool in question).

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 14:55 UTC (Mon) by Wol (subscriber, #4433) [Link]

I think there might also be a problem here caused by "parts of the verb".

The output of an LLM cannot be copyrightED. That is, there is no original creative contribution BY THE LLM worthy of copyright.

But the output of an LLM can be COPYRIGHT. No "ed" on the end of it. The mere fact of feeding stuff through an LLM does not
automatically cancel any pre-existing copyright.

Again, we get back to the human analogy. There is no restriction on humans CONSUMING copyrighted works. European law explicitly extends that to LLMs CONSUMING copyrighted works.

And just as a human can regurgitate a copyrighted work in its entirety (Mozart is famous for doing this), so can an LLM. And both of these are blatant infringements if you don't have permission - although copyright was in its infancy when Mozart did it so I have no idea of the reality on the ground back then ...

Cheers,
Wol

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 13:52 UTC (Mon) by mb (subscriber, #50428) [Link] (8 responses)

>LLM is obviously not the personal mental creation of anyone

Well, that is not obvious at all.

Because the inputs were mental creations.
At which point did the data loose the "mental creation" status traveling through the algorithm?
Will processing the input with 'sed' also remove it, because the output is completely processes by a program, not a human being?
What level or processing do we need for the "mental creation" status to be lost? How many chained 'sed's do we need?

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 21:39 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (7 responses)

Chained sed isn’t going to solve it.

Even “mechanical” transformation by humans does not create a work (as defined by UrhG, i.e. copyright). It has to have some creativity.

Until then, it’s a transformation of the original work(s) and therefore bound to the (sum of their) terms and conditions on the original work.

If you have a copyrighted thing, you can print it out, scan it, compress it as JPEG, store it into a database… it’s still just a transformation of the original work, and you can retrieve a sufficiently substantial part of the original work from it.

The article where someone reimplemented a (slightly older version of) ChatGPT in a 498-line PostgreSQL query showed exactly and easily understandable how this is just a lossy compression/decompression: https://explainextended.com/2023/12/31/happy-new-year-15/

There are now feasible attacks obtaining “training data” from prod models in large scale, e.g: https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html

This is sufficient to prove that these “models” are just databases with lossily compressed, but easily enough accessible, copies of the original, possibly (probably!) copyrighted, works.

Another thing I would like to point out is the relative weight. For a work which I offer to the public under a permissive licence, attribution is basically the only remuneration I can ever get. This means failure to attribute so has a much higher weight than for differently licenced or unlicenced stuff.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 21:55 UTC (Mon) by bluca (subscriber, #118303) [Link] (6 responses)

> This is sufficient to prove that these “models” are just databases with lossily compressed, but easily enough accessible, copies of the original, possibly (probably!) copyrighted, works.

While the AI bandwagon exaggerates greatly the capability of LLMs, let's not fall into the opposite trap. ChatGPT&al are toys, real applications like Copilot are very much not "just databases". A database is not going to provide you with autocomplete based on the current, local context open in your IDE. A database is not going to provide an accurate summary of the meeting that just finished, with action items and all that.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 22:20 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (5 responses)

Oh, it totally is. Please *do* read the explainextended article: it shows you exactly how precisely the context is what parametrises the search query.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 22:44 UTC (Mon) by bluca (subscriber, #118303) [Link] (4 responses)

No, it totally isn't, because it's not about reproducing existing things, which is the only thing a database query can do.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 23:14 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (3 responses)

Just read that.

Consider a database in which things are stored lossily compressed and interleaved (yet still retrievable).

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 23:58 UTC (Mon) by bluca (subscriber, #118303) [Link] (2 responses)

A database query doesn't work differently depending on local context. You very clearly have never used any of this, besides playing with toys like chatgpt, and it shows.

Parts of Debian dismiss AI-contributions policy

Posted May 14, 2024 0:28 UTC (Tue) by mirabilos (subscriber, #84359) [Link] (1 responses)

Just read the fucking explainextended article, which CLEARLY explains all this, or go back to breaking unsuspecting peoples’ nōn-systemd systems, or whatever.

I don’t have the nerve to even try and communicate with systemd apologists who don’t even do the most basic research themselves WHEN POINTED TO IT M̲U̲L̲T̲I̲P̲L̲E̲ ̲T̲I̲M̲E̲S̲.

Second try

Posted May 14, 2024 1:26 UTC (Tue) by corbet (editor, #1) [Link]

OK, I'll state it more clearly: it's time to bring this thread to a halt, it's not getting anywhere.

That's all participants should stop, not just the one I'm responding to here.

Thank you.

Parts of Debian dismiss AI-contributions policy

Posted May 14, 2024 2:55 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link] (1 responses)

But people claiming that AI is some kind of magic Copyright remover comes up over and over again. But that simply doesn't make sense, if it's not equally true for conventional algorithms.

A lot of people saying it doesn't mean it's true. I think the "magic copyright eraser" argument comes from misapplying the combination of the pro- and anti-AI arguments in a way that isn't supported by the law. The strong anti-AI position is that AI is inherently a derivative work of every piece of training material, and that therefore all the output of the AI is likewise a derivative work. The strong pro-AI position is that creating an AI is inherently transformative, so an AI is not a derivative work (or at least not an infringing use) of the material used to train it. The mistake is applying the anti-AI "everything is a derivative work" logic to the pro-AI position that the AI is not a derivative work and concluding that none of the output of the AI would be infringing.

This sounds reasonable but is absolutely wrong. A human being is not a derivative work of everything used to train them, but humans are still capable of copyright infringement. What matters is whether our output meets the established criteria for infringement, e.g. the abstraction-filtration-comparison test farnz mentions above. The same thing would apply to the output of an AI. Even if the AI itself is not infringing, its output can be.

Basically, the courts won't accept "but I got it from an AI" as an argument against copyright infringement. If anything, saying you got it from an AI would probably hurt you. You can try to defend yourself against charges of infringement by showing you never saw the original and thus must have created it independently. That's always challenging, but it will be much harder with an AI, given just how much material they've trained on. The chances are very good the AI has seen whatever you're accused of infringing, so the independent creation defense is no good.

Parts of Debian dismiss AI-contributions policy

Posted May 14, 2024 9:16 UTC (Tue) by farnz (subscriber, #17727) [Link]

Basically, the courts won't accept "but I got it from an AI" as an argument against copyright infringement. If anything, saying you got it from an AI would probably hurt you. You can try to defend yourself against charges of infringement by showing you never saw the original and thus must have created it independently. That's always challenging, but it will be much harder with an AI, given just how much material they've trained on. The chances are very good the AI has seen whatever you're accused of infringing, so the independent creation defense is no good.

Note, as emphasis on your point, that the independent creation defence requires the defendant to show that the independent creator did not have access to the work they are alleged to have copied. The assumption is that you had access, up until you show you didn't.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 12:21 UTC (Mon) by Wol (subscriber, #4433) [Link]

> If $OUTPUT obviously came from $INPUT, applying the AFC test, then you've got yourself a derived work. If $OUTPUT is transformative or independent, even if the algorithm had access to $INPUT, then it's not a derived work.

Bear in mind that a LOT of people don't have a clue how copyright works. Unfortunately, it's now hard to navigate because it's a cobwebsite, but take a look at Groklaw.

But to give you a wonderful example of people making all sorts of outrageous claims about copyright, there was quite a fuss a good few years back about "how Terry Pratchett ripped off Hogwarts to create Unseen University".

Well, yes, they're both "Schools for Wizards". They are actually pretty similar. But the complaints are complete nonsense because I don't know whether J K Rowling had read Terry before she created Hogwarts, but Terry couldn't have read J K without a time-machine to hand!

And just like West Side Story, schools for wizards have been around in the literature for ages, so trying to imagine a link from Hogwarts to UU ignores the existence of a myriad of links to other similar stories, any one of which could have been the inspiration for either UU or Hogwarts.

Notice that the GPL makes *absolutely* *no* *attempt* *whatsoever* to define "derivative work". Because it has nothing to do with computing, AI, all that stuff. It's a legal "term of art", and if you don't speak legalese you WILL make an idiot of yourself.

So as far as the definition of "derivative work" is concerned, whether it's an AI or not is completely irrelevant. What IS relevant is whether the *OUTPUT* is Public Domain or not. My nsho is that if the output is sufficiently close that "a practitioner skilled in the arts" can identify the source, then the output is a legal "derived work", and the input copyright applies to the output. If the source is not identifiable, then the output is a new work, but AI is incapable of creativity, so the output is Public Domain.

And then - hopefully - a human comes along, proof-reads the output to remove hallucinations and mistakes, at which point (because this is *creative* input) they then acquire a copyright over the final work. Such work could also remove all references to the existing source, thereby removing the original copyrights (or it could fail to do so, and fail to remove the original copyrights).

Cheers,
Wol

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 21:24 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (1 responses)

Of course not, at least not generically: only a human may be the author of a work (derived or not).

Machinally manipulating a work does not generate a new work, so it’s just a machinal transformation of the original work and therefore bound to the same terms, which the *user* of the machine must honour.

Parts of Debian dismiss AI-contributions policy

Posted May 14, 2024 9:29 UTC (Tue) by farnz (subscriber, #17727) [Link]

Machinally manipulating a work does not generate a new work, so it’s just a machinal transformation of the original work and therefore bound to the same terms, which the *user* of the machine must honour.

Your "just" is doing a lot of heavy lifting. It is entirely possible for a machine transform of a work to remove all copyright-relevant aspects of the original, leaving something that's no longer protected by copyright. As your terms are about lifting restrictions that copyright protection places on the work by default, if the transform removes all the elements that are protected by copyright, there's no requirement for the user to honour those terms.

For example, I can write a sed script that extracts 5 lines from the Linux kernel; if I extract the right 5 lines, then I've extracted something not protected by copyright, and the kernel's licence terms do not apply (since they merely lift copyright restrictions). On the other hand, if I extract a different set of 5 lines, I extract something to which copyright protections apply, and now the kernel's licence terms apply.

The challenge for AI boosters is that their business models depend critically on all of the AI's output being non-infringing; if you have to do copyright infringement tests on AI output, then most of the business models around generative AI fall apart, since who'd pay for something that puts you at high risk of a lawsuit?

And the challenge for AI critics is to limit ourselves to arguments that make legal sense, as opposed to arguing the way the SCO Group did when it claimed that all of Linux was derived from the UNIX copyrights it owned.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 14:46 UTC (Mon) by Wol (subscriber, #4433) [Link] (2 responses)

Can you get PJ to do a guest author slot? Or is she now outta technology entirely :-( She is missed ...

Cheers,
Wol

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 14:54 UTC (Mon) by jzb (editor, #7867) [Link] (1 responses)

The last post on Groklaw suggests that PJ was pretty much done with technology. I'm not sure how one would reach PJ these days, at any rate.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:00 UTC (Mon) by Wol (subscriber, #4433) [Link]

I'm fairly certain Jon's met her. He might know. But she pretty much did a disappearing act right at that last post, so I wouldn't be surprised if she refused, even if you could find her to ask. And I think she's of an age to be enjoying a well-earned retirement - she might have just shut the door completely. Some people can, some people can't.

Cheers,
Wol

NetBSD does set such a policy

Posted May 15, 2024 19:09 UTC (Wed) by mirabilos (subscriber, #84359) [Link]

Incidentally, NetBSD concurs, according to https://mastodon.sdf.org/@netbsd/112446618914747900 they now have also set a policy to that effect.