Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:19 UTC (Mon) by farnz (subscriber, #17727)
In reply to: Parts of Debian dismiss AI-contributions policy by mb
Parent article: Debian dismisses AI-contributions policy

The whole reason I'm asking is that "AI as magic copyright removal machine" is a falsehood, and you've clearly picked that up from somewhere; rather than trying to correct people who pick it up from the same source, I'd like to go back to the source and correct it there.

Now, if it's a simple misunderstanding of what the article says, there's not a lot that can be done there - English is a horrorshow of a language at the best of times - but if you've picked it up from something someone else has actually claimed, that claim can be corrected at source.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:32 UTC (Mon) by mb (subscriber, #50428) [Link] (15 responses)

>and you've clearly picked that up from somewhere

No.
Just like I call my washing machine the "magic dirt removal machine" I call AI source code transformers "magic copyright removal machines".

And I really don't understand why you insist so hard that there is anything wrong with that. Because there isn't.
It's an aggravation to point out the fact that something that had been there in the input is no longer there in the output after a processing step.

I understand that you don't like these words, for whatever reason, but that's not really my problem to solve.

Thanks a lot for trying to correct me. I learnt a lot in this discussion. But I will continue to use these words, because in my opinion these words describe very well what actually happens.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:40 UTC (Mon) by bluca (subscriber, #118303) [Link] (13 responses)

> It's an aggravation to point out the fact that something that had been there in the input is no longer there in the output after a processing step.

The point is, that's not a fact at all, as it has already been explained many times.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:47 UTC (Mon) by mb (subscriber, #50428) [Link] (12 responses)

Ok. So the output from a LLM that was trained (input) with GPL'ed software is still GPLed?
If not, then it has been removed (laundered).

You guys have to decide on something. Both can't be true. There is nothing in-between. There is no such thing as "half-GPLed".
This is not Schrödinger's LLM.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 16:32 UTC (Mon) by bluca (subscriber, #118303) [Link] (10 responses)

The license of the training material is completely irrelevant with regards to building and training a LLM. Data mining on publicly available datasets is not performed under whatever license the dataset had, but under the copyright exception granted by the law, which trumps any license you might attach to it. Fun fact: tons of code on Github is published without a license _at all_, and thus is effectively proprietary, as that's obviously the default absent a license. Guess what, such a repository can still be data mined for machine learning training purposes (unless the repository owner ticked the opt-out checkbox), just like any other publicly available dataset.

Now, again as it has already been explained, whether the output of a prompt is copyrightable, and whether it's a derived work of existing copyrighted material, is an entirely separate question that depends on many things, but crucially, not on which tool happened to have been used to write it out.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 16:53 UTC (Mon) by mb (subscriber, #50428) [Link] (5 responses)

> Data mining on publicly available datasets is not performed under whatever license the dataset had,
> but under the copyright exception granted by the law, which trumps any license you might attach to it.

Ok. Got it now. So

>the fact that something that had been there in the input is no longer there in the output after a processing step.

is true after all.
The input was copyright protected and the special exception made it non-copyright-protected because of reasons.
And for whatever strange reason that only applies to AI algorithms, because the EU says so.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 17:15 UTC (Mon) by farnz (subscriber, #17727) [Link]

>the fact that something that had been there in the input is no longer there in the output after a processing step.
is true after all. The input was copyright protected and the special exception made it non-copyright-protected because of reasons. And for whatever strange reason that only applies to AI algorithms, because the EU says so.

No, this is also false.

Copyright law says that there are certain actions I am capable of taking, such as making a literal copy, or a "derived work" (a non-literal copy), which the law prohibits unless you have permission from the copyright holder. There are other actions that copyright allows, such as reading your text, or (in the EU) feeding that text as input to an algorithm; they may be banned by other laws, but copyright law says that these actions are completely legal.

The GPL says that the copyright holder gives you permission to do certain acts that copyright law prohibits as long as you comply with certain terms. If I fail to comply with those terms, then the GPL does not give me permission, and I now have a copyright issue to face up to.

The law says nothing about the copyright protection on the output of the LLM; it is entirely plausible that an LLM will output something that's a derived work of the input as far as copyright law is concerned, and if that's the case, then the output of the LLM infringes. Determining if the output infringes on a given input is done by a comparison process between the input and the output - and this applies regardless of what the algorithm that generated the output is.

Further, this continues to apply even if the LLM itself is not a derived work of the input data; it might be fine to send you the LLM, but not to send you the result of giving the LLM certain prompts as input, since the result of those prompts is derived from some or all of the input in such a way that you can't get permission to distribute the resulting work.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 17:15 UTC (Mon) by bluca (subscriber, #118303) [Link] (2 responses)

> The input was copyright protected and the special exception made it non-copyright-protected because of reasons.

No, because what you are stubbornly refusing to understand, despite it having been explained a lot of times, is:

> Now, again as it has already been explained, whether the output of a prompt is copyrightable, and whether it's a derived work of existing copyrighted material, is an entirely separate question that depends on many things, but crucially, not on which tool happened to have been used to write it out.

This is a legal matter, not a programming one. The same paradigms used to understand software cannot be used to try and understand legal issues.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 17:33 UTC (Mon) by mb (subscriber, #50428) [Link] (1 responses)

> The same paradigms used to understand software cannot be used to try and understand legal issues.

Yes. That is the main problem. It does not have to make logical sense for it to be "correct" under law.

> stubbornly

I am just applying logical reasoning. The logical chain obviously is not correctly implemented. Which is often the case in law, of course. Just like the logical reasoning chain breaks if the information goes through a human brain. And that's Ok.

Just saying that some people claiming here things like "it's *obvious* that LLMs are like this and that w.r.t. Copyright" are plain wrong. Nothing is obvious in this context. It's partly counter-logical and defined with contradicting assumptions.

But that's Ok, as long as a majority agrees that it's fine.
But that doesn't mean I personally have to agree. Copyright is a train wreck and it's only getting worse and worse.

Parts of Debian dismiss AI-contributions policy

Posted May 14, 2024 5:11 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

I hesitate to wade into a thread that looks like it has long since become unproductive, but at this point, I think it might be helpful to remember that copyright is a "color" in the sense described at https://ansuz.sooke.bc.ca/entry/23.

Unfortunately, while that article is very well-written and generally illuminates the right way to think about verbatim copying, it can be unintentionally misleading when we're talking about derivative works. The "colors" involved in verbatim copying are relatively straightforward - either X is a copy of Y, or it is not, and this is purely a matter of how you created X. But when we get to derivative works, there are really two* separate components that need to be considered:

- Access (a "color" of the bits, describing whether the defendant could have looked at the alleged original).
- Similarity (a function of the bits, and not a color)

The problem is, if you've been following copyright law for some time, you might be used to working in exclusively one mode of analysis at a time (i.e. either the "bits have color" mode of analysis or the "bits are colorless" mode of analysis). The problem is, access is a colored property, and similarity is a colorless property. You need to be prepared to combine both modalities, or at least to perform each of them sequentially, in order to reason correctly about derivative works. You cannot insist that "it must be one or the other," because as a matter of law, it's both.

* Technically, there is also the third component of originality, but that only matters if you want to copyright the derivative work, which is an entirely different discussion altogether. That one is also a "color" which depends on how much human creativity has gone into the work.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 22:18 UTC (Mon) by mirabilos (subscriber, #84359) [Link]

> The input was copyright protected and the special exception made it non-copyright-protected because of reasons.

No, wrong.

They’re copyright-protected, but *analysing* copyright-protected works for text and data mining is an action permitted without the permission of the rights holders.

See my other post in this subthread. This limitation of copyright protection does not extend to doing *anything* with the output of such models.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 22:00 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (3 responses)

Fun that you mention data mining: in preparation of a copyright and licence workshop I’m holding at $dayjob, I re-read that paragraph as well.

Text and data mining are opt-out, and the opt-out must be machine-readable. but this limitation of copyright only applies to doing automated analysēs of works to obtain information about patterns, trends and correlations.

(I grant that creating an LLM model itself probably falls under this clause.)

But not only are the copies of works made for text and data mining to be deleted as soon as they are no longer necessary for this (which these models clearly don’t do, given how it’s possible to obtain “training data” by the millions), it also does not allow you to reproduce the output of such models.

Text and data mining is, after all, only permissible to obtain “information about especially patterns, trends and correlations”, not to produce outputs as genAI does.

Therefore, the limitation of copyright does NOT apply to LLM output, and therefore the normal copyright rules (i.e. mechanically combined of its inputs, whose licences hold true) apply.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 22:44 UTC (Mon) by bluca (subscriber, #118303) [Link] (2 responses)

> But not only are the copies of works made for text and data mining to be deleted as soon as they are no longer necessary for this (which these models clearly don’t do, given how it’s possible to obtain “training data” by the millions),

It doesn't say that it must be deleted, it says:

> Reproductions and extractions made pursuant to paragraph 1 may be retained for as long as is necessary for the purposes of text and data mining.

Not quite the same thing. I don't know whether it's true that verbatim copies of training data are actually stored as you imply, as I am not a ML expert - it would seem strange and pointless, but I don't really know. But even assuming that was true, if that's required to make the LLM work, then the regulation clearly allows for it.

> it also does not allow you to reproduce the output of such models.

Every LLM producer treats such instances as bugs to be fixed. And they are really hard to reproduce, judging from how contrived and tortured the sequence of prompts need to be to make that actually happen. The NYT had to basically copy and paste themselves portions of their articles in the prompt to make ChatGPT spit them back, as showed in their litigation vs OpenAI.

> Therefore, the limitation of copyright does NOT apply to LLM output, and therefore the normal copyright rules (i.e. mechanically combined of its inputs, whose licences hold true) apply.

And yet, the NYT decided to sue in the US, where the law is murky and based on fair use case-by-case decisions, rather than in the EU where they have an office and it would have been a slam dunk, according to you. Could it be that you are wrong? It's very easy to test it, why don't you sue any of the companies that publishes an LLM and see what happens?

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 23:16 UTC (Mon) by mirabilos (subscriber, #84359) [Link] (1 responses)

Yes, they treat that as bugs PRECISELY because they want to get around to do illegal copyright laundering.

Doesn’t change the fact that it is possible, and in sufficiently amount to consider that models contain sufficient amounts from their input works for the output to be a mechanically produced derivative of them.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 23:57 UTC (Mon) by bluca (subscriber, #118303) [Link]

> Yes, they treat that as bugs PRECISELY because they want to get around to do illegal copyright laundering.

They are treated as bugs because they are bugs, despite the contrived and absurd ways that are necessary to reproduce them. Doesn't really prove anything.

> Doesn’t change the fact that it is possible, and in sufficiently amount to consider that models contain sufficient amounts from their input works for the output to be a mechanically produced derivative of them.

Illiterate FUD. Go to court and prove that, if you really believe that.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 16:58 UTC (Mon) by farnz (subscriber, #17727) [Link]

The output from the LLM is almost certainly GPLed in as far as the output from the LLM is (per copyright law) a derived work of the GPLed input. The complexity is that not all LLM outputs will be a derived work as far as copyright law is concerned, and where they are not derived works, there is no copyright, hence there is nothing to be GPLed.

And that's the key issue - the algorithm between "read a work as input" and "write a work as output" is completely and utterly irrelevant to the question of "does the output infringe on the copyright applicable to the input?". That depends on whether the output is, using something like an abstraction-filtration-comparison test, substantially the same as the input, or not.

For example, I can copy if (!ret) { if (ret == 1) ret = 0; goto cleanup; } directly from the kernel source code into another program, and that has no GPL implications at all, even though it's a literal copy-and-paste of 5 lines of kernel code that I received under the GPL. However, if I copy a different 5 lines of kernel code, I am plausibly creating a derived work, because I'm copying something relatively expressive.

This is why both can be true; as a matter of law, not all copying is copyright infringement, and thus not all copying has GPL implications when the input was GPLed code.

Parts of Debian dismiss AI-contributions policy

Posted May 13, 2024 15:42 UTC (Mon) by Wol (subscriber, #4433) [Link]

The reason those words are bad is because "launder" - in legal terms - is used to describe illegal activity.

So to describe an AI as a "magical copyright laundering machine" is to admit / claim that it's involved in illegal activity.

Cheers,
Wol