LWN: Comments on "Debian dismisses AI-contributions policy"

Whoa there

corbet — Mon, 10 Jun 2024 16:34:35 +0000

This seems ... extreme. I don't doubt that different people can have different ideas of what "copyright maximalist" means — there are different axes on which things can be maximized. Disagreeing on that does not justify calling somebody a liar and attacking them in this way, methinks.

Debian dismisses AI-contributions policy

nye — Mon, 10 Jun 2024 15:42:10 +0000

> I'm not a copyright maximalist

I'm practically speechless that you would lie so brazenly as to say this in the same thread as espousing approximately the most maximalist possible interpretation of copyright.

Honestly, it's threads like this one that remind me of how ashamed I am that I once considered myself part of the FOSS community. It's just... awful. Everyone here is awful. Every time I read LWN I come away thinking just a little bit less of humanity. Today, you've played your part.

Debian dismisses AI-contributions policy

nrxr — Mon, 20 May 2024 00:57:40 +0000

You have too much faith in what scares people who's willing to cheat.

The only thing that they don't want is to be caught and since it's unenforceable, it won't drive them away.

NetBSD does set such a policy

mirabilos — Wed, 15 May 2024 19:09:19 +0000

Incidentally, NetBSD concurs, according to https://mastodon.sdf.org/@netbsd/112446618914747900 they now have also set a policy to that effect.

Parts of Debian dismiss AI-contributions policy

farnz — Tue, 14 May 2024 09:29:31 +0000

Machinally manipulating a work does not generate a new work, so it’s just a machinal transformation of the original work and therefore bound to the same terms, which the *user* of the machine must honour.

Your "just" is doing a lot of heavy lifting. It is entirely possible for a machine transform of a work to remove all copyright-relevant aspects of the original, leaving something that's no longer protected by copyright. As your terms are about lifting restrictions that copyright protection places on the work by default, if the transform removes all the elements that are protected by copyright, there's no requirement for the user to honour those terms.

For example, I can write a sed script that extracts 5 lines from the Linux kernel; if I extract the right 5 lines, then I've extracted something not protected by copyright, and the kernel's licence terms do not apply (since they merely lift copyright restrictions). On the other hand, if I extract a different set of 5 lines, I extract something to which copyright protections apply, and now the kernel's licence terms apply.

The challenge for AI boosters is that their business models depend critically on all of the AI's output being non-infringing; if you have to do copyright infringement tests on AI output, then most of the business models around generative AI fall apart, since who'd pay for something that puts you at high risk of a lawsuit?

And the challenge for AI critics is to limit ourselves to arguments that make legal sense, as opposed to arguing the way the SCO Group did when it claimed that all of Linux was derived from the UNIX copyrights it owned.

Parts of Debian dismiss AI-contributions policy

farnz — Tue, 14 May 2024 09:16:04 +0000

Basically, the courts won't accept "but I got it from an AI" as an argument against copyright infringement. If anything, saying you got it from an AI would probably hurt you. You can try to defend yourself against charges of infringement by showing you never saw the original and thus must have created it independently. That's always challenging, but it will be much harder with an AI, given just how much material they've trained on. The chances are very good the AI has seen whatever you're accused of infringing, so the independent creation defense is no good.

Note, as emphasis on your point, that the independent creation defence requires the defendant to show that the independent creator did not have access to the work they are alleged to have copied. The assumption is that you had access, up until you show you didn't.

Parts of Debian dismiss AI-contributions policy

NYKevin — Tue, 14 May 2024 05:11:27 +0000

I hesitate to wade into a thread that looks like it has long since become unproductive, but at this point, I think it might be helpful to remember that copyright is a "color" in the sense described at https://ansuz.sooke.bc.ca/entry/23.

Unfortunately, while that article is very well-written and generally illuminates the right way to think about verbatim copying, it can be unintentionally misleading when we're talking about derivative works. The "colors" involved in verbatim copying are relatively straightforward - either X is a copy of Y, or it is not, and this is purely a matter of how you created X. But when we get to derivative works, there are really two* separate components that need to be considered:

- Access (a "color" of the bits, describing whether the defendant could have looked at the alleged original).
- Similarity (a function of the bits, and not a color)

The problem is, if you've been following copyright law for some time, you might be used to working in exclusively one mode of analysis at a time (i.e. either the "bits have color" mode of analysis or the "bits are colorless" mode of analysis). The problem is, access is a colored property, and similarity is a colorless property. You need to be prepared to combine both modalities, or at least to perform each of them sequentially, in order to reason correctly about derivative works. You cannot insist that "it must be one or the other," because as a matter of law, it's both.

* Technically, there is also the third component of originality, but that only matters if you want to copyright the derivative work, which is an entirely different discussion altogether. That one is also a "color" which depends on how much human creativity has gone into the work.

Parts of Debian dismiss AI-contributions policy

rgmoore — Tue, 14 May 2024 02:55:31 +0000

But people claiming that AI is some kind of magic Copyright remover comes up over and over again. But that simply doesn't make sense, if it's not equally true for conventional algorithms.

A lot of people saying it doesn't mean it's true. I think the "magic copyright eraser" argument comes from misapplying the combination of the pro- and anti-AI arguments in a way that isn't supported by the law. The strong anti-AI position is that AI is inherently a derivative work of every piece of training material, and that therefore all the output of the AI is likewise a derivative work. The strong pro-AI position is that creating an AI is inherently transformative, so an AI is not a derivative work (or at least not an infringing use) of the material used to train it. The mistake is applying the anti-AI "everything is a derivative work" logic to the pro-AI position that the AI is not a derivative work and concluding that none of the output of the AI would be infringing.

This sounds reasonable but is absolutely wrong. A human being is not a derivative work of everything used to train them, but humans are still capable of copyright infringement. What matters is whether our output meets the established criteria for infringement, e.g. the abstraction-filtration-comparison test farnz mentions above. The same thing would apply to the output of an AI. Even if the AI itself is not infringing, its output can be.

Basically, the courts won't accept "but I got it from an AI" as an argument against copyright infringement. If anything, saying you got it from an AI would probably hurt you. You can try to defend yourself against charges of infringement by showing you never saw the original and thus must have created it independently. That's always challenging, but it will be much harder with an AI, given just how much material they've trained on. The chances are very good the AI has seen whatever you're accused of infringing, so the independent creation defense is no good.

Creator, or proof reader ?

viro — Tue, 14 May 2024 02:31:16 +0000

LLM output is, by definition, a random text that is statistically indistinguishable from "what they say". You can't tell true from false on that level. And it's worse than random BS from the proofreading POV - you can not rely upon "it sounds wrong" feeling to catch the likely spots.

As far as I'm concerned, anyone caught at using that deserves the same treatment as somebody who engages in any other form of post-truth - "not to be trusted ever after in any circumstances".

Debian dismisses AI-contributions policy

Heretic_Blacksheep — Tue, 14 May 2024 02:16:04 +0000

There's two ways of generating technical documentation: fast & good.

Fast is how a lot of people tend to want to do technical documentation, mostly because they don't consider proper communication essential. It's an afterthought. You end up with a lot of mistakes, misunderstandings, and complaints about the quality of a project's information.

Then there's good: it's written by skilled communicators, reviewed and annotated by multiple stake holders including people new to the project, old hands, and technical writing/communications savvy participants, and if necessary skilled translators. This is the best documentation and only becomes better over time as the teams that write it gain feedback. It takes time to generate, time to proofread and edit, and time to manage corrections. But, arguably it actually saves time all around for those that produce it, and particularly those that use it.

Many people using language models are going for the first level, but the result is little better in quality because they discount the point of documentation (and message passing like e'mail) isn't just telling people like themselves how to use whatever they're producing. The point of documentation is to tell not only with intellectual peers how to use their tool, it's to communicate why it exists, why it's designed and implemented the way it is, how it's used, and how it may potentially be extended or repaired.

The first (fast) is a simple how-to like a terse set of reminders on a machine's control panel and may not even be accurate. The latter is the highly accurate, full documentation manual that accompanies it that tells operators what to, not to, when, why, and how to repair or extend it. You can't reliably use a complex tool without a training/documentation manual. It's also why open source never made huge strides into corporate/industrial markets till people started writing _manuals_ not just terse how-tos many open source hobby level projects generate. Training and certification is a big part of the computer industry.

But back to the topic: AI can help do both, but fast/unskilled is still going to be crap. Merely witness the flood of LLM generated fluff/low quality articles proliferating across the internet as formerly reliable media outlets switch from good human generated journalism, to ad-generated human fluff-journalism, to LLM generated "pink slime" or in one person's terminology I recently saw "slop" (building on bot-generated spam). Good documentation can use language model tools, but not without the same human communication skills that good documentation and translation requires... and many coders discount to their ultimate detriment. Tone matters. Nuance matters. Definitions matter. Common usage matters. _Intent_ matters. LLM tools can help with these things, but they can't completely substitute for either the native speaker nor the project originator. They definitely can't intuit a writer's intent.

However, right now any project that is concerned about the provenance of their code base should be wary the unanswered legal questions on the output of LLM code generators. This could end up being a tremendous _gotcha_ in some very important legal jurisdictions where copyright provenance matters and why legally wise companies are looking at fully audited SLMs instead.

Second try

corbet — Tue, 14 May 2024 01:26:04 +0000

OK, I'll state it more clearly: it's time to bring this thread to a halt, it's not getting anywhere.

That's all participants should stop, not just the one I'm responding to here.

Thank you.

Debian dismisses AI-contributions policy

mirabilos — Tue, 14 May 2024 01:20:38 +0000

Because the laws say so. Simple as that.

Robotically made things, or things made by animals, are not works that reflect personal creativity.

Creator, or proof reader ?

mirabilos — Tue, 14 May 2024 01:12:13 +0000

Your example of musical engraving is a bit flawed (though your point in general is still true): there’s been jurisprudence saying that merely engraving does not _necessarily_ create copyright (it still may, if there’s sufficient creativity(! not sweat-of-brow!) going into it, but I only add a “© (where applicable)” to avoid fraudulous claims where no copyright exists to my digital editions of Free Sheet Music).

So, in this specific example, the bar is a bit higher, but yeah, the point stands.

Debian dismisses AI-contributions policy

mirabilos — Tue, 14 May 2024 01:06:19 +0000

Yes, as it’s still a derivative of something that isn’t legal to use.

And for all the other reasons, such as model training being done in unethical ways that exploit people in the global south, and LLM use requiring too much power (and even other natural resouces like clean water, for reasons), so you should not be using them a̲t̲ ̲a̲l̲l̲, period.

Parts of Debian dismiss AI-contributions policy

mirabilos — Tue, 14 May 2024 00:28:27 +0000

Just read the fucking explainextended article, which CLEARLY explains all this, or go back to breaking unsuspecting peoples’ nōn-systemd systems, or whatever.

I don’t have the nerve to even try and communicate with systemd apologists who don’t even do the most basic research themselves WHEN POINTED TO IT M̲U̲L̲T̲I̲P̲L̲E̲ ̲T̲I̲M̲E̲S̲.

Parts of Debian dismiss AI-contributions policy

bluca — Mon, 13 May 2024 23:58:45 +0000

A database query doesn't work differently depending on local context. You very clearly have never used any of this, besides playing with toys like chatgpt, and it shows.

Parts of Debian dismiss AI-contributions policy

bluca — Mon, 13 May 2024 23:57:12 +0000

> Yes, they treat that as bugs PRECISELY because they want to get around to do illegal copyright laundering.

They are treated as bugs because they are bugs, despite the contrived and absurd ways that are necessary to reproduce them. Doesn't really prove anything.

> Doesn’t change the fact that it is possible, and in sufficiently amount to consider that models contain sufficient amounts from their input works for the output to be a mechanically produced derivative of them.

Illiterate FUD. Go to court and prove that, if you really believe that.

Parts of Debian dismiss AI-contributions policy

mirabilos — Mon, 13 May 2024 23:16:39 +0000

Yes, they treat that as bugs PRECISELY because they want to get around to do illegal copyright laundering.

Doesn’t change the fact that it is possible, and in sufficiently amount to consider that models contain sufficient amounts from their input works for the output to be a mechanically produced derivative of them.

Parts of Debian dismiss AI-contributions policy

mirabilos — Mon, 13 May 2024 23:14:42 +0000

Just read that.

Consider a database in which things are stored lossily compressed and interleaved (yet still retrievable).

Parts of Debian dismiss AI-contributions policy

bluca — Mon, 13 May 2024 22:44:39 +0000

> But not only are the copies of works made for text and data mining to be deleted as soon as they are no longer necessary for this (which these models clearly don’t do, given how it’s possible to obtain “training data” by the millions),

It doesn't say that it must be deleted, it says:

> Reproductions and extractions made pursuant to paragraph 1 may be retained for as long as is necessary for the purposes of text and data mining.

Not quite the same thing. I don't know whether it's true that verbatim copies of training data are actually stored as you imply, as I am not a ML expert - it would seem strange and pointless, but I don't really know. But even assuming that was true, if that's required to make the LLM work, then the regulation clearly allows for it.

> it also does not allow you to reproduce the output of such models.

Every LLM producer treats such instances as bugs to be fixed. And they are really hard to reproduce, judging from how contrived and tortured the sequence of prompts need to be to make that actually happen. The NYT had to basically copy and paste themselves portions of their articles in the prompt to make ChatGPT spit them back, as showed in their litigation vs OpenAI.

> Therefore, the limitation of copyright does NOT apply to LLM output, and therefore the normal copyright rules (i.e. mechanically combined of its inputs, whose licences hold true) apply.

And yet, the NYT decided to sue in the US, where the law is murky and based on fair use case-by-case decisions, rather than in the EU where they have an office and it would have been a slam dunk, according to you. Could it be that you are wrong? It's very easy to test it, why don't you sue any of the companies that publishes an LLM and see what happens?

Parts of Debian dismiss AI-contributions policy

bluca — Mon, 13 May 2024 22:44:17 +0000

No, it totally isn't, because it's not about reproducing existing things, which is the only thing a database query can do.

This has gone on for a while

corbet — Mon, 13 May 2024 22:24:18 +0000

While this discussion can be seen as on-topic for LWN, I would also point out that we are not copyright lawyers, and that there may not be a lot of value in continuing to go around in circles here. Perhaps it's time to wind it down?

Parts of Debian dismiss AI-contributions policy

mirabilos — Mon, 13 May 2024 22:20:30 +0000

Oh, it totally is. Please *do* read the explainextended article: it shows you exactly how precisely the context is what parametrises the search query.

Parts of Debian dismiss AI-contributions policy

mirabilos — Mon, 13 May 2024 22:18:38 +0000

> The input was copyright protected and the special exception made it non-copyright-protected because of reasons.

No, wrong.

They’re copyright-protected, but *analysing* copyright-protected works for text and data mining is an action permitted without the permission of the rights holders.

See my other post in this subthread. This limitation of copyright protection does not extend to doing *anything* with the output of such models.

Parts of Debian dismiss AI-contributions policy

mirabilos — Mon, 13 May 2024 22:00:20 +0000

Fun that you mention data mining: in preparation of a copyright and licence workshop I’m holding at $dayjob, I re-read that paragraph as well.

Text and data mining are opt-out, and the opt-out must be machine-readable. but this limitation of copyright only applies to doing automated analysēs of works to obtain information about patterns, trends and correlations.

(I grant that creating an LLM model itself probably falls under this clause.)

But not only are the copies of works made for text and data mining to be deleted as soon as they are no longer necessary for this (which these models clearly don’t do, given how it’s possible to obtain “training data” by the millions), it also does not allow you to reproduce the output of such models.

Text and data mining is, after all, only permissible to obtain “information about especially patterns, trends and correlations”, not to produce outputs as genAI does.

Therefore, the limitation of copyright does NOT apply to LLM output, and therefore the normal copyright rules (i.e. mechanically combined of its inputs, whose licences hold true) apply.

Parts of Debian dismiss AI-contributions policy

bluca — Mon, 13 May 2024 21:55:04 +0000

> This is sufficient to prove that these “models” are just databases with lossily compressed, but easily enough accessible, copies of the original, possibly (probably!) copyrighted, works.

While the AI bandwagon exaggerates greatly the capability of LLMs, let's not fall into the opposite trap. ChatGPT&al are toys, real applications like Copilot are very much not "just databases". A database is not going to provide you with autocomplete based on the current, local context open in your IDE. A database is not going to provide an accurate summary of the meeting that just finished, with action items and all that.

Parts of Debian dismiss AI-contributions policy

mirabilos — Mon, 13 May 2024 21:51:50 +0000

😹😹😹

> LLMs are prompted, they don't produce output out of thin air. Therefore the output is the creation of the human that triggered the prompt.

This is ridiculous. The “prompt” is merely a tiny parametrisation of a query that extracts from the huge database of (copyrighted) works.

Do read the links I listed in https://lwn.net/Comments/973578/

> The idea that LLM output cannot be copyrighted is silly.

😹😹😹😹😹😹😹

You’re silly.

This is literally enshrined into copyright law. For example:

> Werke im Sinne dieses Gesetzes sind nur persönliche geistige Schöpfungen.

“Works as defined by this [copyright] law are only personal intellectual creations that pass threshold of originality.” (UrhG §2(2))

Wikipedia explains the “personal” part of this following general jurisprudence:

> Persönliches Schaffen: setzt „ein Handlungsergebnis, das durch den gestaltenden, formprägenden Einfluß eines Menschen geschaffen wurde“ voraus. Maschinelle Produktionen oder von Tieren erzeugte Gegenstände und Darbietungen erfüllen dieses Kriterium nicht. Der Schaffungsprozeß ist Realakt und bedarf nicht der Geschäftsfähigkeit des Schaffenden.

“demands the result of an act from the creative, form-shaping influence of a human: mechanical production or things or acts produced by animals do not fulfill this criterium (but legal competence is not necessary).” (<https://de.wikipedia.org/wiki/Urheberrecht_(Deutschland)#Schutzgegenstand_des_Urheberrechts:_Das_Werk>)

So, yes, LLM output cannot be copyrighted (as a new work/edition) in ipso.

And to create an adaption of LLM output, the human doing so must not only invest significant *creativity* (not just effort / sweat of brow!) to pass threshold of originality, but they also must have the permission of the copyright (exploitation rights, to be precise) holders of the original works to do so (and, in droit d’auteur, may not deface, so the authors even if not holders of exploitation rights also have something to say).

Parts of Debian dismiss AI-contributions policy

mirabilos — Mon, 13 May 2024 21:39:04 +0000

Chained sed isn’t going to solve it.

Even “mechanical” transformation by humans does not create a work (as defined by UrhG, i.e. copyright). It has to have some creativity.

Until then, it’s a transformation of the original work(s) and therefore bound to the (sum of their) terms and conditions on the original work.

If you have a copyrighted thing, you can print it out, scan it, compress it as JPEG, store it into a database… it’s still just a transformation of the original work, and you can retrieve a sufficiently substantial part of the original work from it.

The article where someone reimplemented a (slightly older version of) ChatGPT in a 498-line PostgreSQL query showed exactly and easily understandable how this is just a lossy compression/decompression: https://explainextended.com/2023/12/31/happy-new-year-15/

There are now feasible attacks obtaining “training data” from prod models in large scale, e.g: https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html

This is sufficient to prove that these “models” are just databases with lossily compressed, but easily enough accessible, copies of the original, possibly (probably!) copyrighted, works.

Another thing I would like to point out is the relative weight. For a work which I offer to the public under a permissive licence, attribution is basically the only remuneration I can ever get. This means failure to attribute so has a much higher weight than for differently licenced or unlicenced stuff.

Parts of Debian dismiss AI-contributions policy

mirabilos — Mon, 13 May 2024 21:24:15 +0000

Of course not, at least not generically: only a human may be the author of a work (derived or not).

Machinally manipulating a work does not generate a new work, so it’s just a machinal transformation of the original work and therefore bound to the same terms, which the *user* of the machine must honour.

Debian dismisses AI-contributions policy

rschroev — Mon, 13 May 2024 20:39:19 +0000

Have you tried something like CoPilot? I've been trying it out a bit over the last three weeks (somewhat grudgingly). One of the things that became clear quite soon is that it does not just gets it code from StackOverflow and GitHub and the like; it clearly tries to adapt to the body of code I'm working on (it certainly doesn't always gets it right, but that's a different story.)

An example, to make things more concrete. Let's say I have a struct with about a dozen members, and a list of key-value pairs, where those keys are the same as the names of the struct members, and I want to assign the values to the struct members. I'll start writing something like:

for (auto &kv: kv_pairs) {
	if (kv.first == "name")
		mystruct.name = kv.second;
	// ...
}

It then doesn't take long before CoPilot starts autocompleting with the remaining struct members, offering me the exact code I was trying to write, even when I'm pretty sure the names I'm using are unique and not present in publicly accessible sources.

I'm not commenting on the usefulness of all this; I'm just showing that what it does is not just applying StackOverflow and GitHub to my code.

We probably should remember that LLMs are not all alike. It's very well possible that e.g. ChatGPT would have a worse "understanding" (for lack of a better word) of my code, and would rely much more on what it learned before from public sources.

Debian dismisses AI-contributions policy

mb — Mon, 13 May 2024 18:27:21 +0000

> users are responsible for everything they submit.

2 is basically that:

https://en.wikipedia.org/wiki/Developer_Certificate_of_Or...

And 1 is, well. Common sense and common practice.

>Right now, the copyright status of AI output needs clarification

For some people here the state "obvious" ;-)

Debian dismisses AI-contributions policy

rgmoore — Mon, 13 May 2024 18:10:32 +0000

It seems to me that the basic policy should be that users are responsible for everything they submit. That means two big things:

The submitter is responsible for the quality of their submission. If they are submitting junk, they can expect to be treated as an annoyance. That's true regardless of why it's junk. It doesn't matter if they're a bad programmer, they're depending on an incompetent AI, or whatever. Send in enough junk, and you can expect your submissions to be forwarded to the bit bucket unread.
The submitter is responsible for ensuring they have the right to submit their contribution in the first place. This means they must either hold copyright themselves, have a valid license from someone else that gives them the right to submit it, or know it isn't covered by copyright. Right now, the copyright status of AI output needs clarification, which means nobody can be sure they have the right to submit AI output.

Parts of Debian dismiss AI-contributions policy

mb — Mon, 13 May 2024 17:33:20 +0000

> The same paradigms used to understand software cannot be used to try and understand legal issues.

Yes. That is the main problem. It does not have to make logical sense for it to be "correct" under law.

> stubbornly

I am just applying logical reasoning. The logical chain obviously is not correctly implemented. Which is often the case in law, of course. Just like the logical reasoning chain breaks if the information goes through a human brain. And that's Ok.

Just saying that some people claiming here things like "it's *obvious* that LLMs are like this and that w.r.t. Copyright" are plain wrong. Nothing is obvious in this context. It's partly counter-logical and defined with contradicting assumptions.

But that's Ok, as long as a majority agrees that it's fine.
But that doesn't mean I personally have to agree. Copyright is a train wreck and it's only getting worse and worse.

Parts of Debian dismiss AI-contributions policy

bluca — Mon, 13 May 2024 17:15:34 +0000

> The input was copyright protected and the special exception made it non-copyright-protected because of reasons.

No, because what you are stubbornly refusing to understand, despite it having been explained a lot of times, is:

> Now, again as it has already been explained, whether the output of a prompt is copyrightable, and whether it's a derived work of existing copyrighted material, is an entirely separate question that depends on many things, but crucially, not on which tool happened to have been used to write it out.

This is a legal matter, not a programming one. The same paradigms used to understand software cannot be used to try and understand legal issues.

Parts of Debian dismiss AI-contributions policy

farnz — Mon, 13 May 2024 17:15:22 +0000

>the fact that something that had been there in the input is no longer there in the output after a processing step.
is true after all. The input was copyright protected and the special exception made it non-copyright-protected because of reasons. And for whatever strange reason that only applies to AI algorithms, because the EU says so.

No, this is also false.

Copyright law says that there are certain actions I am capable of taking, such as making a literal copy, or a "derived work" (a non-literal copy), which the law prohibits unless you have permission from the copyright holder. There are other actions that copyright allows, such as reading your text, or (in the EU) feeding that text as input to an algorithm; they may be banned by other laws, but copyright law says that these actions are completely legal.

The GPL says that the copyright holder gives you permission to do certain acts that copyright law prohibits as long as you comply with certain terms. If I fail to comply with those terms, then the GPL does not give me permission, and I now have a copyright issue to face up to.

The law says nothing about the copyright protection on the output of the LLM; it is entirely plausible that an LLM will output something that's a derived work of the input as far as copyright law is concerned, and if that's the case, then the output of the LLM infringes. Determining if the output infringes on a given input is done by a comparison process between the input and the output - and this applies regardless of what the algorithm that generated the output is.

Further, this continues to apply even if the LLM itself is not a derived work of the input data; it might be fine to send you the LLM, but not to send you the result of giving the LLM certain prompts as input, since the result of those prompts is derived from some or all of the input in such a way that you can't get permission to distribute the resulting work.

Parts of Debian dismiss AI-contributions policy

farnz — Mon, 13 May 2024 16:58:04 +0000

The output from the LLM is almost certainly GPLed in as far as the output from the LLM is (per copyright law) a derived work of the GPLed input. The complexity is that not all LLM outputs will be a derived work as far as copyright law is concerned, and where they are not derived works, there is no copyright, hence there is nothing to be GPLed.

And that's the key issue - the algorithm between "read a work as input" and "write a work as output" is completely and utterly irrelevant to the question of "does the output infringe on the copyright applicable to the input?". That depends on whether the output is, using something like an abstraction-filtration-comparison test, substantially the same as the input, or not.

For example, I can copy if (!ret) { if (ret == 1) ret = 0; goto cleanup; } directly from the kernel source code into another program, and that has no GPL implications at all, even though it's a literal copy-and-paste of 5 lines of kernel code that I received under the GPL. However, if I copy a different 5 lines of kernel code, I am plausibly creating a derived work, because I'm copying something relatively expressive.

This is why both can be true; as a matter of law, not all copying is copyright infringement, and thus not all copying has GPL implications when the input was GPLed code.

Parts of Debian dismiss AI-contributions policy

mb — Mon, 13 May 2024 16:53:22 +0000

> Data mining on publicly available datasets is not performed under whatever license the dataset had,
> but under the copyright exception granted by the law, which trumps any license you might attach to it.

Ok. Got it now. So

>the fact that something that had been there in the input is no longer there in the output after a processing step.

is true after all.
The input was copyright protected and the special exception made it non-copyright-protected because of reasons.
And for whatever strange reason that only applies to AI algorithms, because the EU says so.

Parts of Debian dismiss AI-contributions policy

bluca — Mon, 13 May 2024 16:32:02 +0000

The license of the training material is completely irrelevant with regards to building and training a LLM. Data mining on publicly available datasets is not performed under whatever license the dataset had, but under the copyright exception granted by the law, which trumps any license you might attach to it. Fun fact: tons of code on Github is published without a license _at all_, and thus is effectively proprietary, as that's obviously the default absent a license. Guess what, such a repository can still be data mined for machine learning training purposes (unless the repository owner ticked the opt-out checkbox), just like any other publicly available dataset.

Now, again as it has already been explained, whether the output of a prompt is copyrightable, and whether it's a derived work of existing copyrighted material, is an entirely separate question that depends on many things, but crucially, not on which tool happened to have been used to write it out.

Parts of Debian dismiss AI-contributions policy

mb — Mon, 13 May 2024 15:47:58 +0000

Ok. So the output from a LLM that was trained (input) with GPL'ed software is still GPLed?
If not, then it has been removed (laundered).

You guys have to decide on something. Both can't be true. There is nothing in-between. There is no such thing as "half-GPLed".
This is not Schrödinger's LLM.

Debian dismisses AI-contributions policy

bluca — Mon, 13 May 2024 15:44:19 +0000

> LLMs don't have that, they just try to predict what the answer would be on stackoverflow. Including aparently, much to my delight, "closed as duplicate". If you try using them for actually writing code, it very quickly becomes clear they have no actual understanding of the language beyond stochastically regurgitating online tutorials[1]. They falter as soon as you ask for something that isn't a minor variation of a common question or something that has been uploaded on github thousands of times.

That's really not true for the normal use case, which is fancy autocomplete. It doesn't just regurgitate online tutorials or stack overflow, it provides autocompletion based on the body of work you are currently working on, which is why it's so useful as a tool. The process is the same stochastic parroting mind you, of course language models don't really learn anything in the sense of gaining an "understanding" of something in the human sense.