Debian dismisses AI-contributions policy

Posted May 10, 2024 21:45 UTC (Fri) by kleptog (subscriber, #1183)
Parent article: Debian dismisses AI-contributions policy

> as long as it was trained on content that the copyright holders gave consent to use for that purpose.

Ok, but then we should also ban contributions by any users that have ever read/watched copyrighted material without the copyright holder's consent. You don't know how they might be contaminated. /s

In general I'm against having rules you cannot enforce in any meaningful way and therefore open to abuse. Apparently the term for this is "void for vagueness" (I asked ChatGPT). The best you can do is some kind of statement whereby you say that Debian developers are responsible for their work and LLMs should be used with caution. But if you can't enforce it then it's basically useless to say so.

Debian dismisses AI-contributions policy

Posted May 11, 2024 11:50 UTC (Sat) by josh (subscriber, #17465) [Link] (37 responses)

Humans and AI are not the same. Human brains are, de facto, not covered by copyright held by other humans. (Hopefully that *remains* the case; the alternative is unconscionable.)

AI, on the other hand, should be properly considered a derivative work of the training material. The alternative would be to permit AI to perform laundering of copyright violations.

Debian dismisses AI-contributions policy

Posted May 11, 2024 12:09 UTC (Sat) by bluca (subscriber, #118303) [Link] (21 responses)

Under the EU Copyright Directive, text and data mining for machine-learning training is explicitly permitted as an exception to copyright protection, with the only requirement being that an explicit opt-out needs to be available for owners of data ingested for commercial use. This is similar to the US fair use exception, although of course that one is a case-by-case basis decided in court, rather than explicitly permitted in law.

Debian dismisses AI-contributions policy

Posted May 11, 2024 13:20 UTC (Sat) by josh (subscriber, #17465) [Link] (19 responses)

I am aware of this. This was a mistake that hopefully gets fixed. Until then, in the EU it's possible to break open source licenses without consequence, if you filter through AI training first.

Debian dismisses AI-contributions policy

Posted May 11, 2024 15:13 UTC (Sat) by Wol (subscriber, #4433) [Link] (2 responses)

> Until then, in the EU it's possible to break open source licenses without consequence, if you filter through AI training first.

And that is a *complete* misunderstanding. I'm in Europe, and if an AI chucked my code *out*, I don't see any problem suing (apart from the lack of money, of course ...)

Feeding stuff INTO an LLM is perfectly legal - YOU DON'T NEED TO GET PERMISSION. So as a copyright holder I can't complain, unless they ignored my opt-out.

But the law says nothing about my works losing copyright status, or copyright not applying, or anything of that sort. My work *inside* the LLM is still my work, still my copyright. And if the LLM regurgitates it substantially intact, it's still my work, still my copyright, and anybody *using* it does NOT have my permission, nor the law's, to do that.

Cheers,
Wol

Debian dismisses AI-contributions policy

Posted May 11, 2024 15:23 UTC (Sat) by mb (subscriber, #50428) [Link] (1 responses)

The problem is that an LLM is a click-of-a-button low-effort obfuscator.

Most of the time the outputs are different enough from the training data so that it's hard to say which part of the input data influenced the output to what degree. But at the same time it's clear that only the training data (plus the prompt) can influence the output.
The output should be considered a derived work of all training data.

You can do the same thing manually.
Take Open Source code and manually obfuscate it until nobody can prove its origin. It does not loose its Copyright status by doing that, though. And it's a lot of work. By far not click-of-a-button. So it's not done very often.

LLMs - apparently with some unicorn sparkle magic - remove copyright, by doing the same thing. How does that make sense?

Debian dismisses AI-contributions policy

Posted May 13, 2024 7:07 UTC (Mon) by gfernandes (subscriber, #119910) [Link]

By your argument, a copyrighted music studio CD can be re-encoded as MP3 (or some other format) and then redistributed workout consequence.

Clearly not true.

Debian dismisses AI-contributions policy

Posted May 11, 2024 16:55 UTC (Sat) by bluca (subscriber, #118303) [Link] (15 responses)

It was most definitely not a mistake, in fact it was a very good idea. Copyright maximalism is bad for society, and the more holes are poked in copyright law, the better off we will all be. Other countries should adopt the same laws - right now in the US if you are a large corporation you can do the same, safe in the knowledge that with an army of lawyers you are most surely going to win on fair use any court case. Without deep pockets, normal people are at severe disadvantage. In the EU, this law applies to everyone, equally.

Debian dismisses AI-contributions policy

Posted May 11, 2024 17:17 UTC (Sat) by mb (subscriber, #50428) [Link] (6 responses)

>Copyright maximalism is bad for society, and the more holes are poked in copyright law, the better off we will all be.

Yes, in general I agree.

But we must be cautious, that the holes are not equipped with check valves [1]. The holes shall benefit free work as well and not only proprietary work.
If the LLMs learnt from large proprietary code bases, too, then I would actually be happy with the status quo.
But currently the flow of code is basically only from open source into proprietary.

When do we see the LLM trained on all of Microsoft's internal code? When do we use that to improve Wine? That wouldn't be a problem, right? No material is transferred through the LLM, after all. Right?

[1] https://en.wikipedia.org/wiki/Check_valve

Debian dismisses AI-contributions policy

Posted May 11, 2024 17:24 UTC (Sat) by bluca (subscriber, #118303) [Link] (5 responses)

> But we must be cautious, that the holes are not equipped with check valves [1]. The holes shall benefit free work as well and not only proprietary work.
> If the LLMs learnt from large proprietary code bases, too, then I would actually be happy with the status quo.
> But currently the flow of code is basically only from open source into proprietary.

The end goal is that there's no proprietary software, just software. We don't get there by making copyright even more draconian that it is now, and it's already really bad.

> When do we see the LLM trained on all of Microsoft's internal code?

As a wild guess, probably when it gets out of the atrociously horrendous internal git forge it lives in right now and into Github. Which is not anytime soon, or likely ever, sadly, because it would cost an arm and a leg and throw most of the engineering org into utter chaos. One can only wish.

Debian dismisses AI-contributions policy

Posted May 11, 2024 17:38 UTC (Sat) by mb (subscriber, #50428) [Link] (3 responses)

> The end goal is that there's no proprietary software, just software.
> We don't get there by making copyright even more draconian that it is now

Yes. But we also don't get there my circumventing all Open Source licenses and installing a check valve into the direction of proprietary software.

The end goal of having "just software" actually means that everything is Open Source. (The other option would be to kill Open Source).
Which I currently don't see as a realistic possibility.

Not only the end goal is important, but also the intermediate steps.

Debian dismisses AI-contributions policy

Posted May 11, 2024 17:41 UTC (Sat) by bluca (subscriber, #118303) [Link] (2 responses)

> Yes. But we also don't get there my circumventing all Open Source licenses and installing a check valve into the direction of proprietary software.

You'll be delighted to know that's not how any of this works, then - it's just autocomplete with some extra powers and bells and whistles, it doesn't circumvent anything

Debian dismisses AI-contributions policy

Posted May 11, 2024 21:22 UTC (Sat) by flussence (guest, #85566) [Link] (1 responses)

The Intel architectural branch predictor is just a fancy autocomplete too. And it's perfectly safe.

Debian dismisses AI-contributions policy

Posted May 12, 2024 11:39 UTC (Sun) by bluca (subscriber, #118303) [Link]

It is*!

* (when only fully trusted workloads are executed, no malware allowed, pinky swear)

Debian dismisses AI-contributions policy

Posted May 13, 2024 10:16 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

> > When do we see the LLM trained on all of Microsoft's internal code?

> As a wild guess, probably when it gets out of the atrociously horrendous internal git forge it lives in right now and into Github. Which is not anytime soon, or likely ever, sadly, because it would cost an arm and a leg and throw most of the engineering org into utter chaos. One can only wish.

Alas, this is kind of my baseline for believing Microsoft's stance that Copilot doesn't affect copyright: eat the same damn cake you're forcing down everyone else's throats (IIRC you work at Microsoft, but the "you" here is at Microsoft PR/lawyers, not bluca specifically).

If Copilot really can only be trained on code accessible over the Github API and not raw git repos, that seems a bit short-sighted, no?

Debian dismisses AI-contributions policy

Posted May 12, 2024 10:48 UTC (Sun) by MarcB (guest, #101804) [Link] (7 responses)

> ...safe in the knowledge that with an army of lawyers you are most surely going to win on fair use any court case.

Not necessarily. However, should you lose - or should your lawyers consider this a likely outcome - you can use your huge amount of money to strike a licensing deal with the plaintiff. Even if you overpay, you still win, because you now have exclusive access.

See the Google/Reddit deal.

Debian dismisses AI-contributions policy

Posted May 12, 2024 11:38 UTC (Sun) by bluca (subscriber, #118303) [Link] (6 responses)

Well, precisely my point. In the US if you are a trillion dollar corp you can pay some peanuts and have all the access you want to public data without worry. If you were the proverbial startup in a garage you won't. In the EU both entities have equal access to public training data. Copyright maximalists are arguing that the former situation is better, I am arguing that the latter is better for the public and society at large.

Debian dismisses AI-contributions policy

Posted May 12, 2024 11:55 UTC (Sun) by josh (subscriber, #17465) [Link] (5 responses)

I'm not a copyright maximalist. I want copyright to cease to exist. *In the meantime*, as long as it exists, no special exception should exist for AI. Same rules for everyone. (And to be clear, "everyone gets to launder copyright through AI" is not "same rules for everyone".)

When copyright ceases to exist, all software will be free to copy, modify, and redistribute. Until then, AI training should have to respect Open Source licenses just like everyone else does.

Debian dismisses AI-contributions policy

Posted May 12, 2024 12:00 UTC (Sun) by bluca (subscriber, #118303) [Link]

In other words, you want copyright laws to be amended to be even more draconian than they are right now. That's copyright maximalism. If what you suggest was the case, then only trillion dollar corporations could afford to do machine learning, nobody else could. It's good for society that it is not the case, and that the EU had the long-term vision to diminish the absurd strength of copyright law in this area.

Debian dismisses AI-contributions policy

Posted May 12, 2024 19:55 UTC (Sun) by Wol (subscriber, #4433) [Link]

May I suggest you read up on the history of copyright.

I'm not a fan of copyright, but "never" is just as bad as "for ever".

The US got it approximately right with its "fourteen years". The majority of any value, for any work, is extracted in the first 10 years or so. Beyond that, most works are pretty worthless.

So let's make it a round 15 - all works are automatically protected for 15 years from publication - but if you want to avail yourself of that protection you must put the date on the work. After that, if the work has value you (as a real-person author, or the "heir alive at publication") can renew the copyright on a register for successive 15-year intervals (PLURAL) for a smallish fee.

And for people like Disney, Marvel, etc etc, you can trademark your work to keep control of your valuable universe if you wish.

So this will achieve the US aim of "encouraging works into the Public Domain" and works won't "rot in copyright" because people won't pay to extend it.

Cheers,
Wol

Debian dismisses AI-contributions policy

Posted May 12, 2024 20:05 UTC (Sun) by Wol (subscriber, #4433) [Link]

> no special exception should exist for AI.

And this is the whole point of Berne. Different countries, different rules (same basic framework). But works have to be protected identically regardless of nationality. Which comes as a result of the American abuse of copyright pre-1984'ish. One only has to look at the AT&T/BSD suit, where AT&T removed copyright notices and effectively placed OTHER PEOPLES' COPYRIGHTED WORK into the Public Domain.

Going back, Disney's Fantasia where they used European works and completely ignored copyright. Go back even further to Gilbert & Sullivans "The Pirates of New York" where they had to go to extreme lengths to prevent other people copyrighting their work and preventing them from performing their own work in the US.

THERE IS NO SPECIAL EXCEPTION FOR AI. Not even in Europe. As a person, you are free to study copyrighted works for your own education. European Law makes it explicit that "educating" an AI is the same as educating a person. Presumably the same rules apply to an AI regurgitating what it's learnt, as to a human. If it's indistinguishable from (or clearly based on) the input, then it's a copyright violation. The problem is, be it human or AI, what do you mean by "clearly based on".

Cheers,
Wol

Debian dismisses AI-contributions policy

Posted Jun 10, 2024 15:42 UTC (Mon) by nye (subscriber, #51576) [Link] (1 responses)

> I'm not a copyright maximalist

I'm practically speechless that you would lie so brazenly as to say this in the same thread as espousing approximately the most maximalist possible interpretation of copyright.

Honestly, it's threads like this one that remind me of how ashamed I am that I once considered myself part of the FOSS community. It's just... awful. Everyone here is awful. Every time I read LWN I come away thinking just a little bit less of humanity. Today, you've played your part.

Whoa there

Posted Jun 10, 2024 16:34 UTC (Mon) by corbet (editor, #1) [Link]

This seems ... extreme. I don't doubt that different people can have different ideas of what "copyright maximalist" means — there are different axes on which things can be maximized. Disagreeing on that does not justify calling somebody a liar and attacking them in this way, methinks.

Debian dismisses AI-contributions policy

Posted May 13, 2024 9:50 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

Training and later on using output are not automatically the same thing as you imply.

Debian dismisses AI-contributions policy

Posted May 11, 2024 13:56 UTC (Sat) by Paf (subscriber, #91811) [Link] (9 responses)

“ Humans and AI are not the same…

AI, on the other hand, should be properly considered a derivative work of the training material. The alternative would be to permit AI to perform laundering of copyright violations.”

I would like to understand better why this is. Plenty of things in my brain are in fact covered by copyright and I could likely violate quite a bit of copyright from memory. Instead it’s entirely about how much of the input material is present in the output.

If we’re just saying “humans are different”, it would be nice to understand *why* in detail and if anything non human could ever clear those hurdles. I get the distinct sense a lot of these arguments actually boil down to “humans are special and nothing else is like a human, because humans are special”

Debian dismisses AI-contributions policy

Posted May 11, 2024 14:38 UTC (Sat) by willy (subscriber, #9762) [Link] (2 responses)

I actually don't have a problem with "humans are special". You can't meaningfully kill an AI. You can't send an AI to prison. An AI cannot get married. And so on.

Debian dismisses AI-contributions policy

Posted May 12, 2024 1:29 UTC (Sun) by Paf (subscriber, #91811) [Link] (1 responses)

I guess to this I'd just say that I grew up watching sci-fi, and I am not so comfortable stating that humans are simply special. It's not a principle I feel all at all comfortable with as a basis for general moral reasoning.

Debian dismisses AI-contributions policy

Posted May 13, 2024 1:38 UTC (Mon) by raven667 (subscriber, #5198) [Link]

> I grew up watching sci-fi

Sci-fi is also a fictional scenario that swaps people for aliens or AI or whatever to be able to talk about power dynamics and relationships without existing bias creeping in, but that doesn't mean that LLMs are "alive" or "moral agents" in any way, they are no where near complex and complete enough for that to be a consideration. People see faces in the side of a mountain or a piece of toast, and in the same way perceive the output of LLMs, mistaking cogent-sounding statistical probability with intelligence. There is no there there because while an LLM might in some small way approximate thought, it's thoroughly lobotomized with no concept of concepts.

Debian dismisses AI-contributions policy

Posted May 11, 2024 20:21 UTC (Sat) by flussence (guest, #85566) [Link] (5 responses)

> If we’re just saying “humans are different”, it would be nice to understand *why* in detail and if anything non human could ever clear those hurdles.

Are you saying there's a threshold of "AI-ness", whereby in crossing it, someone distributing a 1TB torrent of Disney DVD rips and RIAA MP3s, encrypted with a one time pad output from a key derivation function with a trivially guessable input, and being caught doing so, would result in the torrent file itself being arrested instead? Does a training set built by stealing the work of others have legal personhood now? Does the colour of the bits and the intent of the deed no longer matter to a court if the proponent of the technology is sufficiently high on their own farts?

Debian dismisses AI-contributions policy

Posted May 12, 2024 1:28 UTC (Sun) by Paf (subscriber, #91811) [Link] (4 responses)

I don't think I understand this comment - It seems to start from the premise that computerized processes are inherently different from biological ones and just proceed from there. I can't really engage on those terms - there's no argument to have.

Debian dismisses AI-contributions policy

Posted May 13, 2024 10:54 UTC (Mon) by LtWorf (subscriber, #124958) [Link] (3 responses)

A person can learn C from 1 book. An AI needs millions of books. Certainly you see a certain difference in orders of magnitude?

Debian dismisses AI-contributions policy

Posted May 13, 2024 15:40 UTC (Mon) by atnot (subscriber, #124910) [Link] (2 responses)

I think calling it "learning C" is being too generous. If you learn a language like C from nothing, you will have a relatively complete understanding of the language and be able to write semi-working, conceptually correct solutions to pretty arbitrary simple problems with relative ease.

LLMs don't have that, they just try to predict what the answer would be on stackoverflow. Including aparently, much to my delight, "closed as duplicate". If you try using them for actually writing code, it very quickly becomes clear they have no actual understanding of the language beyond stochastically regurgitating online tutorials[1]. They falter as soon as you ask for something that isn't a minor variation of a common question or something that has been uploaded on github thousands of times.

If we are to call both of these things "learning", we do have to acknowledge that they are drastically different meanings of the therm.

[1] And no, answers to naive queries about how X works do not prove it "understands" X, merely that the training data contains enough instances of this question being answered to be memorizeable. Which for a language like C is going to be a lot. Consider e.g. that an overwhelming majority of universities in the world have at least one C course.

Debian dismisses AI-contributions policy

Posted May 13, 2024 15:44 UTC (Mon) by bluca (subscriber, #118303) [Link]

> LLMs don't have that, they just try to predict what the answer would be on stackoverflow. Including aparently, much to my delight, "closed as duplicate". If you try using them for actually writing code, it very quickly becomes clear they have no actual understanding of the language beyond stochastically regurgitating online tutorials[1]. They falter as soon as you ask for something that isn't a minor variation of a common question or something that has been uploaded on github thousands of times.

That's really not true for the normal use case, which is fancy autocomplete. It doesn't just regurgitate online tutorials or stack overflow, it provides autocompletion based on the body of work you are currently working on, which is why it's so useful as a tool. The process is the same stochastic parroting mind you, of course language models don't really learn anything in the sense of gaining an "understanding" of something in the human sense.

Debian dismisses AI-contributions policy

Posted May 13, 2024 20:39 UTC (Mon) by rschroev (subscriber, #4164) [Link]

Have you tried something like CoPilot? I've been trying it out a bit over the last three weeks (somewhat grudgingly). One of the things that became clear quite soon is that it does not just gets it code from StackOverflow and GitHub and the like; it clearly tries to adapt to the body of code I'm working on (it certainly doesn't always gets it right, but that's a different story.)

An example, to make things more concrete. Let's say I have a struct with about a dozen members, and a list of key-value pairs, where those keys are the same as the names of the struct members, and I want to assign the values to the struct members. I'll start writing something like:

for (auto &kv: kv_pairs) {
	if (kv.first == "name")
		mystruct.name = kv.second;
	// ...
}

It then doesn't take long before CoPilot starts autocompleting with the remaining struct members, offering me the exact code I was trying to write, even when I'm pretty sure the names I'm using are unique and not present in publicly accessible sources.

I'm not commenting on the usefulness of all this; I'm just showing that what it does is not just applying StackOverflow and GitHub to my code.

We probably should remember that LLMs are not all alike. It's very well possible that e.g. ChatGPT would have a worse "understanding" (for lack of a better word) of my code, and would rely much more on what it learned before from public sources.

Debian dismisses AI-contributions policy

Posted May 11, 2024 21:57 UTC (Sat) by kleptog (subscriber, #1183) [Link] (1 responses)

> AI, on the other hand, should be properly considered a derivative work of the training material. The alternative would be to permit AI to perform laundering of copyright violations

No, the alternative is to consider copyright infringment based on how much something resembles a copyrighted work. Whether the copyrighted work was part of the training set is not relevant to this determination. This is pretty much the EU directive position.

It's pretty much the same idea that allows search engines to process all the information on the internet without having to ask the copyright holders, but they can't just reproduce those pages for users.

Debian dismisses AI-contributions policy

Posted May 12, 2024 0:39 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> No, the alternative is to consider copyright infringment based on how much something resembles a copyrighted work. Whether the copyrighted work was part of the training set is not relevant to this determination. This is pretty much the EU directive position.

Incidentally, this is also the US position, although the rules for the training process itself remain somewhat opaque (unlike in the EU).

https://en.wikipedia.org/wiki/Substantial_similarity

Debian dismisses AI-contributions policy

Posted May 12, 2024 1:25 UTC (Sun) by Paf (subscriber, #91811) [Link] (2 responses)

I tried to ask you this earlier but I don't think it was clear.

*Why* is human output copyrightable and AI output not? Can you explain this, and give reasons? You've been stating it, but not giving reasons.

Debian dismisses AI-contributions policy

Posted May 12, 2024 2:13 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

The (US) Copyright Office has refused to register copyrights in works that are openly described as AI-generated, on the basis that only works of human authorship give rise to copyright protection. You can read their justification at length here: https://copyright.gov/ai/ai_policy_guidance.pdf

Debian dismisses AI-contributions policy

Posted May 14, 2024 1:20 UTC (Tue) by mirabilos (subscriber, #84359) [Link]

Because the laws say so. Simple as that.

Robotically made things, or things made by animals, are not works that reflect personal creativity.