Debian dismisses AI-contributions policy
In April, the Gentoo Linux project banned the use of generative AI/ML tools due to copyright, ethical, and quality concerns. This means contributors cannot use tools like ChatGPT or GitHub Copilot to create content for the distribution such as code, documentation, bug reports, and forum posts. A proposal for Debian to adopt a similar policy revealed a distinct lack of love for those kinds of tools, though it would also seem few contributors support banning them outright.
Tiago Bortoletto Vaz started
the discussion on the Debian project mailing list on May 2, with the
suggestion that the project should consider adopting a policy on the use of
AI/ML tools to generate content. Vaz said that he feared that Debian was
"already facing negative consequences in some areas
" as a
result of this type of content, or it would be in a short time. He referenced the Gentoo
AI policy, and Michał Górny's arguments
against AI tools on copyright, quality, and ethical grounds. He said
he was in agreement with Górny, but wanted to know how other Debian contributors felt.
Ansgar Burchardt wrote
that generative AI is "just another tool
". He noted that Debian doesn't ban Tor, even
though it can be used to violate copyright or for unethical things,
and it doesn't ban human contributions due to quality concerns: "I
don't see why AI as yet another tool should be different.
"
Others saw it differently. Charles Plessy responded
that he would probably vote for a general resolution
against "the use of the current commercial AI for generating Debian packaging, native, or
infrastructure code
". He specified "commercial AI" because "these systems are
copyright laundering machines
" that abuse free software, and found
the idea that other Debian developers would use them discouraging. He was not
against generative AI technology itself, however, as long as it was trained on
content that the copyright holders gave consent to use for that purpose.
Russ Allbery was
skeptical of Gentoo's approach of an outright ban, since "it is
(as they admit) unenforceable
". He also agreed with Burchardt,
"we don't make policies against what tools people use locally for
developing software
". He acknowledged that there are potential
problems for Debian if output from AI tools infringes copyright. Even
so, banning the use of those tools would not make much difference:
"we're going to be facing that problem with upstreams as
well, so the scope of that problem goes far beyond
" direct
contributions to Debian. The project should "plan to be reactive [rather] than
attempt to be proactive
". If there are reports that AI-generated content is a
copyright violation, he said, then the project should deal with it as
it would with any Debian
Free Software Guidelines (DFSG) violation. The project may need to
make judgment calls about the legal issues then, but "hopefully this
will have settled out a bit in broader society before we're forced to
make a decision on a specific case
".
Allbery said his primary concern about the impact of AI is its practical impact:
Most of the output is low-quality garbage and, because it's now automated, the volume of that low-quality garbage can be quite high. (I am repeatedly assured by AI advocates that this will improve rapidly. I suppose we will see. So far, the evidence that I've seen has just led me to question the standards and taste of AI advocates.)
Ultimately, Allbery said he saw no need for new policies. If there
is a deluge of junk, "we have adequate mechanisms to
complain and ask that it stop without making new policy
". The only
statement he wanted to convey so far is that "anyone relying on AI
to summarize important project resources like Debian Policy or the
Developers Guide or whatnot is taking full responsibility for any
resulting failures
".
A sense of urgency
In reply to Allbery, Vaz conceded that Gentoo's policy was not perfect but, despite the difficulty in enforcing it, he maintained there was a need to do something quickly.
Vaz, who is an application
manager (AM) for the Debian new maintainer process, suggested that Debian was already seeing
problems with AI output submitted during the new maintainer (NM)
process and as DebConf submissions, but declined
to provide examples. "So far we can't [prove] anything, and even if
we could, of course we wouldn't bring any of the involved to the
public arena
". He did, however, agree that a statement was a more
appropriate tool than a policy.
Jose-Luis Rivas replied
that Vaz had more context than the rest of the participants in the
discussion and that "others do not have this same information and
can't share this sense of urgency
". He inferred that an NM
applicant might be using a large-language model (LLM) tool during the NM process, but in that
scenario there was "even less point
" in making policy or a
statement about the use of such tools. It would be hard to prove that an LLM was
in use, and "ultimately [it] is in the hands of those judging
"
to make the decisions. "I can't see the point of 'something
needs to be done' without a clear reasoning of the expectations out of
that being done
".
Vaz argued
that having a policy or statement would be useful, even in the absence of proof
that an LLM was in use. He made a comparison to Debian's code of
conduct and its diversity statement: "They might seem quite obvious to some, and
less so to others.
" Having an explicit position on the use
of LLMs would be useful to educate those who are "getting to use LLMs in
their daily life in a quite mindless way
" and "could help us both
avoid and mitigate possible problems in the future
".
The NM scenario Vaz gave was not convincing to Sam Hartman, who replied that the process would not benefit from a policy. It is up to candidates to prove to their application manager (AM), advocates, and reviewers that they can be trusted and have the technical skills to be a Debian Developer:
I as an AM would find an applicant using an LLM as more than a possibly incorrect man page without telling me would violate trust. I don't need a policy to come to that conclusion.
He said he did not mind if a candidate used an LLM to refresh their memory, and saw no need for them to cite the use of the LLM. But if the candidate didn't know the material well enough to catch bad information from an LLM, then it's clear they are not to be trusted to choose good sources of information.
On May 8, after the conversation had died down, Vaz wrote
that it was apparent "we are far from a consensus on an official
Debian position regarding the use of generative AI as a whole in the
project
". He thanked those who had commented, and said that he
hoped the debate would surface again "at a time when we better
understand the consequences of all this
".
It is not surprising to see Debian take a conservative, wait-and-see approach. If Debian is experiencing real problems from AI-generated content, they are not yet painful or widespread enough to motivate support for a ban or specific policy shift. A flood of AI gibberish, or a successful legal challenge to LLM-generated content, might turn the tide.
Posted May 10, 2024 16:28 UTC (Fri)
by JoeBuck (subscriber, #2330)
[Link] (10 responses)
I'd like to see more detail about how generative AI is being used, or suspected of being used, by NM applicants, because I can think of uses that aren't wrong at all, like an applicant with very limited English skills using an LLM to improve their English language communication. I don't think that's wrong or ban-worthy.
A ban might drive people in that position to hide their use, so it might be better to talk about what's OK and what's not in the view of the project, keeping multiple values in mind: avoid using LLMs as plagiarism machines, but don't discriminate against non-native speakers.
Posted May 10, 2024 16:52 UTC (Fri)
by hkario (subscriber, #94864)
[Link] (7 responses)
Same with AI tools: if you ask it to generate some text and then copy output verbatim to the bug tracker, you're wasting everybody's time (worse still: you automate this whole process). If on the other hand, you use the AI tool to generate some template, read it, rewrite it, or use it just to check grammar of the overall post, you're using the tool correctly, and you still have full control over the output and that will result in a good and useful (most likely) post to the bug tracker.
So, if the contribution is clearly a result of an LLM, you're using it wrong and you shouldn't contribute to the project. If the output is indistinguishable from native speaker with domain knowledge of the problem discussed, then you're doing it right.
See also: https://xkcd.com/810/
Posted May 10, 2024 19:33 UTC (Fri)
by james (subscriber, #1325)
[Link] (6 responses)
It's tempting to argue this would be a faster and/or more reliable way of getting good documentation.
Posted May 10, 2024 22:43 UTC (Fri)
by taladar (subscriber, #68407)
[Link] (2 responses)
Posted May 11, 2024 1:38 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted May 11, 2024 13:47 UTC (Sat)
by Paf (subscriber, #91811)
[Link]
Posted May 12, 2024 14:41 UTC (Sun)
by farnz (subscriber, #17727)
[Link] (1 responses)
That leads into the other problem with a "no AI contributions" policy. Imagine I take my code, feed it to an LLM, and ask the LLM to write documentation for my code. I then take the LLM's output, and rewrite it myself to be accurate to my code, checking for mistakes; I used it as a prompt since I struggle to get going on documenting things, but I find it easy to fix bad documentation. The result is something that reflects considerable effort on my part, and that has very little trace of the original training data that was fed into the LLM; the LLM's role was to show me what good documentation would look like for my project, so I had a goal to aim for, even though I could not reuse the LLM's output.
Is this a contribution that we should reject? If so, why?
Posted May 14, 2024 1:06 UTC (Tue)
by mirabilos (subscriber, #84359)
[Link]
And for all the other reasons, such as model training being done in unethical ways that exploit people in the global south, and LLM use requiring too much power (and even other natural resouces like clean water, for reasons), so you should not be using them a̲t̲ ̲a̲l̲l̲, period.
Posted May 14, 2024 2:16 UTC (Tue)
by Heretic_Blacksheep (guest, #169992)
[Link]
Fast is how a lot of people tend to want to do technical documentation, mostly because they don't consider proper communication essential. It's an afterthought. You end up with a lot of mistakes, misunderstandings, and complaints about the quality of a project's information.
Then there's good: it's written by skilled communicators, reviewed and annotated by multiple stake holders including people new to the project, old hands, and technical writing/communications savvy participants, and if necessary skilled translators. This is the best documentation and only becomes better over time as the teams that write it gain feedback. It takes time to generate, time to proofread and edit, and time to manage corrections. But, arguably it actually saves time all around for those that produce it, and particularly those that use it.
Many people using language models are going for the first level, but the result is little better in quality because they discount the point of documentation (and message passing like e'mail) isn't just telling people like themselves how to use whatever they're producing. The point of documentation is to tell not only with intellectual peers how to use their tool, it's to communicate why it exists, why it's designed and implemented the way it is, how it's used, and how it may potentially be extended or repaired.
The first (fast) is a simple how-to like a terse set of reminders on a machine's control panel and may not even be accurate. The latter is the highly accurate, full documentation manual that accompanies it that tells operators what to, not to, when, why, and how to repair or extend it. You can't reliably use a complex tool without a training/documentation manual. It's also why open source never made huge strides into corporate/industrial markets till people started writing _manuals_ not just terse how-tos many open source hobby level projects generate. Training and certification is a big part of the computer industry.
But back to the topic: AI can help do both, but fast/unskilled is still going to be crap. Merely witness the flood of LLM generated fluff/low quality articles proliferating across the internet as formerly reliable media outlets switch from good human generated journalism, to ad-generated human fluff-journalism, to LLM generated "pink slime" or in one person's terminology I recently saw "slop" (building on bot-generated spam). Good documentation can use language model tools, but not without the same human communication skills that good documentation and translation requires... and many coders discount to their ultimate detriment. Tone matters. Nuance matters. Definitions matter. Common usage matters. _Intent_ matters. LLM tools can help with these things, but they can't completely substitute for either the native speaker nor the project originator. They definitely can't intuit a writer's intent.
However, right now any project that is concerned about the provenance of their code base should be wary the unanswered legal questions on the output of LLM code generators. This could end up being a tremendous _gotcha_ in some very important legal jurisdictions where copyright provenance matters and why legally wise companies are looking at fully audited SLMs instead.
Posted May 10, 2024 20:32 UTC (Fri)
by ballombe (subscriber, #9523)
[Link] (1 responses)
Agreed.
Note that there never has been a hard requirement to speak English to become a Debian developer.
We always have a number Debian developers with peculiar English syntax to say the least.
Posted May 12, 2024 11:25 UTC (Sun)
by ganneff (subscriber, #7069)
[Link]
We very much prefer bad english over machine generated sh*t, that's for sure. With bad english you can, halfway easy, in a communication sort out what the other side means and wants. With machine generated sh*t you only get more of the same useless crap in slightly different words.
Using an LLM to write your mails doesn't make your english look better, it makes YOU as its user look worse.
Posted May 10, 2024 17:12 UTC (Fri)
by carlosrodfern (subscriber, #166486)
[Link] (8 responses)
Here is an example of AI-powered contributions that wasted engineering time that could have been better used in valuable things : https://hackerone.com/reports/2199174
By banning AI contributions, the wasteful time is significantly reduced. The negative impact of hurting AI-powered contributions that are indeed high-quality is minimal comparing with the wasted time on the noise.
In other words, the signal-to-noise ratio must be significantly high enough for the banning to actually affect the debian project and community negatively.
Posted May 10, 2024 18:19 UTC (Fri)
by mb (subscriber, #50428)
[Link] (6 responses)
You still have to detect whether something is an "AI contribution". Which is 90% of the wasteful time already.
Having a "No-AI" rule would not have reduced the effort in the example shown in a meaningful way.
Posted May 10, 2024 20:26 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
1. Generate an explanation of how some standard or library function works, which the programmer could then use to make decisions about what code to write. The explanation is discarded afterwards, so no AI-generated content is incorporated into the project as such.
Obviously, these features can be used in combination, but they can also be used in isolation, and a good policy should specify which of these actions (if any) are considered acceptable.
Posted May 11, 2024 18:20 UTC (Sat)
by Baughn (subscriber, #124425)
[Link]
Needless to say, if there’s a rule against AI then I just won’t contribute at all. That’s no big loss—I don’t run Debian and wouldn’t be contributing anyway—but in the hypothetical case I did, I wouldn’t want to break the rules and also wouldn’t want to lose hours of time writing the same exact letters the AI would’ve produced.
I don’t think it has a negative impact on the code quality. If anything, the fact I need to explain what I’m doing in the form of comments so the AI will be useful feels like it makes the result better, not worse. It does need a bit of discipline to actually review the output, and I don’t usually look at the autocompletion unless I already know what I want—finding bugs is harder than matching it against what was in my head—but even with this limited usage I think it speeds me up by 50-100%.
Please don’t throw the baby out with the bathwater.
Posted May 13, 2024 2:33 UTC (Mon)
by raven667 (subscriber, #5198)
[Link] (3 responses)
You can define a little but there is more flexibility if the rules are open to interpretation, state some principles, don't proactively police it, and when something becomes a problem you have a rule to point back to, otherwise if the rules are over-specified clever people will try to craft procedural "workarounds" to ignore the spirit of the rule, if the rules are flexible then it's really up to consensus opinion whether an enforcement action is reasonable or not, rather than arguing about minutia, making a case to the community on the substance of the issue, if they dont agree they'll let you know. You don't have to proactively police with perfect detection to be able to enforce rules, they can applied on as-needed basis and rely on the judgement of the people involved. Written rules can't replace judgement anyway.
Posted May 13, 2024 8:25 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
I'm not saying there needs to be some perfectly decidable algorithm for determining what does or does not count. I'm just saying, you need general principles which are more specific than "AI bad, do not use."
Posted May 13, 2024 18:10 UTC (Mon)
by rgmoore (✭ supporter ✭, #75)
[Link] (1 responses)
Posted May 13, 2024 18:27 UTC (Mon)
by mb (subscriber, #50428)
[Link]
2 is basically that:
https://en.wikipedia.org/wiki/Developer_Certificate_of_Or...
And 1 is, well. Common sense and common practice.
>Right now, the copyright status of AI output needs clarification
For some people here the state "obvious" ;-)
Posted May 10, 2024 18:48 UTC (Fri)
by rgmoore (✭ supporter ✭, #75)
[Link]
There are two issues: the workload issue you highlighted and a copyright question that might be more legally significant. The thing about the workload is not unique to AI, though. You could have the same problem from human contributors, and there are already problems with fuzzers spamming projects with low quality bug reports. There needs to be some kind of policy about that general category that isn't specific to AI.
The copyright question is a separate one. Or rather copyright questions, since there are at least two. The first is whether AI-generated material can be copyrighted. That's significant because many FOSS licences depend on copyright. Probably more significant is whether AI is violating the copyright of the material used to train it and how that affects the legal status of its output. Debian could potentially be opening itself to ruinous lawsuits if it incorporates AI-generated material it turns out not to have the legal right to use.
Posted May 10, 2024 19:54 UTC (Fri)
by flussence (guest, #85566)
[Link] (1 responses)
It is unenforceable. The trick is that it only needs to *be present* to scare away people who were lying about their ability to clear a knee-high hurdle in the first place. It'll filter out scalp-collecting slop pushers looking for another gold star to slap on their form-letter github profile in the same way the average barely legible spam email is crafted to weed out all but the most gullible and credulous recipients. Same as how a code of conduct, even a never-enforced paper tiger like the one Gentoo has on their site, makes a lot of internet tough guys instantly run away sobbing and soiling. You can get a lot of mileage out of simply preying on the superstitions of poisonous demographics.
Posted May 20, 2024 0:57 UTC (Mon)
by nrxr (guest, #170704)
[Link]
The only thing that they don't want is to be caught and since it's unenforceable, it won't drive them away.
Posted May 10, 2024 20:27 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (25 responses)
But using an AI as a translator, or proof reader, and similar could easily improve the quality of submissions. Used THAT way, I don't think a ban is appropriate.
The simple rule would be "if you ask an AI for content, then NO! If you feed *your own* content into the AI for it to improve it, then why not?".
Cheers,
Posted May 11, 2024 4:16 UTC (Sat)
by drago01 (subscriber, #50715)
[Link] (24 responses)
Ok general: technology comes and it's either useful and stays and gets improved or it isn't and goes away.
Banning is not really a solution.
Posted May 11, 2024 11:10 UTC (Sat)
by josh (subscriber, #17465)
[Link] (22 responses)
Posted May 11, 2024 13:51 UTC (Sat)
by Paf (subscriber, #91811)
[Link] (21 responses)
Posted May 11, 2024 14:07 UTC (Sat)
by josh (subscriber, #17465)
[Link] (20 responses)
If AI-generated text is *not* copyrightable, then AI becomes a means of laundering copyright: Open Source and/or proprietary code goes in, public domain code comes out. If the law or jurisimprudence of some major jurisdictions decide to allow that, that's a disaster for Open Source licensing.
Posted May 12, 2024 1:23 UTC (Sun)
by Paf (subscriber, #91811)
[Link] (4 responses)
Posted May 12, 2024 7:55 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (3 responses)
If I get permission (or use a Public Domain work) to set a piece of music in, let's say, lilypond, I can quite legally slap my copyright on it, and forbid people from copying.
Okay, the notes themselves are not copyright - I can't stop people taking my (legally acquired) copy, and re-typsetting it, but I can stop them sticking it in a photocopier.
One of the major points (and quite often a major problem) of copyright is that I only have to make minor changes to a work, and suddenly the whole work is covered by my copyright. A tactic often used by publishers to try and prevent people copying Public Domain.
Cheers,
Posted May 12, 2024 9:05 UTC (Sun)
by mb (subscriber, #50428)
[Link] (1 responses)
Huh? Is that really how (I suppose) US Copyright works? I make a one-liner change to the Linux kernel and then I have Copyright on the whole kernel? I doubt it.
Posted May 12, 2024 10:29 UTC (Sun)
by kleptog (subscriber, #1183)
[Link]
The key to understanding this is that copyright covers "works". So if you take the kernel source, make some modifications and publish a tarball, you own the copyright on the tarball ("the work"). That doesn't mean that you own the copyright to every line of code inside that tarball. Someone could download your tarball, delete your modifications add different one and create a new tarball and now their tarball has nothing to do with yours.
Just cloning a repo doesn't create a new work though, because they're no creativity involved.
In fact, one of the features of open-source is that the copyright status of a lot of code is somewhat unclear but that it doesn't actually matter because open-source licences mean you don't actually need to care. If you make a single line patch, does that construe a "work" that's copyrightable? If you work together with someone else on a patch, can you meaningfully distinguish your copyrighted code, your coauthor's or the copyright of the code you modified?
Copyright law has the concept of joint-ownership and collective works, but copyright law doesn't really have a good handle on open-source development.
Posted May 14, 2024 1:12 UTC (Tue)
by mirabilos (subscriber, #84359)
[Link]
So, in this specific example, the bar is a bit higher, but yeah, the point stands.
Posted May 12, 2024 14:44 UTC (Sun)
by drago01 (subscriber, #50715)
[Link] (14 responses)
If a LLM does that, why would it be? Same goes for any other content, as long as it doee not generate copies of the original.
Posted May 12, 2024 15:15 UTC (Sun)
by mb (subscriber, #50428)
[Link] (12 responses)
So, it is also Ok to use a Non-AI code obfuscator to remove Copyright, as long as the output does not look like the input anymore?
Posted May 12, 2024 16:04 UTC (Sun)
by bluca (subscriber, #118303)
[Link] (11 responses)
Posted May 12, 2024 16:18 UTC (Sun)
by mb (subscriber, #50428)
[Link] (10 responses)
What amount of sparkle dust is needed for a computer program that takes $A and produces $B out of $A not to be considered "compilation with extra steps"?
LLMs are computer programs that produce an output for given inputs. There is no magic involved. It's a mapping of inputs+state => output.
Posted May 12, 2024 18:10 UTC (Sun)
by kleptog (subscriber, #1183)
[Link] (2 responses)
The law is run by humans not computers so this question is irrelevant. All that matters is: does the tool produce an output that somehow affects the market of an existing copyrighted work? How it is done is not relevant.
So an obfuscator doesn't remove copyright because removing copyright isn't a thing. Either the output is market-substitutable for some copyrighted work, or it isn't.
LLMs do not spontaneously produce output, they are prompted. If an LLM reproduces a copyrighted work, then that is the responsibility of the person who made the prompt. It's fairly obvious that LLM do not reproduce copyrighted works in normal usage so you can't argue that LLMs have a fundamental problem with copyright.
(I guess you could create an LLM that, without a prompt, reproduced the entire works of Shakespeare. You could argue such an LLM would violate Shakespeare's copyright, if he had any. That's not a thing with the LLMs currently available though. In fact, they're going to quite some effort to ensure LLMs do not reproduce entire works, because that is an inefficient use of resources (ie money), they don't care about copyright per se.)
Posted May 12, 2024 19:33 UTC (Sun)
by mb (subscriber, #50428)
[Link] (1 responses)
That is not obvious at all.
By that same reasoning my code obfuscator would be Ok to use.
But the output of the obfuscator obviously is a derived work of the input. Right?
Or does using a more complex mixing algorithm suddenly make it not a derived work of the input?
Posted May 13, 2024 7:30 UTC (Mon)
by kleptog (subscriber, #1183)
[Link]
> That is not obvious at all.
Have you actually used one?
> But the output of the obfuscator obviously is a derived work of the input. Right?
Not at all. "Derived work" is a legal term not a technical one. Running a copyrighted work through an algorithm does not necessarily create a derived work. In copyright law, a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work (the underlying work). If you hash a copyrighted file, the resulting hash is not a derived work simply because it's lost everything that is interesting about the original work.
If your obfuscator has a corresponding deobfuscator that can return the original retaining the major copyrightable elements, then there may be no copyright on the obfuscated file, but as soon as you deobfuscate it, the copyright returns.
Honestly, this feels what "What colour are your bits?"[1] all over again. Are you aware of that article? Statements like this:
> Or does using a more complex mixing algorithm suddenly make it not a derived work of the input? What amount of token stirring is needed?
seem to indicate you are not.
Posted May 12, 2024 20:10 UTC (Sun)
by Wol (subscriber, #4433)
[Link] (6 responses)
Can you write an anti-LLM, that given the LLM's output, would reverse it back to the original question?
Cheers,
Posted May 12, 2024 20:36 UTC (Sun)
by mb (subscriber, #50428)
[Link] (5 responses)
No. That's not possible.
You can't reverse 18+6 into 2*12, because it could also have been 4*6 or anything else that fits the equation. There is an endless number of possibilities.
So, does is the output still a derived work of the input? If so, why is an LLM different?
Posted May 12, 2024 21:01 UTC (Sun)
by gfernandes (subscriber, #119910)
[Link] (4 responses)
Who uses am obfuscator? The producer of the works because said producer wants an extra layer/hurdle to protect *their* copyright of their original works.
Who uses an LLM? Obviously *not _just_* the producer of the LLM. And *because* of this, the LLM is fundamentally different as far as copyright goes.
The user can cause the LLM to leak copyrighted training material that the _producer_ of the LLM did not license!
This is impossible in the context of an obfuscator.
In fact there is an ongoing case which might bring legal clarity here - NYT v OpenAI.
Posted May 13, 2024 5:58 UTC (Mon)
by mb (subscriber, #50428)
[Link] (3 responses)
Nope. I use it on foreign copyrighted work to get public domain work out of it. LLM-style.
Posted May 13, 2024 6:17 UTC (Mon)
by gfernandes (subscriber, #119910)
[Link] (2 responses)
Posted May 13, 2024 8:56 UTC (Mon)
by mb (subscriber, #50428)
[Link] (1 responses)
Posted May 13, 2024 9:51 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
It's not different - the output of an LLM may be a derived work of the original. It may also be a non-literal copy, or a transformative work, or even unrelated to the input data.
There's a lot of "AI bros" who would like you to believe that using an LLM automatically results in the output not being a derived work of the input, but this is completely untested in law; the current smart money suggests that "generative AI" output (LLMs, diffusion probabilistic models, whatever) will be treated the same way as human output - it's not automatically a derived work just because you used an LLM, but it could be, and it's on the human operator to ensure that copyright is respected.
It's basically the same story as a printer in that respect; if the input to the printer results in a copyright infringement on the output, then no amount of technical discussion about how I didn't supply the printer with a copyrighted work, I supplied it with a PostScript program to calculate π and instructions on which digits of π to interpret as a bitmap will get me out of trouble. Same currently applies to LLMs; if I get a derived work as output, that's my problem to deal with.
This, BTW, is why "AI bros" would like to see the outputs of LLMs deemed as "non-infringing"; it's going to hurt their business model if "using an AI to generate output" is treated, in law, as equivalent to "using a printer to run a PostScript program", since then their customers have to do all the legal analysis to work out if a given output from a prompt has resulted in a derived work of the training set or not.
Posted May 12, 2024 18:06 UTC (Sun)
by farnz (subscriber, #17727)
[Link]
The question you're reaching towards is "at what point is the LLM's output a derived work of the input, and at what point is it a transformative work?".
This is an open question; it is definitely true that you can get LLMs to output things that, if a human wrote them, would clearly be derived works of the inputs (and smart money says that courts will find that "I used an LLM" doesn't get you out of having a derived work here). Then there's a hard area, where something written by a human would also be a derived work, but proving this is hard (and this is where LLMs get scary, since they make it very quick to rework things such that no transformative step has taken place, and yet it's not clear that this is a derived work, where humans have to spend some time on it).
And then we get into the easy case again, where the output is clearly transformative of the set of inputs, and therefore not a copyright infringement.
Posted May 14, 2024 2:31 UTC (Tue)
by viro (subscriber, #7872)
[Link]
As far as I'm concerned, anyone caught at using that deserves the same treatment as somebody who engages in any other form of post-truth - "not to be trusted ever after in any circumstances".
Posted May 10, 2024 21:45 UTC (Fri)
by kleptog (subscriber, #1183)
[Link] (38 responses)
Ok, but then we should also ban contributions by any users that have ever read/watched copyrighted material without the copyright holder's consent. You don't know how they might be contaminated. /s
In general I'm against having rules you cannot enforce in any meaningful way and therefore open to abuse. Apparently the term for this is "void for vagueness" (I asked ChatGPT). The best you can do is some kind of statement whereby you say that Debian developers are responsible for their work and LLMs should be used with caution. But if you can't enforce it then it's basically useless to say so.
Posted May 11, 2024 11:50 UTC (Sat)
by josh (subscriber, #17465)
[Link] (37 responses)
AI, on the other hand, should be properly considered a derivative work of the training material. The alternative would be to permit AI to perform laundering of copyright violations.
Posted May 11, 2024 12:09 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (21 responses)
Posted May 11, 2024 13:20 UTC (Sat)
by josh (subscriber, #17465)
[Link] (19 responses)
Posted May 11, 2024 15:13 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (2 responses)
And that is a *complete* misunderstanding. I'm in Europe, and if an AI chucked my code *out*, I don't see any problem suing (apart from the lack of money, of course ...)
Feeding stuff INTO an LLM is perfectly legal - YOU DON'T NEED TO GET PERMISSION. So as a copyright holder I can't complain, unless they ignored my opt-out.
But the law says nothing about my works losing copyright status, or copyright not applying, or anything of that sort. My work *inside* the LLM is still my work, still my copyright. And if the LLM regurgitates it substantially intact, it's still my work, still my copyright, and anybody *using* it does NOT have my permission, nor the law's, to do that.
Cheers,
Posted May 11, 2024 15:23 UTC (Sat)
by mb (subscriber, #50428)
[Link] (1 responses)
Most of the time the outputs are different enough from the training data so that it's hard to say which part of the input data influenced the output to what degree. But at the same time it's clear that only the training data (plus the prompt) can influence the output.
You can do the same thing manually.
LLMs - apparently with some unicorn sparkle magic - remove copyright, by doing the same thing. How does that make sense?
Posted May 13, 2024 7:07 UTC (Mon)
by gfernandes (subscriber, #119910)
[Link]
Clearly not true.
Posted May 11, 2024 16:55 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (15 responses)
Posted May 11, 2024 17:17 UTC (Sat)
by mb (subscriber, #50428)
[Link] (6 responses)
Yes, in general I agree.
But we must be cautious, that the holes are not equipped with check valves [1]. The holes shall benefit free work as well and not only proprietary work.
When do we see the LLM trained on all of Microsoft's internal code? When do we use that to improve Wine? That wouldn't be a problem, right? No material is transferred through the LLM, after all. Right?
Posted May 11, 2024 17:24 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (5 responses)
The end goal is that there's no proprietary software, just software. We don't get there by making copyright even more draconian that it is now, and it's already really bad.
> When do we see the LLM trained on all of Microsoft's internal code?
As a wild guess, probably when it gets out of the atrociously horrendous internal git forge it lives in right now and into Github. Which is not anytime soon, or likely ever, sadly, because it would cost an arm and a leg and throw most of the engineering org into utter chaos. One can only wish.
Posted May 11, 2024 17:38 UTC (Sat)
by mb (subscriber, #50428)
[Link] (3 responses)
Yes. But we also don't get there my circumventing all Open Source licenses and installing a check valve into the direction of proprietary software.
The end goal of having "just software" actually means that everything is Open Source. (The other option would be to kill Open Source).
Not only the end goal is important, but also the intermediate steps.
Posted May 11, 2024 17:41 UTC (Sat)
by bluca (subscriber, #118303)
[Link] (2 responses)
You'll be delighted to know that's not how any of this works, then - it's just autocomplete with some extra powers and bells and whistles, it doesn't circumvent anything
Posted May 11, 2024 21:22 UTC (Sat)
by flussence (guest, #85566)
[Link] (1 responses)
Posted May 12, 2024 11:39 UTC (Sun)
by bluca (subscriber, #118303)
[Link]
* (when only fully trusted workloads are executed, no malware allowed, pinky swear)
Posted May 13, 2024 10:16 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
> As a wild guess, probably when it gets out of the atrociously horrendous internal git forge it lives in right now and into Github. Which is not anytime soon, or likely ever, sadly, because it would cost an arm and a leg and throw most of the engineering org into utter chaos. One can only wish.
Alas, this is kind of my baseline for believing Microsoft's stance that Copilot doesn't affect copyright: eat the same damn cake you're forcing down everyone else's throats (IIRC you work at Microsoft, but the "you" here is at Microsoft PR/lawyers, not bluca specifically).
If Copilot really can only be trained on code accessible over the Github API and not raw git repos, that seems a bit short-sighted, no?
Posted May 12, 2024 10:48 UTC (Sun)
by MarcB (subscriber, #101804)
[Link] (7 responses)
Not necessarily. However, should you lose - or should your lawyers consider this a likely outcome - you can use your huge amount of money to strike a licensing deal with the plaintiff. Even if you overpay, you still win, because you now have exclusive access.
See the Google/Reddit deal.
Posted May 12, 2024 11:38 UTC (Sun)
by bluca (subscriber, #118303)
[Link] (6 responses)
Posted May 12, 2024 11:55 UTC (Sun)
by josh (subscriber, #17465)
[Link] (5 responses)
When copyright ceases to exist, all software will be free to copy, modify, and redistribute. Until then, AI training should have to respect Open Source licenses just like everyone else does.
Posted May 12, 2024 12:00 UTC (Sun)
by bluca (subscriber, #118303)
[Link]
Posted May 12, 2024 19:55 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
I'm not a fan of copyright, but "never" is just as bad as "for ever".
The US got it approximately right with its "fourteen years". The majority of any value, for any work, is extracted in the first 10 years or so. Beyond that, most works are pretty worthless.
So let's make it a round 15 - all works are automatically protected for 15 years from publication - but if you want to avail yourself of that protection you must put the date on the work. After that, if the work has value you (as a real-person author, or the "heir alive at publication") can renew the copyright on a register for successive 15-year intervals (PLURAL) for a smallish fee.
And for people like Disney, Marvel, etc etc, you can trademark your work to keep control of your valuable universe if you wish.
So this will achieve the US aim of "encouraging works into the Public Domain" and works won't "rot in copyright" because people won't pay to extend it.
Cheers,
Posted May 12, 2024 20:05 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
And this is the whole point of Berne. Different countries, different rules (same basic framework). But works have to be protected identically regardless of nationality. Which comes as a result of the American abuse of copyright pre-1984'ish. One only has to look at the AT&T/BSD suit, where AT&T removed copyright notices and effectively placed OTHER PEOPLES' COPYRIGHTED WORK into the Public Domain.
Going back, Disney's Fantasia where they used European works and completely ignored copyright. Go back even further to Gilbert & Sullivans "The Pirates of New York" where they had to go to extreme lengths to prevent other people copyrighting their work and preventing them from performing their own work in the US.
THERE IS NO SPECIAL EXCEPTION FOR AI. Not even in Europe. As a person, you are free to study copyrighted works for your own education. European Law makes it explicit that "educating" an AI is the same as educating a person. Presumably the same rules apply to an AI regurgitating what it's learnt, as to a human. If it's indistinguishable from (or clearly based on) the input, then it's a copyright violation. The problem is, be it human or AI, what do you mean by "clearly based on".
Cheers,
Posted Jun 10, 2024 15:42 UTC (Mon)
by nye (subscriber, #51576)
[Link] (1 responses)
I'm practically speechless that you would lie so brazenly as to say this in the same thread as espousing approximately the most maximalist possible interpretation of copyright.
Honestly, it's threads like this one that remind me of how ashamed I am that I once considered myself part of the FOSS community. It's just... awful. Everyone here is awful. Every time I read LWN I come away thinking just a little bit less of humanity. Today, you've played your part.
Posted Jun 10, 2024 16:34 UTC (Mon)
by corbet (editor, #1)
[Link]
Posted May 13, 2024 9:50 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link]
Posted May 11, 2024 13:56 UTC (Sat)
by Paf (subscriber, #91811)
[Link] (9 responses)
AI, on the other hand, should be properly considered a derivative work of the training material. The alternative would be to permit AI to perform laundering of copyright violations.”
I would like to understand better why this is. Plenty of things in my brain are in fact covered by copyright and I could likely violate quite a bit of copyright from memory. Instead it’s entirely about how much of the input material is present in the output.
If we’re just saying “humans are different”, it would be nice to understand *why* in detail and if anything non human could ever clear those hurdles. I get the distinct sense a lot of these arguments actually boil down to “humans are special and nothing else is like a human, because humans are special”
Posted May 11, 2024 14:38 UTC (Sat)
by willy (subscriber, #9762)
[Link] (2 responses)
Posted May 12, 2024 1:29 UTC (Sun)
by Paf (subscriber, #91811)
[Link] (1 responses)
Posted May 13, 2024 1:38 UTC (Mon)
by raven667 (subscriber, #5198)
[Link]
Sci-fi is also a fictional scenario that swaps people for aliens or AI or whatever to be able to talk about power dynamics and relationships without existing bias creeping in, but that doesn't mean that LLMs are "alive" or "moral agents" in any way, they are no where near complex and complete enough for that to be a consideration. People see faces in the side of a mountain or a piece of toast, and in the same way perceive the output of LLMs, mistaking cogent-sounding statistical probability with intelligence. There is no there there because while an LLM might in some small way approximate thought, it's thoroughly lobotomized with no concept of concepts.
Posted May 11, 2024 20:21 UTC (Sat)
by flussence (guest, #85566)
[Link] (5 responses)
Are you saying there's a threshold of "AI-ness", whereby in crossing it, someone distributing a 1TB torrent of Disney DVD rips and RIAA MP3s, encrypted with a one time pad output from a key derivation function with a trivially guessable input, and being caught doing so, would result in the torrent file itself being arrested instead? Does a training set built by stealing the work of others have legal personhood now? Does the colour of the bits and the intent of the deed no longer matter to a court if the proponent of the technology is sufficiently high on their own farts?
Posted May 12, 2024 1:28 UTC (Sun)
by Paf (subscriber, #91811)
[Link] (4 responses)
Posted May 13, 2024 10:54 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (3 responses)
Posted May 13, 2024 15:40 UTC (Mon)
by atnot (subscriber, #124910)
[Link] (2 responses)
LLMs don't have that, they just try to predict what the answer would be on stackoverflow. Including aparently, much to my delight, "closed as duplicate". If you try using them for actually writing code, it very quickly becomes clear they have no actual understanding of the language beyond stochastically regurgitating online tutorials[1]. They falter as soon as you ask for something that isn't a minor variation of a common question or something that has been uploaded on github thousands of times.
If we are to call both of these things "learning", we do have to acknowledge that they are drastically different meanings of the therm.
[1] And no, answers to naive queries about how X works do not prove it "understands" X, merely that the training data contains enough instances of this question being answered to be memorizeable. Which for a language like C is going to be a lot. Consider e.g. that an overwhelming majority of universities in the world have at least one C course.
Posted May 13, 2024 15:44 UTC (Mon)
by bluca (subscriber, #118303)
[Link]
That's really not true for the normal use case, which is fancy autocomplete. It doesn't just regurgitate online tutorials or stack overflow, it provides autocompletion based on the body of work you are currently working on, which is why it's so useful as a tool. The process is the same stochastic parroting mind you, of course language models don't really learn anything in the sense of gaining an "understanding" of something in the human sense.
Posted May 13, 2024 20:39 UTC (Mon)
by rschroev (subscriber, #4164)
[Link]
Have you tried something like CoPilot? I've been trying it out a bit over the last three weeks (somewhat grudgingly). One of the things that became clear quite soon is that it does not just gets it code from StackOverflow and GitHub and the like; it clearly tries to adapt to the body of code I'm working on (it certainly doesn't always gets it right, but that's a different story.) An example, to make things more concrete. Let's say I have a struct with about a dozen members, and a list of key-value pairs, where those keys are the same as the names of the struct members, and I want to assign the values to the struct members. I'll start writing something like: It then doesn't take long before CoPilot starts autocompleting with the remaining struct members, offering me the exact code I was trying to write, even when I'm pretty sure the names I'm using are unique and not present in publicly accessible sources. I'm not commenting on the usefulness of all this; I'm just showing that what it does is not just applying StackOverflow and GitHub to my code. We probably should remember that LLMs are not all alike. It's very well possible that e.g. ChatGPT would have a worse "understanding" (for lack of a better word) of my code, and would rely much more on what it learned before from public sources.
Posted May 11, 2024 21:57 UTC (Sat)
by kleptog (subscriber, #1183)
[Link] (1 responses)
No, the alternative is to consider copyright infringment based on how much something resembles a copyrighted work. Whether the copyrighted work was part of the training set is not relevant to this determination. This is pretty much the EU directive position.
It's pretty much the same idea that allows search engines to process all the information on the internet without having to ask the copyright holders, but they can't just reproduce those pages for users.
Posted May 12, 2024 0:39 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link]
Incidentally, this is also the US position, although the rules for the training process itself remain somewhat opaque (unlike in the EU).
Posted May 12, 2024 1:25 UTC (Sun)
by Paf (subscriber, #91811)
[Link] (2 responses)
*Why* is human output copyrightable and AI output not? Can you explain this, and give reasons? You've been stating it, but not giving reasons.
Posted May 12, 2024 2:13 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link]
Posted May 14, 2024 1:20 UTC (Tue)
by mirabilos (subscriber, #84359)
[Link]
Robotically made things, or things made by animals, are not works that reflect personal creativity.
Posted May 12, 2024 3:27 UTC (Sun)
by patrakov (subscriber, #97174)
[Link] (2 responses)
For example, neither ChatGPT 3.5 nor LLaMa derivatives cannot answer this question correctly; they produce various abominations instead of a one-liner with a few file descriptor redirections:
> I need some POSIX shell help. I have a pipeline, `command_a | command_b`, and sometimes `command_a` produces something on stderr. I need to capture that into a variable. Is there any way to do that without temporary files? Note that in my case, `command_b` sometimes terminates early, and `command_a` gets a SIGPIPE. I need to keep this behavior. Also, it is absolutely forbidden to run `command_a` twice.
Posted May 12, 2024 7:16 UTC (Sun)
by intelfx (subscriber, #130118)
[Link]
You should definitely submit this to some sort of a contemporary Turing test corpus, if someone maintains one.
/*
Posted May 12, 2024 7:41 UTC (Sun)
by donald.buczek (subscriber, #112892)
[Link]
I can confirm that the current top dog, GPT-4-0125-preview, also produces invalid code.
If you put enough energy into the dialog, you can talk it into a correct solution.
Posted May 13, 2024 4:23 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (58 responses)
I’m appalled.
As FOSS contributors, we are DIRECTLY damaged by ML/LLM (so-called “AI”) operators violating our copyrights and licences.
So, no, a wait-and-see approach here is nowhere near discouraging enough.
I totally am in favour of… say GotoSocial’s approach at adding a little checkbox for submitters to confirm they’ve not been using AI sludge to form parts of the submission.
Any use of those tools is totally inacceptable in multiple ways:
I’ve been utterly shocked that the OSI fell, slightly shocked that Creative Commons fell, but I’m totally appalled now.
AI sludge definitely is not, and will never be, acceptable for contribution to MirBSD, and I can say that with confidence.
Posted May 13, 2024 9:05 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (56 responses)
Dear LWN staff:
Can we please have an article explaining substantial similarity analysis (and similar concepts in other parts of the world)? I have seen this fallacy repeated over and over again in these comments, suggesting that we could all use a refresher on what a "derivative work" is and how it is actually determined in real courts of law. It could also be worth going over the abstraction-filtration-comparison test, but that might just be too deep into the weeds for a non-lawyer audience.
Anyway, to save you all a whole bunch of searching: The process used to create the allegedly infringing work is mostly irrelevant to the legal analysis. What matters is how similar that particular work is to the alleged original. It's a very fact-intensive inquiry that must be done on a per-work basis. In other words, you have to show that each output is infringing, one by one, by comparing it with the particular training data that it allegedly infringes on.
A blanket statement that "all" works produced in a certain fashion are derivative works is really only believable when the process, by its very nature, will always generate outputs which closely match their inputs (e.g. lossy encodings like JPEG, MP3, etc.), and so it is always straightforward to show that each particular output is similar to its corresponding input. But you can't do that with LLMs, because there is no such thing as a "corresponding input." Instead, you have to go through the painstaking process that The New York Times's lawyers went through, when they sued OpenAI for allegedly infringing on their articles. The NYT did not simply say "our articles were used as training material, so everything the AI outputs is derivative," because they would have been laughed out of court with an argument like that. Instead, they coaxed the LLM into reproducing verbatim or near-verbatim passages from old NYT articles, then sued for those particular infringing outputs.
For completeness, OpenAI's position, as far as I understand it, is that this sort of regurgitation is a rare bug, and also the NYT's "coaxing" was too aggressive and did not match the way normal users interact with the service (in fact, OpenAI has said that the prompts also contained large portions of the original articles, and so the AI may have simply mirrored the writing style of those prompts in order to complete them). They are still in litigation, but I predict a settlement at some point, so we might not get to see a court rule on those arguments.
My point is this: Yes, some outputs probably are infringing on some training inputs, at least for LLMs where regurgitation has been demonstrated. That is a far cry from all or even most outputs. We do not know where the law is going to come down on models (there are a lot of unanswered questions about how you even apply similarity analysis to a model in the first place), but for outputs, it is hard to believe that all or even most outputs have a corresponding training input (or small collection of training inputs) that they are very similar to.
Posted May 13, 2024 10:06 UTC (Mon)
by bluca (subscriber, #118303)
[Link]
Posted May 13, 2024 10:18 UTC (Mon)
by mb (subscriber, #50428)
[Link] (51 responses)
So, it is possible to write a non-AI obfuscator that takes Copyrighted $INPUT and make Public Domain $OUTPUT from it by transforming it often enough, so that the original $INPUT is not recognizable anymore.
It doesn't matter, that $OUTPUT obviously came from $INPUT, if I put $INPUT into a complicated algorithm and get $OUTPUT. The $OUTPUT is not a Derived Work w.r.t Copyright law in this case.
Right?
Posted May 13, 2024 10:36 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (48 responses)
Wrong.
The law barely cares at all about how you transform the input to get the output; that's in the category of "insignificant shenanigans", in the same sense that a discussion about your route to and from work, and whether you drive a particular make of car, or a different brand, is also insignificant when looking at a speeding ticket.
The law cares about precisely two things when deciding whether a given work is derivative or not:
If $OUTPUT obviously came from $INPUT, applying the AFC test, then you've got yourself a derived work. If $OUTPUT is transformative or independent, even if the algorithm had access to $INPUT, then it's not a derived work.
Posted May 13, 2024 10:48 UTC (Mon)
by mb (subscriber, #50428)
[Link] (46 responses)
Ok, but you conclude with:
> If $OUTPUT is transformative or independent, even if the algorithm had access to $INPUT, then it's not a derived work.
So, why is that apparently only true for an AI algorithm and not my non-AI obfuscator?
Posted May 13, 2024 10:51 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (45 responses)
It's equally true for both; you have not given a source for why you believe that an AI will be treated differently. If the output is, in the copyright sense, transformative or independent of the input, then it's not an infringement. If it's derived from the input, then it's infringing.
And that applies no matter what the steps in the middle are. They can be an AI algorithm, a human working alone, some sort of obfuscation, anything. The steps don't matter; only the inputs and the output matter.
Posted May 13, 2024 11:00 UTC (Mon)
by mb (subscriber, #50428)
[Link] (44 responses)
Ok. I'm fine with that explanation.
>you have not given a source for why you believe that an AI will be treated differently
This discussion is the current source.
But people claiming that AI is some kind of magic Copyright remover comes up over and over again. But that simply doesn't make sense, if it's not equally true for conventional algorithms.
But that makes me come to the conclusion that Copyright actually is completely useless these days.
Posted May 13, 2024 11:04 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (41 responses)
I've not seen anyone other than you in the current discussion claiming that AI is a magic copyright remover; the closest I can see are some misunderstandings of what someone else was saying, and the (correct) statement that in the EU, copyright does not restrict you from feeding a work into an algorithm (but the EU is silent on what copyright implications there are to the output of that algorithm).
So, given that you brought it into discussion, I'd like to know where you got the idea that AI is a magic copyright remover from, so that I can consider debunking it at the source, not when it gets relayed via you.
Posted May 13, 2024 11:14 UTC (Mon)
by mb (subscriber, #50428)
[Link] (23 responses)
Come on. Look harder.
Posted May 13, 2024 13:26 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (22 responses)
I've read every comment on this article, and you are the only person who claims that AI is a "magic copyright removal" tool. Nobody else makes that claim that I can see - a link to a comment where someone other than you makes that claim would be of interest.
The closest I can see is this comment, which could be summarized down to it's already hard to enforce open source licensing in cases where literal copying can be proven, and making it harder to show infringement is going to make it harder to enforce copyright.
Posted May 13, 2024 13:57 UTC (Mon)
by mb (subscriber, #50428)
[Link] (21 responses)
Come on. It's even written in the article itself. Did you read it?
Posted May 13, 2024 14:21 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (20 responses)
I did, and I do not see any claim that AIs are "magic copyright removal tools"; I see claims that AIs can be used to hide infringement, but not that their output cannot be a derived work of their inputs.
Indeed, I see the opposite - people being concerned that someone will use an AI to create something that later causes problems for Debian, since it infringes copyrights that then get enforced.
Posted May 13, 2024 15:05 UTC (Mon)
by mb (subscriber, #50428)
[Link] (19 responses)
> He specified "commercial AI" because ""these systems are copyright laundering machines""
[1] I'm not going to search the internet for you.
Posted May 13, 2024 15:09 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (18 responses)
That's not a claim that AIs are magical copyright removing tools; that's a claim that AIs hide infringement - in English, if something is a "laundering machine", it cleans away the origin of something, but doesn't remove the fact that it was originally dirty.
Again, I ask you to point to where you're getting your claim from, given that this is now the third time I've asked you to identify where you're getting this from, and been pointed to things that don't support the idea that AIs are "magical copyright removal machines", and I've had you insult my reading ability because I dare to question you.
Posted May 13, 2024 15:13 UTC (Mon)
by mb (subscriber, #50428)
[Link] (17 responses)
Posted May 13, 2024 15:19 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (16 responses)
The whole reason I'm asking is that "AI as magic copyright removal machine" is a falsehood, and you've clearly picked that up from somewhere; rather than trying to correct people who pick it up from the same source, I'd like to go back to the source and correct it there.
Now, if it's a simple misunderstanding of what the article says, there's not a lot that can be done there - English is a horrorshow of a language at the best of times - but if you've picked it up from something someone else has actually claimed, that claim can be corrected at source.
Posted May 13, 2024 15:32 UTC (Mon)
by mb (subscriber, #50428)
[Link] (15 responses)
No.
And I really don't understand why you insist so hard that there is anything wrong with that. Because there isn't.
I understand that you don't like these words, for whatever reason, but that's not really my problem to solve.
Thanks a lot for trying to correct me. I learnt a lot in this discussion. But I will continue to use these words, because in my opinion these words describe very well what actually happens.
Posted May 13, 2024 15:40 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (13 responses)
The point is, that's not a fact at all, as it has already been explained many times.
Posted May 13, 2024 15:47 UTC (Mon)
by mb (subscriber, #50428)
[Link] (12 responses)
You guys have to decide on something. Both can't be true. There is nothing in-between. There is no such thing as "half-GPLed".
Posted May 13, 2024 16:32 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (10 responses)
Now, again as it has already been explained, whether the output of a prompt is copyrightable, and whether it's a derived work of existing copyrighted material, is an entirely separate question that depends on many things, but crucially, not on which tool happened to have been used to write it out.
Posted May 13, 2024 16:53 UTC (Mon)
by mb (subscriber, #50428)
[Link] (5 responses)
Ok. Got it now. So
>the fact that something that had been there in the input is no longer there in the output after a processing step.
is true after all.
Posted May 13, 2024 17:15 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
is true after all.
The input was copyright protected and the special exception made it non-copyright-protected because of reasons.
And for whatever strange reason that only applies to AI algorithms, because the EU says so.
No, this is also false.
Copyright law says that there are certain actions I am capable of taking, such as making a literal copy, or a "derived work" (a non-literal copy), which the law prohibits unless you have permission from the copyright holder. There are other actions that copyright allows, such as reading your text, or (in the EU) feeding that text as input to an algorithm; they may be banned by other laws, but copyright law says that these actions are completely legal.
The GPL says that the copyright holder gives you permission to do certain acts that copyright law prohibits as long as you comply with certain terms. If I fail to comply with those terms, then the GPL does not give me permission, and I now have a copyright issue to face up to.
The law says nothing about the copyright protection on the output of the LLM; it is entirely plausible that an LLM will output something that's a derived work of the input as far as copyright law is concerned, and if that's the case, then the output of the LLM infringes. Determining if the output infringes on a given input is done by a comparison process between the input and the output - and this applies regardless of what the algorithm that generated the output is.
Further, this continues to apply even if the LLM itself is not a derived work of the input data; it might be fine to send you the LLM, but not to send you the result of giving the LLM certain prompts as input, since the result of those prompts is derived from some or all of the input in such a way that you can't get permission to distribute the resulting work.
Posted May 13, 2024 17:15 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (2 responses)
No, because what you are stubbornly refusing to understand, despite it having been explained a lot of times, is:
> Now, again as it has already been explained, whether the output of a prompt is copyrightable, and whether it's a derived work of existing copyrighted material, is an entirely separate question that depends on many things, but crucially, not on which tool happened to have been used to write it out.
This is a legal matter, not a programming one. The same paradigms used to understand software cannot be used to try and understand legal issues.
Posted May 13, 2024 17:33 UTC (Mon)
by mb (subscriber, #50428)
[Link] (1 responses)
Yes. That is the main problem. It does not have to make logical sense for it to be "correct" under law.
> stubbornly
I am just applying logical reasoning. The logical chain obviously is not correctly implemented. Which is often the case in law, of course. Just like the logical reasoning chain breaks if the information goes through a human brain. And that's Ok.
Just saying that some people claiming here things like "it's *obvious* that LLMs are like this and that w.r.t. Copyright" are plain wrong. Nothing is obvious in this context. It's partly counter-logical and defined with contradicting assumptions.
But that's Ok, as long as a majority agrees that it's fine.
Posted May 14, 2024 5:11 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link]
Unfortunately, while that article is very well-written and generally illuminates the right way to think about verbatim copying, it can be unintentionally misleading when we're talking about derivative works. The "colors" involved in verbatim copying are relatively straightforward - either X is a copy of Y, or it is not, and this is purely a matter of how you created X. But when we get to derivative works, there are really two* separate components that need to be considered:
- Access (a "color" of the bits, describing whether the defendant could have looked at the alleged original).
The problem is, if you've been following copyright law for some time, you might be used to working in exclusively one mode of analysis at a time (i.e. either the "bits have color" mode of analysis or the "bits are colorless" mode of analysis). The problem is, access is a colored property, and similarity is a colorless property. You need to be prepared to combine both modalities, or at least to perform each of them sequentially, in order to reason correctly about derivative works. You cannot insist that "it must be one or the other," because as a matter of law, it's both.
* Technically, there is also the third component of originality, but that only matters if you want to copyright the derivative work, which is an entirely different discussion altogether. That one is also a "color" which depends on how much human creativity has gone into the work.
Posted May 13, 2024 22:18 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link]
No, wrong.
They’re copyright-protected, but *analysing* copyright-protected works for text and data mining is an action permitted without the permission of the rights holders.
See my other post in this subthread. This limitation of copyright protection does not extend to doing *anything* with the output of such models.
Posted May 13, 2024 22:00 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (3 responses)
Text and data mining are opt-out, and the opt-out must be machine-readable. but this limitation of copyright only applies to doing automated analysēs of works to obtain information about patterns, trends and correlations.
(I grant that creating an LLM model itself probably falls under this clause.)
But not only are the copies of works made for text and data mining to be deleted as soon as they are no longer necessary for this (which these models clearly don’t do, given how it’s possible to obtain “training data” by the millions), it also does not allow you to reproduce the output of such models.
Text and data mining is, after all, only permissible to obtain “information about especially patterns, trends and correlations”, not to produce outputs as genAI does.
Therefore, the limitation of copyright does NOT apply to LLM output, and therefore the normal copyright rules (i.e. mechanically combined of its inputs, whose licences hold true) apply.
Posted May 13, 2024 22:44 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (2 responses)
It doesn't say that it must be deleted, it says:
> Reproductions and extractions made pursuant to paragraph 1 may be retained for as long as is necessary for the purposes of text and data mining.
Not quite the same thing. I don't know whether it's true that verbatim copies of training data are actually stored as you imply, as I am not a ML expert - it would seem strange and pointless, but I don't really know. But even assuming that was true, if that's required to make the LLM work, then the regulation clearly allows for it.
> it also does not allow you to reproduce the output of such models.
Every LLM producer treats such instances as bugs to be fixed. And they are really hard to reproduce, judging from how contrived and tortured the sequence of prompts need to be to make that actually happen. The NYT had to basically copy and paste themselves portions of their articles in the prompt to make ChatGPT spit them back, as showed in their litigation vs OpenAI.
> Therefore, the limitation of copyright does NOT apply to LLM output, and therefore the normal copyright rules (i.e. mechanically combined of its inputs, whose licences hold true) apply.
And yet, the NYT decided to sue in the US, where the law is murky and based on fair use case-by-case decisions, rather than in the EU where they have an office and it would have been a slam dunk, according to you. Could it be that you are wrong? It's very easy to test it, why don't you sue any of the companies that publishes an LLM and see what happens?
Posted May 13, 2024 23:16 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (1 responses)
Doesn’t change the fact that it is possible, and in sufficiently amount to consider that models contain sufficient amounts from their input works for the output to be a mechanically produced derivative of them.
Posted May 13, 2024 23:57 UTC (Mon)
by bluca (subscriber, #118303)
[Link]
They are treated as bugs because they are bugs, despite the contrived and absurd ways that are necessary to reproduce them. Doesn't really prove anything.
> Doesn’t change the fact that it is possible, and in sufficiently amount to consider that models contain sufficient amounts from their input works for the output to be a mechanically produced derivative of them.
Illiterate FUD. Go to court and prove that, if you really believe that.
Posted May 13, 2024 16:58 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
The output from the LLM is almost certainly GPLed in as far as the output from the LLM is (per copyright law) a derived work of the GPLed input. The complexity is that not all LLM outputs will be a derived work as far as copyright law is concerned, and where they are not derived works, there is no copyright, hence there is nothing to be GPLed.
And that's the key issue - the algorithm between "read a work as input" and "write a work as output" is completely and utterly irrelevant to the question of "does the output infringe on the copyright applicable to the input?". That depends on whether the output is, using something like an abstraction-filtration-comparison test, substantially the same as the input, or not.
For example, I can copy if (!ret) { if (ret == 1) ret = 0; goto cleanup; } directly from the kernel source code into another program, and that has no GPL implications at all, even though it's a literal copy-and-paste of 5 lines of kernel code that I received under the GPL. However, if I copy a different 5 lines of kernel code, I am plausibly creating a derived work, because I'm copying something relatively expressive.
This is why both can be true; as a matter of law, not all copying is copyright infringement, and thus not all copying has GPL implications when the input was GPLed code.
Posted May 13, 2024 15:42 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
So to describe an AI as a "magical copyright laundering machine" is to admit / claim that it's involved in illegal activity.
Cheers,
Posted May 13, 2024 12:26 UTC (Mon)
by anselm (subscriber, #2796)
[Link] (16 responses)
Elsewhere in copyright law, the general premise is that copyright is only available for works which are the “personal mental creation” of a human being. Speciesism aside, something that comes out of an LLM is obviously not the personal mental creation of anyone, and that seems to take care of that, even without the EU pronouncing on it in the context of training AI models.
Posted May 13, 2024 13:26 UTC (Mon)
by kleptog (subscriber, #1183)
[Link] (4 responses)
LLMs are prompted, they don't produce output out of thin air. Therefore the output is the creation of the human that triggered the prompt. Now whether that person was pressing buttons on a device that sent network packets to a server that processed all those keystrokes into a block of text to be sent to an LLM in the cloud is irrelevant. Somewhere along the way a human decided to invoke the LLM and controlled which input to send to it and what to do with the output. That human being is responsible for respecting copyright. Whether the output is copyrightable depends mostly on how original the prompt is.
The idea that LLM output cannot be copyrighted is silly. That would be like claiming that documents produced by a human typing into LibreOffice cannot be "the personal mental creation of anyone". LLMs, like LibreOffice, are tools, nothing more. There's a human at the keyboard who is responsible. Sure, most of the output of an LLM isn't going to be original enough to be copyrightable, but that's quite different from saying *all* output from LLMs is not copyrightable.
As with legal things in general, it depends.
Posted May 13, 2024 13:54 UTC (Mon)
by mb (subscriber, #50428)
[Link] (1 responses)
Ok, so if I enter wget into my shell prompt to download some copyrighted music, it makes me the creator?
Posted May 13, 2024 14:18 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
You are the creator of that copy, and in as far as there is anything copyrightable in creating that copy, you own that copyright.
However, that copy is (in most cases) either an exact copy of an existing work, or a derived work of an existing work; if it's an exact copy, then there is nothing copyrightable in the creation of the copy, so you own nothing.
If it's a derived work, then you own copyright in the final work thanks to the creative expression you put in to create the copy, but doing things with that work infringes the copyright in the original work unless you have appropriate permission from the copyright holder on the original work, or a suitable exception in copyright law.
Posted May 13, 2024 21:51 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (1 responses)
> LLMs are prompted, they don't produce output out of thin air. Therefore the output is the creation of the human that triggered the prompt.
This is ridiculous. The “prompt” is merely a tiny parametrisation of a query that extracts from the huge database of (copyrighted) works.
Do read the links I listed in https://lwn.net/Comments/973578/
> The idea that LLM output cannot be copyrighted is silly.
😹😹😹😹😹😹😹
You’re silly.
This is literally enshrined into copyright law. For example:
> Werke im Sinne dieses Gesetzes sind nur persönliche geistige Schöpfungen.
“Works as defined by this [copyright] law are only personal intellectual creations that pass threshold of originality.” (UrhG §2(2))
Wikipedia explains the “personal” part of this following general jurisprudence:
> Persönliches Schaffen: setzt „ein Handlungsergebnis, das durch den gestaltenden, formprägenden Einfluß eines Menschen geschaffen wurde“ voraus. Maschinelle Produktionen oder von Tieren erzeugte Gegenstände und Darbietungen erfüllen dieses Kriterium nicht. Der Schaffungsprozeß ist Realakt und bedarf nicht der Geschäftsfähigkeit des Schaffenden.
“demands the result of an act from the creative, form-shaping influence of a human: mechanical production or things or acts produced by animals do not fulfill this criterium (but legal competence is not necessary).” (<https://de.wikipedia.org/wiki/Urheberrecht_(Deutschland)#Schutzgegenstand_des_Urheberrechts:_Das_Werk>)
So, yes, LLM output cannot be copyrighted (as a new work/edition) in ipso.
And to create an adaption of LLM output, the human doing so must not only invest significant *creativity* (not just effort / sweat of brow!) to pass threshold of originality, but they also must have the permission of the copyright (exploitation rights, to be precise) holders of the original works to do so (and, in droit d’auteur, may not deface, so the authors even if not holders of exploitation rights also have something to say).
Posted May 13, 2024 22:24 UTC (Mon)
by corbet (editor, #1)
[Link]
Posted May 13, 2024 13:40 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (1 responses)
That's looking at the other end of it - the question here is not whether an LLM's output can be copyrighted, but whether an LLM's output can infringe someone else's copyright. And the general stance elsewhere in copyright law is that the tooling used is irrelevant to whether or not a given tool output infringed copyright on that tool's inputs. It might, it might not, but that depends on the details of the inputs and outputs (and importantly, not on the tool in question).
Posted May 13, 2024 14:55 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
The output of an LLM cannot be copyrightED. That is, there is no original creative contribution BY THE LLM worthy of copyright.
But the output of an LLM can be COPYRIGHT. No "ed" on the end of it. The mere fact of feeding stuff through an LLM does not
Again, we get back to the human analogy. There is no restriction on humans CONSUMING copyrighted works. European law explicitly extends that to LLMs CONSUMING copyrighted works.
And just as a human can regurgitate a copyrighted work in its entirety (Mozart is famous for doing this), so can an LLM. And both of these are blatant infringements if you don't have permission - although copyright was in its infancy when Mozart did it so I have no idea of the reality on the ground back then ...
Cheers,
Posted May 13, 2024 13:52 UTC (Mon)
by mb (subscriber, #50428)
[Link] (8 responses)
Well, that is not obvious at all.
Because the inputs were mental creations.
Posted May 13, 2024 21:39 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (7 responses)
Even “mechanical” transformation by humans does not create a work (as defined by UrhG, i.e. copyright). It has to have some creativity.
Until then, it’s a transformation of the original work(s) and therefore bound to the (sum of their) terms and conditions on the original work.
If you have a copyrighted thing, you can print it out, scan it, compress it as JPEG, store it into a database… it’s still just a transformation of the original work, and you can retrieve a sufficiently substantial part of the original work from it.
The article where someone reimplemented a (slightly older version of) ChatGPT in a 498-line PostgreSQL query showed exactly and easily understandable how this is just a lossy compression/decompression: https://explainextended.com/2023/12/31/happy-new-year-15/
There are now feasible attacks obtaining “training data” from prod models in large scale, e.g: https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html
This is sufficient to prove that these “models” are just databases with lossily compressed, but easily enough accessible, copies of the original, possibly (probably!) copyrighted, works.
Another thing I would like to point out is the relative weight. For a work which I offer to the public under a permissive licence, attribution is basically the only remuneration I can ever get. This means failure to attribute so has a much higher weight than for differently licenced or unlicenced stuff.
Posted May 13, 2024 21:55 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (6 responses)
While the AI bandwagon exaggerates greatly the capability of LLMs, let's not fall into the opposite trap. ChatGPT&al are toys, real applications like Copilot are very much not "just databases". A database is not going to provide you with autocomplete based on the current, local context open in your IDE. A database is not going to provide an accurate summary of the meeting that just finished, with action items and all that.
Posted May 13, 2024 22:20 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (5 responses)
Posted May 13, 2024 22:44 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (4 responses)
Posted May 13, 2024 23:14 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (3 responses)
Consider a database in which things are stored lossily compressed and interleaved (yet still retrievable).
Posted May 13, 2024 23:58 UTC (Mon)
by bluca (subscriber, #118303)
[Link] (2 responses)
Posted May 14, 2024 0:28 UTC (Tue)
by mirabilos (subscriber, #84359)
[Link] (1 responses)
I don’t have the nerve to even try and communicate with systemd apologists who don’t even do the most basic research themselves WHEN POINTED TO IT M̲U̲L̲T̲I̲P̲L̲E̲ ̲T̲I̲M̲E̲S̲.
Posted May 14, 2024 1:26 UTC (Tue)
by corbet (editor, #1)
[Link]
That's all participants should stop, not just the one I'm responding to here.
Thank you.
Posted May 14, 2024 2:55 UTC (Tue)
by rgmoore (✭ supporter ✭, #75)
[Link] (1 responses)
A lot of people saying it doesn't mean it's true. I think the "magic copyright eraser" argument comes from misapplying the combination of the pro- and anti-AI arguments in a way that isn't supported by the law. The strong anti-AI position is that AI is inherently a derivative work of every piece of training material, and that therefore all the output of the AI is likewise a derivative work. The strong pro-AI position is that creating an AI is inherently transformative, so an AI is not a derivative work (or at least not an infringing use) of the material used to train it. The mistake is applying the anti-AI "everything is a derivative work" logic to the pro-AI position that the AI is not a derivative work and concluding that none of the output of the AI would be infringing.
This sounds reasonable but is absolutely wrong. A human being is not a derivative work of everything used to train them, but humans are still capable of copyright infringement. What matters is whether our output meets the established criteria for infringement, e.g. the abstraction-filtration-comparison test farnz mentions above. The same thing would apply to the output of an AI. Even if the AI itself is not infringing, its output can be.
Basically, the courts won't accept "but I got it from an AI" as an argument against copyright infringement. If anything, saying you got it from an AI would probably hurt you. You can try to defend yourself against charges of infringement by showing you never saw the original and thus must have created it independently. That's always challenging, but it will be much harder with an AI, given just how much material they've trained on. The chances are very good the AI has seen whatever you're accused of infringing, so the independent creation defense is no good.
Posted May 14, 2024 9:16 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
Note, as emphasis on your point, that the independent creation defence requires the defendant to show that the independent creator did not have access to the work they are alleged to have copied. The assumption is that you had access, up until you show you didn't.
Posted May 13, 2024 12:21 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
Bear in mind that a LOT of people don't have a clue how copyright works. Unfortunately, it's now hard to navigate because it's a cobwebsite, but take a look at Groklaw.
But to give you a wonderful example of people making all sorts of outrageous claims about copyright, there was quite a fuss a good few years back about "how Terry Pratchett ripped off Hogwarts to create Unseen University".
Well, yes, they're both "Schools for Wizards". They are actually pretty similar. But the complaints are complete nonsense because I don't know whether J K Rowling had read Terry before she created Hogwarts, but Terry couldn't have read J K without a time-machine to hand!
And just like West Side Story, schools for wizards have been around in the literature for ages, so trying to imagine a link from Hogwarts to UU ignores the existence of a myriad of links to other similar stories, any one of which could have been the inspiration for either UU or Hogwarts.
Notice that the GPL makes *absolutely* *no* *attempt* *whatsoever* to define "derivative work". Because it has nothing to do with computing, AI, all that stuff. It's a legal "term of art", and if you don't speak legalese you WILL make an idiot of yourself.
So as far as the definition of "derivative work" is concerned, whether it's an AI or not is completely irrelevant. What IS relevant is whether the *OUTPUT* is Public Domain or not. My nsho is that if the output is sufficiently close that "a practitioner skilled in the arts" can identify the source, then the output is a legal "derived work", and the input copyright applies to the output. If the source is not identifiable, then the output is a new work, but AI is incapable of creativity, so the output is Public Domain.
And then - hopefully - a human comes along, proof-reads the output to remove hallucinations and mistakes, at which point (because this is *creative* input) they then acquire a copyright over the final work. Such work could also remove all references to the existing source, thereby removing the original copyrights (or it could fail to do so, and fail to remove the original copyrights).
Cheers,
Posted May 13, 2024 21:24 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (1 responses)
Machinally manipulating a work does not generate a new work, so it’s just a machinal transformation of the original work and therefore bound to the same terms, which the *user* of the machine must honour.
Posted May 14, 2024 9:29 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
Your "just" is doing a lot of heavy lifting. It is entirely possible for a machine transform of a work to remove all copyright-relevant aspects of the original, leaving something that's no longer protected by copyright. As your terms are about lifting restrictions that copyright protection places on the work by default, if the transform removes all the elements that are protected by copyright, there's no requirement for the user to honour those terms.
For example, I can write a sed script that extracts 5 lines from the Linux kernel; if I extract the right 5 lines, then I've extracted something not protected by copyright, and the kernel's licence terms do not apply (since they merely lift copyright restrictions). On the other hand, if I extract a different set of 5 lines, I extract something to which copyright protections apply, and now the kernel's licence terms apply.
The challenge for AI boosters is that their business models depend critically on all of the AI's output being non-infringing; if you have to do copyright infringement tests on AI output, then most of the business models around generative AI fall apart, since who'd pay for something that puts you at high risk of a lawsuit?
And the challenge for AI critics is to limit ourselves to arguments that make legal sense, as opposed to arguing the way the SCO Group did when it claimed that all of Linux was derived from the UNIX copyrights it owned.
Posted May 13, 2024 14:46 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (2 responses)
Cheers,
Posted May 13, 2024 14:54 UTC (Mon)
by jzb (editor, #7867)
[Link] (1 responses)
The last post on Groklaw suggests that PJ was pretty much done with technology. I'm not sure how one would reach PJ these days, at any rate.
Posted May 13, 2024 15:00 UTC (Mon)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted May 15, 2024 19:09 UTC (Wed)
by mirabilos (subscriber, #84359)
[Link]
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Of course, one could use AI to generate documentation on "rigorously undocumented" code, post it to the Internet, and wait for the "Someone is wrong on the Internet" effect to clean it up.
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
However, it would be most annoying if a NM would lie during the NM process about their language skill.
Sure it help a lot, but most important documents are translated, and Debian has teams to help with language issues,
and likely Debian will find an AM that share a language with you.
If a NM manage to use automated translation in a way that make them more productive, good for them.
There are also no strong requirement to send bug report in English. Debian has language-specific mailing list and translations team to translate bug reports when needed.
Debian dismisses AI-contributions policy
You do *NOT* need english skills of a native, not even near.
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
> By banning AI contributions, the wasteful time is significantly reduced.
It could even increase the required effort, because pointing to a "No-AI" rule without further technical explanation about why a contribution is bad surely triggers more mails and questions from the submitter.
Debian dismisses AI-contributions policy
2. Smart(er) autocomplete, similar to what is already present in many IDEs. The AI can generate a small snippet of code that fits into the current context (and might be character-for-character identical to what the programmer would have written anyway).
3. Generate commit messages, documentation, and the like from code the programmer wrote (by hand). The programmer would then have to fact-check the commit message, but they should be in a good position to do that since they wrote the code.
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
It seems to me that the basic policy should be that users are responsible for everything they submit. That means two big things:
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
> unenforceable
Debian dismisses AI-contributions policy
Creator, or proof reader ?
Wol
Creator, or proof reader ?
Creator, or proof reader ?
Creator, or proof reader ?
Creator, or proof reader ?
Creator, or proof reader ?
Creator, or proof reader ?
Wol
Creator, or proof reader ?
>make minor changes to a work, and suddenly the whole work is covered by my copyright.
Creator, or proof reader ?
Creator, or proof reader ?
Creator, or proof reader ?
Creator, or proof reader ?
> Same goes for any other content, as long as it doee not generate copies of the original.
Creator, or proof reader ?
Creator, or proof reader ?
How many additional input parameters ("when", "where", "purpose", etc...) to the algorithm are needed to cross the magic barrier?
Why is that different from my obfuscator, that produces an output for given inputs? Why can't it cross the magic barrier, without being called LLM?
Creator, or proof reader ?
Creator, or proof reader ?
The output is obviously not a copy of the input. You can compare it and it looks completely different.
And I don't see why this would be different for an LLM.
What amount of token stirring is needed?
Creator, or proof reader ?
Creator, or proof reader ?
Wol
Creator, or proof reader ?
> Or to put it mathematically, your "compilation with extra steps" or obfuscator does not falsify the basic "2 * 12 = 18 + 6"
It's not a 1:1 relation.
Of course my hypothetical obfuscator also would not produce a 1:1 relation between input and output. It's pretty easy to do that.
Creator, or proof reader ?
Creator, or proof reader ?
Creator, or proof reader ?
Creator, or proof reader ?
So, why is it different, if I process the input data with an LLM algorithm instead of with my algorithm?
Creator, or proof reader ?
Creator, or proof reader ?
Creator, or proof reader ?
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Wol
Debian dismisses AI-contributions policy
The output should be considered a derived work of all training data.
Take Open Source code and manually obfuscate it until nobody can prove its origin. It does not loose its Copyright status by doing that, though. And it's a lot of work. By far not click-of-a-button. So it's not done very often.
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
If the LLMs learnt from large proprietary code bases, too, then I would actually be happy with the status quo.
But currently the flow of code is basically only from open source into proprietary.
Debian dismisses AI-contributions policy
> If the LLMs learnt from large proprietary code bases, too, then I would actually be happy with the status quo.
> But currently the flow of code is basically only from open source into proprietary.
Debian dismisses AI-contributions policy
> We don't get there by making copyright even more draconian that it is now
Which I currently don't see as a realistic possibility.
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Wol
Debian dismisses AI-contributions policy
Wol
Debian dismisses AI-contributions policy
This seems ... extreme. I don't doubt that different people can have different ideas of what "copyright maximalist" means — there are different axes on which things can be maximized. Disagreeing on that does not justify calling somebody a liar and attacking them in this way, methinks.
Whoa there
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
for (auto &kv: kv_pairs) {
if (kv.first == "name")
mystruct.name = kv.second;
// ...
}
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Debian dismisses AI-contributions policy
Relevant meme.
Q(human, staring at a robot derisively): "Can an AI write an efficient SQL query?"
A(robot, coolly): "Can you?"
*/
Debian dismisses AI-contributions policy
Parts of Debian dismiss AI-contributions policy
- their inputs were obtained without consent, often illegally, and even if not mostly scraped, using up lots of resources
- they were pre-filtered exploiting cheap labour in the global south, leaving its people with deep psychological problems and nowhere even near suitable compensation (if such a thing is even possible)
- running the training and these models uses so much resources, not just electricity but also incidentally water, that it’s environmentally inacceptable
- their outputs are mechanical combinations of their inputs and therefore derivative works, but the models are incapable of producing correct attribution and licencing and retaining correct copyright information (and yes, this is proven by now, it’s possible to extract large amounts of “training data” verbatim)
- they are run by unscrupulous commercial proprietary exploiters and VC sharks and former “blockchain” techbros
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
One just has to ensure that $OUTPUT cannot be traced back to $INPUT. The transformation algorithm just has to be complicated enough.
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
I'm confused.
Parts of Debian dismiss AI-contributions policy
So, why is that apparently only true for an AI algorithm and not my non-AI obfuscator?
I'm confused.
Parts of Debian dismiss AI-contributions policy
AI programs and non-AI programs are both equally capable of transforming Copyrighted work into Public Domain. That makes sense.
You now explained that it's true for both. Which makes sense.
I'm fine with that, though. I publish mostly under permissive licenses these days, because I don't really care anymore what people do with the code. I would publish into the Public Domain, if I could.
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
> that abuse free software
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Just like I call my washing machine the "magic dirt removal machine" I call AI source code transformers "magic copyright removal machines".
It's an aggravation to point out the fact that something that had been there in the input is no longer there in the output after a processing step.
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
If not, then it has been removed (laundered).
This is not Schrödinger's LLM.
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
> but under the copyright exception granted by the law, which trumps any license you might attach to it.
The input was copyright protected and the special exception made it non-copyright-protected because of reasons.
And for whatever strange reason that only applies to AI algorithms, because the EU says so.
Parts of Debian dismiss AI-contributions policy
>the fact that something that had been there in the input is no longer there in the output after a processing step.
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
But that doesn't mean I personally have to agree. Copyright is a train wreck and it's only getting worse and worse.
Parts of Debian dismiss AI-contributions policy
- Similarity (a function of the bits, and not a color)
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Wol
Parts of Debian dismiss AI-contributions policy
the EU is silent on what copyright implications there are to the output of that algorithm
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
>Therefore the output is the creation of the human that triggered the prompt.
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
While this discussion can be seen as on-topic for LWN, I would also point out that we are not copyright lawyers, and that there may not be a lot of value in continuing to go around in circles here. Perhaps it's time to wind it down?
This has gone on for a while
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
automatically cancel any pre-existing copyright.
Wol
Parts of Debian dismiss AI-contributions policy
At which point did the data loose the "mental creation" status traveling through the algorithm?
Will processing the input with 'sed' also remove it, because the output is completely processes by a program, not a human being?
What level or processing do we need for the "mental creation" status to be lost? How many chained 'sed's do we need?
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
OK, I'll state it more clearly: it's time to bring this thread to a halt, it's not getting anywhere.
Second try
Parts of Debian dismiss AI-contributions policy
But people claiming that AI is some kind of magic Copyright remover comes up over and over again. But that simply doesn't make sense, if it's not equally true for conventional algorithms.
Parts of Debian dismiss AI-contributions policy
Basically, the courts won't accept "but I got it from an AI" as an argument against copyright infringement. If anything, saying you got it from an AI would probably hurt you. You can try to defend yourself against charges of infringement by showing you never saw the original and thus must have created it independently. That's always challenging, but it will be much harder with an AI, given just how much material they've trained on. The chances are very good the AI has seen whatever you're accused of infringing, so the independent creation defense is no good.
Parts of Debian dismiss AI-contributions policy
Wol
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Machinally manipulating a work does not generate a new work, so it’s just a machinal transformation of the original work and therefore bound to the same terms, which the *user* of the machine must honour.
Parts of Debian dismiss AI-contributions policy
Wol
Parts of Debian dismiss AI-contributions policy
Parts of Debian dismiss AI-contributions policy
Wol
NetBSD does set such a policy