Creator, or proof reader ?

Posted May 11, 2024 4:16 UTC (Sat) by drago01 (subscriber, #50715)
In reply to: Creator, or proof reader ? by Wol
Parent article: Debian dismisses AI-contributions policy

There is nothing wrong with letting an AI generate content. You just have to double check the output actually makes sense. It can be used as a starting point for whatever you want to do.

Ok general: technology comes and it's either useful and stays and gets improved or it isn't and goes away.

Banning is not really a solution.

Creator, or proof reader ?

Posted May 11, 2024 11:10 UTC (Sat) by josh (subscriber, #17465) [Link] (22 responses)

There's something wrong with contributing AI-generated content to a project that expects all content to be under an open source license. It's the same problem as taking code from your employer that you don't have the rights to contribute, and submitting it to an open source project.

Creator, or proof reader ?

Posted May 11, 2024 13:51 UTC (Sat) by Paf (subscriber, #91811) [Link] (21 responses)

You’re taking as established fact the idea that AI generated responses to human input are not copyrightable. To say the least this is not anything like a settled question and the consensus opinion is much more like “it probably depends on details”.

Creator, or proof reader ?

Posted May 11, 2024 14:07 UTC (Sat) by josh (subscriber, #17465) [Link] (20 responses)

On the contrary, I'm suggesting that AI-generated text *should* be considered derivative works of a *lot* of copyrighted material.

If AI-generated text is *not* copyrightable, then AI becomes a means of laundering copyright: Open Source and/or proprietary code goes in, public domain code comes out. If the law or jurisimprudence of some major jurisdictions decide to allow that, that's a disaster for Open Source licensing.

Creator, or proof reader ?

Posted May 12, 2024 1:23 UTC (Sun) by Paf (subscriber, #91811) [Link] (4 responses)

Derivative works are not copyrightable, so, no, no contrary here.

Creator, or proof reader ?

Posted May 12, 2024 7:55 UTC (Sun) by Wol (subscriber, #4433) [Link] (3 responses)

What do you mean derivative works are not copyrightable?

If I get permission (or use a Public Domain work) to set a piece of music in, let's say, lilypond, I can quite legally slap my copyright on it, and forbid people from copying.

Okay, the notes themselves are not copyright - I can't stop people taking my (legally acquired) copy, and re-typsetting it, but I can stop them sticking it in a photocopier.

One of the major points (and quite often a major problem) of copyright is that I only have to make minor changes to a work, and suddenly the whole work is covered by my copyright. A tactic often used by publishers to try and prevent people copying Public Domain.

Cheers,
Wol

Creator, or proof reader ?

Posted May 12, 2024 9:05 UTC (Sun) by mb (subscriber, #50428) [Link] (1 responses)

>One of the major points (and quite often a major problem) of copyright is that I only have to
>make minor changes to a work, and suddenly the whole work is covered by my copyright.

Huh? Is that really how (I suppose) US Copyright works? I make a one-liner change to the Linux kernel and then I have Copyright on the whole kernel? I doubt it.

Creator, or proof reader ?

Posted May 12, 2024 10:29 UTC (Sun) by kleptog (subscriber, #1183) [Link]

> Huh? Is that really how (I suppose) US Copyright works? I make a one-liner change to the Linux kernel and then I have Copyright on the whole kernel? I doubt it.

The key to understanding this is that copyright covers "works". So if you take the kernel source, make some modifications and publish a tarball, you own the copyright on the tarball ("the work"). That doesn't mean that you own the copyright to every line of code inside that tarball. Someone could download your tarball, delete your modifications add different one and create a new tarball and now their tarball has nothing to do with yours.

Just cloning a repo doesn't create a new work though, because they're no creativity involved.

In fact, one of the features of open-source is that the copyright status of a lot of code is somewhat unclear but that it doesn't actually matter because open-source licences mean you don't actually need to care. If you make a single line patch, does that construe a "work" that's copyrightable? If you work together with someone else on a patch, can you meaningfully distinguish your copyrighted code, your coauthor's or the copyright of the code you modified?

Copyright law has the concept of joint-ownership and collective works, but copyright law doesn't really have a good handle on open-source development.

Creator, or proof reader ?

Posted May 14, 2024 1:12 UTC (Tue) by mirabilos (subscriber, #84359) [Link]

Your example of musical engraving is a bit flawed (though your point in general is still true): there’s been jurisprudence saying that merely engraving does not _necessarily_ create copyright (it still may, if there’s sufficient creativity(! not sweat-of-brow!) going into it, but I only add a “© (where applicable)” to avoid fraudulous claims where no copyright exists to my digital editions of Free Sheet Music).

So, in this specific example, the bar is a bit higher, but yeah, the point stands.

Creator, or proof reader ?

Posted May 12, 2024 14:44 UTC (Sun) by drago01 (subscriber, #50715) [Link] (14 responses)

If I read open source code, to learn concepts and/or algorithms and then use that knowledge to write code, it's not considered derived work.

If a LLM does that, why would it be? Same goes for any other content, as long as it doee not generate copies of the original.

Creator, or proof reader ?

Posted May 12, 2024 15:15 UTC (Sun) by mb (subscriber, #50428) [Link] (12 responses)

> If a LLM does that, why would it be?
> Same goes for any other content, as long as it doee not generate copies of the original.

So, it is also Ok to use a Non-AI code obfuscator to remove Copyright, as long as the output does not look like the input anymore?

Creator, or proof reader ?

Posted May 12, 2024 16:04 UTC (Sun) by bluca (subscriber, #118303) [Link] (11 responses)

No, because that's compilation with extra steps, and that's not what LLMs do. LLMs are stochastic parrots, and the answers they give heavily depend on context - who asks it, when, where, for what purpose, etc.

Creator, or proof reader ?

Posted May 12, 2024 16:18 UTC (Sun) by mb (subscriber, #50428) [Link] (10 responses)

Sounds a lot like magic to me.

What amount of sparkle dust is needed for a computer program that takes $A and produces $B out of $A not to be considered "compilation with extra steps"?
How many additional input parameters ("when", "where", "purpose", etc...) to the algorithm are needed to cross the magic barrier?

LLMs are computer programs that produce an output for given inputs. There is no magic involved. It's a mapping of inputs+state => output.
Why is that different from my obfuscator, that produces an output for given inputs? Why can't it cross the magic barrier, without being called LLM?

Creator, or proof reader ?

Posted May 12, 2024 18:10 UTC (Sun) by kleptog (subscriber, #1183) [Link] (2 responses)

> What amount of sparkle dust is needed for a computer program that takes $A and produces $B out of $A not to be considered "compilation with extra steps"?

The law is run by humans not computers so this question is irrelevant. All that matters is: does the tool produce an output that somehow affects the market of an existing copyrighted work? How it is done is not relevant.

So an obfuscator doesn't remove copyright because removing copyright isn't a thing. Either the output is market-substitutable for some copyrighted work, or it isn't.

LLMs do not spontaneously produce output, they are prompted. If an LLM reproduces a copyrighted work, then that is the responsibility of the person who made the prompt. It's fairly obvious that LLM do not reproduce copyrighted works in normal usage so you can't argue that LLMs have a fundamental problem with copyright.

(I guess you could create an LLM that, without a prompt, reproduced the entire works of Shakespeare. You could argue such an LLM would violate Shakespeare's copyright, if he had any. That's not a thing with the LLMs currently available though. In fact, they're going to quite some effort to ensure LLMs do not reproduce entire works, because that is an inefficient use of resources (ie money), they don't care about copyright per se.)

Creator, or proof reader ?

Posted May 12, 2024 19:33 UTC (Sun) by mb (subscriber, #50428) [Link] (1 responses)

>It's fairly obvious that LLM do not reproduce copyrighted works in normal usage

That is not obvious at all.

By that same reasoning my code obfuscator would be Ok to use.
The output is obviously not a copy of the input. You can compare it and it looks completely different.

But the output of the obfuscator obviously is a derived work of the input. Right?
And I don't see why this would be different for an LLM.

Or does using a more complex mixing algorithm suddenly make it not a derived work of the input?
What amount of token stirring is needed?

Creator, or proof reader ?

Posted May 13, 2024 7:30 UTC (Mon) by kleptog (subscriber, #1183) [Link]

> >It's fairly obvious that LLM do not reproduce copyrighted works in normal usage

> That is not obvious at all.

Have you actually used one?

> But the output of the obfuscator obviously is a derived work of the input. Right?

Not at all. "Derived work" is a legal term not a technical one. Running a copyrighted work through an algorithm does not necessarily create a derived work. In copyright law, a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work (the underlying work). If you hash a copyrighted file, the resulting hash is not a derived work simply because it's lost everything that is interesting about the original work.

If your obfuscator has a corresponding deobfuscator that can return the original retaining the major copyrightable elements, then there may be no copyright on the obfuscated file, but as soon as you deobfuscate it, the copyright returns.

Honestly, this feels what "What colour are your bits?"[1] all over again. Are you aware of that article? Statements like this:

> Or does using a more complex mixing algorithm suddenly make it not a derived work of the input? What amount of token stirring is needed?

seem to indicate you are not.

[1] https://ansuz.sooke.bc.ca/entry/23

Creator, or proof reader ?

Posted May 12, 2024 20:10 UTC (Sun) by Wol (subscriber, #4433) [Link] (6 responses)

I think the obvious difference is that you can write a decompiler to retrieve the original source. Or to put it mathematically, your "compilation with extra steps" or obfuscator does not falsify the basic "2 * 12 = 18 + 6" relationship between source and output.

Can you write an anti-LLM, that given the LLM's output, would reverse it back to the original question?

Cheers,
Wol

Creator, or proof reader ?

Posted May 12, 2024 20:36 UTC (Sun) by mb (subscriber, #50428) [Link] (5 responses)

>I think the obvious difference is that you can write a decompiler to retrieve the original source.
> Or to put it mathematically, your "compilation with extra steps" or obfuscator does not falsify the basic "2 * 12 = 18 + 6"

No. That's not possible.

You can't reverse 18+6 into 2*12, because it could also have been 4*6 or anything else that fits the equation. There is an endless number of possibilities.
It's not a 1:1 relation.
Of course my hypothetical obfuscator also would not produce a 1:1 relation between input and output. It's pretty easy to do that.

So, does is the output still a derived work of the input? If so, why is an LLM different?

Creator, or proof reader ?

Posted May 12, 2024 21:01 UTC (Sun) by gfernandes (subscriber, #119910) [Link] (4 responses)

You're sort of missing the point here.

Who uses am obfuscator? The producer of the works because said producer wants an extra layer/hurdle to protect *their* copyright of their original works.

Who uses an LLM? Obviously *not _just_* the producer of the LLM. And *because* of this, the LLM is fundamentally different as far as copyright goes.

The user can cause the LLM to leak copyrighted training material that the _producer_ of the LLM did not license!

This is impossible in the context of an obfuscator.

In fact there is an ongoing case which might bring legal clarity here - NYT v OpenAI.

Creator, or proof reader ?

Posted May 13, 2024 5:58 UTC (Mon) by mb (subscriber, #50428) [Link] (3 responses)

> Who uses am obfuscator? The producer of the works

Nope. I use it on foreign copyrighted work to get public domain work out of it. LLM-style.

Creator, or proof reader ?

Posted May 13, 2024 6:17 UTC (Mon) by gfernandes (subscriber, #119910) [Link] (2 responses)

Then that's already a copyright violation unless you have the permission of the copyright holder to reissue copyrighted material as public domain works.

Creator, or proof reader ?

Posted May 13, 2024 8:56 UTC (Mon) by mb (subscriber, #50428) [Link] (1 responses)

I agree.
So, why is it different, if I process the input data with an LLM algorithm instead of with my algorithm?

Creator, or proof reader ?

Posted May 13, 2024 9:51 UTC (Mon) by farnz (subscriber, #17727) [Link]

It's not different - the output of an LLM may be a derived work of the original. It may also be a non-literal copy, or a transformative work, or even unrelated to the input data.

There's a lot of "AI bros" who would like you to believe that using an LLM automatically results in the output not being a derived work of the input, but this is completely untested in law; the current smart money suggests that "generative AI" output (LLMs, diffusion probabilistic models, whatever) will be treated the same way as human output - it's not automatically a derived work just because you used an LLM, but it could be, and it's on the human operator to ensure that copyright is respected.

It's basically the same story as a printer in that respect; if the input to the printer results in a copyright infringement on the output, then no amount of technical discussion about how I didn't supply the printer with a copyrighted work, I supplied it with a PostScript program to calculate π and instructions on which digits of π to interpret as a bitmap will get me out of trouble. Same currently applies to LLMs; if I get a derived work as output, that's my problem to deal with.

This, BTW, is why "AI bros" would like to see the outputs of LLMs deemed as "non-infringing"; it's going to hurt their business model if "using an AI to generate output" is treated, in law, as equivalent to "using a printer to run a PostScript program", since then their customers have to do all the legal analysis to work out if a given output from a prompt has resulted in a derived work of the training set or not.

Creator, or proof reader ?

Posted May 12, 2024 18:06 UTC (Sun) by farnz (subscriber, #17727) [Link]

The question you're reaching towards is "at what point is the LLM's output a derived work of the input, and at what point is it a transformative work?".

This is an open question; it is definitely true that you can get LLMs to output things that, if a human wrote them, would clearly be derived works of the inputs (and smart money says that courts will find that "I used an LLM" doesn't get you out of having a derived work here). Then there's a hard area, where something written by a human would also be a derived work, but proving this is hard (and this is where LLMs get scary, since they make it very quick to rework things such that no transformative step has taken place, and yet it's not clear that this is a derived work, where humans have to spend some time on it).

And then we get into the easy case again, where the output is clearly transformative of the set of inputs, and therefore not a copyright infringement.

Creator, or proof reader ?

Posted May 14, 2024 2:31 UTC (Tue) by viro (subscriber, #7872) [Link]

LLM output is, by definition, a random text that is statistically indistinguishable from "what they say". You can't tell true from false on that level. And it's worse than random BS from the proofreading POV - you can not rely upon "it sounds wrong" feeling to catch the likely spots.

As far as I'm concerned, anyone caught at using that deserves the same treatment as somebody who engages in any other form of post-truth - "not to be trusted ever after in any circumstances".