Creator, or proof reader ?

Posted May 12, 2024 16:18 UTC (Sun) by mb (subscriber, #50428)
In reply to: Creator, or proof reader ? by bluca
Parent article: Debian dismisses AI-contributions policy

Sounds a lot like magic to me.

What amount of sparkle dust is needed for a computer program that takes $A and produces $B out of $A not to be considered "compilation with extra steps"?
How many additional input parameters ("when", "where", "purpose", etc...) to the algorithm are needed to cross the magic barrier?

LLMs are computer programs that produce an output for given inputs. There is no magic involved. It's a mapping of inputs+state => output.
Why is that different from my obfuscator, that produces an output for given inputs? Why can't it cross the magic barrier, without being called LLM?

Creator, or proof reader ?

Posted May 12, 2024 18:10 UTC (Sun) by kleptog (subscriber, #1183) [Link] (2 responses)

> What amount of sparkle dust is needed for a computer program that takes $A and produces $B out of $A not to be considered "compilation with extra steps"?

The law is run by humans not computers so this question is irrelevant. All that matters is: does the tool produce an output that somehow affects the market of an existing copyrighted work? How it is done is not relevant.

So an obfuscator doesn't remove copyright because removing copyright isn't a thing. Either the output is market-substitutable for some copyrighted work, or it isn't.

LLMs do not spontaneously produce output, they are prompted. If an LLM reproduces a copyrighted work, then that is the responsibility of the person who made the prompt. It's fairly obvious that LLM do not reproduce copyrighted works in normal usage so you can't argue that LLMs have a fundamental problem with copyright.

(I guess you could create an LLM that, without a prompt, reproduced the entire works of Shakespeare. You could argue such an LLM would violate Shakespeare's copyright, if he had any. That's not a thing with the LLMs currently available though. In fact, they're going to quite some effort to ensure LLMs do not reproduce entire works, because that is an inefficient use of resources (ie money), they don't care about copyright per se.)

Creator, or proof reader ?

Posted May 12, 2024 19:33 UTC (Sun) by mb (subscriber, #50428) [Link] (1 responses)

>It's fairly obvious that LLM do not reproduce copyrighted works in normal usage

That is not obvious at all.

By that same reasoning my code obfuscator would be Ok to use.
The output is obviously not a copy of the input. You can compare it and it looks completely different.

But the output of the obfuscator obviously is a derived work of the input. Right?
And I don't see why this would be different for an LLM.

Or does using a more complex mixing algorithm suddenly make it not a derived work of the input?
What amount of token stirring is needed?

Creator, or proof reader ?

Posted May 13, 2024 7:30 UTC (Mon) by kleptog (subscriber, #1183) [Link]

> >It's fairly obvious that LLM do not reproduce copyrighted works in normal usage

> That is not obvious at all.

Have you actually used one?

> But the output of the obfuscator obviously is a derived work of the input. Right?

Not at all. "Derived work" is a legal term not a technical one. Running a copyrighted work through an algorithm does not necessarily create a derived work. In copyright law, a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work (the underlying work). If you hash a copyrighted file, the resulting hash is not a derived work simply because it's lost everything that is interesting about the original work.

If your obfuscator has a corresponding deobfuscator that can return the original retaining the major copyrightable elements, then there may be no copyright on the obfuscated file, but as soon as you deobfuscate it, the copyright returns.

Honestly, this feels what "What colour are your bits?"[1] all over again. Are you aware of that article? Statements like this:

> Or does using a more complex mixing algorithm suddenly make it not a derived work of the input? What amount of token stirring is needed?

seem to indicate you are not.

[1] https://ansuz.sooke.bc.ca/entry/23

Creator, or proof reader ?

Posted May 12, 2024 20:10 UTC (Sun) by Wol (subscriber, #4433) [Link] (6 responses)

I think the obvious difference is that you can write a decompiler to retrieve the original source. Or to put it mathematically, your "compilation with extra steps" or obfuscator does not falsify the basic "2 * 12 = 18 + 6" relationship between source and output.

Can you write an anti-LLM, that given the LLM's output, would reverse it back to the original question?

Cheers,
Wol

Creator, or proof reader ?

Posted May 12, 2024 20:36 UTC (Sun) by mb (subscriber, #50428) [Link] (5 responses)

>I think the obvious difference is that you can write a decompiler to retrieve the original source.
> Or to put it mathematically, your "compilation with extra steps" or obfuscator does not falsify the basic "2 * 12 = 18 + 6"

No. That's not possible.

You can't reverse 18+6 into 2*12, because it could also have been 4*6 or anything else that fits the equation. There is an endless number of possibilities.
It's not a 1:1 relation.
Of course my hypothetical obfuscator also would not produce a 1:1 relation between input and output. It's pretty easy to do that.

So, does is the output still a derived work of the input? If so, why is an LLM different?

Creator, or proof reader ?

Posted May 12, 2024 21:01 UTC (Sun) by gfernandes (subscriber, #119910) [Link] (4 responses)

You're sort of missing the point here.

Who uses am obfuscator? The producer of the works because said producer wants an extra layer/hurdle to protect *their* copyright of their original works.

Who uses an LLM? Obviously *not _just_* the producer of the LLM. And *because* of this, the LLM is fundamentally different as far as copyright goes.

The user can cause the LLM to leak copyrighted training material that the _producer_ of the LLM did not license!

This is impossible in the context of an obfuscator.

In fact there is an ongoing case which might bring legal clarity here - NYT v OpenAI.

Creator, or proof reader ?

Posted May 13, 2024 5:58 UTC (Mon) by mb (subscriber, #50428) [Link] (3 responses)

> Who uses am obfuscator? The producer of the works

Nope. I use it on foreign copyrighted work to get public domain work out of it. LLM-style.

Creator, or proof reader ?

Posted May 13, 2024 6:17 UTC (Mon) by gfernandes (subscriber, #119910) [Link] (2 responses)

Then that's already a copyright violation unless you have the permission of the copyright holder to reissue copyrighted material as public domain works.

Creator, or proof reader ?

Posted May 13, 2024 8:56 UTC (Mon) by mb (subscriber, #50428) [Link] (1 responses)

I agree.
So, why is it different, if I process the input data with an LLM algorithm instead of with my algorithm?

Creator, or proof reader ?

Posted May 13, 2024 9:51 UTC (Mon) by farnz (subscriber, #17727) [Link]

It's not different - the output of an LLM may be a derived work of the original. It may also be a non-literal copy, or a transformative work, or even unrelated to the input data.

There's a lot of "AI bros" who would like you to believe that using an LLM automatically results in the output not being a derived work of the input, but this is completely untested in law; the current smart money suggests that "generative AI" output (LLMs, diffusion probabilistic models, whatever) will be treated the same way as human output - it's not automatically a derived work just because you used an LLM, but it could be, and it's on the human operator to ensure that copyright is respected.

It's basically the same story as a printer in that respect; if the input to the printer results in a copyright infringement on the output, then no amount of technical discussion about how I didn't supply the printer with a copyrighted work, I supplied it with a PostScript program to calculate π and instructions on which digits of π to interpret as a bitmap will get me out of trouble. Same currently applies to LLMs; if I get a derived work as output, that's my problem to deal with.

This, BTW, is why "AI bros" would like to see the outputs of LLMs deemed as "non-infringing"; it's going to hurt their business model if "using an AI to generate output" is treated, in law, as equivalent to "using a printer to run a PostScript program", since then their customers have to do all the legal analysis to work out if a given output from a prompt has resulted in a derived work of the training set or not.