How to check copyright? [LWN.net]

How to check copyright?

Posted Oct 2, 2025 20:05 UTC (Thu) by mb (subscriber, #50428) [Link] (12 responses)

Why are you so sure?

What is the fundamental difference between

a) human brains processing code into documentation and then into code
and
b) LLMs processing code into very abstract and compressed intermediate representations and then into code?

LLM models would probably contain *less* information about the original code than a documentation.

How to check copyright?

Posted Oct 2, 2025 21:07 UTC (Thu) by pizza (subscriber, #46) [Link] (4 responses)

> What is the fundamental difference between
> a) human brains
> b) LLMs processing

Legally, there's a huge distinction between the two.

And please keep in mind that "legally" is rarely satisfied with "technically" arguments.

How to check copyright?

Posted Oct 2, 2025 21:17 UTC (Thu) by mb (subscriber, #50428) [Link] (3 responses)

>Legally, there's a huge distinction between the two.

Interesting.
Can you back this up with some actual legal text or descriptions from lawyers?
I'd really be interested in learning what lawyers think the differences are.

How to check copyright?

Posted Oct 3, 2025 0:16 UTC (Fri) by pizza (subscriber, #46) [Link] (1 responses)

> Can you back this up with some actual legal text or descriptions from lawyers?

Only stuff created by a human is eligible for copyright protection.

See https://www.copyright.gov/comp3/chap300/ch300-copyrightab... section 307.

Doesn't get any simpler than that.

How to check copyright?

Posted Oct 3, 2025 7:01 UTC (Fri) by mb (subscriber, #50428) [Link]

> Only stuff created by a human is eligible for copyright protection.

That is a completely different topic, though.
This is about *re*-producing existing actually copyrighted content.

How to check copyright?

Posted Oct 3, 2025 11:16 UTC (Fri) by Wol (subscriber, #4433) [Link]

I can't point you to the law(s) themselves, but the European position - IN LAW - is that there is no difference between an AI reading and learning, and a person reading and learning.

So I guess (and this is not clear) that there is no difference between an AI regurgitating what it's learnt, and a person regurgitating what it's learnt.

So it basically comes down to the question "how close is the output to the input, and was the output obvious and not worthy of copyright protection?"

Given the tendency of AI to hallucinate, I guess the output of an AI is LESS likely to violate copyright than that of a human. Of course, the corollary becomes the output of a human is more valuable :-)

Cheers,
Wol

How to check copyright?

Posted Oct 3, 2025 17:39 UTC (Fri) by ballombe (subscriber, #9523) [Link] (6 responses)

> Why are you so sure?

Clean room reverse engineering requires that there two separate, non-interacting, teams, one having access to the original code and writing its specification and a second team that never access the original code and is only relying on the specification to write the new program.

Since by hypothesis the LLM had access to all the code on github, it cannot be used to write the new program.

Remember when some Windows code was leaked, WINE developers were advised not to look at it to avoid being "tainted".

How to check copyright?

Posted Oct 3, 2025 17:43 UTC (Fri) by farnz (subscriber, #17727) [Link]

It definitely can be used to write the new program; because it had access to the code on GitHub, you cannot assert lack of access as evidence of lack of copying (which is what the clean room setup is all about), but you can still assert that either the copied code falls on the idea side of the idea-expression distinction, or that it is not a derived work (in the legal, not mathematical, sense) for the purposes of copyright law for some other reason.

The point of the clean room process is that the only thing you need to look at to confirm that the second team did not copy the original code is the specification produced by the first team, which makes it tractable to confirm that the second team's output is not a derived work by virtue of no copying being possible.

But that's not the only way to avoid infringing - it's just a well-understood and low-risk way to do so.

How to check copyright?

Posted Oct 3, 2025 18:22 UTC (Fri) by mb (subscriber, #50428) [Link] (4 responses)

>two separate, non-interacting, teams

The two teams are interacting. Via documentation.
Which is IMO not that dissimilar from the network weights, which are passed from the network trainer application to the network executor application.

>Since by hypothesis the LLM had access to all the code on github

I don't agree.
The training application had access to the code.
And the executing application doesn't have access to the code.

The generated code comes out of the executing application.

How to check copyright?

Posted Oct 4, 2025 19:51 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (3 responses)

Network weights are a lossy representation of the training material, but they still contain it and are able to reproduce it if asked, as shown above (and also in the New York Times lawsuit against OpenAI).

In fact, bigger models also increase the memorization ability.

How to check copyright?

Posted Oct 4, 2025 19:54 UTC (Sat) by mb (subscriber, #50428) [Link] (2 responses)

>a lossy representation of the training material, but they still contain it and are able to reproduce it

This contradicts itself.

How to check copyright?

Posted Oct 5, 2025 10:07 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (1 responses)

In prediction model they are not able to reproduce *all of it* but they can reproduce a lot of specific texts with varying degrees of precision. For literary works, for example, the models often remember more easily poetry than prose. You can measure precision by checking if the model needs to be told the first few words as opposed to just the title, how often they change a word with a synonym, whether they go into the weeds after a few paragraphs or a few chapters in others. The same is true of programs.

You can also use the language models as source of probabilities for arithmetic coding and some texts will compress ridiculously well, so much that the only explanation is that large parts of the text is already present in the weights in compressed form. In fact it can be mathematically proven that memorization, compression and training are essentially the same thing.

Here is a paper from DeepMind on the memorization capabilities of LLMs: https://arxiv.org/pdf/2507.05578

And here is an earlier one that analyzed how memorization improves as the number of parameters grows: https://arxiv.org/pdf/2202.07646

How to check copyright?

Posted Oct 5, 2025 13:45 UTC (Sun) by kleptog (subscriber, #1183) [Link]

While the issue of memorisation is interesting, it is ultimately not really relevant to the discussion. You don't need an LLM to intentionally violate copyright. The issue is can you use an LLM to *unintentionally* violate copyright?

I think those papers actually show it is quite hard. Because even with very specific prompting, the majority of texts could not be recovered to any significant degree. So what are the chances an LLM will reproduce a literal text without special prompting?

Mathematically speaking an LLM is just a function, and for every output there exists an input that will produce something close to it. Even if it is just "Repeat X". (Well, technically I don't know if we know that LLMs have a dense output space.) What are the chances a random person will hit one of those inputs that matches some copyrighted output?

I suppose we've given the "infinite monkeys" a power-tool that makes it more likely for them to reproduce Shakespeare. Is it too likely?

Clean-room reverse engineering

Posted Oct 3, 2025 8:07 UTC (Fri) by rschroev (subscriber, #4164) [Link]

Clean-room reverse engineering is a whole different topic though, isn't it? It's what Compaq did back in the 80's to reverse engineer the IBM PC's BIOS, enabling them to make compatible machines in a legal way. Notably the people who studied IBM's BIOS and the ones who implemented Compaq's new one were different teams, to avoid any copyright issues.

That's a whole different situation than either people or LLMs reading code and later using their knowledge to write new code.

How to check copyright?

Posted Oct 3, 2025 9:40 UTC (Fri) by farnz (subscriber, #17727) [Link]

Clean-room reverse-engineering isn't part of the codified side of copyright law; rather, it's a process that the courts recognise as guaranteeing that the work produced in the clean room cannot be a derived work of the original.

To be a derived work, there must be some copying of the original, intended or accidental. The clean-room process guarantees that the people in the clean-room cannot copy the original, and therefore, if they do come up with something that appears to be a copy of the original, it's not a derived work.

You can, of course, do reverse-engineering and reimplementation without a clean-room setup; it's just that you then have to show that each piece that's alleged to be a literal copy of the original falls on the right side of the idea-expression distinction to not be a derived work, instead of being able to show that no copying took place.