How to check copyright?
How to check copyright?
Posted Oct 3, 2025 17:39 UTC (Fri) by ballombe (subscriber, #9523)In reply to: How to check copyright? by mb
Parent article: Fedora floats AI-assisted contributions policy
Clean room reverse engineering requires that there two separate, non-interacting, teams, one having access to the original code and writing its specification and a second team that never access the original code and is only relying on the specification to write the new program.
Since by hypothesis the LLM had access to all the code on github, it cannot be used to write the new program.
Remember when some Windows code was leaked, WINE developers were advised not to look at it to avoid being "tainted".
Posted Oct 3, 2025 17:43 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
The point of the clean room process is that the only thing you need to look at to confirm that the second team did not copy the original code is the specification produced by the first team, which makes it tractable to confirm that the second team's output is not a derived work by virtue of no copying being possible.
But that's not the only way to avoid infringing - it's just a well-understood and low-risk way to do so.
Posted Oct 3, 2025 18:22 UTC (Fri)
by mb (subscriber, #50428)
[Link] (4 responses)
The two teams are interacting. Via documentation.
>Since by hypothesis the LLM had access to all the code on github
I don't agree.
The generated code comes out of the executing application.
Posted Oct 4, 2025 19:51 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link] (3 responses)
In fact, bigger models also increase the memorization ability.
Posted Oct 4, 2025 19:54 UTC (Sat)
by mb (subscriber, #50428)
[Link] (2 responses)
This contradicts itself.
Posted Oct 5, 2025 10:07 UTC (Sun)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
You can also use the language models as source of probabilities for arithmetic coding and some texts will compress ridiculously well, so much that the only explanation is that large parts of the text is already present in the weights in compressed form. In fact it can be mathematically proven that memorization, compression and training are essentially the same thing.
Here is a paper from DeepMind on the memorization capabilities of LLMs: https://arxiv.org/pdf/2507.05578
And here is an earlier one that analyzed how memorization improves as the number of parameters grows: https://arxiv.org/pdf/2202.07646
Posted Oct 5, 2025 13:45 UTC (Sun)
by kleptog (subscriber, #1183)
[Link]
I think those papers actually show it is quite hard. Because even with very specific prompting, the majority of texts could not be recovered to any significant degree. So what are the chances an LLM will reproduce a literal text without special prompting?
Mathematically speaking an LLM is just a function, and for every output there exists an input that will produce something close to it. Even if it is just "Repeat X". (Well, technically I don't know if we know that LLMs have a dense output space.) What are the chances a random person will hit one of those inputs that matches some copyrighted output?
I suppose we've given the "infinite monkeys" a power-tool that makes it more likely for them to reproduce Shakespeare. Is it too likely?
It definitely can be used to write the new program; because it had access to the code on GitHub, you cannot assert lack of access as evidence of lack of copying (which is what the clean room setup is all about), but you can still assert that either the copied code falls on the idea side of the idea-expression distinction, or that it is not a derived work (in the legal, not mathematical, sense) for the purposes of copyright law for some other reason.
How to check copyright?
How to check copyright?
Which is IMO not that dissimilar from the network weights, which are passed from the network trainer application to the network executor application.
The training application had access to the code.
And the executing application doesn't have access to the code.
How to check copyright?
How to check copyright?
How to check copyright?
How to check copyright?
