How to check copyright?

Posted Oct 4, 2025 19:51 UTC (Sat) by pbonzini (subscriber, #60935)
In reply to: How to check copyright? by mb
Parent article: Fedora floats AI-assisted contributions policy

Network weights are a lossy representation of the training material, but they still contain it and are able to reproduce it if asked, as shown above (and also in the New York Times lawsuit against OpenAI).

In fact, bigger models also increase the memorization ability.

How to check copyright?

Posted Oct 4, 2025 19:54 UTC (Sat) by mb (subscriber, #50428) [Link] (2 responses)

>a lossy representation of the training material, but they still contain it and are able to reproduce it

This contradicts itself.

How to check copyright?

Posted Oct 5, 2025 10:07 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (1 responses)

In prediction model they are not able to reproduce *all of it* but they can reproduce a lot of specific texts with varying degrees of precision. For literary works, for example, the models often remember more easily poetry than prose. You can measure precision by checking if the model needs to be told the first few words as opposed to just the title, how often they change a word with a synonym, whether they go into the weeds after a few paragraphs or a few chapters in others. The same is true of programs.

You can also use the language models as source of probabilities for arithmetic coding and some texts will compress ridiculously well, so much that the only explanation is that large parts of the text is already present in the weights in compressed form. In fact it can be mathematically proven that memorization, compression and training are essentially the same thing.

Here is a paper from DeepMind on the memorization capabilities of LLMs: https://arxiv.org/pdf/2507.05578

And here is an earlier one that analyzed how memorization improves as the number of parameters grows: https://arxiv.org/pdf/2202.07646

How to check copyright?

Posted Oct 5, 2025 13:45 UTC (Sun) by kleptog (subscriber, #1183) [Link]

While the issue of memorisation is interesting, it is ultimately not really relevant to the discussion. You don't need an LLM to intentionally violate copyright. The issue is can you use an LLM to *unintentionally* violate copyright?

I think those papers actually show it is quite hard. Because even with very specific prompting, the majority of texts could not be recovered to any significant degree. So what are the chances an LLM will reproduce a literal text without special prompting?

Mathematically speaking an LLM is just a function, and for every output there exists an input that will produce something close to it. Even if it is just "Repeat X". (Well, technically I don't know if we know that LLMs have a dense output space.) What are the chances a random person will hit one of those inputs that matches some copyrighted output?

I suppose we've given the "infinite monkeys" a power-tool that makes it more likely for them to reproduce Shakespeare. Is it too likely?