How to check copyright?

Posted Oct 2, 2025 13:13 UTC (Thu) by stefanha (subscriber, #55072)
In reply to: How to check copyright? by stefanha
Parent article: Fedora floats AI-assisted contributions policy

I missed a word:
"contributors cannot comply with their legal obligations when submitting LLM output"

How to check copyright?

Posted Oct 2, 2025 13:27 UTC (Thu) by farnz (subscriber, #17727) [Link] (11 responses)

The trouble is that your argument depends on the LLM's output being a derived work of its training data; this is not necessarily true, even if you can demonstrate that the training data is present in some form inside the LLM's weights (not least because literal copying is not necessarily a derived work).

If you limit yourself to the subsets of LLM output that are not derived works (e.g. because they're covered by the equivalents of the scènes à faire doctrine in US copyright law or other parts of the idea-expression distinction), then you can comply with your legal obligations. You are forced to do the work to confirm that the LLM output you're using is not, legally speaking, a derived work, but then it's safe to use.

How to check copyright?

Posted Oct 2, 2025 14:54 UTC (Thu) by stefanha (subscriber, #55072) [Link] (2 responses)

> If you limit yourself to the subsets of LLM output that are not derived works (e.g. because they're covered by the equivalents of the scènes à faire doctrine in US copyright law or other parts of the idea-expression distinction), then you can comply with your legal obligations. You are forced to do the work to confirm that the LLM output you're using is not, legally speaking, a derived work, but then it's safe to use.

I started this thread by asking:

> But how is a contributor supposed to know whether AI-generated output is covered by copyright and under a compatible license?

And here you are saying that if you know it's not a derived work, then it's safe to use. I agree with you.

The problem is that we still have no practical way of knowing whether the LLM output is under copyright or not.

How to check copyright?

Posted Oct 2, 2025 15:18 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

There's at least two cases where "knowing whether the LLM output is under copyright or not" is completely irrelevant:

You don't know how to solve the problem; you ask an LLM to explain how to solve this problem, and then you manually write the code yourself, based on the LLM's explanation. This is an existing problem - it's the same as reading a book or a paper that explains how to solve this problem - and the answer is to "assume it's covered by copyright, but write your own solution, don't just copy blindly. That applies whether "the text" is a book, a paper, or some LLM generated work.
The parts of the contribution copied from the LLM's output is one that you've inspected, and confirmed would be covered by an exception to copyright law even if the work they are taken from is under copyright. In this case, the copyright status of the LLM's output is irrelevant, since the part you're using is one you can use even if it's under copyright. Again, this is a pre-existing problem; if I read (say) IEEE 1003.1-2024 (or one of the many things that's copied text from it verbatim, like this Linux man page), and copy part of it into my contribution, that's copying from a document under copyright and licensed under restrictive terms, but because it doesn't rise to the point where my copying creates a derived work, copyright status is irrelevant.

How to check copyright?

Posted Oct 3, 2025 14:53 UTC (Fri) by stefanha (subscriber, #55072) [Link]

> There's at least two cases where "knowing whether the LLM output is under copyright or not" is completely irrelevant:

I agree. I'm curious if anyone has solutions when copyright does come into play. It seems like a major use case that needs to be addressed.

How to check copyright?

Posted Oct 2, 2025 17:16 UTC (Thu) by Wol (subscriber, #4433) [Link] (7 responses)

> The trouble is that your argument depends on the LLM's output being a derived work of its training data; this is not necessarily true, even if you can demonstrate that the training data is present in some form inside the LLM's weights (not least because literal copying is not necessarily a derived work).

It also "conveniently forgets" that any developer worth their salt is exposed to a lot of code for which they do not hold the copyright, and may not even be aware of the fact that they are recalling verbatim chunks of code they memorised at Uni / another place of work / a friend showed it to them.

So all this complaining about AI-generated code could also be applied pretty much the same to developer-generated code, it's just that we don't think it's a problem if it's a developer, some people think it is if it's an AI.

Personally, I'd be quite to happy to ingest AI-generated code into my brain, and then regurgitate the gist of it (suitably modified for corporate guidelines/whatever). By the time you've managed to explain in excruciating detail to the AI what you want, it's probably better to give it a simple explanation and rewrite the result.

Okay, that end result may not be "clean room" copyright compliant, but given the propensity for developers to remember code fragments, I expect very little code is.

We have a problem with musicians suing each other for copying fragments of songs (which the "copier" was probably unaware of - which the copyright *holder* probably copied as well without being aware of it!!!), how can we keep that out of computer programming? We can't, and that's assuming AI had no hand in it!

Cheers,
Wol

How to check copyright?

Posted Oct 3, 2025 13:20 UTC (Fri) by alex (subscriber, #1355) [Link]

I went through this many moons ago when one of the start-ups I worked at was working on an emulation layer. The lawyer made a distinction between "retained knowledge" (i.e. what was in our heads) and copying verbatim from either the files or notes. I had to hand in all my notebooks when I left the company but assuming no reference I could implement something the roughly the same way I had before. There is a lot of code which isn't copyrightable because it is either the only way to it or its "obvious".

Patents where a separate legal rabbit hole.

How to check copyright?

Posted Oct 3, 2025 15:12 UTC (Fri) by stefanha (subscriber, #55072) [Link] (5 responses)

I am not claiming that all AI output is covered by the copyright of its training data. It seems reasonable that generated output is treated in the same way as when humans who have been exposed to copyrighted content create something.

In the original comment I linked to a paper about extracting copyrighted content from LLMs. A web search brings up a bunch more in this field that I haven't read. Here is one explicitly about generated code (https://arxiv.org/html/2408.02487v3) that says "we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations".

I think AI policies are getting ahead of themselves when they assume that a contributor can vouch for license compliance. There needs to be some kind of lawyer-approved solution to this so that the open source community is protected from a copyright mess.

How to check copyright?

Posted Oct 3, 2025 15:25 UTC (Fri) by farnz (subscriber, #17727) [Link] (4 responses)

There's a critical piece of data missing - what proportion of human-written code is strikingly similar to existing open-source implementations?

We know that humans accidentally and unknowingly infringe, too. Why can't we reuse the existing lawyer-approved solution to that problem for LLM output?

How to check copyright?

Posted Oct 3, 2025 16:47 UTC (Fri) by Wol (subscriber, #4433) [Link] (3 responses)

And another thing - how much copyright violation is being blamed on the LLM, when the query being *sent* to the LLM itself is a pretty blatant copyright violation? At which point we're seriously into "unclean hands", and if the querier is not the copyright holder, they could easily find themselves named as a co-defendant (quite likely the more culpable defendant!) even if they're not the deeper pocket.

If I had an LLM and found myself sued like that, I'd certainly want to drag the querier into it ...

Cheers,
Wol

How to check copyright?

Posted Oct 6, 2025 14:24 UTC (Mon) by stefanha (subscriber, #55072) [Link] (2 responses)

> If I had an LLM and found myself sued like that, I'd certainly want to drag the querier into it ...

Hence why contributors need a way to check copyright compliance.

How to check copyright?

Posted Oct 6, 2025 14:29 UTC (Mon) by farnz (subscriber, #17727) [Link]

TBF, you also need such a mechanism to check copyright compliance of any code you've written yourself - you are also quite capable of accidental infringement (where having seen a particular way to write code before, you copy it unintentionally), and to defend yourself or the project you contribute to, you have to prove either that you never saw the original code that you're alleged to have copied (the clean room route) or that this code is on the "idea" side of the idea-expression distinction (however that's expressed in local law).

How to check copyright?

Posted Oct 6, 2025 14:36 UTC (Mon) by pizza (subscriber, #46) [Link]

> Hence why contributors need a way to check copyright compliance.

This is a legal problem, and cannot be solved via (purely, or even mostly) technical means.