How to check copyright?
How to check copyright?
Posted Oct 2, 2025 10:23 UTC (Thu) by stefanha (subscriber, #55072)In reply to: How to check copyright? by farnz
Parent article: Fedora floats AI-assisted contributions policy
They look at the license information for any reference code they are copying (e.g. code they found through a web search). LLMs do not provide license information with their output, so the contributor cannot do this.
It's a false equivalence to compare this to a developer verifying their own hand-written code. It's more like a developer coming across source code without license information somewhere. It might be handy to copy-paste that code but how can they determine the license of the mystery code?
> And it's perfectly fair to push this responsibility onto contributors; they're the ones who choose which tools they use, and if they're choosing tools where they can't be sure about the copyright status of the output, then they're the root cause of the problem. Why should Fedora take responsibility for finding a way to vet the output of all possible tools against the vast body of copyrighted content, when it has no say in the tools you choose?
Because a policy whose requirements are impossible to fulfill is at best sloppy. At worst it's "wink, wink, we know the code is not actually license compliant but we'll turn a blind eye to it".
Posted Oct 2, 2025 10:31 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (45 responses)
That's why this is a reasonable policy; the requirements are the same no matter what tools you use. It's on you, as contributor, to make sure that you're complying with the rules.
There's also no reason why you can't have an LLM whose code is known to be licensed permissively; if you only train the model on MIT-licensed code, then, by definition, if the output is a derived work of the input (which is not guaranteed to be the case), the output is MIT-licensed. Banning "AI tools" bans this tool, too.
Posted Oct 2, 2025 12:44 UTC (Thu)
by io-cat (subscriber, #172381)
[Link] (11 responses)
Using these tools then makes it extremely hard if not impossible to comply with the rules.
I strongly agree with the statement that the final responsibility is on the contributor (the person).
Posted Oct 2, 2025 12:47 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (3 responses)
Quite the opposite - I see it telling you that if you use them, you're responsible for everything they do, and Fedora will not accept "oh, that was the LLM" as an excuse for your failure to sort out licensing terms.
Posted Oct 2, 2025 13:10 UTC (Thu)
by io-cat (subscriber, #172381)
[Link] (1 responses)
> We encourage the use of AI assistants as an evolution of the contributor toolkit. However, human oversight remains critical.
https://discussion.fedoraproject.org/t/council-policy-pro...
Posted Oct 2, 2025 13:17 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
Posted Oct 2, 2025 13:10 UTC (Thu)
by jzb (editor, #7867)
[Link]
"I see no encouragement to use these tools in the policy post." The first sentence in the original policy post after the heading "AI-assisted project contributions" is "We encourage the use of AI assistants as an evolution of the contributor toolkit." (Emphasis added.) The second draft posted in the discussion thread is more neutral and focuses on stressing that the contributor is responsible.
Posted Oct 2, 2025 13:21 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (6 responses)
Well that’s a problem for LLM advocates not for Fedora. It’s not for the community to solve problems in products pushed by wealthy corporations. The sad part of IBM buying Red Hat is that Fedora is now part of the corporate hype cycle, and every Red Hat employee is now required to state whatever tech IBM wants a slice of is something terrific you should be enthusiastic about. Red Hat consistently outperformed the IBMs of the day because it delivered solid boring tech that solved actual problems not over-hyped vaporware.
LLM tech has some core unsolved problems just like cryptocurrencies (the previous hype cycle) had core unsolved problems, sad to be you if you were foolish enough to put your savings there listening to the corporate hype, the corporate hype does not care about you nor about communities.
Posted Oct 2, 2025 13:53 UTC (Thu)
by io-cat (subscriber, #172381)
[Link] (2 responses)
Could you clarify how did you perceive my comment? I’m not sure how does your response, especially given the tone, follow from it :)
Posted Oct 2, 2025 14:57 UTC (Thu)
by nim-nim (subscriber, #34454)
[Link] (1 responses)
I completely agree the “enthusiasm” and gushing about the greatness of IA has no place in a community policy. That’s pure unadulterated corporate brown-nosing. Good community policies should be dry and to the point, help contributors contribute, not feel like an advert for something else.
Posted Oct 2, 2025 15:21 UTC (Thu)
by io-cat (subscriber, #172381)
[Link]
If it is too hard or impossible to guarantee that the license of LLM output is compliant with the rules - it doesn’t make sense to me to encourage or perhaps even allow usage of such tools until this is ironed out by their proponents.
I’m focusing my opinion here specifically on the licensing question, aside from other potentially problematic things.
Posted Oct 2, 2025 13:58 UTC (Thu)
by zdzichu (subscriber, #17118)
[Link]
Posted Oct 3, 2025 9:38 UTC (Fri)
by nim-nim (subscriber, #34454)
[Link]
Posted Oct 3, 2025 12:56 UTC (Fri)
by stefanha (subscriber, #55072)
[Link]
I felt bemused reading this. Here we are in a comment thread that I, a Red Hat employee, started about issues with the policy proposal. Many of the people raising questions on the Fedora website are also Red Hat employees.
It's normal for discussions to happen in public in the community. Red Hatters can and will disagree with each other.
Posted Oct 2, 2025 13:12 UTC (Thu)
by stefanha (subscriber, #55072)
[Link] (32 responses)
The statement I'm making is: contributors cannot comply with their legal obligations when LLM output. I'll also add "LLM output as it exists today", because you're right that in theory LLMs could be trained in a way to provide clarity on licensing.
There is no point in a policy about contributing LLM-generated code today because no contributor can follow it while still honoring their legal obligations.
> There's also no reason why you can't have an LLM whose code is known to be licensed permissively; if you only train the model on MIT-licensed code, then, by definition, if the output is a derived work of the input (which is not guaranteed to be the case), the output is MIT-licensed. Banning "AI tools" bans this tool, too.
The MIT license requires including the copyright notice with the software, so the LLM would need to explicitly include "Copyright (c) <year> <copyright holders>" and the rest of the copyright notice for each input that is being copied. It's not enough to just add a blanket MIT-license to the output because it will not contain the copyright holder information.
An LLM that can do that would be great. It could also be trained on software under other licenses because the same attribution needed for MIT would be enough properly license other software too.
But that does not exist as far as I know. The reality today is that a contributor cannot take AI-generated output and know how it is licensed.
Posted Oct 2, 2025 13:13 UTC (Thu)
by stefanha (subscriber, #55072)
[Link] (12 responses)
Posted Oct 2, 2025 13:27 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (11 responses)
If you limit yourself to the subsets of LLM output that are not derived works (e.g. because they're covered by the equivalents of the scènes à faire doctrine in US copyright law or other parts of the idea-expression distinction), then you can comply with your legal obligations. You are forced to do the work to confirm that the LLM output you're using is not, legally speaking, a derived work, but then it's safe to use.
Posted Oct 2, 2025 14:54 UTC (Thu)
by stefanha (subscriber, #55072)
[Link] (2 responses)
I started this thread by asking:
> But how is a contributor supposed to know whether AI-generated output is covered by copyright and under a compatible license?
And here you are saying that if you know it's not a derived work, then it's safe to use. I agree with you.
The problem is that we still have no practical way of knowing whether the LLM output is under copyright or not.
Posted Oct 2, 2025 15:18 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (1 responses)
Posted Oct 3, 2025 14:53 UTC (Fri)
by stefanha (subscriber, #55072)
[Link]
I agree. I'm curious if anyone has solutions when copyright does come into play. It seems like a major use case that needs to be addressed.
Posted Oct 2, 2025 17:16 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (7 responses)
It also "conveniently forgets" that any developer worth their salt is exposed to a lot of code for which they do not hold the copyright, and may not even be aware of the fact that they are recalling verbatim chunks of code they memorised at Uni / another place of work / a friend showed it to them.
So all this complaining about AI-generated code could also be applied pretty much the same to developer-generated code, it's just that we don't think it's a problem if it's a developer, some people think it is if it's an AI.
Personally, I'd be quite to happy to ingest AI-generated code into my brain, and then regurgitate the gist of it (suitably modified for corporate guidelines/whatever). By the time you've managed to explain in excruciating detail to the AI what you want, it's probably better to give it a simple explanation and rewrite the result.
Okay, that end result may not be "clean room" copyright compliant, but given the propensity for developers to remember code fragments, I expect very little code is.
We have a problem with musicians suing each other for copying fragments of songs (which the "copier" was probably unaware of - which the copyright *holder* probably copied as well without being aware of it!!!), how can we keep that out of computer programming? We can't, and that's assuming AI had no hand in it!
Cheers,
Posted Oct 3, 2025 13:20 UTC (Fri)
by alex (subscriber, #1355)
[Link]
Patents where a separate legal rabbit hole.
Posted Oct 3, 2025 15:12 UTC (Fri)
by stefanha (subscriber, #55072)
[Link] (5 responses)
In the original comment I linked to a paper about extracting copyrighted content from LLMs. A web search brings up a bunch more in this field that I haven't read. Here is one explicitly about generated code (https://arxiv.org/html/2408.02487v3) that says "we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations".
I think AI policies are getting ahead of themselves when they assume that a contributor can vouch for license compliance. There needs to be some kind of lawyer-approved solution to this so that the open source community is protected from a copyright mess.
Posted Oct 3, 2025 15:25 UTC (Fri)
by farnz (subscriber, #17727)
[Link] (4 responses)
We know that humans accidentally and unknowingly infringe, too. Why can't we reuse the existing lawyer-approved solution to that problem for LLM output?
Posted Oct 3, 2025 16:47 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (3 responses)
If I had an LLM and found myself sued like that, I'd certainly want to drag the querier into it ...
Cheers,
Posted Oct 6, 2025 14:24 UTC (Mon)
by stefanha (subscriber, #55072)
[Link] (2 responses)
Hence why contributors need a way to check copyright compliance.
Posted Oct 6, 2025 14:29 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
Posted Oct 6, 2025 14:36 UTC (Mon)
by pizza (subscriber, #46)
[Link]
This is a legal problem, and cannot be solved via (purely, or even mostly) technical means.
Posted Oct 2, 2025 13:18 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (18 responses)
And the reality today is that you can take AI-generated output, and confirm by inspection that it's not possible for it to be a derived work, and hence that licensing is irrelevant.
Posted Oct 2, 2025 15:09 UTC (Thu)
by stefanha (subscriber, #55072)
[Link] (17 responses)
I agree.
It is common practice to use AI to generate non-trivial output though. If the intent of the policy is to allow trivial AI-generated contributions, then it should mention this to prevent legal issues.
Posted Oct 2, 2025 15:34 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (16 responses)
Fundamentally, and knowing how transformers work, I do not accept the claim that an AI's output is inherently a derived work of the training data. It is definitely (and demonstrably) possible to get an AI to output things that are derived works of the training data, with the right prompts, but it is also entirely possible to get an AI to produce output that is not derived from the training data for the purposes of copyright law.
It is also possible to get humans to produce outputs that are derived works of their training data, but that doesn't imply that all works produced by humans are derived works of their training data, for the purposes of copyright law.
Posted Oct 2, 2025 19:55 UTC (Thu)
by ballombe (subscriber, #9523)
[Link] (15 responses)
Posted Oct 2, 2025 20:05 UTC (Thu)
by mb (subscriber, #50428)
[Link] (12 responses)
What is the fundamental difference between
a) human brains processing code into documentation and then into code
LLM models would probably contain *less* information about the original code than a documentation.
Posted Oct 2, 2025 21:07 UTC (Thu)
by pizza (subscriber, #46)
[Link] (4 responses)
Legally, there's a huge distinction between the two.
And please keep in mind that "legally" is rarely satisfied with "technically" arguments.
Posted Oct 2, 2025 21:17 UTC (Thu)
by mb (subscriber, #50428)
[Link] (3 responses)
Interesting.
Posted Oct 3, 2025 0:16 UTC (Fri)
by pizza (subscriber, #46)
[Link] (1 responses)
Only stuff created by a human is eligible for copyright protection.
See https://www.copyright.gov/comp3/chap300/ch300-copyrightab... section 307.
Doesn't get any simpler than that.
Posted Oct 3, 2025 7:01 UTC (Fri)
by mb (subscriber, #50428)
[Link]
That is a completely different topic, though.
Posted Oct 3, 2025 11:16 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
So I guess (and this is not clear) that there is no difference between an AI regurgitating what it's learnt, and a person regurgitating what it's learnt.
So it basically comes down to the question "how close is the output to the input, and was the output obvious and not worthy of copyright protection?"
Given the tendency of AI to hallucinate, I guess the output of an AI is LESS likely to violate copyright than that of a human. Of course, the corollary becomes the output of a human is more valuable :-)
Cheers,
Posted Oct 3, 2025 17:39 UTC (Fri)
by ballombe (subscriber, #9523)
[Link] (6 responses)
Clean room reverse engineering requires that there two separate, non-interacting, teams, one having access to the original code and writing its specification and a second team that never access the original code and is only relying on the specification to write the new program.
Since by hypothesis the LLM had access to all the code on github, it cannot be used to write the new program.
Remember when some Windows code was leaked, WINE developers were advised not to look at it to avoid being "tainted".
Posted Oct 3, 2025 17:43 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
The point of the clean room process is that the only thing you need to look at to confirm that the second team did not copy the original code is the specification produced by the first team, which makes it tractable to confirm that the second team's output is not a derived work by virtue of no copying being possible.
But that's not the only way to avoid infringing - it's just a well-understood and low-risk way to do so.
Posted Oct 3, 2025 18:22 UTC (Fri)
by mb (subscriber, #50428)
[Link] (4 responses)
The two teams are interacting. Via documentation.
>Since by hypothesis the LLM had access to all the code on github
I don't agree.
The generated code comes out of the executing application.
Posted Oct 4, 2025 19:51 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link] (3 responses)
In fact, bigger models also increase the memorization ability.
Posted Oct 4, 2025 19:54 UTC (Sat)
by mb (subscriber, #50428)
[Link] (2 responses)
This contradicts itself.
Posted Oct 5, 2025 10:07 UTC (Sun)
by pbonzini (subscriber, #60935)
[Link] (1 responses)
You can also use the language models as source of probabilities for arithmetic coding and some texts will compress ridiculously well, so much that the only explanation is that large parts of the text is already present in the weights in compressed form. In fact it can be mathematically proven that memorization, compression and training are essentially the same thing.
Here is a paper from DeepMind on the memorization capabilities of LLMs: https://arxiv.org/pdf/2507.05578
And here is an earlier one that analyzed how memorization improves as the number of parameters grows: https://arxiv.org/pdf/2202.07646
Posted Oct 5, 2025 13:45 UTC (Sun)
by kleptog (subscriber, #1183)
[Link]
I think those papers actually show it is quite hard. Because even with very specific prompting, the majority of texts could not be recovered to any significant degree. So what are the chances an LLM will reproduce a literal text without special prompting?
Mathematically speaking an LLM is just a function, and for every output there exists an input that will produce something close to it. Even if it is just "Repeat X". (Well, technically I don't know if we know that LLMs have a dense output space.) What are the chances a random person will hit one of those inputs that matches some copyrighted output?
I suppose we've given the "infinite monkeys" a power-tool that makes it more likely for them to reproduce Shakespeare. Is it too likely?
Posted Oct 3, 2025 8:07 UTC (Fri)
by rschroev (subscriber, #4164)
[Link]
That's a whole different situation than either people or LLMs reading code and later using their knowledge to write new code.
Posted Oct 3, 2025 9:40 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
To be a derived work, there must be some copying of the original, intended or accidental. The clean-room process guarantees that the people in the clean-room cannot copy the original, and therefore, if they do come up with something that appears to be a copy of the original, it's not a derived work.
You can, of course, do reverse-engineering and reimplementation without a clean-room setup; it's just that you then have to show that each piece that's alleged to be a literal copy of the original falls on the right side of the idea-expression distinction to not be a derived work, instead of being able to show that no copying took place.
But it's not Fedora's decision to use an LLM - it's the contributor. The contributor is responsible for making sure that they've complied with all of their legal obligations. And, taking your very example: how does Fedora determine that the developer didn't just copy and past mystery code they were sent in a private chat at work?
How to check copyright?
How to check copyright?
But the encouragement to use these tools in the original policy post is incompatible with compliance in the current landscape.
I see no encouragement to use these tools in the policy post.
How to check copyright?
How to check copyright?
I missed that, skipping over what I thought was "corporate boilerplate" to the bolded sentence afterwards: "The contributor is always the author and is fully accountable for their contributions."
How to check copyright?
How to check copyright?
How to check copyright?
How to check copyright?
How to check copyright?
How to check copyright?
How to check copyright?
How to check copyright?
How to check copyright?
How to check copyright?
How to check copyright?
"contributors cannot comply with their legal obligations when submitting LLM output"
The trouble is that your argument depends on the LLM's output being a derived work of its training data; this is not necessarily true, even if you can demonstrate that the training data is present in some form inside the LLM's weights (not least because literal copying is not necessarily a derived work).
How to check copyright?
How to check copyright?
There's at least two cases where "knowing whether the LLM output is under copyright or not" is completely irrelevant:
How to check copyright?
How to check copyright?
How to check copyright?
Wol
How to check copyright?
How to check copyright?
There's a critical piece of data missing - what proportion of human-written code is strikingly similar to existing open-source implementations?
How to check copyright?
How to check copyright?
Wol
How to check copyright?
TBF, you also need such a mechanism to check copyright compliance of any code you've written yourself - you are also quite capable of accidental infringement (where having seen a particular way to write code before, you copy it unintentionally), and to defend yourself or the project you contribute to, you have to prove either that you never saw the original code that you're alleged to have copied (the clean room route) or that this code is on the "idea" side of the idea-expression distinction (however that's expressed in local law).
How to check copyright?
How to check copyright?
Right, but policy should not be written to just cover today. It also needs to cover tomorrow's evolutions of the technology, too. And that means both hypotheticals, like an LLM that could correctly attribute derived works, or a contributor who uses something that's AI-based, but is careful to make sure that the stuff they submit to Fedora is not a derived work in the copyright sense (and hence licensing is irrelevant).
How to check copyright?
How to check copyright?
You continue to ignore the massive difference between "non-trivial" and "a derived work of the training data, protected by copyright". That's deeply critical to this conversation, and unless you can show that an AI's output is inherently a derived work, you're asking us to accept a tautology.
How to check copyright?
How to check copyright?
How to check copyright?
and
b) LLMs processing code into very abstract and compressed intermediate representations and then into code?
How to check copyright?
> a) human brains
> b) LLMs processing
How to check copyright?
Can you back this up with some actual legal text or descriptions from lawyers?
I'd really be interested in learning what lawyers think the differences are.
How to check copyright?
How to check copyright?
This is about *re*-producing existing actually copyrighted content.
How to check copyright?
Wol
How to check copyright?
It definitely can be used to write the new program; because it had access to the code on GitHub, you cannot assert lack of access as evidence of lack of copying (which is what the clean room setup is all about), but you can still assert that either the copied code falls on the idea side of the idea-expression distinction, or that it is not a derived work (in the legal, not mathematical, sense) for the purposes of copyright law for some other reason.
How to check copyright?
How to check copyright?
Which is IMO not that dissimilar from the network weights, which are passed from the network trainer application to the network executor application.
The training application had access to the code.
And the executing application doesn't have access to the code.
How to check copyright?
How to check copyright?
How to check copyright?
How to check copyright?
Clean-room reverse engineering
Clean-room reverse-engineering isn't part of the codified side of copyright law; rather, it's a process that the courts recognise as guaranteeing that the work produced in the clean room cannot be a derived work of the original.
How to check copyright?
