|
|
Subscribe / Log in / New account

Fedora floats AI-assisted contributions policy

By Joe Brockmeier
October 1, 2025

The Fedora Council began a process to create a policy on AI-assisted contributions in 2024, starting with a survey to ask the community its opinions about AI and using AI technologies in Fedora. On September 25, Jason Brooks published a draft policy for discussion; so far, in keeping with the spirit of compromise, it has something to make everyone unhappy. For some it is too AI-friendly, while others have complained that it holds Fedora back from experimenting with AI tooling.

Fedora's AI survey

Aoife Moloney asked for suggestions in May 2024, via Fedora's discussion forum, on survey questions to learn "what our community would like and perhaps even need from AI capabilities in Fedora". Many of Fedora's contributor conversations take place on the Fedora devel mailing list, but Moloney did not solicit input for the survey questions there.

Tulio Magno Quites Machado Filho suggested asking whether the community should accept contributions generated by AI, and if AI-generated responses to mailing lists should be prohibited. Josh Boyer had ideas for the survey, including how Fedora defines AI and whether contributions to the project should be used as data by Fedora to create models. Justin Wheeler wanted to understand "the feelings that someone might have when we talk about 'Open Source' and 'AI/ML' at the same time". People likely have strong opinions about both, he said, but what about when the topics are combined?

Overall, there were only a handful of suggested questions. Matthew Miller, who was the Fedora Project Leader (FPL) at the time, pointed out that some of the questions proposed by commenters were good questions but not good survey questions.

In July, Moloney announced on the forum and via Fedora's devel-announce list that the survey had been published. Unfortunately, it is no longer available online, and the questions were not included in the announcements.

The way the survey's questions and answers were structured turned out to be a bit contentious; some felt that the survey was biased toward AI/ML inclusion in Fedora. Lyude Paul wanted a way to say Fedora should not, for example, include AI in the operating system itself in any capacity. That was not possible with the survey as written:

I'd like to make sure I'm getting across to surveyors that tools like [Copilot] are things that should actively be kept away from the community due to the enormous PR risk they carry. Otherwise it makes it feel like the only option that's being given is one that feels like it's saying "well, I don't think we should keep these things out of Fedora - I just feel they're less important."

Moloney acknowledged that the questions were meant to "keep the tone of this survey positive about AI" because it is easy to find negatives for the use of AI, "and we didn't want to take that route":

We wanted to approach the questions and uncertainty around AI and its potential uses in Fedora from a positive view and useful application, but I will reassure you that just because we are asking you to rank your preference of AI in certain areas of the project, does not mean we will be introducing AI into all of these areas.

We are striving to understand peoples preference only and any AI introductions into Fedora will always be done in the Fedora way - an open conversation about intent, community feedback, and transparent decision-making and/or planning that may follow after.

DJ Delorie complained on the devel list that there was no way to mark all options as a poor fit for Fedora. Moloney repeated the sentiment about positive tone in reply. Tom Hughes responded "if you're only interested in positive responses then we can say the survey design is a success - just a shame that the results will be meaningless". Several other Fedora community members chimed in with complaints about the survey, which was pulled on July 3, and then relaunched on July 10, after some revisions.

It does not appear that the full survey results were ever published online. Miller summarized the results during his State of Fedora talk at Flock 2024, but the responses were compressed into a less-than-useful form. The survey asked separate questions whether AI was useful for specific tasks, such as testing or coding, and whether respondents would like to see AI used for those specific tasks in Fedora. So, for example, a respondent could say "yes" to using AI for testing, but say "no" or that they are uncertain to the question of whether they'd like to see AI used for contributions.

Instead of breaking out the responses by task, all of the responses have been lumped into an overall result broken out into two groups, user or contributor. If a respondent answered yes to some questions but uncertain to others, they were counted as "Yes + Uncertain". That presentation does not seem to correspond with how the questions were posed to those taking the survey.

The only conclusion that can be inferred from these graphs is that a majority of respondents chose "uncertain" in a lot of cases, and that there is less uncertainty among users than contributors. Miller's slides are available online. The survey results were discussed in more detail in a Fedora Council meeting on September 11, 2024; the video of the meeting is available on YouTube.

Analysis

The council had asked Greg Sutcliffe, who has a background as a data scientist to analyze and share the results from the survey. He began the discussion by saying that the survey cannot be interpreted as "this is what Fedora thinks", because it failed to deal with sampling bias.

He also noted other flaws in the survey, such as giving respondents the option of choosing "uncertain" as an answer; Sutcliffe said it was not clear whether uncertain meant the respondent did not know enough about AI to answer the question, or whether it meant the respondent knew enough to answer, but was ambivalent in some way. He also said that it was "interesting that we don't ask about the thing we actually care finding out about": how the respondents feel about AI in general. Without understanding that, it is challenging to place other answers in context.

One thing that was clear is that the vast majority of respondents were Fedora users, rather than contributors. The survey asked those who responded to identify their role with Fedora, with options such as developer, packager, support, QA, and user. Out of about 3,000 responses, more than 1,750 chose "user" as their role.

Sutcliffe noted that "'no' is pretty much the largest category in every case" where respondents were asked about use of AI for specific tasks in Fedora; across the board, the survey seemed to indicate a strong bent toward rejecting the use of AI overall. However, respondents were less negative about using AI depending on the context. For example, respondents were overwhelmingly negative about having AI included in the operating system and being used for Fedora development or moderation; the responses regarding the use of AI technologies in testing or Fedora infrastructure were more balanced.

"On the negative side"

Miller asked him to sum up the results, with the caveat that the survey did not support general conclusions about what Fedora thinks. Sutcliffe replied that "the sentiment seems to be more on the negative side [...] the vast majority don't think this is a good idea, or at least they don't see a place" for AI.

Given that, it seems odd that when Brooks announced the draft policy that the council had put forward, he summarized the survey results as giving "a clear message" that Fedora's community sees "the potential for AI to help us build a better platform" with valid concerns about privacy, ethics, and quality. Sutcliffe's analysis seemed to indicate that the survey delivered a muddled message, but one that could best be summed up as "no thanks" to AI if one were to reach a conclusion at all. Neal Gompa said that he did not understand how the conclusions for the policy were made, "because it doesn't seem like it jives with the community sentiment from most contributors".

Some might notice that there has been quite a bit of time since the survey and the council's draft policy. This may be because there is not a consensus within the council on what Fedora should be doing or allowing. The AI guidelines were a topic in the September 10 council meeting. (Meeting log.) David Cantrell said that he was unsure that Fedora could have a real policy right now, "because the field is so young". Miro Hrončok pointed to the Gentoo AI policy, which expressly forbids contributing AI-generated content to Gentoo. He did not want to replicate that policy as-is, but "they certainly do have a point", he said. Cantrell said Gentoo's policy is what he would want for Fedora's policy right now, with an understanding it could be changed later. LWN covered Gentoo's policy in April 2024.

Cantrell added that his concerns about copyright, ownership, and creator rights with AI had not been addressed:

So far my various conversations have been met with shrugs and responses like 'eh, we'll figure it out', which to me is not good enough. [...]

Any code now that I've written and put on the internet is now in the belly of every LLM and can be spit back out at someone sans the copyright and licensing block. Does no one care about open source licensing anymore? Did I not get the memo?

FPL Jef Spaleta said that he did not understand how splitting out Fedora's "thin layer of contributions" does anything to address Cantrell's copyright concerns. "I'm not saying it's moot. I'm saying you're drawing the line in the sand at the wrong place." Moloney eventually reminded the rest of the council that time was running short in the meeting, and suggested that the group come to an agreement on what to do next. "We have a ticket, we have a WIP and we have a lot of opinions, all the ingredients to make...something."

Proposed policy

There are several sections to the initial draft policy from September 25; it addresses AI-assisted contributions, the use of AI for Fedora project management, as well as policy for use of AI-powered features in Fedora and how Fedora project data could be used for training AI models.

The first draft encouraged the use of AI assistants in contributing to Fedora, but stressed that "the contributor is always the author and is fully accountable for their contributions". Contributors are asked, but not required, to disclose when AI tools have "significantly assisted" creation of a work. Usage of AI tools for translation to "overcome language barriers" is welcome.

It would put guardrails around using AI tools for reviewing contributions; the draft says that AI tools may assist in providing feedback, but "AI should not make the final determination on whether a contribution is accepted or not". The use of AI for Fedora project management, such as deciding code-of-conduct matters or reviewing conference talks, is expressly forbidden. However, the use of automated note-taking and spam filtering are allowed.

Many vendors, and even some open-source projects, are rushing to push AI-related features into operating systems and software whether users want them or not. Perhaps the most famous (or infamous) of these is Microsoft's Copilot, which is deeply woven into the Windows 11 operating system and notoriously difficult to turn off. Fedora is unlikely to go down that path anytime soon; any AI-powered features in Fedora, the policy says, must be opt-in—especially those that send data to a remote service.

The draft policy section on use of Fedora project data prohibits "aggressive scraping" and suggests contacting Fedora's Infrastructure team for "efficient data access". It does not, however, address the use of project data by Fedora itself; that is, there is no indication about how the project's data could be used by the project in creating any models or in when using AI tools on behalf of the project. It also does not explain what criteria might be used to grant "efficient" access to Fedora's data.

Feedback

The draft received quite a bit of feedback in short order. John Handyman said that the language about opt-in features was not strong enough. He suggested that the policy should require any AI features not only to be opt-in, they should be optional components that must be installed by the user consciously. He also wanted the policy to prefer models running locally on the user's machine, rather than those that send data to a service.

Hrončok distanced himself a bit from the proposal in the forum discussion after it was published; he said that he agreed to go forward with public discussion of the proposal, but made it known, "I did not write this proposal, nor did I sign it as 'proposed by the Council'". Moloney clarified that Brooks had compiled the policy and council members had been asked to review the policy and provide any feedback they might have. If there is feedback that requires significant changes ("which there has been"), then it should be withdrawn, revised, and re-proposed.

Red Hat vice president of core platforms, Mike McGrath, challenged the policy's prohibition on using AI to make decisions when reviewing contributions. "A ban of this kind seems to step away from Fedora's 'first' policies without even having done the experiment to see what it would look like today." He wanted to see Fedora "get in front of RHEL again with a more aggressive approach to AI". He would hate, he said, for Fedora to be a "less attractive innovation engine than even CentOS Stream".

Fedora Engineering Steering Committee (FESCo) member Fabio Valentini was not persuaded that inclusion and use of AI technologies constitutes innovation. Red Hat and IBM could experiment with AI all they want, but "as far as I can tell, it's pretty clear that most Fedora contributors don't want this to happen in Fedora itself".

McGrath said that has been part of the issue; he had not been a top contributor to Fedora for a long time, but he could not understand why Fedora and its governing boards "have become so well known for what they don't want. I don't have a clue what they do want." The policy, he said, was a compromise between Fedora's mission and nay-sayers:

I just think it sends a weak message at a time where Fedora could be leading and shaping the future and saying "AI, our doors are open, let's invent the future again".

Spaleta replied that much of the difference between establishing policy between Fedora and RHEL comes down to trust. Contribution access to RHEL and CentOS is hierarchical, he said, and trust is established differently there than within Fedora. Even those packagers who work on both may have different levels of trust within Red Hat and Fedora. There is not a workable framework, currently, for non-deterministic systems to establish the same level of trust as a human within Fedora. The policy may not be as open to experimenting with AI as McGrath might like, but he would rather find a way to move forward with a more prohibitive policy than to "grind on a policy discussion for another year without any progress, stuck on the things we don't have enough experience with to feel comfortable".

Graham White, an IBM employee who is the technical lead for the Granite AI agents, objected to a part of the policy that referenced AI slop:

I've been working in the industry and building AI models for a shade over 20 years and never come across "AI slop". This seems derogatory to me and an unnecessary addition to the policy.

Clearly, White has not been reviewing LWN's article submissions queue or paying much attention to open-source maintainers who have been wading through the stuff.

Redraft

After a few days of feedback, Brooks said that it was clear that the policy was trying to do too much by being a statement about AI as well as a policy for AI usage. He said that he had made key changes to the policy, including removing "biased-sounding rules" about AI slop, using RFC 2119 standard language such as SHOULD and MUST NOT rather than weaker terms, and removed some murky language about honoring Fedora's licenses. He also disclosed that he used AI tools extensively in the revision process.

The shorter policy draft says that contributors should disclose the use of AI assistance; non-disclosure "should be exceptional and justifiable". AI must not be used to make final determinations about contributions or community standing, though it does not prohibit automated tooling for pre-screening tasks like checking packages for common errors. User-facing features, per the policy, must be opt-in and require explicit user consent before being activated. The shorter policy retains the prohibition on disruptive or aggressive scraping, and leaves it to the Fedora Infrastructure team to grant data access.

Daniel P. Berrangé questioned whether parts of the policy were necessary at all, or if they were specific to AI tools. In particular, he noted that questions around sending user data to remote services had come up well before AI was involved. He suggested that Fedora should have a general policy on how it evaluates and approves tools that process user data, "which should be considered for any tool whether using AI or not". An AI policy, he said, should not be a stand-in for scenarios where Fedora has been missing a satisfactory policy until now.

Prohibition is impossible

The revised policy as drafted and the general spirit of the policy as being in favor of generative AI tools will not please people who are against the use of generative AI in general. However, it does have the virtue of being more practical in terms of enforcement than flat-out forbidding the use of generative AI tools.

Spaleta said that a strict prohibition against AI tools would not stop people from using AI tools; it would "only serve to keep people from talking about how they use it". Prohibition is essentially unenforceable and "only reads as a protest statement". He also said that there is "an entire ecosystem of open source LM/LLM work that this community appears to be completely disconnected from"; he did not want to "purity test" their work, but to engage with them about ethical issues and try to address them.

Next

There appears to be a growing tension between what Red Hat and IBM would like to see from Fedora versus what its users and community contributors want from the project. Red Hat and IBM have already come down in favor of AI as part of their product strategies, the only real questions are what to develop and offer to the customers or partners.

The Fedora community, on the other hand, has quite a few people who feel strongly against AI technologies for various ethical, practical, and social reasons. The results, so far, of turning people loose with generative AI tools on unsuspecting open-source projects has not been universally positive. People join communities to collaborate with other people, not to sift through the output of large language models. It is possible that Red Hat will persuade Fedora to formally endorse a policy of accepting AI-assisted content, but it may be at the expense of users and contributors.

The discussion continues. Fedora's change policy requires a minimum two-week period for discussion of such policies before the council can vote. A policy change needs to be passed with the "full consensus" model, meaning that it must have at least three of the eight current council members voting in favor of the policy and no votes against. At the moment, the council could vote on a policy as soon as October 9. What the final policy looks like, if one is accepted at all, and how it is received by the larger Fedora community remains to be seen.



to post comments

Prohibition *is* possible

Posted Oct 1, 2025 17:16 UTC (Wed) by Lightspill (subscriber, #168954) [Link]

Lots of communities prohibit things like violating licenses and plagiarism. Can you detect them with 100% certainty? No. Obviously not. Does everyone decide "Oh, well, okay, then. Since we can't be sure someone isn't submitting code they found somewhere under an incompatible license and stripping off the attribution, I guess we can't have any requirements on provenance."

Communities do, in fact, sanction some activities and the people who performed them when they are found out. You can require all commits to have a trailer indicating that the author affirms they did not use an LLM.

If you want to argue for LLM-code, fine. But "Well, prohibition is impossible so we have no choice." is simply false.

How to check copyright?

Posted Oct 1, 2025 21:44 UTC (Wed) by stefanha (subscriber, #55072) [Link] (48 responses)

The policy says:

> Contributing to Fedora means vouching for the quality, license compliance, and utility of your submission.

But how is a contributor supposed to know whether AI-generated output is covered by copyright and under a compatible license?

Here is what a recent a paper about extracting the text of books from LLMs that had mixed success but says:

> However, we also find that Llama 3.1 70B entirely memorizes some books, like the first Harry Potter book and 1984. In fact, the first Harry Potter is so memorized that, using a seed prompt consisting of just the first few tokens of the first chapter, we can deterministically generate the entire book near-verbatim.

From "Extracting memorized pieces of (copyrighted) books from open-weight language models", https://arxiv.org/abs/2505.12546.

It's easy to imagine AI-generated content that copies source code, documentation, standard documents, etc that are under an incompatible license.

There is no free tool available that contributors can use to check whether their contribution is license compliant. It is unfair to push this responsibility onto contributors without providing a means for them to fulfill their responsibility.

A policy that ignores this issue invites legal problems for contributors and/or the project.

How to check copyright?

Posted Oct 2, 2025 8:20 UTC (Thu) by farnz (subscriber, #17727) [Link] (47 responses)

How does an experienced developer, who's read lots of code, verify that their output is either not a copyright infringement, or if it is, it's under a compatible license?

This is a pre-existing problem that contributors are expected to solve for their own contributions today, and have been expected to solve for years. Fedora's AI policy simply reaffirms that this remains your problem as a contributor, even if you're using an AI tool.

And it's perfectly fair to push this responsibility onto contributors; they're the ones who choose which tools they use, and if they're choosing tools where they can't be sure about the copyright status of the output, then they're the root cause of the problem. Why should Fedora take responsibility for finding a way to vet the output of all possible tools against the vast body of copyrighted content, when it has no say in the tools you choose?

How to check copyright?

Posted Oct 2, 2025 10:23 UTC (Thu) by stefanha (subscriber, #55072) [Link] (46 responses)

> How does an experienced developer, who's read lots of code, verify that their output is either not a copyright infringement, or if it is, it's under a compatible license?

They look at the license information for any reference code they are copying (e.g. code they found through a web search). LLMs do not provide license information with their output, so the contributor cannot do this.

It's a false equivalence to compare this to a developer verifying their own hand-written code. It's more like a developer coming across source code without license information somewhere. It might be handy to copy-paste that code but how can they determine the license of the mystery code?

> And it's perfectly fair to push this responsibility onto contributors; they're the ones who choose which tools they use, and if they're choosing tools where they can't be sure about the copyright status of the output, then they're the root cause of the problem. Why should Fedora take responsibility for finding a way to vet the output of all possible tools against the vast body of copyrighted content, when it has no say in the tools you choose?

Because a policy whose requirements are impossible to fulfill is at best sloppy. At worst it's "wink, wink, we know the code is not actually license compliant but we'll turn a blind eye to it".

How to check copyright?

Posted Oct 2, 2025 10:31 UTC (Thu) by farnz (subscriber, #17727) [Link] (45 responses)

But it's not Fedora's decision to use an LLM - it's the contributor. The contributor is responsible for making sure that they've complied with all of their legal obligations. And, taking your very example: how does Fedora determine that the developer didn't just copy and past mystery code they were sent in a private chat at work?

That's why this is a reasonable policy; the requirements are the same no matter what tools you use. It's on you, as contributor, to make sure that you're complying with the rules.

There's also no reason why you can't have an LLM whose code is known to be licensed permissively; if you only train the model on MIT-licensed code, then, by definition, if the output is a derived work of the input (which is not guaranteed to be the case), the output is MIT-licensed. Banning "AI tools" bans this tool, too.

How to check copyright?

Posted Oct 2, 2025 12:44 UTC (Thu) by io-cat (subscriber, #172381) [Link] (11 responses)

I understand your position, but the primary distinction in my view is that nowadays there is no reliable way to get licensing information from an LLM in most cases.

Using these tools then makes it extremely hard if not impossible to comply with the rules.

I strongly agree with the statement that the final responsibility is on the contributor (the person).
But the encouragement to use these tools in the original policy post is incompatible with compliance in the current landscape.

How to check copyright?

Posted Oct 2, 2025 12:47 UTC (Thu) by farnz (subscriber, #17727) [Link] (3 responses)

I see no encouragement to use these tools in the policy post.

Quite the opposite - I see it telling you that if you use them, you're responsible for everything they do, and Fedora will not accept "oh, that was the LLM" as an excuse for your failure to sort out licensing terms.

How to check copyright?

Posted Oct 2, 2025 13:10 UTC (Thu) by io-cat (subscriber, #172381) [Link] (1 responses)

Here is the direct quote in the original post that I was referring to:

> We encourage the use of AI assistants as an evolution of the contributor toolkit. However, human oversight remains critical.

https://discussion.fedoraproject.org/t/council-policy-pro...

How to check copyright?

Posted Oct 2, 2025 13:17 UTC (Thu) by farnz (subscriber, #17727) [Link]

I missed that, skipping over what I thought was "corporate boilerplate" to the bolded sentence afterwards: "The contributor is always the author and is fully accountable for their contributions."

How to check copyright?

Posted Oct 2, 2025 13:10 UTC (Thu) by jzb (editor, #7867) [Link]

"I see no encouragement to use these tools in the policy post."

The first sentence in the original policy post after the heading "AI-assisted project contributions" is "We encourage the use of AI assistants as an evolution of the contributor toolkit." (Emphasis added.)

The second draft posted in the discussion thread is more neutral and focuses on stressing that the contributor is responsible.

How to check copyright?

Posted Oct 2, 2025 13:21 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (6 responses)

> I understand your position, but the primary distinction in my view is that nowadays there is no reliable way to get licensing information from an LLM in most cases.

Well that’s a problem for LLM advocates not for Fedora. It’s not for the community to solve problems in products pushed by wealthy corporations. The sad part of IBM buying Red Hat is that Fedora is now part of the corporate hype cycle, and every Red Hat employee is now required to state whatever tech IBM wants a slice of is something terrific you should be enthusiastic about. Red Hat consistently outperformed the IBMs of the day because it delivered solid boring tech that solved actual problems not over-hyped vaporware.

LLM tech has some core unsolved problems just like cryptocurrencies (the previous hype cycle) had core unsolved problems, sad to be you if you were foolish enough to put your savings there listening to the corporate hype, the corporate hype does not care about you nor about communities.

How to check copyright?

Posted Oct 2, 2025 13:53 UTC (Thu) by io-cat (subscriber, #172381) [Link] (2 responses)

I think we are in agreement that this should not be the community problem.

Could you clarify how did you perceive my comment? I’m not sure how does your response, especially given the tone, follow from it :)

How to check copyright?

Posted Oct 2, 2025 14:57 UTC (Thu) by nim-nim (subscriber, #34454) [Link] (1 responses)

I *think* I reacted strongly to the first part of your post. I just hate the “It’s too hard, let’s pretend the problem does not exist” of hype advocates.

I completely agree the “enthusiasm” and gushing about the greatness of IA has no place in a community policy. That’s pure unadulterated corporate brown-nosing. Good community policies should be dry and to the point, help contributors contribute, not feel like an advert for something else.

How to check copyright?

Posted Oct 2, 2025 15:21 UTC (Thu) by io-cat (subscriber, #172381) [Link]

I see. The intent of that part of my comment does not contradict your sentiment.

If it is too hard or impossible to guarantee that the license of LLM output is compliant with the rules - it doesn’t make sense to me to encourage or perhaps even allow usage of such tools until this is ironed out by their proponents.

I’m focusing my opinion here specifically on the licensing question, aside from other potentially problematic things.

How to check copyright?

Posted Oct 2, 2025 13:58 UTC (Thu) by zdzichu (guest, #17118) [Link]

Fortunately, Red Hat employes are minority among Fedora contributors (around 30%). Rest of us are free to totally ignore what people at IBM are enthusiastic about.

How to check copyright?

Posted Oct 3, 2025 9:38 UTC (Fri) by nim-nim (subscriber, #34454) [Link]

Anyway, if anyone had any doubt about where the sudden urge to be enthusiastic about AI is coming from, here is a confirmation

https://blogs.gnome.org/chergert/2025/10/03/mi2-glib/

How to check copyright?

Posted Oct 3, 2025 12:56 UTC (Fri) by stefanha (subscriber, #55072) [Link]

> every Red Hat employee is now required to state whatever tech IBM wants a slice of is something terrific you should be enthusiastic about

I felt bemused reading this. Here we are in a comment thread that I, a Red Hat employee, started about issues with the policy proposal. Many of the people raising questions on the Fedora website are also Red Hat employees.

It's normal for discussions to happen in public in the community. Red Hatters can and will disagree with each other.

How to check copyright?

Posted Oct 2, 2025 13:12 UTC (Thu) by stefanha (subscriber, #55072) [Link] (32 responses)

> But it's not Fedora's decision to use an LLM - it's the contributor. The contributor is responsible for making sure that they've complied with all of their legal obligations. And, taking your very example: how does Fedora determine that the developer didn't just copy and past mystery code they were sent in a private chat at work?

The statement I'm making is: contributors cannot comply with their legal obligations when LLM output. I'll also add "LLM output as it exists today", because you're right that in theory LLMs could be trained in a way to provide clarity on licensing.

There is no point in a policy about contributing LLM-generated code today because no contributor can follow it while still honoring their legal obligations.

> There's also no reason why you can't have an LLM whose code is known to be licensed permissively; if you only train the model on MIT-licensed code, then, by definition, if the output is a derived work of the input (which is not guaranteed to be the case), the output is MIT-licensed. Banning "AI tools" bans this tool, too.

The MIT license requires including the copyright notice with the software, so the LLM would need to explicitly include "Copyright (c) <year> <copyright holders>" and the rest of the copyright notice for each input that is being copied. It's not enough to just add a blanket MIT-license to the output because it will not contain the copyright holder information.

An LLM that can do that would be great. It could also be trained on software under other licenses because the same attribution needed for MIT would be enough properly license other software too.

But that does not exist as far as I know. The reality today is that a contributor cannot take AI-generated output and know how it is licensed.

How to check copyright?

Posted Oct 2, 2025 13:13 UTC (Thu) by stefanha (subscriber, #55072) [Link] (12 responses)

I missed a word:
"contributors cannot comply with their legal obligations when submitting LLM output"

How to check copyright?

Posted Oct 2, 2025 13:27 UTC (Thu) by farnz (subscriber, #17727) [Link] (11 responses)

The trouble is that your argument depends on the LLM's output being a derived work of its training data; this is not necessarily true, even if you can demonstrate that the training data is present in some form inside the LLM's weights (not least because literal copying is not necessarily a derived work).

If you limit yourself to the subsets of LLM output that are not derived works (e.g. because they're covered by the equivalents of the scènes à faire doctrine in US copyright law or other parts of the idea-expression distinction), then you can comply with your legal obligations. You are forced to do the work to confirm that the LLM output you're using is not, legally speaking, a derived work, but then it's safe to use.

How to check copyright?

Posted Oct 2, 2025 14:54 UTC (Thu) by stefanha (subscriber, #55072) [Link] (2 responses)

> If you limit yourself to the subsets of LLM output that are not derived works (e.g. because they're covered by the equivalents of the scènes à faire doctrine in US copyright law or other parts of the idea-expression distinction), then you can comply with your legal obligations. You are forced to do the work to confirm that the LLM output you're using is not, legally speaking, a derived work, but then it's safe to use.

I started this thread by asking:

> But how is a contributor supposed to know whether AI-generated output is covered by copyright and under a compatible license?

And here you are saying that if you know it's not a derived work, then it's safe to use. I agree with you.

The problem is that we still have no practical way of knowing whether the LLM output is under copyright or not.

How to check copyright?

Posted Oct 2, 2025 15:18 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

There's at least two cases where "knowing whether the LLM output is under copyright or not" is completely irrelevant:
  1. You don't know how to solve the problem; you ask an LLM to explain how to solve this problem, and then you manually write the code yourself, based on the LLM's explanation. This is an existing problem - it's the same as reading a book or a paper that explains how to solve this problem - and the answer is to "assume it's covered by copyright, but write your own solution, don't just copy blindly. That applies whether "the text" is a book, a paper, or some LLM generated work.
  2. The parts of the contribution copied from the LLM's output is one that you've inspected, and confirmed would be covered by an exception to copyright law even if the work they are taken from is under copyright. In this case, the copyright status of the LLM's output is irrelevant, since the part you're using is one you can use even if it's under copyright. Again, this is a pre-existing problem; if I read (say) IEEE 1003.1-2024 (or one of the many things that's copied text from it verbatim, like this Linux man page), and copy part of it into my contribution, that's copying from a document under copyright and licensed under restrictive terms, but because it doesn't rise to the point where my copying creates a derived work, copyright status is irrelevant.

How to check copyright?

Posted Oct 3, 2025 14:53 UTC (Fri) by stefanha (subscriber, #55072) [Link]

> There's at least two cases where "knowing whether the LLM output is under copyright or not" is completely irrelevant:

I agree. I'm curious if anyone has solutions when copyright does come into play. It seems like a major use case that needs to be addressed.

How to check copyright?

Posted Oct 2, 2025 17:16 UTC (Thu) by Wol (subscriber, #4433) [Link] (7 responses)

> The trouble is that your argument depends on the LLM's output being a derived work of its training data; this is not necessarily true, even if you can demonstrate that the training data is present in some form inside the LLM's weights (not least because literal copying is not necessarily a derived work).

It also "conveniently forgets" that any developer worth their salt is exposed to a lot of code for which they do not hold the copyright, and may not even be aware of the fact that they are recalling verbatim chunks of code they memorised at Uni / another place of work / a friend showed it to them.

So all this complaining about AI-generated code could also be applied pretty much the same to developer-generated code, it's just that we don't think it's a problem if it's a developer, some people think it is if it's an AI.

Personally, I'd be quite to happy to ingest AI-generated code into my brain, and then regurgitate the gist of it (suitably modified for corporate guidelines/whatever). By the time you've managed to explain in excruciating detail to the AI what you want, it's probably better to give it a simple explanation and rewrite the result.

Okay, that end result may not be "clean room" copyright compliant, but given the propensity for developers to remember code fragments, I expect very little code is.

We have a problem with musicians suing each other for copying fragments of songs (which the "copier" was probably unaware of - which the copyright *holder* probably copied as well without being aware of it!!!), how can we keep that out of computer programming? We can't, and that's assuming AI had no hand in it!

Cheers,
Wol

How to check copyright?

Posted Oct 3, 2025 13:20 UTC (Fri) by alex (subscriber, #1355) [Link]

I went through this many moons ago when one of the start-ups I worked at was working on an emulation layer. The lawyer made a distinction between "retained knowledge" (i.e. what was in our heads) and copying verbatim from either the files or notes. I had to hand in all my notebooks when I left the company but assuming no reference I could implement something the roughly the same way I had before. There is a lot of code which isn't copyrightable because it is either the only way to it or its "obvious".

Patents where a separate legal rabbit hole.

How to check copyright?

Posted Oct 3, 2025 15:12 UTC (Fri) by stefanha (subscriber, #55072) [Link] (5 responses)

I am not claiming that all AI output is covered by the copyright of its training data. It seems reasonable that generated output is treated in the same way as when humans who have been exposed to copyrighted content create something.

In the original comment I linked to a paper about extracting copyrighted content from LLMs. A web search brings up a bunch more in this field that I haven't read. Here is one explicitly about generated code (https://arxiv.org/html/2408.02487v3) that says "we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations".

I think AI policies are getting ahead of themselves when they assume that a contributor can vouch for license compliance. There needs to be some kind of lawyer-approved solution to this so that the open source community is protected from a copyright mess.

How to check copyright?

Posted Oct 3, 2025 15:25 UTC (Fri) by farnz (subscriber, #17727) [Link] (4 responses)

There's a critical piece of data missing - what proportion of human-written code is strikingly similar to existing open-source implementations?

We know that humans accidentally and unknowingly infringe, too. Why can't we reuse the existing lawyer-approved solution to that problem for LLM output?

How to check copyright?

Posted Oct 3, 2025 16:47 UTC (Fri) by Wol (subscriber, #4433) [Link] (3 responses)

And another thing - how much copyright violation is being blamed on the LLM, when the query being *sent* to the LLM itself is a pretty blatant copyright violation? At which point we're seriously into "unclean hands", and if the querier is not the copyright holder, they could easily find themselves named as a co-defendant (quite likely the more culpable defendant!) even if they're not the deeper pocket.

If I had an LLM and found myself sued like that, I'd certainly want to drag the querier into it ...

Cheers,
Wol

How to check copyright?

Posted Oct 6, 2025 14:24 UTC (Mon) by stefanha (subscriber, #55072) [Link] (2 responses)

> If I had an LLM and found myself sued like that, I'd certainly want to drag the querier into it ...

Hence why contributors need a way to check copyright compliance.

How to check copyright?

Posted Oct 6, 2025 14:29 UTC (Mon) by farnz (subscriber, #17727) [Link]

TBF, you also need such a mechanism to check copyright compliance of any code you've written yourself - you are also quite capable of accidental infringement (where having seen a particular way to write code before, you copy it unintentionally), and to defend yourself or the project you contribute to, you have to prove either that you never saw the original code that you're alleged to have copied (the clean room route) or that this code is on the "idea" side of the idea-expression distinction (however that's expressed in local law).

How to check copyright?

Posted Oct 6, 2025 14:36 UTC (Mon) by pizza (subscriber, #46) [Link]

> Hence why contributors need a way to check copyright compliance.

This is a legal problem, and cannot be solved via (purely, or even mostly) technical means.

How to check copyright?

Posted Oct 2, 2025 13:18 UTC (Thu) by farnz (subscriber, #17727) [Link] (18 responses)

Right, but policy should not be written to just cover today. It also needs to cover tomorrow's evolutions of the technology, too. And that means both hypotheticals, like an LLM that could correctly attribute derived works, or a contributor who uses something that's AI-based, but is careful to make sure that the stuff they submit to Fedora is not a derived work in the copyright sense (and hence licensing is irrelevant).

And the reality today is that you can take AI-generated output, and confirm by inspection that it's not possible for it to be a derived work, and hence that licensing is irrelevant.

How to check copyright?

Posted Oct 2, 2025 15:09 UTC (Thu) by stefanha (subscriber, #55072) [Link] (17 responses)

> And the reality today is that you can take AI-generated output, and confirm by inspection that it's not possible for it to be a derived work, and hence that licensing is irrelevant.

I agree.

It is common practice to use AI to generate non-trivial output though. If the intent of the policy is to allow trivial AI-generated contributions, then it should mention this to prevent legal issues.

How to check copyright?

Posted Oct 2, 2025 15:34 UTC (Thu) by farnz (subscriber, #17727) [Link] (16 responses)

You continue to ignore the massive difference between "non-trivial" and "a derived work of the training data, protected by copyright". That's deeply critical to this conversation, and unless you can show that an AI's output is inherently a derived work, you're asking us to accept a tautology.

Fundamentally, and knowing how transformers work, I do not accept the claim that an AI's output is inherently a derived work of the training data. It is definitely (and demonstrably) possible to get an AI to output things that are derived works of the training data, with the right prompts, but it is also entirely possible to get an AI to produce output that is not derived from the training data for the purposes of copyright law.

It is also possible to get humans to produce outputs that are derived works of their training data, but that doesn't imply that all works produced by humans are derived works of their training data, for the purposes of copyright law.

How to check copyright?

Posted Oct 2, 2025 19:55 UTC (Thu) by ballombe (subscriber, #9523) [Link] (15 responses)

Clean-room reverse-engineering is recognized and codified by copyright law, but LLM certainly do not do clean room reverse engineering.

How to check copyright?

Posted Oct 2, 2025 20:05 UTC (Thu) by mb (subscriber, #50428) [Link] (12 responses)

Why are you so sure?

What is the fundamental difference between

a) human brains processing code into documentation and then into code
and
b) LLMs processing code into very abstract and compressed intermediate representations and then into code?

LLM models would probably contain *less* information about the original code than a documentation.

How to check copyright?

Posted Oct 2, 2025 21:07 UTC (Thu) by pizza (subscriber, #46) [Link] (4 responses)

> What is the fundamental difference between
> a) human brains
> b) LLMs processing

Legally, there's a huge distinction between the two.

And please keep in mind that "legally" is rarely satisfied with "technically" arguments.

How to check copyright?

Posted Oct 2, 2025 21:17 UTC (Thu) by mb (subscriber, #50428) [Link] (3 responses)

>Legally, there's a huge distinction between the two.

Interesting.
Can you back this up with some actual legal text or descriptions from lawyers?
I'd really be interested in learning what lawyers think the differences are.

How to check copyright?

Posted Oct 3, 2025 0:16 UTC (Fri) by pizza (subscriber, #46) [Link] (1 responses)

> Can you back this up with some actual legal text or descriptions from lawyers?

Only stuff created by a human is eligible for copyright protection.

See https://www.copyright.gov/comp3/chap300/ch300-copyrightab... section 307.

Doesn't get any simpler than that.

How to check copyright?

Posted Oct 3, 2025 7:01 UTC (Fri) by mb (subscriber, #50428) [Link]

> Only stuff created by a human is eligible for copyright protection.

That is a completely different topic, though.
This is about *re*-producing existing actually copyrighted content.

How to check copyright?

Posted Oct 3, 2025 11:16 UTC (Fri) by Wol (subscriber, #4433) [Link]

I can't point you to the law(s) themselves, but the European position - IN LAW - is that there is no difference between an AI reading and learning, and a person reading and learning.

So I guess (and this is not clear) that there is no difference between an AI regurgitating what it's learnt, and a person regurgitating what it's learnt.

So it basically comes down to the question "how close is the output to the input, and was the output obvious and not worthy of copyright protection?"

Given the tendency of AI to hallucinate, I guess the output of an AI is LESS likely to violate copyright than that of a human. Of course, the corollary becomes the output of a human is more valuable :-)

Cheers,
Wol

How to check copyright?

Posted Oct 3, 2025 17:39 UTC (Fri) by ballombe (subscriber, #9523) [Link] (6 responses)

> Why are you so sure?

Clean room reverse engineering requires that there two separate, non-interacting, teams, one having access to the original code and writing its specification and a second team that never access the original code and is only relying on the specification to write the new program.

Since by hypothesis the LLM had access to all the code on github, it cannot be used to write the new program.

Remember when some Windows code was leaked, WINE developers were advised not to look at it to avoid being "tainted".

How to check copyright?

Posted Oct 3, 2025 17:43 UTC (Fri) by farnz (subscriber, #17727) [Link]

It definitely can be used to write the new program; because it had access to the code on GitHub, you cannot assert lack of access as evidence of lack of copying (which is what the clean room setup is all about), but you can still assert that either the copied code falls on the idea side of the idea-expression distinction, or that it is not a derived work (in the legal, not mathematical, sense) for the purposes of copyright law for some other reason.

The point of the clean room process is that the only thing you need to look at to confirm that the second team did not copy the original code is the specification produced by the first team, which makes it tractable to confirm that the second team's output is not a derived work by virtue of no copying being possible.

But that's not the only way to avoid infringing - it's just a well-understood and low-risk way to do so.

How to check copyright?

Posted Oct 3, 2025 18:22 UTC (Fri) by mb (subscriber, #50428) [Link] (4 responses)

>two separate, non-interacting, teams

The two teams are interacting. Via documentation.
Which is IMO not that dissimilar from the network weights, which are passed from the network trainer application to the network executor application.

>Since by hypothesis the LLM had access to all the code on github

I don't agree.
The training application had access to the code.
And the executing application doesn't have access to the code.

The generated code comes out of the executing application.

How to check copyright?

Posted Oct 4, 2025 19:51 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (3 responses)

Network weights are a lossy representation of the training material, but they still contain it and are able to reproduce it if asked, as shown above (and also in the New York Times lawsuit against OpenAI).

In fact, bigger models also increase the memorization ability.

How to check copyright?

Posted Oct 4, 2025 19:54 UTC (Sat) by mb (subscriber, #50428) [Link] (2 responses)

>a lossy representation of the training material, but they still contain it and are able to reproduce it

This contradicts itself.

How to check copyright?

Posted Oct 5, 2025 10:07 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (1 responses)

In prediction model they are not able to reproduce *all of it* but they can reproduce a lot of specific texts with varying degrees of precision. For literary works, for example, the models often remember more easily poetry than prose. You can measure precision by checking if the model needs to be told the first few words as opposed to just the title, how often they change a word with a synonym, whether they go into the weeds after a few paragraphs or a few chapters in others. The same is true of programs.

You can also use the language models as source of probabilities for arithmetic coding and some texts will compress ridiculously well, so much that the only explanation is that large parts of the text is already present in the weights in compressed form. In fact it can be mathematically proven that memorization, compression and training are essentially the same thing.

Here is a paper from DeepMind on the memorization capabilities of LLMs: https://arxiv.org/pdf/2507.05578

And here is an earlier one that analyzed how memorization improves as the number of parameters grows: https://arxiv.org/pdf/2202.07646

How to check copyright?

Posted Oct 5, 2025 13:45 UTC (Sun) by kleptog (subscriber, #1183) [Link]

While the issue of memorisation is interesting, it is ultimately not really relevant to the discussion. You don't need an LLM to intentionally violate copyright. The issue is can you use an LLM to *unintentionally* violate copyright?

I think those papers actually show it is quite hard. Because even with very specific prompting, the majority of texts could not be recovered to any significant degree. So what are the chances an LLM will reproduce a literal text without special prompting?

Mathematically speaking an LLM is just a function, and for every output there exists an input that will produce something close to it. Even if it is just "Repeat X". (Well, technically I don't know if we know that LLMs have a dense output space.) What are the chances a random person will hit one of those inputs that matches some copyrighted output?

I suppose we've given the "infinite monkeys" a power-tool that makes it more likely for them to reproduce Shakespeare. Is it too likely?

Clean-room reverse engineering

Posted Oct 3, 2025 8:07 UTC (Fri) by rschroev (subscriber, #4164) [Link]

Clean-room reverse engineering is a whole different topic though, isn't it? It's what Compaq did back in the 80's to reverse engineer the IBM PC's BIOS, enabling them to make compatible machines in a legal way. Notably the people who studied IBM's BIOS and the ones who implemented Compaq's new one were different teams, to avoid any copyright issues.

That's a whole different situation than either people or LLMs reading code and later using their knowledge to write new code.

How to check copyright?

Posted Oct 3, 2025 9:40 UTC (Fri) by farnz (subscriber, #17727) [Link]

Clean-room reverse-engineering isn't part of the codified side of copyright law; rather, it's a process that the courts recognise as guaranteeing that the work produced in the clean room cannot be a derived work of the original.

To be a derived work, there must be some copying of the original, intended or accidental. The clean-room process guarantees that the people in the clean-room cannot copy the original, and therefore, if they do come up with something that appears to be a copy of the original, it's not a derived work.

You can, of course, do reverse-engineering and reimplementation without a clean-room setup; it's just that you then have to show that each piece that's alleged to be a literal copy of the original falls on the right side of the idea-expression distinction to not be a derived work, instead of being able to show that no copying took place.

Productivity

Posted Oct 2, 2025 7:00 UTC (Thu) by drago01 (subscriber, #50715) [Link] (1 responses)

Most Fedora contributions are spec files i.e packing.

Those files are relatively simple, maybe not even copyright able before reaching a certain level of complexity, automation there always made sense.

AI would be just a tool to achieve that while handling more than just the simplest cases.

Productivity

Posted Oct 2, 2025 15:10 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

But KISS simplicity is the hardest thing to achieve. That’s exactly what an automaton is bad at, automatons are good at piling up slop till it sort of works, not at reducing things to their simplest core.

Reducing things requires opinions on how things should end up. It requires the ability to evaluate features and the willingness to cull parts that add to much complexity for too little benefits. Outside the distro world, we see house of cards of hundreds of interlinked software components that no one know how to evaluate for robustness, security and maintainability. They are the direct result of automation with no human in the loop to put a stop when it gets out of hand.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds