No disclosure for LLM-generated patch?

Posted Jun 27, 2025 6:55 UTC (Fri) by drago01 (subscriber, #50715)
In reply to: No disclosure for LLM-generated patch? by lucaswerkmeister
Parent article: Supporting kernel development with large language models

Why is there a need for that?
People tend to overreacting, an LLM is just a tool.

No one discloses which IDE or editor has been used or whether autocomplete has been used or not etc.

In the end of the day what matters is whether the patch is correct or not, which is what reviews are for. The tools used to write it are not that relevant.

No disclosure for LLM-generated patch?

Posted Jun 27, 2025 9:11 UTC (Fri) by Funcan (subscriber, #44209) [Link] (5 responses)

Is there a need for it? Given the legal uncertainty around the copyright status of llm output (see Disney's massive lawsuit for example), I'd say 'yes', and that it might be legally similar to copying code from a proprietary licensed kernel into Linux and passing it off as your own work.

I vaguely remember some llm providers include legal wavers for copyright where they take on the liability, but I can't find one for e.g. copilot right now

No disclosure for LLM-generated patch?

Posted Jun 27, 2025 10:51 UTC (Fri) by mb (subscriber, #50428) [Link] (3 responses)

LLMs don't usually copy, though.

If you as a human learn from proprietary code and then write Open Source with that knowledge, it's not copying unless you actually copy code sections. Same goes for LLMs. If it produces a copy, then it copied. Otherwise it didn't.

No disclosure for LLM-generated patch?

Posted Jun 27, 2025 11:47 UTC (Fri) by laarmen (subscriber, #63948) [Link] (1 responses)

This is not as simple as you make it out to be, at least in the eyes of some people. That's why you have clean-room rules such as https://gitlab.winehq.org/wine/wine/-/wikis/Clean-Room-Gu...

No disclosure for LLM-generated patch?

Posted Jun 27, 2025 12:57 UTC (Fri) by mb (subscriber, #50428) [Link]

Clean-room is a *tool* to make accidental/unintentional copying less likely.

It's in no way required to avoid copyright problems.
Just don't copy and then you are safe.
Learning is not copying.

And you can also use that concept with LLMs, if you want.
Just feed the output from one LLM into the input of another LLM and you basically get the same thing as with two human clean-room teams.

No disclosure for LLM-generated patch?

Posted Jul 1, 2025 9:51 UTC (Tue) by cyphar (subscriber, #110703) [Link]

Copyright law isn't this simple. For one, it is established law that only humans can create copyrightable works and the copyright of the output of programs is based on the copyright of the input (if the input is sufficiently creative). Copyright in its current form is entirely based on "human supremacy" when it comes to capability of artistic expression, and so comparing examples where humans do something equivalent is not (in the current legal framework) actually legally equivalent. Maybe that will change in the future, but that is the current case law in the US AFAIK (and probably most other countries).

You could just as easily argue that LLMs produce something equivalent to a generative collage of all of their training data, which (given the current case law on programs and copyright) would mean that the copyright status of the training data would be transferred to the collage. You would thus need to make an argument for a fair use exemption for the output, which your example would not pass muster.

However, this is not the only issue at play here -- to submit code to Linux you need to sign the DCO, which the commit author did with their Signed-off-by line. However, none of the sections of the DCO can be applied to LLM-produced code, and so the Signed-off-by is invalid regardless of the legal questions about copyright and LLM code.

No disclosure for LLM-generated patch?

Posted Jun 27, 2025 16:57 UTC (Fri) by geofft (subscriber, #59789) [Link]

> I vaguely remember some llm providers include legal wavers for copyright where they take on the liability, but I can't find one for e.g. copilot right now

https://blogs.microsoft.com/on-the-issues/2023/09/07/copi...

"Specifically, if a third party sues a commercial customer for copyright infringement for using Microsoft’s Copilots or the output they generate, we will defend the customer and pay the amount of any adverse judgments or settlements that result from the lawsuit, as long as the customer used the guardrails and content filters we have built into our products."

See also https://learn.microsoft.com/en-us/legal/cognitive-service... . The exact legal text seems to be the "Customer Copyright Commitment" section of https://www.microsoft.com/licensing/terms/product/ForOnli...

No disclosure for LLM-generated patch?

Posted Jun 27, 2025 10:45 UTC (Fri) by excors (subscriber, #95769) [Link] (3 responses)

> In the end of the day what matters is whether the patch is correct or not, which is what reviews are for. The tools used to write it are not that relevant.

One problem is that reviewers typically assume the patch was submitted in good faith, and look for the kinds of errors that good-faith humans typically make (which the reviewer has learned through many years of experience, debugging their own code and other people's code).

If e.g. Jia Tan started submitting patches to your project, you wouldn't say "I know he deliberately introduced a subtle backdoor into OpenSSH and he's probably a front for a national intelligence service, but he also submitted plenty of genuinely useful patches while building up trust, so let's welcome him and just review all his patches carefully before accepting them". You'd understand that your review process is not infallible and he's going to try to sneak something past it, with malicious patches that look as non-suspicious as possible, so it's not worth the risk and you would simply ban him. Linux banned a whole university for a clumsy version of that: https://lwn.net/Articles/853717/. The source of a patch _does_ matter.

Similarly, LLMs generate code with errors that are not what a good-faith human would typically make, so they're not the kind of errors that reviewers are looking out for. A human isn't going to hallucinate a whole API and write a professional-looking well-documented patch that calls it, but an LLM will eagerly do so. In the best case, it'll waste reviewers' time as they try to figure out what the nonsense means. In the worst case there will be more subtle inhuman bugs that get missed because nobody is thinking to look for them.

At the same time, the explicit goal of generating code with LLMs is to make developers more productive at writing patches, meaning there will be more patches to review and reviewers will be under even more pressure. And in the long term there will be fewer new reviewers, because none of the junior developers who outsourced their understanding of the code to an LLM will be learning enough to take on that role. I think writing code is already the easiest and most enjoyable part of software development, so it seems like the worst part to be trying to automate away.

No disclosure for LLM-generated patch?

Posted Jun 27, 2025 13:38 UTC (Fri) by drago01 (subscriber, #50715) [Link] (1 responses)

Well one of the reviewers would be the patch submitter, who used the LLM to save time.

If you don't trust that person and don't review his submission in detail the problem is unrelated to whether a LLM has been used or not.

No disclosure for LLM-generated patch?

Posted Jun 27, 2025 15:10 UTC (Fri) by SLi (subscriber, #53131) [Link]

Exactly. The submitter takes the responsibility for the sanity of the submissions. If it is bad, that's on them. It doesn't matter that they found it convenient to edit the scheluder code in MS Word.

No disclosure for LLM-generated patch?

Posted Jun 27, 2025 14:02 UTC (Fri) by martinfick (subscriber, #4455) [Link]

If these junior programmers are reviewing the output of the LLM before submitting the code for inclusion, doesn't that mean this would be helping get more coders to become reviewers sooner? Perhaps this will actually help resolve the reviewer crisis by making reading code finally a more common skill then writing it?

No disclosure for LLM-generated patch?

Posted Jun 29, 2025 15:30 UTC (Sun) by nevets (subscriber, #11875) [Link] (2 responses)

As the maintainer that took that patch from Sasha, this is the first I've learned that it was completely created by an LLM. But it definitely looked like it was created by some kind of script. When I first saw the patch, I figured it was found and or created by some new static analyzer (like Coccinelle). I guess I wasn't wrong.

I was going to even ask Sasha if this came from some new tool. I think I should have. And yes, it would have been nice if Sasha mentioned that it was completely created by an LLM because I would have taken a deeper look at it. It appears (from comments below) that it does indeed have a slight bug. Which I would have caught if I had know this was 100% generated, as I would not have trusted the patch as much as I did thinking Sasha did the work.

No disclosure for LLM-generated patch?

Posted Jun 29, 2025 23:31 UTC (Sun) by sashal (✭ supporter ✭, #81842) [Link] (1 responses)

As the author of hashtable.h, and the person who sent the patch to Steve, I missed it too.

For that matter, the reason I felt comfortable sending this patch out is because I know hashtable.h.

Maybe we should have a tag for tool generated patches, but OTOH we had checkpatch.pl and Coccinelle generated patches for over a decade, so why start now?

Is it an issue with the patch? Sure.

Am I surprised that LWN comments are bikeshedding over a lost __read_mostly? Not really...

No disclosure for LLM-generated patch?

Posted Jun 30, 2025 2:26 UTC (Mon) by nevets (subscriber, #11875) [Link]

For both checkpatch.pl and Coccinelle generated patches, there isn't a tag, but pretty much every case (or it should be every case) it is stated in the change log that the patch was created by a script. You left that out. I've submitted a few Coccinelle generated patches, and every time I not only state that it was produced by Coccinelle, I add to the change log the script that I fed Coccinelle to produce the patch. This allows others to know if I did the script correctly or not.

The missing "__read_mostly" is a red herring. The real issue is transparency. We should not be submitting AI generated patches without explicitly stating how it was generated. As I mentioned. If I had known it was 100% a script, I may have been a bit more critical over the patch. I shouldn't be finding this out by reading LWN articles.