LWN: Comments on "Supporting kernel development with large language models"

Assumptions!

kaesaecracker — Tue, 15 Jul 2025 19:36:38 +0000

I guess that's fine for backports and classifying commits, as the original change is applied to the old version? Even in the worst case, you could declare those older versions dead and continue on without the shiny new tools.

Code completely generated by the LLM I would also be cautious about though. Once that code is in the kernel, any follow up changes (or even developers?) may be tainted depending on what courts decide.

An LLM is...

dskoll — Thu, 03 Jul 2025 01:57:54 +0000

Humans can have other redeeming attributes not shared by LLMs. We also don't need to make a new type of human when we can barely figure out how to deal with the old type.

An LLM is...

mbligh — Thu, 03 Jul 2025 01:55:36 +0000

So are humans.

Human authorship?

Cyberax — Wed, 02 Jul 2025 18:55:36 +0000

But _where_ does the copyright start? A prompt may be copyrightable if it's specific enough. And edits made by humans to the generated code are likely to be copyrightable.

But suppose we have this case, you build a web service to track sleep times using an LLM. And then I build a service to track the blood sugar data using an LLM.

The source code for them ends up 95% identical, just because there are so many ways to generate a simple CRUD app and we both used the same LLM version. And if you had looked at these two code bases 15 years ago, it would have been a clear-cut case of copyright infringement.

But clearly, this can't be the case anymore. Mere similarity of the code can't be used as an argument when LLMs are at play.

Human authorship?

jani — Wed, 02 Jul 2025 14:24:21 +0000

> If the prompt is copyrightable, then the output is too.

It's just not that clear cut: https://www.skadden.com/insights/publications/2025/02/cop...

Human authorship?

kleptog — Wed, 02 Jul 2025 13:43:06 +0000

> It's likely that only non-trivial _prompts_ will be copyrighted.

If the prompt is copyrightable, then the output is too. An LLM is just a tool. Photos don't lose copyright by feeding them through a tool, so why would an LLM be any different? You'd have to somehow argue that an LLM is somehow a fundamentally different kind of tool than anything else you use to process text, which I don't think is a supportable idea.

No disclosure for LLM-generated patch?

cyphar — Wed, 02 Jul 2025 04:31:26 +0000

Another interesting question is whether the Signed-off-by line can be considered valid in this case, since I don't think any reasonable reading of the DCO would apply to LLM-produced code.

You didn't write it yourself, you don't know in what manner it was based on previous works (or their licenses), and even if you argue that the LLM directly gave you the code (despite not being a person), the LLM cannot certify the licensing status of the code (and would probably refuse to do so, even if asked).

Human authorship?

Cyberax — Tue, 01 Jul 2025 18:37:21 +0000

It's likely that only non-trivial _prompts_ will be copyrighted.

I can see that in future source code repos will have the LLM-generated source code along with the list of prompts that resulted in it. And then lawyers will argue where exactly the copyright protection is going to stop. E.g. if a prompt "a website with the list of issues extracted from Bugzilla" is creative enough or if it's just a statement of requirements.

Human authorship?

excors — Tue, 01 Jul 2025 17:49:11 +0000

> AIUI, with AI there is no unique deterministic relationship between the prompt you're entering and what the AI produces. If you feed your AI image generator (or what have you) the same prompt twice in a row you will probably get different results, so the output does not depend solely on the human effort on the input side.

It generally isn't non-deterministic, but it is pseudorandom. The input includes the prompt plus a PRNG seed; if you use the same prompt and seed, and sufficiently similar software and hardware, then you should get exactly the same output. (Some UIs might not let you set the seed, but that's just a UI limitation.)

With image generators there's also some coherence across prompts: if you use the same seed with two similar but different prompts, you'll probably get similar but different outputs. So you can generate a bunch of images with one prompt across many arbitrary seeds, pick your favourite, then reuse the seed and tweak the prompt to get a variation on that image; it's not completely chaotic and unpredictable. (They work by starting with an image of pure noise, then progressively denoising it guided by the prompt. Using the same seed means starting with the same noise, and it seems a lot of the output's high-level structure is determined by that noise and is preserved through all the denoising.)

Human authorship?

paulj — Tue, 01 Jul 2025 16:18:23 +0000

It's going to be fascinating. There are corporate interests on both sides of this fence - in a number of cases, the /same/ corporate. ;) Different jurisdictions may well come to different answers too.

Human authorship?

kleptog — Tue, 01 Jul 2025 16:00:41 +0000

At least in NL law, the copyright belongs to the person that made the creative choices. In other words, no AI tool can ever produce anything copyrightable by itself. The user who made the creative choices that lead to the output has the copyright. This is the same principle that prevents Microsoft from claiming copyright over your Word documents, or a compiler writer claiming copyright over your binaries.

If those companies somewhere in the chain include a human who is deciding which AI output is acceptable and which isn't, then that would be copyrightable. Even if they were just writing a program that did the evaluation for them. Although I expect the actual protection to be somewhat commensurate to the amount of effort. And if you're talking to a chatbot, the output is copyright of the person typing.

This is Civil Law, so by statute and no court case can change that. At best the courts can prod the legislature to tell them the law might need updating, but that's it. The US being Common Law however is likely to attract a lot of litigation, unless the legislature explicitly goes to fix it.

Human authorship?

jani — Tue, 01 Jul 2025 11:14:49 +0000

I think at this point the only thing we can say with any certainty is that we're going to see plenty of litigation and lobbying to protect corporate interests, both for LLM input *and* output.

Human authorship?

anselm — Tue, 01 Jul 2025 10:46:57 +0000

If the output of photography is copyrightable, even with minimal creative input, while the output of an AI is not even with massive creative input, then something is pretty blatantly wrong.

At least here in Germany, copyright law differentiates between run-of-the-mill snapshot photography and photography as art, based on a “threshold of originality”. The latter gets full copyright protection and the former doesn't – there is still some protection but, for example, it doesn't last as long as it would otherwise.

IMHO, the analogy with photography falls flat because a point-and-shoot camera will, to all intents and purposes, produce reproducible results given the same framing, lighting conditions, etc. (which is generally considered a feature if you're a photographer). AIUI, with AI there is no unique deterministic relationship between the prompt you're entering and what the AI produces. If you feed your AI image generator (or what have you) the same prompt twice in a row you will probably get different results, so the output does not depend solely on the human effort on the input side. To my understanding this would conflict, e.g., with German copyright law, which stipulates – at least for now – that copyright protection is available exclusively for “personal intellectual creations” by human beings.

Human authorship?

paulj — Tue, 01 Jul 2025 10:30:19 +0000

Does this mean that companies moving to AI-coding will be producing code they hold no copyright over (because being machine produced, there simply is no [new] copyright created in the output)? If someone produces an app / a piece of software that is 100% AI generated, will others be able to redistribute as they wish?

No disclosure for LLM-generated patch?

cyphar — Tue, 01 Jul 2025 09:51:32 +0000

Copyright law isn't this simple. For one, it is established law that only humans can create copyrightable works and the copyright of the output of programs is based on the copyright of the input (if the input is sufficiently creative). Copyright in its current form is entirely based on "human supremacy" when it comes to capability of artistic expression, and so comparing examples where humans do something equivalent is not (in the current legal framework) actually legally equivalent. Maybe that will change in the future, but that is the current case law in the US AFAIK (and probably most other countries).

You could just as easily argue that LLMs produce something equivalent to a generative collage of all of their training data, which (given the current case law on programs and copyright) would mean that the copyright status of the training data would be transferred to the collage. You would thus need to make an argument for a fair use exemption for the output, which your example would not pass muster.

However, this is not the only issue at play here -- to submit code to Linux you need to sign the DCO, which the commit author did with their Signed-off-by line. However, none of the sections of the DCO can be applied to LLM-produced code, and so the Signed-off-by is invalid regardless of the legal questions about copyright and LLM code.

What about trust?

koh — Mon, 30 Jun 2025 22:40:02 +0000

There was no mentioning of "database" in the post you replied to. You invented that notion. As with nearly all your comments, they are in a subtle but quite fundamental way: wrong. And yes, LLMs do copy, too, see e.g. Q_rsqrt. It's called overfitting in the corresponding terminology.

Human authorship?

Wol — Mon, 30 Jun 2025 20:05:53 +0000

The other problem, even with photography, is that the amount of creative input varies massively. And the further back in time you go, the more effort it took to even "point and shoot". Today, you point a snartphone at a bunch of friends, and the AI will take a good shot for you.

Go back to the start of common digital photography (maybe Y2K, with your typical camera being measured in K-pixels, and a 1.3MP camera being very hi-def, you still needed a fair bit of nouce, zooming, framing, to try and give you a decent chance of post-processing (if you wanted to).

A bit earlier (early 90s) I had a mid-range SLR (F90), with a good zoom and flash-gun. It still took a fair bit of creative input in selecting the shot, but the camera took good care of making sure the shot itself was good.

Go back to the 70s/80s, the Zenith was a popular camera that basically had no smarts whatsoever. In order to take a shot as good as the F90, you had to choose the f-stop and shutter speed, balance the flash against daylight, basically doing a lot of work.

Go further back to medium format, and you needed a tripod ...

With *both* AI and photography, the more you put in, the more you get out (well, that's true of life :-)

And (iirc the parent context correctly) if someone says "the output of an AI is not copyrightable", what are they going to say when someone argues "the *input* *is* copyrightable, and the output is *clearly* *derivative*". If the output of photography is copyrightable, even with minimal creative input, while the output of an AI is not even with massive creative input, then something is pretty blatantly wrong.

Cheers,
Wol

Human authorship?

NYKevin — Mon, 30 Jun 2025 19:01:15 +0000

> I'm just a casual photographer who's used point-and-shoots pretty much my entire life, but I think any casual photographer still frames the photo to get the subject they want, adjusts the angle, and tries to ensure the subjects are lit. Sure, it's not a particularly rich or complex creative choice, but we/they still make them and it's definitely intentional.

Replace "casual photographer" with "LLM user" and replace "frames the photo, adjusts the angle, etc." with "writes the prompt, tries multiple iterations, asks the LLM to refine it, etc." and maybe that will make the problem more apparent. This is even more of an issue for image generators, where we have to consider a more elaborate family of techniques such as inpainting, textual inversion, and ControlNet-like inputs. The amount of creative control a StableDiffusion user can exert over the output, when using all of those tools, is probably greater than the amount of creative control the average point-and-shoot photographer can exert without professional equipment.

> Even if it's dubious in some cases right now, what do you think the appropriate input bar ought to be for photos to receive copyright protection? But that said, I do get what you mean that the minimal input on point-and-shoot photos seems comparable to the minimal input given to an LLM AI to produce something and hard to reconcile why one deserves protection but the other doesn't.

Exactly. I don't believe the Copyright Office's position is logically coherent, and I think sooner or later some judge is going to point that out. As for where the bar ought to be, I have no idea, but it seems a bit late to change the rules for photos now, so probably it ends up being fairly low. I'm not sure we're going to get copyright protection for simple one-off "type a prompt and use the first output" cases, but anything more elaborate than that remains a grey area at best.

No disclosure for LLM-generated patch?

nevets — Mon, 30 Jun 2025 02:26:22 +0000

For both checkpatch.pl and Coccinelle generated patches, there isn't a tag, but pretty much every case (or it should be every case) it is stated in the change log that the patch was created by a script. You left that out. I've submitted a few Coccinelle generated patches, and every time I not only state that it was produced by Coccinelle, I add to the change log the script that I fed Coccinelle to produce the patch. This allows others to know if I did the script correctly or not.

The missing "__read_mostly" is a red herring. The real issue is transparency. We should not be submitting AI generated patches without explicitly stating how it was generated. As I mentioned. If I had known it was 100% a script, I may have been a bit more critical over the patch. I shouldn't be finding this out by reading LWN articles.

Why was he not laughed out?

mirabilos — Sun, 29 Jun 2025 23:44:31 +0000

And, even more importantly, why was his ability to submit patches not revoked for submitting AI slop under his own name?

This is a DCO breach at the *very* least, plus outright lying and violation of community standards.

No disclosure for LLM-generated patch?

sashal — Sun, 29 Jun 2025 23:31:40 +0000

As the author of hashtable.h, and the person who sent the patch to Steve, I missed it too.

For that matter, the reason I felt comfortable sending this patch out is because I know hashtable.h.

Maybe we should have a tag for tool generated patches, but OTOH we had checkpatch.pl and Coccinelle generated patches for over a decade, so why start now?

Is it an issue with the patch? Sure.

Am I surprised that LWN comments are bikeshedding over a lost __read_mostly? Not really...

ReviewGPT

nevets — Sun, 29 Jun 2025 15:47:12 +0000

As the maintainer that pulled Sasha's patch, I missed the removal of the "__read_mostly" because I thought Sasha had written it, and that's not something that he would have lightly removed without mentioning it.

I first thought that's even a bug as the hash is set up at boot or module load or unload and doesn't get changed other than that. But it likely could be removed because it's only for the trace output and that's not performance critical. But it should have been a separate patch.

No disclosure for LLM-generated patch?

nevets — Sun, 29 Jun 2025 15:30:18 +0000

As the maintainer that took that patch from Sasha, this is the first I've learned that it was completely created by an LLM. But it definitely looked like it was created by some kind of script. When I first saw the patch, I figured it was found and or created by some new static analyzer (like Coccinelle). I guess I wasn't wrong.

I was going to even ask Sasha if this came from some new tool. I think I should have. And yes, it would have been nice if Sasha mentioned that it was completely created by an LLM because I would have taken a deeper look at it. It appears (from comments below) that it does indeed have a slight bug. Which I would have caught if I had know this was 100% generated, as I would not have trusted the patch as much as I did thinking Sasha did the work.

An LLM is...

geert — Sun, 29 Jun 2025 13:01:45 +0000

So it looks a lot like comparing driver code in a vendor tree with driver code that has ended up in the Linux kernel?

I remember the (buggy) WiFi driver in my first Android tablet: it was three (IIRC) times as large as the driver that ended up upstream.

Human authorship?

interalia — Sun, 29 Jun 2025 08:41:13 +0000

I'm just a casual photographer who's used point-and-shoots pretty much my entire life, but I think any casual photographer still frames the photo to get the subject they want, adjusts the angle, and tries to ensure the subjects are lit. Sure, it's not a particularly rich or complex creative choice, but we/they still make them and it's definitely intentional.

Even if it's dubious in some cases right now, what do you think the appropriate input bar ought to be for photos to receive copyright protection? But that said, I do get what you mean that the minimal input on point-and-shoot photos seems comparable to the minimal input given to an LLM AI to produce something and hard to reconcile why one deserves protection but the other doesn't.

Human authorship?

NYKevin — Sat, 28 Jun 2025 02:44:35 +0000

A wrinkle here is the fact that photos in general receive copyright protection, despite the fact that the average photographer is just pointing and shooting a phone at something that looks neat, and the device is doing all the work. The usual justification is that the photographer exercises discretion over lighting, framing, viewing angle, etc., but the average person is hardly doing any of that, at least not intentionally.

The Copyright Office has said that anything produced by an AI is ineligible for copyright protection, except to the extent it incorporates a human's work (e.g. because somebody photoshopped it after the fact). I'm... not entirely convinced they're right about that. Or at least, I'm not convinced it can be squared with the existing law on photos (and many similar kinds of less-creative works).

An LLM doesn't need vast power.

gmatht — Sat, 28 Jun 2025 01:51:29 +0000

I have used TabbyML happily on an integrated GPU. Obviously there is a trade off with the size of the model, but it is quite possible to use LLMs locally without significantly impairing your battery life.

No disclosure for LLM-generated patch?

geofft — Fri, 27 Jun 2025 16:57:23 +0000

> I vaguely remember some llm providers include legal wavers for copyright where they take on the liability, but I can't find one for e.g. copilot right now

https://blogs.microsoft.com/on-the-issues/2023/09/07/copi...

"Specifically, if a third party sues a commercial customer for copyright infringement for using Microsoft’s Copilots or the output they generate, we will defend the customer and pay the amount of any adverse judgments or settlements that result from the lawsuit, as long as the customer used the guardrails and content filters we have built into our products."

See also https://learn.microsoft.com/en-us/legal/cognitive-service... . The exact legal text seems to be the "Customer Copyright Commitment" section of https://www.microsoft.com/licensing/terms/product/ForOnli...

On (not) replacing humans

SLi — Fri, 27 Jun 2025 15:30:20 +0000

> Levin does not believe that LLMs will replace humans in tasks like kernel development.

I think this is an interesting claim; not because I think it's right or wrong, but because it's a nontrivial prediction, and a quite common one, with no explicitly expressed rationale.

Now there's enough leeway in the language to make it hard to agree on what it might mean. For one, what does "replacing humans" mean? Did compilers replace programmers? After all, people don't need to write programs (i.e. assembly) any more; instead they write these fancy higher level descriptions of the program, and the compiler intelligently does the programming. At least that's how I think it would have been viewed at a time. Or did the emergence of compilers lead to there being less programmers?

But also there are probably unstated assumptions and predictions of the capabilities of LLMs behind this kind of stances. Either extrapolation from the trajectory of LLMs so far, or even deeper stances like "LLMs are fundamentally unable of creating anything new", or beliefs—justified or not—that humans are necessarily better than any algorithm that can exist.

I don't mean to imply that these stances are ridiculous. They may well turn out to be true. And in the other direction, things that don't seem realistic sometimes work out.

I did some LLM stuff before GPT-3 was a thing. It seemed bewildering. I think it's safe to say that nobody in the field predicted the generalizing capabilities of language models.

For a long time, machine learning models meant training a model on data that was pretty similar to what you want to predict on (like creditworthiness). That was not too mind bending.

Then people started to talk about _few-shot learning_, meaning that you'd maybe only need to give a model five examples of what you want to do, and it would understand enough and be generalist enough to handle that. That sounded like scifi.

Then, next there was one-shot learning. Surely *that* could not work.

And the biggest surprise? Zero-shot learning. You just tell the model what you want it to do. I really don't think even people who researched LLMs predicted that before they started to see curious patterns in slightly less capable (by today's standards) language models.

Now, a few years later? GPT-3 feels like something which surely has been there since maybe 90s. It doesn't feel magical anymore, and the mistakes it made are ridiculous compared today's models.

No disclosure for LLM-generated patch?

SLi — Fri, 27 Jun 2025 15:10:29 +0000

Exactly. The submitter takes the responsibility for the sanity of the submissions. If it is bad, that's on them. It doesn't matter that they found it convenient to edit the scheluder code in MS Word.

No disclosure for LLM-generated patch?

martinfick — Fri, 27 Jun 2025 14:02:03 +0000

If these junior programmers are reviewing the output of the LLM before submitting the code for inclusion, doesn't that mean this would be helping get more coders to become reviewers sooner? Perhaps this will actually help resolve the reviewer crisis by making reading code finally a more common skill then writing it?

No disclosure for LLM-generated patch?

drago01 — Fri, 27 Jun 2025 13:38:09 +0000

Well one of the reviewers would be the patch submitter, who used the LLM to save time.

If you don't trust that person and don't review his submission in detail the problem is unrelated to whether a LLM has been used or not.

No disclosure for LLM-generated patch?

mb — Fri, 27 Jun 2025 12:57:15 +0000

Clean-room is a *tool* to make accidental/unintentional copying less likely.

It's in no way required to avoid copyright problems.
Just don't copy and then you are safe.
Learning is not copying.

And you can also use that concept with LLMs, if you want.
Just feed the output from one LLM into the input of another LLM and you basically get the same thing as with two human clean-room teams.

Human authorship?

jani — Fri, 27 Jun 2025 12:34:45 +0000

I keep wondering about copyright on works that lack human authorship. I'm not talking about what went into the sausage machine, but rather what comes out.

Famously, the owner of the camera didn't get the copyright on a selfie that a monkey took: https://en.wikipedia.org/wiki/Monkey_selfie_copyright_dis...

Obviously derivative works of copyleft code need to remain copyleft, regardless of who, or rather what, produces the modifications. But what about all new work? If an LLM produces something, how can anyone claim copyright on the result? Even if the LLM prompt is a work of art, how can the end result be? So how would you protect your vibe coding results?

No disclosure for LLM-generated patch?

laarmen — Fri, 27 Jun 2025 11:47:28 +0000

This is not as simple as you make it out to be, at least in the eyes of some people. That's why you have clean-room rules such as https://gitlab.winehq.org/wine/wine/-/wikis/Clean-Room-Gu...

What about trust?

bluca — Fri, 27 Jun 2025 11:34:32 +0000

As I said, it's not copying from a database, which was implied by the post I answered to

No disclosure for LLM-generated patch?

mb — Fri, 27 Jun 2025 10:51:17 +0000

LLMs don't usually copy, though.

If you as a human learn from proprietary code and then write Open Source with that knowledge, it's not copying unless you actually copy code sections. Same goes for LLMs. If it produces a copy, then it copied. Otherwise it didn't.

What about trust?

paulj — Fri, 27 Jun 2025 10:49:01 +0000

Oh, and at inference time you are combining that "trained" data-base with your own context window (a fraction the size of the data-set), to have the LLM extend that context window to produce something that 'follows' from the 2.

What about trust?

paulj — Fri, 27 Jun 2025 10:46:58 +0000

An LLM is literally a database of fine-grained features found in a data-set. A quite sophisticated one that captures the relationships between features and their likelyhood, over varying and overlapping spans of features.

The LLM weights are a function of and a transformation of the data that was fed in. No more, no less.

No disclosure for LLM-generated patch?

excors — Fri, 27 Jun 2025 10:45:31 +0000

> In the end of the day what matters is whether the patch is correct or not, which is what reviews are for. The tools used to write it are not that relevant.

One problem is that reviewers typically assume the patch was submitted in good faith, and look for the kinds of errors that good-faith humans typically make (which the reviewer has learned through many years of experience, debugging their own code and other people's code).

If e.g. Jia Tan started submitting patches to your project, you wouldn't say "I know he deliberately introduced a subtle backdoor into OpenSSH and he's probably a front for a national intelligence service, but he also submitted plenty of genuinely useful patches while building up trust, so let's welcome him and just review all his patches carefully before accepting them". You'd understand that your review process is not infallible and he's going to try to sneak something past it, with malicious patches that look as non-suspicious as possible, so it's not worth the risk and you would simply ban him. Linux banned a whole university for a clumsy version of that: https://lwn.net/Articles/853717/. The source of a patch _does_ matter.

Similarly, LLMs generate code with errors that are not what a good-faith human would typically make, so they're not the kind of errors that reviewers are looking out for. A human isn't going to hallucinate a whole API and write a professional-looking well-documented patch that calls it, but an LLM will eagerly do so. In the best case, it'll waste reviewers' time as they try to figure out what the nonsense means. In the worst case there will be more subtle inhuman bugs that get missed because nobody is thinking to look for them.

At the same time, the explicit goal of generating code with LLMs is to make developers more productive at writing patches, meaning there will be more patches to review and reviewers will be under even more pressure. And in the long term there will be fewer new reviewers, because none of the junior developers who outsourced their understanding of the code to an LLM will be learning enough to take on that role. I think writing code is already the easiest and most enjoyable part of software development, so it seems like the worst part to be trying to automate away.

LWN: Comments on "Supporting kernel development with large language models"

Assumptions!

An LLM is...

An LLM is...

Human authorship?

Human authorship?

Human authorship?

No disclosure for LLM-generated patch?

Human authorship?

Human authorship?

Human authorship?

Human authorship?

Human authorship?

Human authorship?

Human authorship?

No disclosure for LLM-generated patch?

What about trust?

Human authorship?

Human authorship?

No disclosure for LLM-generated patch?

Why was he not laughed out?

No disclosure for LLM-generated patch?

ReviewGPT

No disclosure for LLM-generated patch?

An LLM is...

Human authorship?

Human authorship?

An LLM doesn't *need* vast power.

No disclosure for LLM-generated patch?

On (not) replacing humans

No disclosure for LLM-generated patch?

No disclosure for LLM-generated patch?

No disclosure for LLM-generated patch?

No disclosure for LLM-generated patch?

Human authorship?

No disclosure for LLM-generated patch?

What about trust?

No disclosure for LLM-generated patch?

What about trust?

What about trust?

No disclosure for LLM-generated patch?

An LLM doesn't need vast power.