The FSF considers large language models
[LWN subscriber-only content]
Welcome to LWN.net
The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net!
The Free Software Foundation's Licensing and Compliance Lab concerns itself with many aspects of software licensing, Krzysztof Siewicz said at the beginning of his 2025 GNU Tools Cauldron session. These include supporting projects that are facing licensing challenges, collecting copyright assignments, and addressing GPL violations. In this session, though, there was really only one topic that the audience wanted to know about: the interaction between free-software licensing and large language models (LLMs).
Anybody hoping to exit the session with clear answers about the status of
LLM-created code was bound to be disappointed; the FSF, too, is trying to
figure out what this landscape looks like. The organization is currently
running a survey of
free-software projects with the intent of gathering information about
what position those projects are taking with regard to LLM-authored code.
From that information (and more), the FSF eventually hopes to come up with
guidance of its own.
Nick Clifton asked whether the FSF is working on a new version of the GNU General Public License — a GPLv4 — that takes LLM-generated code into account. No license changes are under consideration now, Siewicz answered; instead, the FSF is considering adjustments to the Free Software Definition first.
Siewicz continued that LLM-generated code is problematic from a free-software point of view because, among other reasons, the models themselves are usually non-free, as is the software used to train them. Clifton asked why the training code mattered; Siewicz said that at this point he was just highlighting the concern that some feel. There are people who want to avoid proprietary software even when it is being run by others.
Siewicz went on to say that one of the key questions is whether code that
is created by an LLM is copyrightable and, if not, if there is some way to
make it copyrightable. It was never said explicitly, but the driving issue
seems to be whether this software can be credibly put under a copyleft
license. Equally important is whether such code infringes on the rights of
others. With regard to copyrightability, the question is still open; there
are some cases working their way through the courts now. Regardless,
though, he said that it seems possible to ensure that LLM output can be
copyrighted by applying some human effort to enhance the resulting code.
The use of a "creative prompt
" might also make the code
copyrightable.
Many years ago, he said, photographs were not generally seen as being copyrightable. That changed over time as people figured out what could be done with that technology and the creativity it enabled. Photography may be a good analogy for LLMs, he suggested.
There is also, of course, the question of copyright infringements in
code produced by LLMs, usually in the form of training data leaking into
the model's output. Prompting an LLM for output "in the style of
"
some producer may be more likely to cause that to happen. Clifton
suggested that LLM-generated code should be submitted with the prompt used
to create it so that the potential for copyright infringement can be
evaluated by others.
Siewicz said that he does not know of any model that says explicitly whether it incorporates licensed data. As some have suggested, it could be possible to train a model exclusively on permissively licensed material so that its output would have to be distributable, but even permissive licenses require the preservation of copyright notices, which LLMs do not do. A related concern is that some LLMs come with terms of service that assert copyright over the model's output; incorporating such code into a free-software project could expose that project to copyright claims.
Siewicz concluded his talk with a few suggested precautions for any project that accepts LLM-generated code, assuming that the project accepts it at all. These suggestions mostly took the form of collecting metadata about the code. Submissions should disclose which LLM was used to create them, including version information and any available information on the data that the model was trained on. The prompt used to create the code should also be provided. The LLM-generated code should be clearly marked. If there are any use restrictions on the model output, those need to be documented as well. All of this information should be recorded and saved when the code is accepted.
A member of the audience pointed out that the line between LLMs and assistive (accessibility) technology can be blurry, and that any outright ban of the former can end up blocking developers needing assistive technology, which nobody wants to do.
There were some questions about how to distinguish LLM-generated code from human-authored code, given that some contributors may not be up-front about their model use. Clifton said that there must always be humans in the loop; they, in the end, are responsible for the code they submit. Jeff Law added that the developers certificate of origin, under which code is submitted to many projects, includes a statement that the contributor has the right to submit the code in question. Determining whether that right is something the contributor truly holds is not a new concern; developers could be, for example, submitting code that is owned by their employer.
A real concern, Siewicz said, is whether contributors are sufficiently educated to know where the risks actually are.
Mark Wielaard said that developers are normally able to cite any inspirations for the code they write; an LLM is clearly inspired by other code, but is unable to make any such citations. So there is no way to really know where LLM-generated code came from. A developer would have to publish their entire session with the LLM to even begin to fill that in.
The session came to an end with, perhaps, participants feeling that they had a better understanding of where some of the concerns are, but nobody walked out convinced that they knew the answers.
A video of this session is available on YouTube.
[Thanks to the Linux Foundation, LWN's travel sponsor, for supporting my travel to this event.]
Index entries for this article | |
---|---|
Conference | GNU Tools Cauldron/2025 |
Posted Oct 14, 2025 16:43 UTC (Tue)
by gwolf (subscriber, #14632)
[Link]
If most of my programming consisted of searching for answers to a question related to mine in StackOverflow... I *could* get persuaded to link to the post in question in a comment before each included snippet. But that's also not something I've seen to be frequent. And if I didn't write the comment _the same moment_ I included said snippet, it's most likely I never will.
So... I think there is an argumentative issue in here :-)
Now can we?
Be it that I learnt programming at school or by reading books, or that I took a "BootCamp", I cannot usually said where I got a particular construct from. I could, of course, say that I write C in the K&R style — but I doubt that's what Siewicz refers to. And of course, Perl-heads will recognize a "Schwartzian transform". But in general, I learnt _how to code_, and I am not able to attribute specific constructs of my programming to specific bits of code. Just like an LLM.