|
|
Log in / Subscribe / Register

Toward a policy for machine-learning tools in kernel development

By Jonathan Corbet
December 11, 2025

Maintainers Summit
The first topic of discussion at the 2025 Maintainers Summit has been in the air for a while: what role — if any — should machine-learning-based tools have in the kernel development process? While there has been a fair amount of controversy around these tools, and concerns remain, it seems that the kernel community, or at least its high-level maintainership, is comfortable with these tools becoming a significant part of the development process.

Sasha Levin began the discussion by pointing to a summary he had sent to the mailing lists a few days before. There is some consensus, he said, that human accountability for patches is critical, and that use of a large language model in the creation of a patch does not change that. Purely machine-generated patches, without human involvement, are not welcome. Maintainers must retain the authority to accept or reject machine-generated contributions as they see fit. And, he said, there is agreement that the use of tools should be disclosed in some manner.

Just tools?

But, he asked the group: is there agreement in general that these tools are, in the end, just more tools? Steve Rostedt said that LLM-generated code may bring legal concerns that other tools do not raise, but Greg Kroah-Hartman answered that the current developers certificate of origin ("Signed-off-by") process should cover the legal side of things. Rostedt agreed that the submitter is ultimately on the hook for the code they contribute, but he wondered about the possibility of some court ruling that a given model violates copyright years after the kernel had accepted code it generated. That would create the need for a significant cleanup effort.

[Sasha Levin] Ted Ts'o said that people worry about the copyright problems, but those same problems exist even in the absence of these tools. Developers could, for example, submit patches without going through the processes required by their employer — patches which, as a result, they have no right to submit. We do not worry about that problem now, he said, and it has almost never actually come up. Jiri Kosina said that these tools make code creation easy enough that the problem could become larger over time. Dave Airlie asked whether it makes sense to keep track of which models people are using. But, he said, any copyrighted code put into a patch by an LLM is likely to have come from the kernel itself.

Levin mentioned that there had been some ethical concerns raised about LLM use and its effects on the rest of the world. Arnd Bergmann said that it could make sense to distinguish between which types of models are in use. Running one's own model locally is different from using a third party's tool.

Linus Torvalds jumped in to say that he thought the conversation was overly focused on the use of LLMs to write code, but there has not, yet, been much of that happening for the kernel. So any problems around LLM-written code are purely hypothetical. But these tools are being used for other purposes, including identifying CVE candidates and stable-backport candidates, and for patch review. Andrew Morton, Torvalds said, had recently shown an example of a machine-reviewed patch that was "stunning"; it found all of the complaints that Torvalds had raised with the patch in question, and a few more as well.

Alexei Starovoitov said that, within Meta, automated tools have been producing good reviews about 60% of the time, with another 20% having some good points. Less than 20% of the review comments have been false positives. Jens Axboe added that he has been testing with older patches and seeing similar results. He passed one five-line patch with a known problem to three human reviewers, and none of them found the bug. But the automated tool did find the problem (a reversed condition in a test); "AI always catches that stuff".

Christian Brauner asked the group how many people use LLMs for coding; about four developers raised their hands. Shuah Khan expressed concern about access to LLMs; most of this work is being done behind corporate walls. Ts'o said that he has been using the review prompts posted by Chris Mason, originally written for Claude, with Gemini, with generally good results and at a relatively low cost.

Torvalds, though, pointed out that developers have long been complaining about a lack of code review; LLMs may just solve that problem. They are not writing code at this point, he said, though that will likely happen at some point too. Once these systems start submitting kernel code, we will truly need automated systems to review all that code, he said.

Proprietary systems

Konstantin Ryabitsev said that he had tried using some of these systems, but found them to be far too expensive; he also was worried about depending on proprietary technology. Brauner said that this usage had to be supported by employers, or perhaps the Linux Foundation could attempt to provide an automated review service. Ts'o said that the expense depends on how the system is used. One can pull in the entire kernel, using a lot of tokens; that will be expensive. The alternative is to create a set of review rules, reducing the token use by a factor of at least five. Khan repeated that not all developers will have equal access to this technology.

Mark Brown was concerned about requiring submitters to run their patches through proprietary tools; some will surely object to that. Axboe suggested that the review tools should be run by subsystem maintainers, not submitters.

I pointed out that, 20 years ago, the kernel community abruptly lost access to BitKeeper, highlighting the hazards of depending on proprietary tools. If the kernel community becomes dependent on these systems, development will suffer when the inevitable rug-pull happens. At some point, the cost of using LLMs will have to increase significantly if the companies behind them are to have a chance at reaching their revenue targets.

Torvalds, though, called that concern a "non-argument". We do not have those tools today, he said; if they go away tomorrow, the community will just be back where it is now. Meanwhile, he said, we should take advantage of the technology industry's willingness to waste billions of dollars to get people to use these tools. Even if it only lasts a couple of years, it can help the community.

Starovoitov said that he loves the reviews that the BPF community gets from the LLM systems. They ask good questions even when the reviews are wrong. Even better, developers respond to the questions, despite the fact that they are answering a bot; those answers can be used to help the models learn to do better in the future. But he acknowledged a recent three-day outage caused by some problems at GitHub; it "felt devastating". He was just waiting for the service to come back, since it does a better job of reviewing than he does.

Disclosure

Levin shifted the discussion to disclosure requirements. There have been proposals for an Assisted-by tag that would name the specific tool used; should that tag be required for all tools, or just for LLMs? Torvalds said that he would like to see links to LLM-generated reviews, but that there is no need for a special review tag. Ts'o agreed, saying that people need to look at the reviews to determine whether they make sense, but he pointed out that a lot of reviews are not posted publicly. Starovoitov answered that the reviews in the BPF subsystem are posted as email responses to the patches.

Kees Cook said that he didn't care about which specific tag is used, he just wants to know what he should use; Torvalds answered that there does not need to be a tag at all. The information could just be put into the changelog instead. Ryabitsev suggested putting it after the "---" marker so that it doesn't appear in the Git changelog, but Bergmann said he would prefer to have that information in the changelog. Torvalds accused the group of over-thinking the problem, saying that it was better to experiment and see what works. The community should encourage disclosure of tool use, but not make hard rules about how that disclosure should be done.

Ts'o said that, in any case, it is not possible to count on submitters disclosing their tool use; some people may want to lie about it. Dan Williams said that disclosure rules would make it clear that the community values transparency in this area. Levin added that the nice thing about these tools is that they listen; if a disclosure rule is added to the documentation, the models will comply. Williams suggested a rule that all changelogs should mention leprechauns.

As the session moved toward a close, Levin said he would post a documentation patch asking LLM tools to add an Assisted-by tag, but would not make an effort to enforce the rule. There was some final discussion on the details of that tag, which seems sure to evolve over time.

Index entries for this article
KernelDevelopment tools/Large language models
ConferenceKernel Maintainers Summit/2025


to post comments

Missing the point

Posted Dec 11, 2025 23:27 UTC (Thu) by dmv (subscriber, #168800) [Link] (45 responses)

> Ted Ts'o said that people worry about the copyright problems, but those same problems exist even in the absence of these tools

This misses the point entirely. The models are trained on, and are only useful to the extent that they were, a prodigious amount of code/text, and we know for a fact that this ingestion/training was done for the most part (entirely?) without any consideration of the license terms that code came under. These questions are wending their way through courts now, so there will be an answer eventually, but the idea that the risk environment is similar to anything in the past is wrong.

(I’m sure (I’d hope!) the kernel team has asked for legal advice from their own counsel though, in which case it does make sense to say to devs don’t worry about the legal risk. )

Missing the point

Posted Dec 12, 2025 1:26 UTC (Fri) by Wol (subscriber, #4433) [Link] (8 responses)

> This misses the point entirely. The models are trained on, and are only useful to the extent that they were, a prodigious amount of code/text, and we know for a fact that this ingestion/training was done for the most part (entirely?) without any consideration of the license terms that code came under.

Are you sure? That's the hype, certainly, but I believe my company has a policy that the ONLY code the tools we use train on, is our own code (or Free Software code that we are modifying/contributing back to).

I think you'll find most "big company" lawyers are somewhat paranoid on that front.

Cheers,
Wol

Missing the point

Posted Dec 12, 2025 10:29 UTC (Fri) by excors (subscriber, #95769) [Link] (2 responses)

There's no doubt they used huge amounts of unlicensed copyrighted material - lawsuits have established that e.g. Meta torrented terabytes of pirated ebooks for AI training. They considered it "medium-high legal risk" and "high policy risk", but on the other hand they believed it was the only way they could compete with state-of-the-art LLMs that were using the same pirated data, and that's worth probably hundreds of billions of dollars, so they got approval from Zuckerberg and a VP who accepted the risk. (https://storage.courtlistener.com/recap/gov.uscourts.cand..., https://www.theverge.com/2025/1/14/24343692/meta-lawsuit-...)

Their defence is that their use of unlicensed material is fair use. That seems to have been a successful defence so far, but cases are ongoing.

Missing the point

Posted Dec 12, 2025 11:24 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

And in Europe it's explicitly legal - for an LLM to *ingest* copyrighted works is seen no different from a *person* ingesting them.

The *assumption* is that an LLM will be treated the same as a person when they regurgitate copyrighted works (and quite often a person has no idea, why should an LLM? :-)

My basic problem with LLMs such as Google is they have no concept of grammar, and therefore often answer completely the wrong question - "how do I disable unwanted features in Excel Macros" got interpreted as "how do I disable Excel Macros" - NOT something I want to do! Google AI is almost always a complete waste of time as far as I'm concerned. And everybody raves over it!? Garbage In, Garbage Out as far as I'm concerned.

Oh well. At some point the Tulip Crap Supply is going to exceed Demand ...

Cheers,
Wol

Missing the point

Posted Dec 13, 2025 8:59 UTC (Sat) by bboissin (subscriber, #29506) [Link]

> My basic problem with LLMs such as Google is they have no concept of grammar, and therefore often answer completely the wrong question

Fwiw traditional google search isn't an LLM (mostly, lines get blurrier with AI Mode/Overviews and more LLM tech is being used, but it's still fairly far from a pure LLM app like chatgpt/Claude/gemini).

LLMs are actually much better than traditional search engine at grammar parsing. Just tried your example on Gemini (google LLM app) in fast mode (cheapest model) and it's definitely interpreted correctly.

Missing the point

Posted Dec 12, 2025 12:48 UTC (Fri) by malmedal (subscriber, #56172) [Link] (3 responses)

A few fully only open models trained on freely available text does exist, e.g. https://allenai.org/olmo at least say they are providing everything training data included.

For people who are fine with the copyright status of unknown datasets, but worried that the model they've got may have an undesireable bias, there are a few post-training datasets you can apply to an existing model, e.g. https://allenai.org/blog/tulu-3

Missing the point

Posted Dec 12, 2025 13:57 UTC (Fri) by excors (subscriber, #95769) [Link] (2 responses)

> A few fully only open models trained on freely available text does exist, e.g. https://allenai.org/olmo at least say they are providing everything training data included.

As far as I can tell, that's not free in the Free Software sense. They say they use all 167M documents from StackEdu [1], which are scraped from GitHub. 13.7% have permissive licences, and the other 86.3% have no licence [2]. Explicitly non-permissive licences and copyleft are excluded from StackEdu, but repositories with no licence (i.e. all rights reserved) are included, and the OLMo training data doesn't appear to filter them out.

[1] https://huggingface.co/datasets/allenai/dolma3_pool
[2] https://huggingface.co/datasets/HuggingFaceTB/stack-edu

Missing the point

Posted Dec 12, 2025 16:11 UTC (Fri) by malmedal (subscriber, #56172) [Link] (1 responses)

There's some sort of "purity test"-thing going on with respect to provenance. Something like this should hopefully satisfy most: https://www.fairlytrained.org/certified-models

Missing the point

Posted Dec 13, 2025 3:13 UTC (Sat) by pabs (subscriber, #43278) [Link]

Reminds me of this proposed policy by some Debian folks:

https://salsa.debian.org/deeplearning-team/ml-policy/

Missing the point

Posted Dec 12, 2025 13:09 UTC (Fri) by hailfinger (subscriber, #76962) [Link]

Not sure if there is a disconnect in the scope of the "training". Some AI tools are trained on a huge body of work, then additional training happens with your local body of code. That may be a problematic thing due to the previous training "remembering" code which is not under a license you'd want to integrate.

There may also be accidental copying of code between your different internal codebases which is a problem if those internal codebases have incompatible licenses.

Missing the point

Posted Dec 12, 2025 14:09 UTC (Fri) by bluca (subscriber, #118303) [Link] (11 responses)

> The models are trained on, and are only useful to the extent that they were, a prodigious amount of code/text, and we know for a fact that this ingestion/training was done for the most part (entirely?) without any consideration of the license terms that code came under.

Which is the correct way to do it - as per the copyright law in Europe, your license matters jack squat when the work is being used as a dataset to train a model. Copyright law >>> your license

The only thing that matters is that the work being ingested is publicly available on the internet (ie: you did not hack past a firewall or so).

Missing the point

Posted Dec 12, 2025 17:38 UTC (Fri) by Wol (subscriber, #4433) [Link] (10 responses)

> The only thing that matters is that the work being ingested is publicly available on the internet (ie: you did not hack past a firewall or so).

Does robots.txt count as a firewall? Given the stories about bot miners effectively DoS'ing a lot of small sites, I would hope so ...

Cheers,
Wol

robots.txt and model training in the EU

Posted Dec 12, 2025 22:17 UTC (Fri) by gingercreek (subscriber, #155755) [Link] (9 responses)

> Does robots.txt count as a firewall [against LLM scraping, in the EU]?

Most of the time, yes, though there are exceptions. The EU regulation regulating general-purpose AI systems reuses the rules from an earlier EU directive that regulates text and data mining. For the purposes of LLM development, these rules stipulate that

  • LLM developers can scrape anything that's publicly available,
  • unless rightsholders have explicitly disallowed such scraping in "an appropriate manner;" and
  • all publicly available materials may be scraped by research organizations for scientific research, even if the rightsholders have explicitly disallowed scraping.

Neither the directive nor the regulation define "appropriate manner," which technically leaves its interpretation to national legislatures and courts. That said, the EU AI Office's "AI Code of Practice" does require signatories to honor robots.txt (in Measure 1.3). LLM developers aren't required to become signatories to the AI Code of Practice, but most of the major "western" developers have, and courts are likely to consider it when interpreting what counts as an "appropriate manner," even when defendants are not signatories.

(Obligatory disclaimer: I'm not a lawyer, this isn't legal advice, and it doesn't represent the opinion of my employer.)

robots.txt and model training in the EU

Posted Dec 12, 2025 22:30 UTC (Fri) by bluca (subscriber, #118303) [Link] (2 responses)

Some years ago the W3C started to work on a standard to more specifically define the opt-out mechanism, not sure where that went though

robots.txt and model training in the EU

Posted Dec 12, 2025 22:34 UTC (Fri) by bluca (subscriber, #118303) [Link]

robots.txt and model training in the EU

Posted Dec 12, 2025 23:12 UTC (Fri) by gingercreek (subscriber, #155755) [Link]

The W3C TDM Reservation Protocol working group is still ongoing, though it hasn't released a new draft in more than a year. IETF also has a similar AI Preferences working group still ongoing.

Additionally, IPTC ("IETF for news media") has extended its existing metadata standard for image and video to cover data mining, and I wouldn't be surprised if there are similar efforts under way elsewhere. It'll be interesting to see which ones become widely adopted and which ones fade away.

robots.txt and model training in the EU

Posted Dec 18, 2025 15:20 UTC (Thu) by mirabilos (subscriber, #84359) [Link] (5 responses)

The TDM exception however does *not* allow use of the scraped data in “generative” models, only to discover “patterns, trends and correlations” (worded specifically like this).

robots.txt and model training in the EU

Posted Dec 18, 2025 15:27 UTC (Thu) by mirabilos (subscriber, #84359) [Link]

Especially not distributing, not even the “model”.

In fact, since they do the mining *in order to* do things not permitted by the TDM exception, their mining is already illegal as well as it’s not covered by that.

The aggressive way they scrape is likely also illegal, under different provisions.

robots.txt and model training in the EU

Posted Dec 18, 2025 18:40 UTC (Thu) by bluca (subscriber, #118303) [Link] (3 responses)

Of course it allows that, even the AI act specifically references that.
Did you generate that comment with an AI chatbot?

robots.txt and model training in the EU

Posted Dec 19, 2025 0:43 UTC (Fri) by mirabilos (subscriber, #84359) [Link] (2 responses)

https://www.gesetze-im-internet.de/urhg/__44b.html says no. It’s for *analysis*, not for reproduction.

> Did you generate that comment with an AI chatbot?

gfy, are you now fully TESCREAL, hater of humanity?

robots.txt and model training in the EU

Posted Dec 19, 2025 0:56 UTC (Fri) by bluca (subscriber, #118303) [Link] (1 responses)

> https://www.gesetze-im-internet.de/urhg/__44b.html says no. It’s for *analysis*, not for reproduction.

I don't speak German so I have no idea what that is or says. This is the EU copyright directive and it very clearly doesn't say what you think it says: https://eur-lex.europa.eu/eli/dir/2019/790/oj
Reference from the AI act is even more obvious: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng

> gfy, are you now fully TESCREAL, hater of humanity?

I have no idea what any of that means, so I'm going to go with chatbot

robots.txt and model training in the EU

Posted Dec 19, 2025 2:24 UTC (Fri) by jzb (editor, #7867) [Link]

This back-and-forth has gone on long enough. Let's end the thread here, please.

Missing the point

Posted Dec 12, 2025 17:23 UTC (Fri) by mb (subscriber, #50428) [Link] (23 responses)

My brain is trained on a prodigious amount of code/text, and we know for a fact that this ingestion/training was done for the most part (entirely?) without any consideration of the license terms that code came under.

Does that automatically mean that any code I write has Copyright problems?

The answer is simple: No.
Copyright is about copying. (D'oh!)

Why would it be different with machine learning?

Missing the point

Posted Dec 12, 2025 21:08 UTC (Fri) by somlo (subscriber, #92421) [Link] (22 responses)

> My brain is trained on a prodigious amount of code/text [...]
> Does that automatically mean that any code I write has Copyright problems?
> The answer is simple: No.

Except when you're literally parroting/regurgitating material you've remembered by heart -- so maybe not quite that simple...

> Why would it be different with machine learning?

Because, as a thinking, feeling, sentient, self-aware entity, with some ineffable quality science hasn't yet been able to explain, you (at least *should*) have the inalienable right to use your ability to think, and understand the world around you.

OTOH, a Turing machine with a (very very large) state machine and a (very very large) tape attached to it is just that. A "machine learning" program is qualitatively much closer to a (lossy) compression algorithm than it is to whatever *you* are.

It makes me really sad to see people constantly anthropomorphize AI/ML systems, which are nothing more than large Turing machines, and therefore not "ineffable magic" with "rights" that must somehow be "protected"... :(

Missing the point

Posted Dec 12, 2025 21:26 UTC (Fri) by mb (subscriber, #50428) [Link] (21 responses)

>Except when you're literally parroting/regurgitating

Yes, as I said. Copyright is about the copying part. Not about the learning part.

>as a thinking, feeling, sentient, self-aware entity

That is completely irrelevant to whether copying is done or not.

>A "machine learning" program is qualitatively much closer to a (lossy) compression algorithm than it is to whatever *you* are.

I do not think that this is true.

I am a lossy compression algorithm and I feel it every single day.
The older I become, the more I feel it.
What was the topic?
Nevermind.

Missing the point

Posted Dec 15, 2025 11:02 UTC (Mon) by paulj (subscriber, #341) [Link] (20 responses)

> Yes, as I said. Copyright is about the copying part. Not about the learning part.

And this is the essential question....

Are these things "learning"? Are they incorporating their experiences into the very fabric of their being? Or are they updating weights in a massive state machine, that corresponds to vectors that are numerical descriptions of distinct features within the data-set consumed - from very basic features that are literal exact copies of pieces of the data-set, through to features of the relationships between those more basic features including the likelyhoods of how one can compose with the other within the data-set.

Missing the point

Posted Dec 15, 2025 16:03 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

While I'm partial to the "updating weights" side, the marketing departments don't seem to agree. I believe that we should hold them to standards reflecting their marketing boasts while they continue to make them.

Missing the point

Posted Dec 15, 2025 16:08 UTC (Mon) by bluca (subscriber, #118303) [Link] (9 responses)

> Are these things "learning"?

As far as the law is concerned, yes, because the law says that data mining for the purpose of training models is an allowed exception to copyright protection

Missing the point

Posted Dec 15, 2025 16:31 UTC (Mon) by paulj (subscriber, #341) [Link] (8 responses)

You're referencing the explicit EU AI Act, correct? But that is an explicit exception. If AI data-mining+training was clearly legally equivalent to a human learning, then there would have been no need for the EU to regulate such an exception...

(Aside: How was the EU able to achieve this via regulation? Copyright is surely a devolved matter that can not directly be changed by regulation? Wouldn't it require an EU Directive to be agreed, and transposed into member state law by each member?).

Missing the point

Posted Dec 15, 2025 16:38 UTC (Mon) by paulj (subscriber, #341) [Link]

The relevant text in the AI act, AFAICT, seems to be this - is it really legalising data-strip-mining for AI training? Or is it reinforcing that copyright holders have rights?

https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX...

"(105) General-purpose AI models, in particular large generative AI models, capable of generating text, images, and other
content, present unique innovation opportunities but also challenges to artists, authors, and other creators and the
way their creative content is created, distributed, used and consumed. The development and training of such models
require access to vast amounts of text, images, videos and other data. Text and data mining techniques may be used
extensively in this context for the retrieval and analysis of such content, which may be protected by copyright and
related rights. Any use of copyright protected content requires the authorisation of the rightsholder concerned
unless relevant copyright exceptions and limitations apply. Directive (EU) 2019/790 introduced exceptions and
limitations allowing reproductions and extractions of works or other subject matter, for the purpose of text and data
mining, under certain conditions. Under these rules, rightsholders may choose to reserve their rights over their
works or other subject matter to prevent text and data mining, unless this is done for the purposes of scientific
research. Where the rights to opt out has been expressly reserved in an appropriate manner, providers of
general-purpose AI models need to obtain an authorisation from rightsholders if they want to carry out text and
data mining over such works."

?

Missing the point

Posted Dec 15, 2025 16:50 UTC (Mon) by Wol (subscriber, #4433) [Link]

> You're referencing the explicit EU AI Act, correct? But that is an explicit exception. If AI data-mining+training was clearly legally equivalent to a human learning, then there would have been no need for the EU to regulate such an exception...

That's the point. It's stating that it IS legally equivalent. People are arguing over whether they are different (they clearly are in practice, not imho in principle), so we have a new law that explicitly says that as far as the law is concerned they are to be treated the same.

Seriously - how can two things be "legally equivalent" unless there is a law stating that they are?

And you'll see this throughout most legal systems - "We have this general legal principle, and we explicitly say that these assorted actual instances are all considered equivalent to the principle".

We had exactly this sort of thing with TWOC'ing. It was deemed not to be theft (which is what any normal person would have called it) because the owner usually got their car back. And if it had been abandoned undamaged (which admittedly wasn't that common) the defence barristers argued "where is the harm?" We needed a new law - Taking WithOut Consent - that said TWOCing and theft were legally the same thing (where in detail they are clearly not). I believe there's various other things which fit the *common* but not the *legal* definition of theft, so the law also defines a whole bunch of other offences which are then declared to be "legally equivalent" to theft.

Cheers,
Wol

Missing the point

Posted Dec 15, 2025 16:56 UTC (Mon) by bluca (subscriber, #118303) [Link] (5 responses)

Whether it's "equivalent to a human learning" is irrelevant. What is relevant is that the copyright directive (from a few years ago, even before the current AI craze exploded) explicitly carved out permissions to data mine for the purpose of training models as excepted from copyright, so no other equivalence is required. As long as the few requirements that the directive specifies are conformed to, it is perfectly legal to data mine any publicly available content, ignoring any copyright and thus any license that may or may not be attached to such content.

Missing the point

Posted Dec 15, 2025 17:31 UTC (Mon) by paulj (subscriber, #341) [Link] (1 responses)

Sure, but rights holders can opt-out, under the DSM - and that right is affirmed in the AI Act too.

Reading some legal blogs, it seems like AI data-miners do /not/ have a blanket exception from copyright. Reservations expressed by rights holders in obots.txt, ai.txt, and maybe some headers all apparently ought to be render material out-of-bounds, so far as copyright goes. I'm not sure who is right here, but there definitely seem to be legal people out there who think rightsholders have that opt-out.

(I have no understanding of this area at all, I am just trying to gain some basic understanding and triangulate what I can from opinions I'm reading in a few places).

Missing the point

Posted Dec 15, 2025 17:38 UTC (Mon) by bluca (subscriber, #118303) [Link]

Yes, as mentioned, the directive has some requirements that need to be followed, and one of these is that for-profits need to respect opt-outs. The problem is that there is no well-defined mechanism to do so, yet.

But this requirement does not apply to research, educational, etc etc data mining.

Missing the point

Posted Dec 15, 2025 17:55 UTC (Mon) by Wol (subscriber, #4433) [Link]

Yes. And what is ALSO relevant, is that the directive makes no mention whatsoever about the OUTPUT from such training, so that presumably is subject to the same copyright rules as anything else - if it looks like a copy, then you can claim to be the rights-holder and seek to claim your just rewards.

But who's to say most of this isn't JUST like music - how many musicians have been accused of plagiarism, just because they are famous and their music "just happened" to contain similar elements to someone else's copyrighted works!

Just beware! Those works could easily be works you yourself wrote a few years earlier! Certainly in the Classical industry, a lot of works are commissioned, and may come with a "transfer of copyright" clause. So you then should not write anything that sounds like that said earlier work! Which is probably a lot easier said than done!

Still, who said Copyright was sane ???

Cheers,
Wol

Missing the point

Posted Dec 18, 2025 15:22 UTC (Thu) by mirabilos (subscriber, #84359) [Link] (1 responses)

but not to do anything with that other than to analyse it for patterns, trends and correlation (verbatim from the law).

Missing the point

Posted Dec 18, 2025 15:49 UTC (Thu) by bluca (subscriber, #118303) [Link]

> ‘text and data mining’ means any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations;

https://eur-lex.europa.eu/eli/dir/2019/790/oj

_INCLUDES BUT IS NOT LIMITED TO_ patterns, trends and correlations

Literally referenced from the AI act itself in https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng

Missing the point

Posted Dec 15, 2025 16:49 UTC (Mon) by mb (subscriber, #50428) [Link] (8 responses)

>Are these things "learning"?

That is a good question, but it is not relevant for the Copyright topic.
What matters for Copyright is whether the output of the "AI" is essentially a copy of copyrighted material.
It has been shown that "AI"s can be triggered to do that copying partially, if you explicitly try hard enough to trigger it.
However, I don't believe that this actually happens in real use cases.
Outputs of the coding LLM models with their corresponding surroundings don't simply reproduce stuff. They ingest the individual context and give a very specific answer based on that context.

Missing the point

Posted Dec 15, 2025 18:08 UTC (Mon) by Wol (subscriber, #4433) [Link] (7 responses)

> It has been shown that "AI"s can be triggered to do that copying partially, if you explicitly try hard enough to trigger it.

Which is likely more than sufficient to invoke "unclean hands".

The chances of an AI regurgitating any (portion of a) copyrighted work to an enquirer who is unaware of said work, are probably somewhere between "highly unlikely" and "nil".

Cheers,
Wol

Missing the point

Posted Dec 15, 2025 18:43 UTC (Mon) by pizza (subscriber, #46) [Link] (6 responses)

> The chances of an AI regurgitating any (portion of a) copyrighted work to an enquirer who is unaware of said work, are probably somewhere between "highly unlikely" and "nil".

Copyright liability falls on those doing the copying (ie the regurgitator), not those asking for said regurgitation.

....And your "probably highly unlikely" has been repeatedly demonstrated to (easily) happen. Indeed, that's a large part of the legal theory behind many of those ongoing publisher-initiated lawsuits.

Missing the point

Posted Dec 15, 2025 19:36 UTC (Mon) by bluca (subscriber, #118303) [Link]

> Copyright liability falls on those doing the copying (ie the regurgitator), not those asking for said regurgitation.

<citation needed>

> ....And your "probably highly unlikely" has been repeatedly demonstrated to (easily) happen. Indeed, that's a large part of the legal theory behind many of those ongoing publisher-initiated lawsuits.

Remains to be seen how those end up. But they are court cases in America, where the legislation is far behind, so legality or not hinges around "fair use", which is notoriously murky.

It's funny how the anarco-capitalists (and copyright maximalists) constantly slag the EU for "stifling innovation via regulation", and yet here we are clearly far ahead, with much more certain and clear rules around data mining that apply to everyone, while in the US it's a coin flip (or rather it's down to whoever can pay for the best lawyers, which is probably why said capitalists love it)

Missing the point

Posted Dec 16, 2025 0:35 UTC (Tue) by Wol (subscriber, #4433) [Link] (4 responses)

> > The chances of an AI regurgitating any (portion of a) copyrighted work to an enquirer who is unaware of said work, are probably somewhere between "highly unlikely" and "nil".

> Copyright liability falls on those doing the copying (ie the regurgitator), not those asking for said regurgitation.

And from what (little) I've seen, the ones doing the copying are those TYPING THE PROMPT INTO THE AI. Which is why I said the chances of it happening to a prompter UNAWARE of the copyrighted work are minimal to non-existent.

There are plenty of examples of people *getting* AIs to regurgitate works, by CAREFULLY CRAFTED prompts. I'm not aware of any AIs regurgitating works in response to innocent prompts (not saying there aren't any, they just appear to be non-existent).

Cheers,
Wol

Missing the point

Posted Dec 16, 2025 1:03 UTC (Tue) by pizza (subscriber, #46) [Link] (3 responses)

> There are plenty of examples of people *getting* AIs to regurgitate works, by CAREFULLY CRAFTED prompts. I'm not aware of any AIs regurgitating works in response to innocent prompts (not saying there aren't any, they just appear to be non-existent).

Again, you're making a distinction that does not exist in copyright law. That I asked you to copy something doesn't change the basic fact that the one performing the copying [1] is liable, not the one asking for said copy.

(in the US, yadda yadda, other jurisdictions may vary)

[1] they could always respond with "I'm sorry Dave, I'm afraid I can't do that"

Missing the point

Posted Dec 16, 2025 1:11 UTC (Tue) by pizza (subscriber, #46) [Link]

> That I asked you to copy something doesn't change the basic fact that the one performing the copying [1] is liable, not the one asking for said copy.

BTW, Secondary and/or Contributory infringement can fall upon the recipient if they "materially contributed to the infringement" but that only applies on top of the primary infringement (ie by the party performing the actual copy).

Lots of sources for that, but here's a nice summary

https://web.law.duke.edu/cspd/papers/pdf/ipcasebook_chap-...

Missing the point

Posted Dec 16, 2025 10:33 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

So you asked me copy something. BY GIVING ME A (PARTIAL?) COPY OF WHAT YOU WANTED TO GET BACK.

As I said, "unclean hands".

Cheers,
Wol

Missing the point

Posted Dec 16, 2025 13:25 UTC (Tue) by pizza (subscriber, #46) [Link]

> So you asked me copy something. BY GIVING ME A (PARTIAL?) COPY OF WHAT YOU WANTED TO GET BACK.

I remember seeing an early test that showed that the original ChatGPT returned something like *65%* of the first Harry Potter book, word-for-word. The input prompt to output volume ratio was many, many orders of magnitude.

More critically, *if the LLM purveyor hadn't essentially copied the text in the first place* the LLM couldn't have been able to regurgitate anything. Indeed, the LLM purveyors have already admitted to this wholesale copying [1], and their defence is essentially that all of this is permissible fair use.

[1] as in admitting to obtaining many gigabytes of ebooks via illegal torrents

Linux Foundation

Posted Dec 12, 2025 11:03 UTC (Fri) by alx.manpages (subscriber, #145117) [Link] (56 responses)

> or perhaps the Linux Foundation could attempt to provide an automated review service.

If the Linux Foundation would spend money on AI reviews, but would not spend money on paying directly to the maintainers, I think that would be negligence. Maintainers could do a much better job than LLMs at reviewing code. They just need to be paid so they can work full time on that.

Linux Foundation

Posted Dec 13, 2025 0:35 UTC (Sat) by AdamW (subscriber, #48457) [Link] (5 responses)

Have you tried it?

The frontier models, at least, are getting really good at finding certain types of issues, better than humans in some cases. Well, rather they tend to find different things. Humans are still better at things like "but maybe this is just a wrong idea in the first place" (an LLM will almost never tell you that) or "it should be architected entirely differently" or "you should use this library, not reimplement it". But LLMs are very, very good at "you got this bit of logic the wrong way round" or "you didn't consider what happens when <x>" or "you have a typing issue" or "you didn't write that loop as tightly as it could've been written".

I started getting robot overlord PR reviews as an experiment, but now I'd be sad to lose them (to the point where I'm now working on ways to get them done without relying on Google continuing to give them away for free for public Github projects). Of course we *also* do human review.

Linux Foundation

Posted Dec 13, 2025 3:18 UTC (Sat) by pabs (subscriber, #43278) [Link] (2 responses)

> Of course we *also* do human review.

For how long? Relying on tools usually means one's own skills will atrophy.

Linux Foundation

Posted Dec 14, 2025 17:55 UTC (Sun) by SLi (subscriber, #53131) [Link] (1 responses)

Just like compilers cause atrophy of real programming skills because now you can use these fancy magical things to do the programming (assembly) for you, starting from a high level description that any manager could understand (COBOL)?

Yes, I would admit that assembly skills have atrophied because of compilers. Thank God.

Linux Foundation

Posted Dec 15, 2025 0:32 UTC (Mon) by dskoll (subscriber, #1630) [Link]

This is a false analogy. With a compiler, someone who has a reasonably decent understanding of computer architectures can at least grasp how the high-level language is transformed into machine language. And though that person might not know all the details of assembly language, it's true IMO that people who completely lack some low-level understanding write worse code and code with more vulnerabilities.

The transformation made by AI is far more extensive and disconnected from the input (and indeed, the same input can give different output... something a non-buggy compiler won't do.)

With a compiler or with assembly language, you still have to consider algorithms, time complexity, and so on. With AI, you don't need to know any of that (or rather, when the AI output inevitably proves to be crap, you really do need someone who knows it.)

Linux Foundation

Posted Dec 13, 2025 7:07 UTC (Sat) by alx.manpages (subscriber, #145117) [Link]

> Have you tried it?

Yes. And no, I didn't like them, thanks.

Linux Foundation

Posted Dec 15, 2025 10:20 UTC (Mon) by taladar (subscriber, #68407) [Link]

LLMs are also really great at rewriting code using the latest version of the library to use the deprecated or removed forms of that library API that was current when their training happened. I tried e.g. a test project with the Bevy game engine (currently at 0.17) and even if I just tell it (Gemini CLI in this case) to move functions to another file it keeps replacing 0.17 constructs with functionally equivalen 0.14 constructs. I am not sure that kind of model would ever be suitable for reviewing code in general.

Linux Foundation

Posted Dec 13, 2025 3:17 UTC (Sat) by pabs (subscriber, #43278) [Link] (49 responses)

Sounds like AI reviewers do a better job than human ones, from the article above:
He passed one five-line patch with a known problem to three human reviewers, and none of them found the bug. But the automated tool did find the problem (a reversed condition in a test); "AI always catches that stuff".

Linux Foundation

Posted Dec 13, 2025 7:13 UTC (Sat) by alx.manpages (subscriber, #145117) [Link] (2 responses)

> Sounds like AI reviewers do a better job than human ones, from the article above

As you said, humans getting lazy is a consequence of relying on that. That's enough of a concern to not use it even if it has cases where it works.

Another one is that, just like it works in some cases, I'm pretty sure it could convince humans that something is a bug, even if it isn't, and induce the humans to introduce a bug as a consequence. How good is a tool that doesn't allow you to discern between false positives and true positives? That is, how useful is a liar?

> He passed one five-line patch with a known problem to three human reviewers, and none of them found the bug. But the automated tool did find the problem (a reversed condition in a test); "AI always catches that stuff".

I'd like to see the conditions of the experiment. I don't trust that statement at face value. Was it in a mailing list, where people have days to look at it? Or was it during a talk, where people are under time pressure, and it's likely that they won't see the bug?

Linux Foundation

Posted Dec 13, 2025 14:01 UTC (Sat) by kleptog (subscriber, #1183) [Link] (1 responses)

> As you said, humans getting lazy is a consequence of relying on that. That's enough of a concern to not use it even if it has cases where it works.

Just like people reviewing Rust code rely on the compiler to check object lifetimes, do type checks and that all the variable names are valid. You can call that 'humans being lazy'. I call that humans utilising tools to avoid doing dumb work a computer can do.

> Another one is that, just like it works in some cases, I'm pretty sure it could convince humans that something is a bug, even if it isn't, and induce the humans to introduce a bug as a consequence.

Are you telling me you've never had this happen with a human reviewer?

The software we (collectively) build is fantastically complicated, far more complicated than any physical system we've ever constructed. Requiring a computer or human to be infallible is a losing game. That's why we should be taking reviews from as diverse a group of sources as possible. It not like no-one ever looks at code after it's committed, right?

If people were suggesting that we should outsource all code review to a single proprietary model and blindly trust the results, you might have a point. But in such a heterogeneous environment like kernel development that seems rather unlikely.

Linux Foundation

Posted Dec 13, 2025 16:00 UTC (Sat) by alx.manpages (subscriber, #145117) [Link]

> Just like people reviewing Rust code rely on the compiler to check object lifetimes, do type checks and that all the variable names are valid.

I write C code and do pretty much the same. Rust isn't much special about this.

> You can call that 'humans being lazy'. I call that humans utilising tools to avoid doing dumb work a computer can do.

No, these are not in the same league. You can rely on deterministic tools for matters they're good at. Programmers don't need to look out for type mismatches, because the compiler will tell.

But if you get to rely on non-deterministic tools to tell you if some code is good or bad, you have deep problems.

> Are you telling me you've never had this happen with a human reviewer?

Certainly; and I wouldn't want to increase the rate at which that happens by orders of magnitude. Also, when it happens with humans in a mailing list, other humans will jump in and correct them. With humans, you have some consistency in which areas some humans tend to be better or worse, and who to trust more for what.

> Requiring a computer or human to be infallible is a losing game.

That's why we need more humans. If LF spent money on those humans that are best at it, we'd have plenty of them. If LF spends money on machines, the humans that are capable enough will eventually be scarce enough, and the situation will not be recoverable anymore.

Linux Foundation

Posted Dec 13, 2025 12:23 UTC (Sat) by dskoll (subscriber, #1630) [Link] (45 responses)

My problem with GenAI is not only or even primarily the quality of its output. I have serious questions about the way the entire GenAI industry is headed.

Linux Foundation

Posted Dec 13, 2025 15:50 UTC (Sat) by alx.manpages (subscriber, #145117) [Link] (44 responses)

> I have serious questions about the way the entire GenAI industry is headed.

I fully agree with those. I just gave up trying to convince other about that, because some people seem to not care.

Linux Foundation

Posted Dec 13, 2025 19:29 UTC (Sat) by dskoll (subscriber, #1630) [Link] (43 responses)

I guess I'm more stubborn and more willing to tilt at windmills. 🙂

Also, I am retired and am only working on hobby software projects, so I get to set the rules.

Linux Foundation

Posted Dec 13, 2025 20:32 UTC (Sat) by mb (subscriber, #50428) [Link] (42 responses)

>am only working on hobby software projects, so I get to set the rules.

... on your hobby projects.
The rest of the world moves on, though.

I also tend to be grumpy when new things come along, but these newfangled "AI" tools do actually have some use for me as well.
Not always. But I like to use "AI" when it benefits me.

Banning those tools is just nonsense. It makes your projects go obsolete. People will go on and fork or completely reimplement your thing with modern tools.

Linux Foundation

Posted Dec 13, 2025 20:56 UTC (Sat) by dskoll (subscriber, #1630) [Link] (22 responses)

As I wrote before, quality is not my primary concern about AI; these are my concerns.

Also... I suspect AI is over-hyped and overblown, and in the end won't make all that much difference to software development.

Linux Foundation

Posted Dec 13, 2025 21:06 UTC (Sat) by mb (subscriber, #50428) [Link] (21 responses)

I read your link before already and I still disagree.

I think machine learning tools make a huge difference to software development.

I'm sorry, but just saying that these tools are "fraudulent" or similar is just nonsense.
It is as fraudulent as me learning from a proprietary text book or proprietary job assignment.

Linux Foundation

Posted Dec 13, 2025 21:26 UTC (Sat) by dskoll (subscriber, #1630) [Link] (20 responses)

The tools are not fraudulent.

The industry is fraudulent, seeing as it's based on theft of humans' works, it makes wildly unsubstantiated claims, and it is borrowing money at a completely reckless pace that is pretty sure to wreck our economy within the next few years.

Linux Foundation

Posted Dec 13, 2025 22:39 UTC (Sat) by alx.manpages (subscriber, #145117) [Link] (12 responses)

FWIW, I've banned AI tools (including for review) in a non-hobby project:

<https://git.kernel.org/pub/scm/docs/man-pages/man-pages.g...>

And I've had to enforce the policy already:

<https://bugzilla.kernel.org/show_bug.cgi?id=220726#c7>

Linux Foundation

Posted Dec 15, 2025 6:05 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

> including for review

That's stupid. AI reviewers routinely find real issues. What next, banning "-Wall" because real humans should find bugs, not some stinking compilers?

Banning warnings

Posted Dec 15, 2025 9:38 UTC (Mon) by farnz (subscriber, #17727) [Link] (6 responses)

That's basically the attitude of the group of people who say that we don't need any new programming languages, we just need people to Get Good at writing bug-free C.

Banning warnings

Posted Dec 15, 2025 10:54 UTC (Mon) by alx.manpages (subscriber, #145117) [Link] (5 responses)

That's basically the attitude of the group of people that care about how much destruction of Nature is happening because of AI, among other issues.

It seems a different group of people, which presumably includes you, don't care about that, and instead make fun of us. Please rethink the group of people in which you want to be.

Banning warnings

Posted Dec 15, 2025 11:03 UTC (Mon) by farnz (subscriber, #17727) [Link] (4 responses)

I am only talking about the group of people who assert that there is no need for any changes after ANSI C89, since the only remaining issue is that people don't write bug-free code. If you dare to try to improve C above the baseline set by C89 and C89 era compilers, you are also in the group that you've just insulted.

Banning warnings

Posted Dec 15, 2025 11:21 UTC (Mon) by alx.manpages (subscriber, #145117) [Link] (3 responses)

> I am only talking about the group of people who assert that there is no need for any changes after ANSI C89,

Maybe I misunderstood something, but we were talking about banning AI for reviews, which I've done. Per my understanding you put me in some group that allegedly do two things:

- Ban AI reviews.
- Assert that there is no need for any changes after ANSI C89.

> If you dare to try to improve C above the baseline set by C89 and C89 era compilers,

I do.

Banning warnings

Posted Dec 15, 2025 11:28 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

No - I was responding to Cyberax saying "why would anyone reject a tool that finds bugs?" by pointing out that there is an existing group that's been around for decades that says that people should upskill to the point where they don't write bugs, and therefore there's no need for language changes that make it easier to find bugs, let alone tools that make it easier.

You then went off on me about how I must hate nature for daring to notice that there are people who will reject tools that find bugs on the basis that you should just get good at coding.

Banning warnings

Posted Dec 15, 2025 11:36 UTC (Mon) by alx.manpages (subscriber, #145117) [Link] (1 responses)

> I was responding to Cyberax saying "why would anyone reject a tool that finds bugs?"

Cyberax was replying to me saying I've banned AI tools for review in projects I maintain.

I think it is compatible to want to improve the language and static analysis tools, and still reject AI tools.

Banning warnings

Posted Dec 15, 2025 11:38 UTC (Mon) by farnz (subscriber, #17727) [Link]

You replied to me, and not to Cyberax, though.

If you were replying to Cyberax, why not reply to him, and not me? The only reasonable read of your previous comment is that anyone who says that "just get better at coding - the language and tools are perfect" hates nature.

Real issues

Posted Dec 16, 2025 16:17 UTC (Tue) by adobriyan (subscriber, #30858) [Link] (3 responses)

> AI reviewers routinely find real issues.

This is not an issue. What's the false positive rate?

We use PVS at work which is painting CI in red every once in a while.

One day it has found "missing unlock on error path". That's after like 5 false positives.

I thought, hey maybe it is not complete garbage. But some time later it did _not_ find
"unlock-unlock" bug (which should have been regular lock-unlock pair) — lockdep did.

> What next, banning "-Wall" because real humans should find bugs, not some stinking compilers?

-Wsign-compare also find real bugs but for some (explicable!) reason it is turned off in top-level Linux Makefile.

Real issues

Posted Dec 16, 2025 19:45 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> This is not an issue. What's the false positive rate?

In my experience, about 0-50% depending on the code.

> We use PVS at work which is painting CI in red every once in a while.

PVS? We're using Claude for reviews, and it is pretty reasonable. It definitely saved me many hours (if not days) of debugging by pointing out issues in the code.

E.g. something like: "const filteredPoints = filterPoints(points); const distance = distanceBetween(filteredPoints[0], points[filteredPoints.length-1]);". A stupid mistake, but one that can take hours to debug in math-heavy 3D code.

And since it's just _reviews_, it's not affecting the actual code quality.

Real issues

Posted Dec 16, 2025 20:42 UTC (Tue) by Wol (subscriber, #4433) [Link]

Given that they say text recognition needs to be 95% correct in order to be worth having, that doesn't look a good success rate ...

Cheers,
Wol

Real issues

Posted Dec 17, 2025 2:47 UTC (Wed) by pabs (subscriber, #43278) [Link]

> lockdep did

Was this locking bug in the Linux kernel or in userspace? I got the impression that lockdep is dead for userspace usage.

Linux Foundation

Posted Dec 13, 2025 22:44 UTC (Sat) by Wol (subscriber, #4433) [Link] (5 responses)

Hmmm...

While I may not agree with the detail (for example, training AIs is just the same as training people, they hoover up large amounts of copyrighted work, often with little recompense to the authors - how many textbooks are passed down from one generation to the next of students ...) the basic thrust of Diane's concerns are - imho - fully justified.

The biggest problem, as I see it, is there is very little feedback to tell an AI whether its reasoning is correct. I'd be quite happy with an SLM, where I knew the training data was "clean". But where the default AI I see (in Google search) pretty much *always* can't even understand the question, (fortunately it almost always paraphrases and tells me the question it is actually answering,) you can see why I'm extremely cynical. Yes I'm aware other AIs are apparently better, but if your showcase AI is crap ...

In the real world, "Meat Intelligence" can be a matter of life and death - get it wrong and *you* are dead. If you can't accurately recognise the clues to eg a hungry lion ... In the AI world, as Diane points out, it's usually someone else who dies ...

And given the human tendency to ignore "inconvenient facts", there's a not insignificant chance that "someone else" is going to be most of us, in the next ten years or so ...

Cheers,
Wol

Linux Foundation

Posted Dec 13, 2025 23:49 UTC (Sat) by dskoll (subscriber, #1630) [Link] (3 responses)

training AIs is just the same as training people

To be honest, I've struggled to respond to this, even though I think it's different. After some thinking, I have come up with a couple of differences:

Textbook authors write books with the intent to train people. They know that's how their works will be used. People who create art don't do so primarily to train other people, even though of course everyone learns from every life experience. But I think artists and others never expected industrial-scale ingestion of their works by massive for-profit companies in order to machine-produce other works (I would write "derived works" but apparently the courts disagree.)

Second is that the average person can, over a career of lets say 40 years, ingest maybe 500-1000 textbooks. Maybe 2000 if they're a prodigious reader. AI can ingest vastly higher quantities of information and what's more, remember it all. So this gives AIs a huge (and I'd say unfair) advantage over humans.

I know I'll be accused of Luddism---machines in a factory have a huge advantage over humans when producing things---but I think AI is qualitatively different. It's trying to compete with us in the very fields that make us human: Art, science, reasoning, etc. and I think it is very dangerous. It's not as if the AI oligarchs are going to support universal basic income for everyone they make redundant.

Linux Foundation

Posted Dec 14, 2025 0:11 UTC (Sun) by Wol (subscriber, #4433) [Link]

> Second is that the average person can, over a career of lets say 40 years, ingest maybe 500-1000 textbooks. Maybe 2000 if they're a prodigious reader. AI can ingest vastly higher quantities of information and what's more, remember it all. So this gives AIs a huge (and I'd say unfair) advantage over humans.

This, I think, is actually Artificial Idiocy's achilles heel. It has no clue whatsoever about the QUALITY of what it's ingesting. And it has no way of learning. Which is why LLMs tend to go off the rails pretty quickly. Yes I'm seriously concerned about the damage they are capable of doing (including accidentally destroying civilisation, which I seriously think is on the cards), but their ability to replace human intelligence? Nah, on that score, they're just a huge con. Their ability to dumb down the masses, on the other hand ...

I think the biggest threat from AI (and it's a real threat) is best summed up in the comment "To err is human, but to really screw things up you need a computer".

Cheers,
Wol

Linux Foundation

Posted Dec 14, 2025 3:43 UTC (Sun) by mgb (guest, #3226) [Link]

Humans learn. LLM AIs parrot.

The fictional Marvin had a brain the size of a planet. A real world LLM AI has a memory the size of a planet, but less intelligence than an ant.

Training LLM AIs is not akin to training humans. It is akin to allowing anyone wealthy enough to illegally copy your work with no legal consequences.

Linux Foundation

Posted Dec 14, 2025 18:18 UTC (Sun) by SLi (subscriber, #53131) [Link]

> Textbook authors write books with the intent to train people. They know that's how their works will be used. People who create art don't do so primarily to train other people, even though of course everyone learns from every life experience. But I think artists and others never expected industrial-scale ingestion of their works by massive for-profit companies in order to machine-produce other works (I would write "derived works" but apparently the courts disagree.)

This is true, but I'd argue it's irrelevant. People who make their works available on the Internet have, and as a policy matter should have, no inherent right to control who reads it. Someone might want to educate a certain class of people, but if others read it, and maybe even use the information in a way that the author not prefer... *good*. That's supposed to happen. Copyright should not be extended to anything like total control of use. Copyright is quite narrow in a sense; you only get to control certain things. And I think the world is much better that way.

And it's not like the LLMs are the actors here. Humans train LLMs. Yes, they found a way to use your contribution to public knowledge in a way you don't like. Tough, but I'm happy to think that few societies would stop useful progress because some people didn't want their posts read by *that thing*.

> Second is that the average person can, over a career of lets say 40 years, ingest maybe 500-1000 textbooks. Maybe 2000 if they're a prodigious reader. AI can ingest vastly higher quantities of information and what's more, remember it all. So this gives AIs a huge (and I'd say unfair) advantage over humans.

This, on the other hand, I don't think is really true. An LLM has a limited size of weights. It doesn't remember everything it read, just like a human doesn't remember everything they ever experienced. It's only the gist that remains, when important enough. Just like when humans learn to program by reading code and books...

Also, I think the "unfair" part is weird. These are tools, by humans, for humans. They don't *want* to compete with you. Sure, those who have trained maybe want. So if they really provide a way to outcompete programmers, what is the big issue?

Does a tractor also have an unfair advantage in agriculture? It did transform the society so that the need of agricultural labor fell from 90% of humans to 10% of humans. And, yes, a lot of grief ensued. The society has always been bad at distributing the benefits of net positive developments so that nobody is worse off.

I think the only difference this time is that it's educated people who are threatened for the first time in history. But if the argument is that LLMs doing programming, even if they work well (which I understand is contested), are harmful for the humankind? I absolutely love the fact that there are tractors.

Linux Foundation

Posted Dec 14, 2025 18:01 UTC (Sun) by SLi (subscriber, #53131) [Link]

> But where the default AI I see (in Google search)

I think this is your major problem. Stop thinking you are interacting with something akin to a modern LLM when you are typing in Google search.

Linux Foundation

Posted Dec 15, 2025 12:37 UTC (Mon) by paulj (subscriber, #341) [Link]

Indeed. There is an absolutely huge bubble being built. It is going to pop within 2 years. Even NVidia will be hurt once it does (it will keep lumbering on though - whether it gets holed as badly as Sun was by the Dot-com bubble we shall see, maybe it survives like Cisco did).

The only big AI investors making real money are those who already have huge consumer-data-mining-advertising operations, who can afford to subsidise their AI work off their existing operations. Those operations are also working hard on building their own silicon to ultimately completely replace NVidia.

Linux Foundation

Posted Dec 15, 2025 12:33 UTC (Mon) by paulj (subscriber, #341) [Link] (18 responses)

It's not clear all these AI tools are substantively better than humans, in the net. All these tools still appear to require significant human review, and - cause of the fundamental nature of LLMs and the like, with no fixes on the horizon - likely always will do.

These tools /do/ require massive amounts of power (20% of Ireland's electricity is going on DCs now, with 30% projected by 2030 - all of that growth will be AI driven; Ireland doesn't have that power, so our regulators are going to require DCs to self-power future DCs, which will of course be fossil-fuel based), and water.

10 minutes of an LLM reviewing your code is about 70 MJ of energy just on the inference (assuming it takes a 70 kW rack) - never mind training. That's about equivalent to a month of a human doing work (7 hours a day, 5 days/week, 4 weeks/month). Is the energy cost of that 10 minutes worth it? Ok, so it gives you some useful information, maybe finds a bug or two - but it will (and this is my experience) ALSO give you a lot of false information, including /subtly/ wrong information and subtly buggy code. So it might in 10 to 30 minutes of interactive use give you code that would otherwise have taken you X days to research and write yourself; but you will also spend Y time figuring out whether or not that AI code is suitable or not, where the bugs are, etc. For small stuff, X/Y can be a win - judged solely on the time saving. For larger stuff, IME, the AI stuff becomes increasingly detached from what you ask of it, and needs ever more remedial work.

So... yeah, maybe we have some benefit in terms of that X/Y ratio for smaller chunks of work - if we take care to break our problem down into the realm of where the AI remains coherent. But is it a /big/ benefit? Not in my experience - least, not if you can actually code already*. Also, that small benefit is coming with a *MAHOOSIVE* cost to the environment because of the exorbitant and unfixable energy costs, and also - as yet unquantifiable - future social costs.

* I do know some people using AI a lot to write code, cause their own skills aren't great. The resulting code is generally not that good and I am pretty sure is going to have to be almost entirely rewritten from scratch one day.

Linux Foundation

Posted Dec 15, 2025 18:51 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (17 responses)

> 10 minutes of an LLM reviewing your code is about 70 MJ of energy just on the inference

You're way off in your estimation. Running the inference takes less than 0.1 of that (I think it might be closer to 0.01). And 10 sustained minutes of review is enough to go through quite a large patch. If that saves just one compile-run-test cycle for a real human, it's likely worth it.

> Also, that small benefit is coming with a *MAHOOSIVE* cost to the environment because of the exorbitant and unfixable energy costs, and also - as yet unquantifiable - future social costs.

And AI still takes less energy than Bitcoin mining.

Linux Foundation

Posted Dec 16, 2025 1:30 UTC (Tue) by raven667 (subscriber, #5198) [Link] (12 responses)

> And AI still takes less energy than Bitcoin mining.

I don't have numbers at hand, but that doesn't seem right to me, the investment and power usage of "AI" datacenters we have now, and have planned, seems to dwarf the bitcoin bubble, by quite a lot. People were spending billions to power crypto but trillions and a significant GDP of entire developed nations to build AI.

Linux Foundation

Posted Dec 16, 2025 4:27 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (11 responses)

Some numbers, an average transaction on Bitcoin is about 1.5MJ of energy. The sustained global mining throughput is probably around 25 GWt (the range of estimates is 20 to 35 GW).

The sustained data center use in the US was about 18 GWt in 2023 (so pre-AI). The global datacenter use was estimated at about 50GWt in 2023. I doubt that the AI has doubled the US data center energy usage within the last 2 years.

Regarding that factoid about Ireland. It is true, but it omits the fact that the datacenters were already using 22% of Irish electricity in 2023, again in the pre-AI era.

I don't doubt that the AI will exceed Bitcoin energy use in the near future, but at least AI is useful for something.

Linux Foundation

Posted Dec 16, 2025 11:17 UTC (Tue) by malmedal (subscriber, #56172) [Link] (9 responses)

> datacenters were already using 22% of Irish electricity in 2023,

Indeed. This is because of Irish behavior that I consider immoral. They attract big foreign companies with low taxes, Ireland gets a lot of money compared to their size, but the companies can serve the whole EU. Also the Irish don't bother paying for a proper military so they rely on other states to provide security.

It's very lucrative though, Dublin is looking very good these days.

Linux Foundation

Posted Dec 16, 2025 13:56 UTC (Tue) by paulj (subscriber, #341) [Link] (8 responses)

No one is going to invade Ireland, cause Ireland hasn't been part of and avoids being part of, the UK-USA colonial-terrorist projects of the last X decades.

Try not being a nation that murders people abroad regularly.

Linux Foundation

Posted Dec 16, 2025 14:36 UTC (Tue) by rschroev (subscriber, #4164) [Link] (4 responses)

Like Ukraine or Venezuela, or countless other examples in history?

Linux Foundation

Posted Dec 16, 2025 15:43 UTC (Tue) by Wol (subscriber, #4433) [Link] (3 responses)

Given The Troubles, I have to be careful with this statement, but neither Ireland nor the UK share a common border with a hostile neighbour (or even a potentially hostile one).

That's not to say you won't face invasion, but it does dramatically lower the odds ...

Cheers,
Wol

Linux Foundation

Posted Dec 16, 2025 16:48 UTC (Tue) by paulj (subscriber, #341) [Link] (2 responses)

Uh... wtf. Ireland was England's *original* colonial-terrorist project.

Linux Foundation

Posted Dec 16, 2025 17:32 UTC (Tue) by malmedal (subscriber, #56172) [Link] (1 responses)

> wtf. Ireland was England's *original* colonial-terrorist project.

Which makes Ireland's anti-military stance even stranger.

(also "original" is wrong, for instance "Harrying of the North" was earlier and "Rough Wooing" in the same timeframe)

Linux Foundation

Posted Dec 16, 2025 17:35 UTC (Tue) by daroc (editor, #160859) [Link]

I think this subthread has gotten pretty far afield from a discussion of the use of machine learning in kernel development.

Linux Foundation

Posted Dec 16, 2025 15:52 UTC (Tue) by malmedal (subscriber, #56172) [Link]

> No one is going to invade Ireland, cause Ireland hasn't been part of and avoids being part of, the UK-USA colonial-terrorist projects of the last X decades.

They are not at risk of being invaded because the UK does not want to, and would put a stop to it if anyone else tried.

> Try not being a nation that murders people abroad regularly.

This does not get you invaded though. At most they get a few terrorist attacks.

What does get you invaded is being weak and have an expansionist neighbor.

Linux Foundation

Posted Dec 16, 2025 17:58 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

All it would take, given history, is England electing a government who believes that the British Empire under Queen Victoria was a good thing, has isolated us from our allies (so upsetting the USA and EU are not concerns any more), and needs a "quick win" war (as Thatcher got in the Falkland Islands) to help with their unpopularity.

Given that set of circumstances, each of which individually is possible, albeit unlikely (for some of them), Ireland could get invaded again by its nearest neighbour, aiming to reunite the island under one government by force. And well, look at the last 800? 900? years of history to see how often we've treated the Irish as lesser beings…

Linux Foundation

Posted Dec 16, 2025 19:26 UTC (Tue) by jzb (editor, #7867) [Link]

While potentially interesting, we've detoured far away from matters related to kernel development at this point. Let's please end the digression here. (This is not specific to this one comment or user, simply placing this plea here to try to end the thread.) Thanks!

Linux Foundation

Posted Dec 16, 2025 14:00 UTC (Tue) by paulj (subscriber, #341) [Link]

2023 is already in the AI era. We don't have a break down for the industry on AI v other for DC power use unfortunately. The forecasted growth from 22% to ~30% in next 5 years is surely largely AI.

Linux Foundation

Posted Dec 16, 2025 13:51 UTC (Tue) by paulj (subscriber, #341) [Link] (3 responses)

Dont think that is true for a start. As I wrote DCs are consuming a huge amount of power in my country, and project to double cause of AI. And the AI DCs being planned in the USA, there isn't the power for them at all.

Further, there is a qualitative difference between the energy bitcoin uses and that which goes to AI. Bitcoin seeks cheap energy, and the cheapest energy is that which gets generated but will not otherwise be used. Much of Bitcoin energy is off-peak base load and renewables (hydro, solar).

A built AI DC requires X amount of energy. And it has to be provided so our tech-bro overlords can make their numbers. And that takes energy away from other others - manufacturing, consumer - driving up prices for all. Bitcoin has no specific energy requirement, it generally uses what is spare (and it adjusts to what is available).

Linux Foundation

Posted Dec 16, 2025 16:46 UTC (Tue) by malmedal (subscriber, #56172) [Link] (1 responses)

> that takes energy away from other others - manufacturing, consumer - driving up prices for all.

This is exactly what happened with bitcoin in multiple places, e.g. Kazakhstan, Abkhasia, Kosovo. Ensuing riots caused some 200 dead in Kazakhstan I believe.

Linux Foundation

Posted Dec 25, 2025 17:22 UTC (Thu) by rbtree (guest, #129790) [Link]

A bit late to the party, but you're right on point. The immediate cause for protests were steep increases in energy prices which pushed people over the edge. The official figure of 200+ dead is very likely underestimated.

Whatever problems the country has, we at least enjoyed cheap utilities (including electricity), which attracted significant mining “investment” from all over the world around 2020.

Which has driven energy demand to the point that the country now imports most of the energy we use, mostly from Russian power plants, and prices have been steadily increasing several times per year. And it's not caused by the economic crisis — we're no strangers to these crises, and I don't remember such steep increases… well, ever before. I believe we're now paying four times as much per kW·h compared to 2020. Another increase is coming in five days.

So no, Bitcoin miners aren't much better than this newfangled “AI” industry. Maybe they absorb the otherwise unused surpluses in Texas oil and gas fields, but not here.

Linux Foundation

Posted Dec 16, 2025 19:40 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> Bitcoin seeks cheap energy, and the cheapest energy is that which gets generated but will not otherwise be used.

So is the AI. Training jobs are designed to be checkpointed and fault-tolerant. Some datacenters already have load-shedding agreements: https://www.reuters.com/sustainability/boards-policy-regu...


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds