LWN: Comments on "What to do in response to a kernel warning" https://lwn.net/Articles/876209/ This is a special feed containing comments posted to the individual LWN article titled "What to do in response to a kernel warning". en-us Thu, 23 Oct 2025 16:11:39 +0000 Thu, 23 Oct 2025 16:11:39 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net What to do in response to a kernel warning https://lwn.net/Articles/878320/ https://lwn.net/Articles/878320/ Vipketsh <div class="FormattedComment"> Translated: they are turning a &quot;potential issue&quot; into &quot;definitely an issue&quot;. Depending on the situation that can be anything from great to down right bastard.<br> <p> If you are some cloud provider with a large number of machines (google, facebook, etc.), you probably have a bunch of redundancy and load balancing in your setup: if one machine goes down for whatever reason, even if just because of a &quot;potential issue&quot;, there are a bunch more available to take over and the load balancing makes sure it happens. Furthermore the cloud provider probably has engineers available to take a look at the issue in short order and return the machine to operation. In such a setting taking machines down due to &quot;potential issue&quot; is very well warranted.<br> <p> On the other hand if you are using the machine in question as a primary work machine, the last thing you want is for a &quot;potential issue&quot; to take down your system and eat hours of work with it. In such a scenario chugging along for as long as possible is the desired mode of operation.<br> <p> I think google &amp; co. have enough expertise on hand to tweak defaults to whatever makes sense for them while individuals (using Linux as a primary work machine) very often do not. Thus the only same thing is to keep the defaults to what makes sense for individuals: keep going for as long as possible. <br> <p> </div> Fri, 10 Dec 2021 20:32:46 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876955/ https://lwn.net/Articles/876955/ flussence <div class="FormattedComment"> Another here - my laptop&#x27;s i915 driver spews messages about corrupt EDID data on every boot. It used to be at warning severity but it seems they&#x27;ve downgraded it.<br> <p> But graphics drivers aside, I&#x27;ve had actual WARN_ONs recently too, one was a soft lockup timeout because an NFS mount tree went away during a system upgrade. I would rather not have a spontaneous panic/reboot there before I get a chance to ensure the reboot will actually succeed.<br> </div> Fri, 26 Nov 2021 23:50:48 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876747/ https://lwn.net/Articles/876747/ mathstuf <div class="FormattedComment"> Agreed. Warnings are good for things like &quot;it should work, but there may be something coming over the horizon&quot;. If it works, it should pass. If it fails, it should block things. However, there *is* a middle state of &quot;things are changing, adapt before it&#x27;s too late&quot; (e.g., deprecation warnings and subsequent removals). If you want to fail on warnings, it&#x27;s usually easy to do something like `-Werror` (however that ends up being spelled for the tool in question). But just blindly continuing on as if nothing is wrong on warnings is also not viable long-term.<br> <p> As an example, we test git master with our stuff. Currently we make it block if it fails because git is pretty good and doesn&#x27;t break things much (we&#x27;ve found one regression in 5 years). But if git weren&#x27;t so stable, we&#x27;d *still* want to know if breakages are on their way so at least we could move out of the way of whatever the light ahead of us in the tunnel turns out to be.<br> </div> Wed, 24 Nov 2021 01:41:18 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876746/ https://lwn.net/Articles/876746/ marcH <div class="FormattedComment"> <font class="QuotedText">&gt; &quot;Green&quot; is nothing more than an &quot;I&#x27;m finished&quot; indicator, and &quot;red&quot; means &quot;I was not able to finish.&quot; There&#x27;s no &quot;yellow&quot; state because the process is not allowed to &quot;half-finish.&quot;</font><br> <p> I&#x27;m sorry your CI is so limited. Ours runs tests and publishes dashboards and pass rates. I&#x27;ve seen others automatically track regressions and progressions.<br> </div> Wed, 24 Nov 2021 01:07:48 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876744/ https://lwn.net/Articles/876744/ NYKevin <div class="FormattedComment"> That&#x27;s because the purpose of a CI system is not to give you a red/green/yellow status indicator. It is to perform a sequence of actions which (usually) culminates either in producing final (tested) artifacts, or in actually pushing those artifacts to production (depending on what your process looks like). &quot;Green&quot; is nothing more than an &quot;I&#x27;m finished&quot; indicator, and &quot;red&quot; means &quot;I was not able to finish.&quot; There&#x27;s no &quot;yellow&quot; state because the process is not allowed to &quot;half-finish.&quot; Either it gives you the final end product/deployment (green), or it doesn&#x27;t (red). Yellow would just end up being green or red but with a different UI indication.<br> <p> It&#x27;s also important to bear in mind that you can have more than one CI pipeline for the same underlying codebase. For example, you could have a lenient CI pipeline that &quot;just&quot; pushes nightlies into your staging or QA instance, and then a more stringent pipeline that pushes into prod. You can even have a darklaunch pipeline that works just like the real prod pipeline, except that it never actually pushes anything, so that you can still have advance notice that &quot;this nightly is in staging now, but it won&#x27;t make it into prod when that push happens for real, because it triggered a warning and failed the darklaunch pipeline.&quot;<br> <p> The important thing to bear in mind is that it is *not enough* to just surface warnings in a UI somewhere. You need to have a systematic policy for what happens when a warning is triggered. Once such a policy exists, it can be enforced with code, regardless of what the UI looks like. But if there is no policy, you will get developers making ad-hoc judgment calls about whether a given warning is &quot;bad enough&quot; to stall the release for a day, or if we should just push it anyway. As it turns out, humans are pretty bad at making such decisions, especially when the PM wants to get our new feature out yesterday and the warning is in some really hairy subsystem that nobody has properly understood in many years.<br> </div> Wed, 24 Nov 2021 00:16:18 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876695/ https://lwn.net/Articles/876695/ mm7323 <div class="FormattedComment"> For the cases where a WARN() can be certainly attributed to some task, perhaps there should be a WARN_ON() variant which takes a struct task_struct e.g. WARN_ON_TASK(). Then some policy may decide to dump, kill or deliver a signal to that process, avoiding the risk of killing something unrelated and risking more harm.<br> </div> Tue, 23 Nov 2021 10:38:08 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876692/ https://lwn.net/Articles/876692/ Fowl <div class="FormattedComment"> Microsoft Azure DevOps (horrible name for what used to be &quot;Team Foundation Server&quot;), aka. Microsoft&#x27;s /other/ forge thing, does actually support in it&#x27;s build/release pipelines.<br> <p> This page has some screenshots -&gt; <a href="https://github.com/melix-dev/azure-devops-dotnet-warnings#printscreens">https://github.com/melix-dev/azure-devops-dotnet-warnings...</a><br> </div> Tue, 23 Nov 2021 04:44:23 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876673/ https://lwn.net/Articles/876673/ mathstuf <div class="FormattedComment"> Agreed. My request to GitLab is here if anyone wants to &quot;vote&quot;: <a href="https://gitlab.com/gitlab-org/gitlab/-/issues/219574">https://gitlab.com/gitlab-org/gitlab/-/issues/219574</a><br> </div> Mon, 22 Nov 2021 19:24:26 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876657/ https://lwn.net/Articles/876657/ ianmcc <div class="FormattedComment"> Same here - in my previous system I used to get regular warnings (with a dire looking log message) from the NVIDIA proprietary drivers, but it never caused any actual problem, not even any noticeable graphics glitch.<br> </div> Mon, 22 Nov 2021 16:55:07 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876609/ https://lwn.net/Articles/876609/ marcH <div class="FormattedComment"> Meanwhile, not a single CI engine seems to support warnings. It&#x27;s either red or green and nothing in the middle. At best you get stderr in a different color in the logs (that one one opens when the status is green).<br> <p> <p> </div> Mon, 22 Nov 2021 15:55:11 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876601/ https://lwn.net/Articles/876601/ taladar <div class="FormattedComment"> This is an interesting discussion, however when I read the headline I first thought of explanations similar to the ones the Rust compiler provides, giving the system administrator some guidance on what the warning actually means. That might be a useful thing to have as well.<br> </div> Mon, 22 Nov 2021 11:31:45 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876595/ https://lwn.net/Articles/876595/ nix <div class="FormattedComment"> Looking at the WARN_ONs I&#x27;ve experienced here in the last half-decade, I had warnings on boot for about two years because of amdgpu multi-monitor problems which eventually got fixed: but while they were happening, reboot-on-warn would have been precisely the wrong thing to do because the warnings didn&#x27;t actually ever cause anything to go wrong that I could tell. I had warnings on minor bcache gc problems which might cause, at most, a leak of a single bucket (4MiB) which in any case would go away upon the next gc after reboot: there&#x27;s no *way* that sort of thing would ever be worthy of a reboot unless it hit almost every bucket (it didn&#x27;t, it hit one out of ~90k), but a WARN_ON was used because, well, it&#x27;s worthy of a warning, right? In this case the warning was likely used as a &quot;please tell the developer&quot; flag, probably on the grounds that anything less would just be ignored.<br> <p> The only warning I ever had that was actually worthwhile was a warning from xfs which was immediately afterwards followed by a verifier failure flipping the affected fs to readonly. This was entirely correct given that it would otherwise have resulted in fs corruption -- but I&#x27;m fairly sure a reboot would not have been preferable to the readonly-flipping it already did (though in the event the affected fs happened to be the rootfs, so there was little else I could do: but that need not have been the case).<br> <p> So, so far, here at least, the number of WARNs that should have resulted in a reboot is, uh, 0%.<br> </div> Sun, 21 Nov 2021 17:54:59 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876579/ https://lwn.net/Articles/876579/ dullfire <div class="FormattedComment"> I think it would be kind of nice if the kernel had an interface (maybe a character devices) that was poll-able. And on oops, read(2) would return something like<br> <p> struct us_oops_event {<br> /* time that matches the kernel time stamp in dmesg of the event */<br> ktime_t time;<br> /* Which thread the oops occurred on */<br> pid_t victim;<br> };<br> <p> <p> After which, userspace can make any decisions: may it knows the pid it question is a specific process, or maybe it just forwards the even+dmesg over the network (if syslog forwarding isn&#x27;t normally setup).<br> <p> Anyhow that seems to me like the sanest thing the kernel can do (besides panicking, which is already an option).<br> </div> Sat, 20 Nov 2021 18:44:40 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876564/ https://lwn.net/Articles/876564/ NYKevin <div class="FormattedComment"> <font class="QuotedText">&gt; In practice, currently, the Linux kernel kills processes on oops. OOM killer kills processes. grsecurity also kills processes.</font><br> <p> Well...<br> <p> 1. oops is pretty darned unusual, in my experience.<br> 2. OOM killing is both unusual and at least tries* to target processes that might possibly be responsible for the OOM condition.<br> 3. I have no idea about grsecurity, but it&#x27;s not part of Linus&#x27;s tree, so I frankly don&#x27;t care what it does.<br> <p> * There is a difference between trying and succeeding, of course.<br> </div> Sat, 20 Nov 2021 02:38:37 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876464/ https://lwn.net/Articles/876464/ geert <div class="FormattedComment"> But it doesn&#x27;t Fail-Fast if it kills the wrong thread. In general, there is no guarantee the bad state that caused the warning is limited to the thread(s) killed.<br> </div> Fri, 19 Nov 2021 09:21:14 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876459/ https://lwn.net/Articles/876459/ developer122 <div class="FormattedComment"> The fundamental problem here is a desire for compartmentalization that simply does not exist.<br> <p> It seems there are many different kinds of warn()s that can be triggered in different ways at different times. Some might happen in an interrupt handler, others might be the result of a filesystem or driver problem, and then some other might be the result of something actually triggered by userspace. So not only are warn()s not tagged by probable cause, but tagging them with that may be impossible.<br> <p> Warn()s also occur in the shared security boundary of &quot;all of kernel space.&quot; There is not theoretical limit to what state might be contaminated. It might be limited to only what is currently being processed. It might affect the whole subsystem. Or it could lead to corruption to anything else the kernel is doing and possibly take the whole system down. So not only are warn()s not tagged with their probably effects, but it&#x27;s quite possibly impossible to do so.<br> <p> Without knowing what caused a warn, and what effects it might have, it is impossible to take any decision more fine-grained than &quot;do nothing&quot; or &quot;reset the system.&quot; You can&#x27;t reliably kill the culprit or avoid the conditions that caused it. You can&#x27;t reliably determine the affected area and pick an appropriate mitigation (flushing a buffer, killing a process, killing a subsystem, or indeed shutting down a totally-compromised system).<br> <p> What the kernel devs desire is a system like minix, where the various subsystems are isolated enough that anything bad happening in one is reasonably assured to not have affected the others, and where the components are fine grained enough that one experiencing a problem can be automatically killed without too much cost. This is fundamentally impossible outside of a microkernel.<br> <p> I&#x27;m not saying that minix or other microkernels are better, but I am saying that what people desire is just not an option with the design choices that linux has made.<br> </div> Fri, 19 Nov 2021 05:43:16 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876458/ https://lwn.net/Articles/876458/ xecycle <div class="FormattedComment"> Thanks, this way it feels safer to declare the system as broken.<br> </div> Fri, 19 Nov 2021 04:25:08 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876457/ https://lwn.net/Articles/876457/ a13xp0p0v <div class="FormattedComment"> No, the output of WARN_ON() in the kernel log looks like that:<br> <p> ```<br> WARNING: CPU: 1 PID: 6739 at net/vmw_vsock/virtio_transport_common.c:34<br> ...<br> CPU: 1 PID: 6739 Comm: racer Tainted: G W 5.10.11-200.fc33.x86_64 #1<br> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014<br> RIP: 0010:virtio_transport_send_pkt_info+0x14d/0x180 [vmw_vsock_virtio_transport_common]<br> ...<br> RSP: 0018:ffffc90000d07e10 EFLAGS: 00010246<br> RAX: 0000000000000000 RBX: ffff888103416ac0 RCX: ffff88811e845b80<br> RDX: 00000000ffffffff RSI: ffffc90000d07e58 RDI: ffff888103416ac0<br> RBP: 0000000000000000 R08: 00000000052008af R09: 0000000000000000<br> R10: 0000000000000126 R11: 0000000000000000 R12: 0000000000000008<br> R13: ffffc90000d07e58 R14: 0000000000000000 R15: ffff888103416ac0<br> FS: 00007f2f123d5640(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000<br> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033<br> CR2: 00007f81ffc2a000 CR3: 000000011db96004 CR4: 0000000000370ee0<br> Call Trace:<br> virtio_transport_notify_buffer_size+0x60/0x70 [vmw_vsock_virtio_transport_common]<br> vsock_update_buffer_size+0x5f/0x70 [vsock]<br> vsock_stream_setsockopt+0x128/0x270 [vsock]<br> ...<br> ```<br> </div> Fri, 19 Nov 2021 04:06:05 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876455/ https://lwn.net/Articles/876455/ pbonzini <div class="FormattedComment"> KVM recently added KVM_BUG_ON, which makes all subsequent ioctls return with EIO.<br> </div> Fri, 19 Nov 2021 02:51:04 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876450/ https://lwn.net/Articles/876450/ xecycle <div class="FormattedComment"> I don’t know whether journalctl -k -p warning means the same WARNING? I’m seeing warnings (and errors) on every boot saying tsc is unstable, switching clocksource to hpet; so if it does mean the same, maybe I cannot boot if this settled on “panic on warning”.<br> </div> Thu, 18 Nov 2021 23:26:48 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876448/ https://lwn.net/Articles/876448/ dambacher <div class="FormattedComment"> maybe we should taint the affected subsystem on a WARN,<br> and make security sensible processes aware of this.<br> they can decide themselves if it is save to continue or not. <br> </div> Thu, 18 Nov 2021 22:59:59 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876447/ https://lwn.net/Articles/876447/ a13xp0p0v <div class="FormattedComment"> That is the right question.<br> <p> In theory, the userspace should be adapted to this kernel behavior.<br> <p> In practice, currently, the Linux kernel kills processes on oops. OOM killer kills processes. grsecurity also kills processes.<br> <p> pkill_on_warn is simply stopping the process when the first signs of wrong behavior are detected. That complies with the Fail-Fast principle.<br> Bugs usually don&#x27;t come alone, and a kernel warning may be followed by memory corruption or other negative effects. Real example:<br> <a href="https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html">https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html</a> <br> pkill_on_warn would prevent this kernel vulnerability exploit.<br> <p> It would also make kernel warning infoleaks less valuable for the attacker. Exploit examples using such infoleaks:<br> <a href="https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html">https://googleprojectzero.blogspot.com/2018/09/a-cache-in...</a><br> <a href="https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html">https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html</a><br> <p> Anyway, this patch provoked a deep discussion.<br> Maybe one day, the Linux kernel will get a more consistent error handling policy.<br> </div> Thu, 18 Nov 2021 22:53:40 +0000 What to do in response to a kernel warning https://lwn.net/Articles/876445/ https://lwn.net/Articles/876445/ NYKevin <div class="FormattedComment"> In the past, &quot;I can make XScreenSaver crash by doing a weird thing&quot; has been a CVE (because if XScreenSaver crashes, the screen unlocks). If someone could find a semi-reliable way to generate a kernel warning while a specific process is executing, then similar attacks might become possible under a pkill_on_warn policy.<br> <p> Hopefully, the move to Wayland will obviate that specific instance of the problem, but in the more general case, how does the kernel know that it&#x27;s safe to kill the currently executing process? You might be causing a security problem instead of remedying it.<br> </div> Thu, 18 Nov 2021 22:19:30 +0000