LWN: Comments on "What to do in response to a kernel warning"

What to do in response to a kernel warning

Vipketsh — Fri, 10 Dec 2021 20:32:46 +0000

Translated: they are turning a "potential issue" into "definitely an issue". Depending on the situation that can be anything from great to down right bastard.

If you are some cloud provider with a large number of machines (google, facebook, etc.), you probably have a bunch of redundancy and load balancing in your setup: if one machine goes down for whatever reason, even if just because of a "potential issue", there are a bunch more available to take over and the load balancing makes sure it happens. Furthermore the cloud provider probably has engineers available to take a look at the issue in short order and return the machine to operation. In such a setting taking machines down due to "potential issue" is very well warranted.

On the other hand if you are using the machine in question as a primary work machine, the last thing you want is for a "potential issue" to take down your system and eat hours of work with it. In such a scenario chugging along for as long as possible is the desired mode of operation.

I think google & co. have enough expertise on hand to tweak defaults to whatever makes sense for them while individuals (using Linux as a primary work machine) very often do not. Thus the only same thing is to keep the defaults to what makes sense for individuals: keep going for as long as possible.

What to do in response to a kernel warning

flussence — Fri, 26 Nov 2021 23:50:48 +0000

Another here - my laptop's i915 driver spews messages about corrupt EDID data on every boot. It used to be at warning severity but it seems they've downgraded it.

But graphics drivers aside, I've had actual WARN_ONs recently too, one was a soft lockup timeout because an NFS mount tree went away during a system upgrade. I would rather not have a spontaneous panic/reboot there before I get a chance to ensure the reboot will actually succeed.

What to do in response to a kernel warning

mathstuf — Wed, 24 Nov 2021 01:41:18 +0000

Agreed. Warnings are good for things like "it should work, but there may be something coming over the horizon". If it works, it should pass. If it fails, it should block things. However, there *is* a middle state of "things are changing, adapt before it's too late" (e.g., deprecation warnings and subsequent removals). If you want to fail on warnings, it's usually easy to do something like `-Werror` (however that ends up being spelled for the tool in question). But just blindly continuing on as if nothing is wrong on warnings is also not viable long-term.

As an example, we test git master with our stuff. Currently we make it block if it fails because git is pretty good and doesn't break things much (we've found one regression in 5 years). But if git weren't so stable, we'd *still* want to know if breakages are on their way so at least we could move out of the way of whatever the light ahead of us in the tunnel turns out to be.

What to do in response to a kernel warning

marcH — Wed, 24 Nov 2021 01:07:48 +0000

> "Green" is nothing more than an "I'm finished" indicator, and "red" means "I was not able to finish." There's no "yellow" state because the process is not allowed to "half-finish."

I'm sorry your CI is so limited. Ours runs tests and publishes dashboards and pass rates. I've seen others automatically track regressions and progressions.

What to do in response to a kernel warning

NYKevin — Wed, 24 Nov 2021 00:16:18 +0000

That's because the purpose of a CI system is not to give you a red/green/yellow status indicator. It is to perform a sequence of actions which (usually) culminates either in producing final (tested) artifacts, or in actually pushing those artifacts to production (depending on what your process looks like). "Green" is nothing more than an "I'm finished" indicator, and "red" means "I was not able to finish." There's no "yellow" state because the process is not allowed to "half-finish." Either it gives you the final end product/deployment (green), or it doesn't (red). Yellow would just end up being green or red but with a different UI indication.

It's also important to bear in mind that you can have more than one CI pipeline for the same underlying codebase. For example, you could have a lenient CI pipeline that "just" pushes nightlies into your staging or QA instance, and then a more stringent pipeline that pushes into prod. You can even have a darklaunch pipeline that works just like the real prod pipeline, except that it never actually pushes anything, so that you can still have advance notice that "this nightly is in staging now, but it won't make it into prod when that push happens for real, because it triggered a warning and failed the darklaunch pipeline."

The important thing to bear in mind is that it is *not enough* to just surface warnings in a UI somewhere. You need to have a systematic policy for what happens when a warning is triggered. Once such a policy exists, it can be enforced with code, regardless of what the UI looks like. But if there is no policy, you will get developers making ad-hoc judgment calls about whether a given warning is "bad enough" to stall the release for a day, or if we should just push it anyway. As it turns out, humans are pretty bad at making such decisions, especially when the PM wants to get our new feature out yesterday and the warning is in some really hairy subsystem that nobody has properly understood in many years.

What to do in response to a kernel warning

mm7323 — Tue, 23 Nov 2021 10:38:08 +0000

For the cases where a WARN() can be certainly attributed to some task, perhaps there should be a WARN_ON() variant which takes a struct task_struct e.g. WARN_ON_TASK(). Then some policy may decide to dump, kill or deliver a signal to that process, avoiding the risk of killing something unrelated and risking more harm.

What to do in response to a kernel warning

Fowl — Tue, 23 Nov 2021 04:44:23 +0000

Microsoft Azure DevOps (horrible name for what used to be "Team Foundation Server"), aka. Microsoft's /other/ forge thing, does actually support in it's build/release pipelines.

This page has some screenshots -> https://github.com/melix-dev/azure-devops-dotnet-warnings...

What to do in response to a kernel warning

mathstuf — Mon, 22 Nov 2021 19:24:26 +0000

Agreed. My request to GitLab is here if anyone wants to "vote": https://gitlab.com/gitlab-org/gitlab/-/issues/219574

What to do in response to a kernel warning

ianmcc — Mon, 22 Nov 2021 16:55:07 +0000

Same here - in my previous system I used to get regular warnings (with a dire looking log message) from the NVIDIA proprietary drivers, but it never caused any actual problem, not even any noticeable graphics glitch.

What to do in response to a kernel warning

marcH — Mon, 22 Nov 2021 15:55:11 +0000

Meanwhile, not a single CI engine seems to support warnings. It's either red or green and nothing in the middle. At best you get stderr in a different color in the logs (that one one opens when the status is green).

What to do in response to a kernel warning

taladar — Mon, 22 Nov 2021 11:31:45 +0000

This is an interesting discussion, however when I read the headline I first thought of explanations similar to the ones the Rust compiler provides, giving the system administrator some guidance on what the warning actually means. That might be a useful thing to have as well.

What to do in response to a kernel warning

nix — Sun, 21 Nov 2021 17:54:59 +0000

Looking at the WARN_ONs I've experienced here in the last half-decade, I had warnings on boot for about two years because of amdgpu multi-monitor problems which eventually got fixed: but while they were happening, reboot-on-warn would have been precisely the wrong thing to do because the warnings didn't actually ever cause anything to go wrong that I could tell. I had warnings on minor bcache gc problems which might cause, at most, a leak of a single bucket (4MiB) which in any case would go away upon the next gc after reboot: there's no *way* that sort of thing would ever be worthy of a reboot unless it hit almost every bucket (it didn't, it hit one out of ~90k), but a WARN_ON was used because, well, it's worthy of a warning, right? In this case the warning was likely used as a "please tell the developer" flag, probably on the grounds that anything less would just be ignored.

The only warning I ever had that was actually worthwhile was a warning from xfs which was immediately afterwards followed by a verifier failure flipping the affected fs to readonly. This was entirely correct given that it would otherwise have resulted in fs corruption -- but I'm fairly sure a reboot would not have been preferable to the readonly-flipping it already did (though in the event the affected fs happened to be the rootfs, so there was little else I could do: but that need not have been the case).

So, so far, here at least, the number of WARNs that should have resulted in a reboot is, uh, 0%.

What to do in response to a kernel warning

dullfire — Sat, 20 Nov 2021 18:44:40 +0000

I think it would be kind of nice if the kernel had an interface (maybe a character devices) that was poll-able. And on oops, read(2) would return something like

struct us_oops_event {
/* time that matches the kernel time stamp in dmesg of the event */
ktime_t time;
/* Which thread the oops occurred on */
pid_t victim;
};

After which, userspace can make any decisions: may it knows the pid it question is a specific process, or maybe it just forwards the even+dmesg over the network (if syslog forwarding isn't normally setup).

Anyhow that seems to me like the sanest thing the kernel can do (besides panicking, which is already an option).

What to do in response to a kernel warning

NYKevin — Sat, 20 Nov 2021 02:38:37 +0000

> In practice, currently, the Linux kernel kills processes on oops. OOM killer kills processes. grsecurity also kills processes.

Well...

1. oops is pretty darned unusual, in my experience.
2. OOM killing is both unusual and at least tries* to target processes that might possibly be responsible for the OOM condition.
3. I have no idea about grsecurity, but it's not part of Linus's tree, so I frankly don't care what it does.

* There is a difference between trying and succeeding, of course.

What to do in response to a kernel warning

geert — Fri, 19 Nov 2021 09:21:14 +0000

But it doesn't Fail-Fast if it kills the wrong thread. In general, there is no guarantee the bad state that caused the warning is limited to the thread(s) killed.

What to do in response to a kernel warning

developer122 — Fri, 19 Nov 2021 05:43:16 +0000

The fundamental problem here is a desire for compartmentalization that simply does not exist.

It seems there are many different kinds of warn()s that can be triggered in different ways at different times. Some might happen in an interrupt handler, others might be the result of a filesystem or driver problem, and then some other might be the result of something actually triggered by userspace. So not only are warn()s not tagged by probable cause, but tagging them with that may be impossible.

Warn()s also occur in the shared security boundary of "all of kernel space." There is not theoretical limit to what state might be contaminated. It might be limited to only what is currently being processed. It might affect the whole subsystem. Or it could lead to corruption to anything else the kernel is doing and possibly take the whole system down. So not only are warn()s not tagged with their probably effects, but it's quite possibly impossible to do so.

Without knowing what caused a warn, and what effects it might have, it is impossible to take any decision more fine-grained than "do nothing" or "reset the system." You can't reliably kill the culprit or avoid the conditions that caused it. You can't reliably determine the affected area and pick an appropriate mitigation (flushing a buffer, killing a process, killing a subsystem, or indeed shutting down a totally-compromised system).

What the kernel devs desire is a system like minix, where the various subsystems are isolated enough that anything bad happening in one is reasonably assured to not have affected the others, and where the components are fine grained enough that one experiencing a problem can be automatically killed without too much cost. This is fundamentally impossible outside of a microkernel.

I'm not saying that minix or other microkernels are better, but I am saying that what people desire is just not an option with the design choices that linux has made.

What to do in response to a kernel warning

xecycle — Fri, 19 Nov 2021 04:25:08 +0000

Thanks, this way it feels safer to declare the system as broken.

What to do in response to a kernel warning

a13xp0p0v — Fri, 19 Nov 2021 04:06:05 +0000

No, the output of WARN_ON() in the kernel log looks like that:

```
WARNING: CPU: 1 PID: 6739 at net/vmw_vsock/virtio_transport_common.c:34
...
CPU: 1 PID: 6739 Comm: racer Tainted: G W 5.10.11-200.fc33.x86_64 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:virtio_transport_send_pkt_info+0x14d/0x180 [vmw_vsock_virtio_transport_common]
...
RSP: 0018:ffffc90000d07e10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888103416ac0 RCX: ffff88811e845b80
RDX: 00000000ffffffff RSI: ffffc90000d07e58 RDI: ffff888103416ac0
RBP: 0000000000000000 R08: 00000000052008af R09: 0000000000000000
R10: 0000000000000126 R11: 0000000000000000 R12: 0000000000000008
R13: ffffc90000d07e58 R14: 0000000000000000 R15: ffff888103416ac0
FS: 00007f2f123d5640(0000) GS:ffff88817bd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81ffc2a000 CR3: 000000011db96004 CR4: 0000000000370ee0
Call Trace:
virtio_transport_notify_buffer_size+0x60/0x70 [vmw_vsock_virtio_transport_common]
vsock_update_buffer_size+0x5f/0x70 [vsock]
vsock_stream_setsockopt+0x128/0x270 [vsock]
...
```

What to do in response to a kernel warning

pbonzini — Fri, 19 Nov 2021 02:51:04 +0000

KVM recently added KVM_BUG_ON, which makes all subsequent ioctls return with EIO.

What to do in response to a kernel warning

xecycle — Thu, 18 Nov 2021 23:26:48 +0000

I don’t know whether journalctl -k -p warning means the same WARNING? I’m seeing warnings (and errors) on every boot saying tsc is unstable, switching clocksource to hpet; so if it does mean the same, maybe I cannot boot if this settled on “panic on warning”.

What to do in response to a kernel warning

dambacher — Thu, 18 Nov 2021 22:59:59 +0000

maybe we should taint the affected subsystem on a WARN,
and make security sensible processes aware of this.
they can decide themselves if it is save to continue or not.

What to do in response to a kernel warning

a13xp0p0v — Thu, 18 Nov 2021 22:53:40 +0000

That is the right question.

In theory, the userspace should be adapted to this kernel behavior.

In practice, currently, the Linux kernel kills processes on oops. OOM killer kills processes. grsecurity also kills processes.

pkill_on_warn is simply stopping the process when the first signs of wrong behavior are detected. That complies with the Fail-Fast principle.
Bugs usually don't come alone, and a kernel warning may be followed by memory corruption or other negative effects. Real example:
https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
pkill_on_warn would prevent this kernel vulnerability exploit.

It would also make kernel warning infoleaks less valuable for the attacker. Exploit examples using such infoleaks:
https://googleprojectzero.blogspot.com/2018/09/a-cache-in...
https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html

Anyway, this patch provoked a deep discussion.
Maybe one day, the Linux kernel will get a more consistent error handling policy.

What to do in response to a kernel warning

NYKevin — Thu, 18 Nov 2021 22:19:30 +0000

In the past, "I can make XScreenSaver crash by doing a weird thing" has been a CVE (because if XScreenSaver crashes, the screen unlocks). If someone could find a semi-reliable way to generate a kernel warning while a specific process is executing, then similar attacks might become possible under a pkill_on_warn policy.

Hopefully, the move to Wayland will obviate that specific instance of the problem, but in the more general case, how does the kernel know that it's safe to kill the currently executing process? You might be causing a security problem instead of remedying it.