LWN: Comments on "Averting excessive oopses"

Averting excessive oopses

sammythesnake — Thu, 05 Jan 2023 14:50:18 +0000

There's a point where "very long running" bumps up against the mean time between kernel upgrades you don't want to miss. Longer ruining than that is probably not something to aspire to!

Providing the "cosmic ray" and "meh - non-critical subsystem" oopsen don't add up to 10,000 much more quickly than that, then your reboot rate will probably be pretty unaffected, I'd have thought.

For my own use case, the most common cause of reboots is accidentally knocking the power cable loose while hunting for something that fell down the back of my desk, so this is all a little academic for me :-P

Averting excessive oopses

farnz — Thu, 01 Dec 2022 18:05:58 +0000

It's also worth noting that this is an optional mechanism; if your system is meant to be extremely long running, and to keep going through oopses, you'd turn this off. But for a use case like Twitter (or Amazon Web Services, or Facebook, or Google), this mechanism will result in servers that are accumulating software faults panicking and thus rebooting before things get serious - and you'll have other monitoring that says "this server is rebooting frequently, take it out of service" already.

Averting excessive oopses

esemwy — Thu, 01 Dec 2022 17:54:19 +0000

OK, thanks. That makes a lot more sense then. The only remaining concern would be on extremely long running systems, and I expect the amount of monitoring those receive is sufficient that the hard limit rather than a timed window shouldn’t matter.

Averting excessive oopses

farnz — Thu, 01 Dec 2022 16:57:01 +0000

You've already got a denial of service if you can trigger oopses at will - oops handling is not free, and if two or more CPUs hit an oops at the same time, one CPU is chosen as the oops handler, while the others spin in kernel space, preventing you from doing anything useful until the oops is handled. If I can trigger oopses freely, I can have all but one core spinning in the kernel, while the one that's trying to do work keeps re-entering the oops handler and not making progress.

Plus, the nature of oopses is that some are one-off ("cosmic ray") events that never recur, while others leave the system in a state where doing the same thing will result in another oops. The older knob (panic_on_oops) lets you turn all oopses into panics, so that the machine comes into a known state before trying again. This knob allows you to attempt to distinguish cosmic ray events from important subsystems broken, so that (e.g.) if I can reliably oops you because part of the TCP stack has trashed state for one tx queue, and you'll oops every time you send data, you'll panic, but if the oops is because the driver author for my NIC didn't think about what would happen if I tried to tx during a NIC reset, and everything will be fine after the NIC reset completes, well, this is a rare event and I'll recover.

And to some extent, the key difference between an oops and a panic is that an oops says "the system is partially broken; some things still work, but other things are permanently broken", while a panic says "the system is beyond recovery, short of a complete reboot". On my laptop, for example, an oops that occurs every time my TCP stack tries to handle a receive window of more than 8 bytes is a pain, but I can deal - save files (NVMe is unaffected by a broken TCP stack), and reboot. If I lose my input subsystem, however, I'm stuck - if I can't tell the computer what to do, all I can do is reset it. On a network server at Twitter, however, losing the input subsystem is no big deal, but a network stack that can't handle a receive window of more than 8 bytes is effectively useless.

Averting excessive oopses

esemwy — Thu, 01 Dec 2022 15:42:05 +0000

Isn’t this turning any bug into an immediate denial-of-service? I understand the concern that there may be an information leak or multiple tries on the same intermittent vulnerability, but given that the system owner is rarely in a position to actually make the fix, this seems like a pretty crude solution.

Is there something I’m missing?

Averting excessive oopses

Wol — Wed, 30 Nov 2022 16:30:27 +0000

> Absolutely - and automating admin is not, in itself, a sign that you're going for the "herd of cattle" phase. It's still valuable to automate repeated jobs for pets.

> The point of "herd of cattle" is that there's a continuum from the "pet" (e.g. my laptop), where all admin is done by hand, as a careful operation,

Even with a single, "pet" computer, automating admin where possible just makes sense. My /usr/local/bin has a few scripts (I ought to write more) that do assorted things like creating lvm volumes to assist my weekly backups/emerges, and there's plenty of stuff I ought to do regularly, that scripts would help with.

Cheers,
Wol

Averting excessive oopses

farnz — Wed, 30 Nov 2022 12:12:37 +0000

Absolutely - and automating admin is not, in itself, a sign that you're going for the "herd of cattle" phase. It's still valuable to automate repeated jobs for pets.

The point of "herd of cattle" is that there's a continuum from the "pet" (e.g. my laptop), where all admin is done by hand, as a careful operation, through to the "cattle" where you have so many servers to look after that doing admin by hand is simply too time-consuming, and thus you must automate everything so that any server that leaves the "expected" state is either repaired by automation or kicked out as "definite hardware fault" by automation, with clear instructions on how to fix or replace the faulty hardware.

Where you need to be on that line depends on how much time you have to do admin. If you have 40 hours a week to look after a single server that runs a single service, then pets are absolutely fine. If you have an hour a month to look after 10,000 servers, then you need to treat them as cattle. If you have 8 hours a week and 10 servers, then you're probably somewhere in the middle - you want to have the baseline dealt with by automation (things like checking for hardware failures, making sure the underlying OS is up to date etc), but you can treat the services on top of the hardware (the things you care about like a database setup, a web server, your organisation's secret special code etc) as pets.

And it's a good rule of thumb to assume that anything you do more than 3 times should be automated. For a small shop, this means that the things that recur have you treating the servers as cattle (why care about manually checking whether both disks are present in the RAID-1, or that btrfs checksums match data, when the server can do that automatically and notify you when bad things happen? Why manually choose all the options in the installer, when kickstart or equivalents can do that for you?) but once-offs (there's been a schema change from 2021-09 to 2021-10 deploy, needs manual application) are done by hand, treating the affected servers as pets.

Averting excessive oopses

jccleaver — Tue, 29 Nov 2022 22:47:36 +0000

> The key is that automation handles everything if you have cattle, not pets.

At the risk of mixing metaphors further: you can automate pets as well. Plenty of shops use robust config management but don't (for a variety of different reasons) presume that indefinite horizontal scaling is the proper solution to all of life's problems, and thus can have a lot of systems running where the system cattle type n <= 4, which when combined with data locality, master/standby, or prod/test, is very much in the pet range for whatever number your tweaks go to.

Not everyone is Google Scale, and designing for Google Scale -- or creating enough complexity so that you can self-fullfillingly justify designing for Google Scale -- constitutes owning only the world largest k8s hammer.

Averting excessive oopses

Wol — Tue, 29 Nov 2022 16:14:54 +0000

> With a fleet of cattle, as opposed to a fleet of pets,

Shouldn't that be a hurd?

(G, D&R)

Cheers,
Wol

Averting excessive oopses

farnz — Tue, 29 Nov 2022 11:27:39 +0000

With a fleet of cattle, as opposed to a fleet of pets, you don't presume that all wear and tear is identical - you have monitoring that measures the state of each machine individually, and handles it. E.g. your monitoring detects that machine 10,994,421 is frequently throwing ECC corrected errors, and it migrates all jobs off that machine and sends it off for repair. Or you detect filesystem corruption, and migrate everything off then send that machine to repair.

The key is that automation handles everything if you have cattle, not pets. Instead of knowing that http10294 is a unique snowflake with certain problems and fixes for problems, you have your automation look at the state of all hardware, and identify and deal with the problems of the machines en-masse.

Averting excessive oopses

jccleaver — Tue, 29 Nov 2022 00:41:19 +0000

Fleet not cattle. If your boxes have state, data locality, and preferred workloads, then they develop individual quirks. Doesn't mean they have to be babied like your pal Spot, but presuming all wear and tear has been identical forever seems ill-advised.

Averting excessive oopses

flussence — Thu, 24 Nov 2022 12:45:11 +0000

Yes, specifically make LLVM=1 with the full-lto option on llvm 15, x86-64. Happening over a few weeks, my working theory was that the power supply was getting marginal with age. Then on a whim I made the next update plain GCC and it's been fine ever since.

There's good reasons why that code is impossible to activate by accident, so I can't complain too much.

Averting excessive oopses

KaiRo — Tue, 22 Nov 2022 00:30:43 +0000

The sysfs exposure of oops and warning counts sounds really great and I hope a lot of monitoring tools (including stuff on desktops) will look at that as I'm sure that most of those right now are simply not noticed at all - people are just not regularly reading logs and if an entry there is all that happens, it will go unnoticed.

Averting excessive oopses

KaiRo — Tue, 22 Nov 2022 00:22:54 +0000

After recent happenings, I wouldn't trust them to have enough employees in relevant areas to debug why something goes wrong if a reboot fixes it for now.

Averting excessive oopses

aaronmdjones — Mon, 21 Nov 2022 10:44:57 +0000

OOPSes already provide a stack trace and a full register dump. They also (if it was caused by e.g. the BUG_ON() macro) provide the file and line number that caused it.

Averting excessive oopses

jafd — Mon, 21 Nov 2022 08:58:17 +0000

Don't oopses provide a stack trace? Is this what you mean?

Averting excessive oopses

NYKevin — Sun, 20 Nov 2022 21:07:24 +0000

It depends on how their system is set up. If you use a cattle-not-pets mentality, then you probably have panic_on_oops enabled and some higher-level control plane (such as k8s) is responsible for rescheduling containers onto other available nodes. In that case, this change would not affect you at all, because you're effectively already running with an oops_limit of 1.

Averting excessive oopses

developer122 — Sat, 19 Nov 2022 05:26:44 +0000

A timer means that an attacker merely needs to wait longer to succeed. It doesn't prevent the attack.

Meanwhile, a timer is terrible for catching normal oops, because if the problem is infrequent enough it goes completely unnoticed while the corruption it causes persists.

Averting excessive oopses

sroracle — Fri, 18 Nov 2022 23:25:10 +0000

To be clear - was this an LTO-built kernel that was sporadically rebooting, or something else?

Averting excessive oopses

unixbhaskar — Fri, 18 Nov 2022 21:00:28 +0000

Well, in a wild hunch and armed with a lack of knowledge, may I suggest, it would be of great help, if and only if, the oops can point to the nearby location to look at(without going through the rigmarole of firing up the debugger to find out the culprit). I know, it requires some sort of system training to be built on to get it.

I am driving this notion by seeing it in different contexts in a specific environment.

See, I am looking for an "easy way out" than putting in any kind of effort. My lacuna.

Averting excessive oopses

mathstuf — Fri, 18 Nov 2022 20:31:17 +0000

If I were doing this kind of stuff, I'd want to involve as few components as possible. Timers seem fundamental, but not as fundamental as a simple counter.

Averting excessive oopses

flussence — Fri, 18 Nov 2022 20:18:52 +0000

I've been having the complete inverse problem on one machine: no warning signs whatsoever, the kernel would just spontaneously reboot back to BIOS at random times of day. Sometimes several times a day. It turned out it was just caused by use of llvm-lto (which I'm going to swear off of until I next forget this), but getting to that deduction was extremely unfun.

Something in the middle, between that headache scenario and a dmesg full of emergencies going ignored, seems preferable to either... but if the option existed, I'd rather have the kernel vocalise its impending doom through the PC speaker.

Averting excessive oopses

xi0n — Fri, 18 Nov 2022 19:38:13 +0000

If the concern is about an attacker who can rapidly trigger a large number of oopses to exploit some counter vulnerability, then wouldn’t it be better to track the oopses over a time window instead?

Granted, I don’t have a good mental model as to how severe an oops is. But if something like a faulty driver for an obscure device can trigger it consistently without much harm for the rest of the kernel, then I can imagine a long running system may eventually hit the limit and panic seemingly out of the blue.

Averting excessive oopses

adobriyan — Fri, 18 Nov 2022 17:21:12 +0000

> While there are almost certainly systems out there that can oops continuously for over a week without anybody noticing, they are probably relatively rare.

We'll soon find out if Twitter had such systems. :-)

More seriously, 10.000 is gross, oops_limit if implemented should default to a small multiple of "possible" CPUs.