Averting excessive oopses

Posted Dec 1, 2022 15:42 UTC (Thu) by esemwy (guest, #83963)
Parent article: Averting excessive oopses

Isn’t this turning any bug into an immediate denial-of-service? I understand the concern that there may be an information leak or multiple tries on the same intermittent vulnerability, but given that the system owner is rarely in a position to actually make the fix, this seems like a pretty crude solution.

Is there something I’m missing?

Averting excessive oopses

Posted Dec 1, 2022 16:57 UTC (Thu) by farnz (subscriber, #17727) [Link] (3 responses)

You've already got a denial of service if you can trigger oopses at will - oops handling is not free, and if two or more CPUs hit an oops at the same time, one CPU is chosen as the oops handler, while the others spin in kernel space, preventing you from doing anything useful until the oops is handled. If I can trigger oopses freely, I can have all but one core spinning in the kernel, while the one that's trying to do work keeps re-entering the oops handler and not making progress.

Plus, the nature of oopses is that some are one-off ("cosmic ray") events that never recur, while others leave the system in a state where doing the same thing will result in another oops. The older knob (panic_on_oops) lets you turn all oopses into panics, so that the machine comes into a known state before trying again. This knob allows you to attempt to distinguish cosmic ray events from important subsystems broken, so that (e.g.) if I can reliably oops you because part of the TCP stack has trashed state for one tx queue, and you'll oops every time you send data, you'll panic, but if the oops is because the driver author for my NIC didn't think about what would happen if I tried to tx during a NIC reset, and everything will be fine after the NIC reset completes, well, this is a rare event and I'll recover.

And to some extent, the key difference between an oops and a panic is that an oops says "the system is partially broken; some things still work, but other things are permanently broken", while a panic says "the system is beyond recovery, short of a complete reboot". On my laptop, for example, an oops that occurs every time my TCP stack tries to handle a receive window of more than 8 bytes is a pain, but I can deal - save files (NVMe is unaffected by a broken TCP stack), and reboot. If I lose my input subsystem, however, I'm stuck - if I can't tell the computer what to do, all I can do is reset it. On a network server at Twitter, however, losing the input subsystem is no big deal, but a network stack that can't handle a receive window of more than 8 bytes is effectively useless.

Averting excessive oopses

Posted Dec 1, 2022 17:54 UTC (Thu) by esemwy (guest, #83963) [Link] (2 responses)

OK, thanks. That makes a lot more sense then. The only remaining concern would be on extremely long running systems, and I expect the amount of monitoring those receive is sufficient that the hard limit rather than a timed window shouldn’t matter.

Averting excessive oopses

Posted Dec 1, 2022 18:05 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

It's also worth noting that this is an optional mechanism; if your system is meant to be extremely long running, and to keep going through oopses, you'd turn this off. But for a use case like Twitter (or Amazon Web Services, or Facebook, or Google), this mechanism will result in servers that are accumulating software faults panicking and thus rebooting before things get serious - and you'll have other monitoring that says "this server is rebooting frequently, take it out of service" already.

Averting excessive oopses

Posted Jan 5, 2023 14:50 UTC (Thu) by sammythesnake (guest, #17693) [Link]

There's a point where "very long running" bumps up against the mean time between kernel upgrades you don't want to miss. Longer ruining than that is probably not something to aspire to!

Providing the "cosmic ray" and "meh - non-critical subsystem" oopsen don't add up to 10,000 much more quickly than that, then your reboot rate will probably be pretty unaffected, I'd have thought.

For my own use case, the most common cause of reboots is accidentally knocking the power cable loose while hunting for something that fell down the back of my desk, so this is all a little academic for me :-P