Averting excessive oopses

By Jonathan Corbet
November 18, 2022

Even a single kernel oops is never a good thing; it is an indication that something has gone badly wrong in the system somewhere and a straightforward recovery is not possible. But it seems that oopsing a large number of times has the potential to be even worse. To head off problems that might result from repeated oopsing, there is currently work afoot to put an upper limit on the number of times that the kernel can be allowed to oops before just giving up and rebooting.

An oops in the kernel is the equivalent of a crash in user space. It can come about for a number of reasons, including dereferencing a stray pointer, hardware problems, or a bug detected by checks within the kernel code itself. The normal response to an oops is to output a bunch of diagnostic information to the system log and kill the process that was running when the problem occurred.

The system as a whole, however, will continue on after an oops if at all possible. Killing the system would deprive the users of the ability to save any outstanding work and can also make problems much harder to debug than they would otherwise be. So the kernel will do its best to continue executing even when something has clearly gone badly wrong. An immediate result of that design decision is that any given system can oops more than once. Indeed, for some types of problems, multiple oopses are common and may continue until somebody gets fed up and reboots the system.

Jann Horn recently started to wonder whether perhaps the kernel should just give up and go into a panic (which will cause a reboot) if it oopses too many times. This could be a wise course of action in general; a kernel that is oopsing frequently is clearly not in a good condition and allowing it to continue could lead to problems like data corruption. But Horn had another concern: oopsing a system enough times might be a way to exploit security problems.

An oops, almost by definition, will leave an operation halfway completed; there is usually no way to clean up everything that might need cleaning when something has gone wrong in an unexpected place. So an oops might cause locks to be left in a held state or might lead to the failure to decrement counters that have been incremented. Counters are a particular concern; if an oops causes a counter to not be properly decremented, oopsing repeatedly might well become a way to overflow that counter, creating an exploitable situation.

To thwart attacks of this type, Horn wrote a patch putting an upper limit on the number of times the system can oops before it simply calls panic() and reboots. The limit was set to 10,000, but can be changed with the oops_limit command-line parameter.

One might well wonder whether oopsing the kernel repeatedly is a realistic way to exploit a kernel vulnerability. A kernel oops takes a bit of time, depending on a number of factors including the amount of data to be logged and the speed of the console device(s). The development community has put a vast amount of effort into optimizing many parts of the kernel, but speeding up oops handling has somehow never been a priority. To determine how long handling an oops takes, Horn wrote a special sort of benchmark:

In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run.

Based on that, he concluded that it would take between eight and 12 days of constant oopsing, in the best of conditions, to overflow a 32-bit counter that was incremented once for each oops. So it is not the quickest path toward a system exploit; it is also not the most discreet: "this is a *very* noisy and violent approach to exploiting the kernel". While there are almost certainly systems out there that can oops continuously for over a week without anybody noticing, they are probably relatively rare.

Even so, nobody seemed opposed to putting an upper limit on the number of oopses any given kernel can be expected to endure. Nobody even really felt the need to argue over the 10,000 number, though Linus Torvalds did note in passing that he would have preferred a higher limit. Alexander "Solar Designer" Peslyak suggested that, rather than creating a separate command-line parameter, Horn could just turn the existing panic_on_oops boolean parameter into an integer and use that. That idea didn't get too far, though, due to the number of places in the kernel that check that parameter and use it to modify their behavior now.

A few days later, Kees Cook posted an updated version of the patch (since revised), turning it into a six-part series. The behavior implemented by Horn remained unchanged, but there have been some additions, starting with a separate count to panic the system if the kernel emits too many warnings. Cook also concluded that, since the kernel was now tracking the number of oopses and warnings, that information could be provided to user space via sysfs, where it might be useful to monitoring systems.

No opposition to this change appears to be in the offing, so chances are that this patch set will find its way into the 6.2 kernel in something close to its current form. Thereafter, no kernel will be forced to put up with the indignity of constant oopsing for too long and, perhaps, some clever exploit might even be fended off. "Don't panic" might be good advice for galactic hitchhikers, but sometimes it is the right thing for the kernel to do.

Index entries for this article
Kernel	Security/Kernel hardening

Averting excessive oopses

Posted Nov 18, 2022 17:21 UTC (Fri) by adobriyan (subscriber, #30858) [Link] (8 responses)

> While there are almost certainly systems out there that can oops continuously for over a week without anybody noticing, they are probably relatively rare.

We'll soon find out if Twitter had such systems. :-)

More seriously, 10.000 is gross, oops_limit if implemented should default to a small multiple of "possible" CPUs.

Averting excessive oopses

Posted Nov 20, 2022 21:07 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (6 responses)

It depends on how their system is set up. If you use a cattle-not-pets mentality, then you probably have panic_on_oops enabled and some higher-level control plane (such as k8s) is responsible for rescheduling containers onto other available nodes. In that case, this change would not affect you at all, because you're effectively already running with an oops_limit of 1.

Averting excessive oopses

Posted Nov 29, 2022 0:41 UTC (Tue) by jccleaver (guest, #127418) [Link] (5 responses)

Fleet not cattle. If your boxes have state, data locality, and preferred workloads, then they develop individual quirks. Doesn't mean they have to be babied like your pal Spot, but presuming all wear and tear has been identical forever seems ill-advised.

Averting excessive oopses

Posted Nov 29, 2022 11:27 UTC (Tue) by farnz (subscriber, #17727) [Link] (4 responses)

With a fleet of cattle, as opposed to a fleet of pets, you don't presume that all wear and tear is identical - you have monitoring that measures the state of each machine individually, and handles it. E.g. your monitoring detects that machine 10,994,421 is frequently throwing ECC corrected errors, and it migrates all jobs off that machine and sends it off for repair. Or you detect filesystem corruption, and migrate everything off then send that machine to repair.

The key is that automation handles everything if you have cattle, not pets. Instead of knowing that http10294 is a unique snowflake with certain problems and fixes for problems, you have your automation look at the state of all hardware, and identify and deal with the problems of the machines en-masse.

Averting excessive oopses

Posted Nov 29, 2022 16:14 UTC (Tue) by Wol (subscriber, #4433) [Link]

> With a fleet of cattle, as opposed to a fleet of pets,

Shouldn't that be a hurd?

(G, D&R)

Cheers,
Wol

Averting excessive oopses

Posted Nov 29, 2022 22:47 UTC (Tue) by jccleaver (guest, #127418) [Link] (2 responses)

> The key is that automation handles everything if you have cattle, not pets.

At the risk of mixing metaphors further: you can automate pets as well. Plenty of shops use robust config management but don't (for a variety of different reasons) presume that indefinite horizontal scaling is the proper solution to all of life's problems, and thus can have a lot of systems running where the system cattle type n <= 4, which when combined with data locality, master/standby, or prod/test, is very much in the pet range for whatever number your tweaks go to.

Not everyone is Google Scale, and designing for Google Scale -- or creating enough complexity so that you can self-fullfillingly justify designing for Google Scale -- constitutes owning only the world largest k8s hammer.

Averting excessive oopses

Posted Nov 30, 2022 12:12 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

Absolutely - and automating admin is not, in itself, a sign that you're going for the "herd of cattle" phase. It's still valuable to automate repeated jobs for pets.

The point of "herd of cattle" is that there's a continuum from the "pet" (e.g. my laptop), where all admin is done by hand, as a careful operation, through to the "cattle" where you have so many servers to look after that doing admin by hand is simply too time-consuming, and thus you must automate everything so that any server that leaves the "expected" state is either repaired by automation or kicked out as "definite hardware fault" by automation, with clear instructions on how to fix or replace the faulty hardware.

Where you need to be on that line depends on how much time you have to do admin. If you have 40 hours a week to look after a single server that runs a single service, then pets are absolutely fine. If you have an hour a month to look after 10,000 servers, then you need to treat them as cattle. If you have 8 hours a week and 10 servers, then you're probably somewhere in the middle - you want to have the baseline dealt with by automation (things like checking for hardware failures, making sure the underlying OS is up to date etc), but you can treat the services on top of the hardware (the things you care about like a database setup, a web server, your organisation's secret special code etc) as pets.

And it's a good rule of thumb to assume that anything you do more than 3 times should be automated. For a small shop, this means that the things that recur have you treating the servers as cattle (why care about manually checking whether both disks are present in the RAID-1, or that btrfs checksums match data, when the server can do that automatically and notify you when bad things happen? Why manually choose all the options in the installer, when kickstart or equivalents can do that for you?) but once-offs (there's been a schema change from 2021-09 to 2021-10 deploy, needs manual application) are done by hand, treating the affected servers as pets.

Averting excessive oopses

Posted Nov 30, 2022 16:30 UTC (Wed) by Wol (subscriber, #4433) [Link]

> Absolutely - and automating admin is not, in itself, a sign that you're going for the "herd of cattle" phase. It's still valuable to automate repeated jobs for pets.

> The point of "herd of cattle" is that there's a continuum from the "pet" (e.g. my laptop), where all admin is done by hand, as a careful operation,

Even with a single, "pet" computer, automating admin where possible just makes sense. My /usr/local/bin has a few scripts (I ought to write more) that do assorted things like creating lvm volumes to assist my weekly backups/emerges, and there's plenty of stuff I ought to do regularly, that scripts would help with.

Cheers,
Wol

Averting excessive oopses

Posted Nov 22, 2022 0:22 UTC (Tue) by KaiRo (subscriber, #1987) [Link]

After recent happenings, I wouldn't trust them to have enough employees in relevant areas to debug why something goes wrong if a reboot fixes it for now.

Averting excessive oopses

Posted Nov 18, 2022 19:38 UTC (Fri) by xi0n (subscriber, #138144) [Link] (2 responses)

If the concern is about an attacker who can rapidly trigger a large number of oopses to exploit some counter vulnerability, then wouldn’t it be better to track the oopses over a time window instead?

Granted, I don’t have a good mental model as to how severe an oops is. But if something like a faulty driver for an obscure device can trigger it consistently without much harm for the rest of the kernel, then I can imagine a long running system may eventually hit the limit and panic seemingly out of the blue.

Averting excessive oopses

Posted Nov 18, 2022 20:31 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

If I were doing this kind of stuff, I'd want to involve as few components as possible. Timers seem fundamental, but not as fundamental as a simple counter.

Averting excessive oopses

Posted Nov 19, 2022 5:26 UTC (Sat) by developer122 (guest, #152928) [Link]

A timer means that an attacker merely needs to wait longer to succeed. It doesn't prevent the attack.

Meanwhile, a timer is terrible for catching normal oops, because if the problem is infrequent enough it goes completely unnoticed while the corruption it causes persists.

Averting excessive oopses

Posted Nov 18, 2022 20:18 UTC (Fri) by flussence (guest, #85566) [Link] (2 responses)

I've been having the complete inverse problem on one machine: no warning signs whatsoever, the kernel would just spontaneously reboot back to BIOS at random times of day. Sometimes several times a day. It turned out it was just caused by use of llvm-lto (which I'm going to swear off of until I next forget this), but getting to that deduction was extremely unfun.

Something in the middle, between that headache scenario and a dmesg full of emergencies going ignored, seems preferable to either... but if the option existed, I'd rather have the kernel vocalise its impending doom through the PC speaker.

Averting excessive oopses

Posted Nov 18, 2022 23:25 UTC (Fri) by sroracle (subscriber, #124960) [Link] (1 responses)

To be clear - was this an LTO-built kernel that was sporadically rebooting, or something else?

Averting excessive oopses

Posted Nov 24, 2022 12:45 UTC (Thu) by flussence (guest, #85566) [Link]

Yes, specifically make LLVM=1 with the full-lto option on llvm 15, x86-64. Happening over a few weeks, my working theory was that the power supply was getting marginal with age. Then on a whim I made the next update plain GCC and it's been fine ever since.

There's good reasons why that code is impossible to activate by accident, so I can't complain too much.

Averting excessive oopses

Posted Nov 18, 2022 21:00 UTC (Fri) by unixbhaskar (guest, #44758) [Link] (2 responses)

Well, in a wild hunch and armed with a lack of knowledge, may I suggest, it would be of great help, if and only if, the oops can point to the nearby location to look at(without going through the rigmarole of firing up the debugger to find out the culprit). I know, it requires some sort of system training to be built on to get it.

I am driving this notion by seeing it in different contexts in a specific environment.

See, I am looking for an "easy way out" than putting in any kind of effort. My lacuna.

Averting excessive oopses

Posted Nov 21, 2022 8:58 UTC (Mon) by jafd (subscriber, #129642) [Link]

Don't oopses provide a stack trace? Is this what you mean?

Averting excessive oopses

Posted Nov 21, 2022 10:44 UTC (Mon) by aaronmdjones (subscriber, #119973) [Link]

OOPSes already provide a stack trace and a full register dump. They also (if it was caused by e.g. the BUG_ON() macro) provide the file and line number that caused it.

Averting excessive oopses

Posted Nov 22, 2022 0:30 UTC (Tue) by KaiRo (subscriber, #1987) [Link]

The sysfs exposure of oops and warning counts sounds really great and I hope a lot of monitoring tools (including stuff on desktops) will look at that as I'm sure that most of those right now are simply not noticed at all - people are just not regularly reading logs and if an entry there is all that happens, it will go unnoticed.

Averting excessive oopses

Posted Dec 1, 2022 15:42 UTC (Thu) by esemwy (guest, #83963) [Link] (4 responses)

Isn’t this turning any bug into an immediate denial-of-service? I understand the concern that there may be an information leak or multiple tries on the same intermittent vulnerability, but given that the system owner is rarely in a position to actually make the fix, this seems like a pretty crude solution.

Is there something I’m missing?

Averting excessive oopses

Posted Dec 1, 2022 16:57 UTC (Thu) by farnz (subscriber, #17727) [Link] (3 responses)

You've already got a denial of service if you can trigger oopses at will - oops handling is not free, and if two or more CPUs hit an oops at the same time, one CPU is chosen as the oops handler, while the others spin in kernel space, preventing you from doing anything useful until the oops is handled. If I can trigger oopses freely, I can have all but one core spinning in the kernel, while the one that's trying to do work keeps re-entering the oops handler and not making progress.

Plus, the nature of oopses is that some are one-off ("cosmic ray") events that never recur, while others leave the system in a state where doing the same thing will result in another oops. The older knob (panic_on_oops) lets you turn all oopses into panics, so that the machine comes into a known state before trying again. This knob allows you to attempt to distinguish cosmic ray events from important subsystems broken, so that (e.g.) if I can reliably oops you because part of the TCP stack has trashed state for one tx queue, and you'll oops every time you send data, you'll panic, but if the oops is because the driver author for my NIC didn't think about what would happen if I tried to tx during a NIC reset, and everything will be fine after the NIC reset completes, well, this is a rare event and I'll recover.

And to some extent, the key difference between an oops and a panic is that an oops says "the system is partially broken; some things still work, but other things are permanently broken", while a panic says "the system is beyond recovery, short of a complete reboot". On my laptop, for example, an oops that occurs every time my TCP stack tries to handle a receive window of more than 8 bytes is a pain, but I can deal - save files (NVMe is unaffected by a broken TCP stack), and reboot. If I lose my input subsystem, however, I'm stuck - if I can't tell the computer what to do, all I can do is reset it. On a network server at Twitter, however, losing the input subsystem is no big deal, but a network stack that can't handle a receive window of more than 8 bytes is effectively useless.

Averting excessive oopses

Posted Dec 1, 2022 17:54 UTC (Thu) by esemwy (guest, #83963) [Link] (2 responses)

OK, thanks. That makes a lot more sense then. The only remaining concern would be on extremely long running systems, and I expect the amount of monitoring those receive is sufficient that the hard limit rather than a timed window shouldn’t matter.

Averting excessive oopses

Posted Dec 1, 2022 18:05 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

It's also worth noting that this is an optional mechanism; if your system is meant to be extremely long running, and to keep going through oopses, you'd turn this off. But for a use case like Twitter (or Amazon Web Services, or Facebook, or Google), this mechanism will result in servers that are accumulating software faults panicking and thus rebooting before things get serious - and you'll have other monitoring that says "this server is rebooting frequently, take it out of service" already.

Averting excessive oopses

Posted Jan 5, 2023 14:50 UTC (Thu) by sammythesnake (guest, #17693) [Link]

There's a point where "very long running" bumps up against the mean time between kernel upgrades you don't want to miss. Longer ruining than that is probably not something to aspire to!

Providing the "cosmic ray" and "meh - non-critical subsystem" oopsen don't add up to 10,000 much more quickly than that, then your reboot rate will probably be pretty unaffected, I'd have thought.

For my own use case, the most common cause of reboots is accidentally knocking the power cable loose while hunting for something that fell down the back of my desk, so this is all a little academic for me :-P