Averting excessive oopses
An oops in the kernel is the equivalent of a crash in user space. It can come about for a number of reasons, including dereferencing a stray pointer, hardware problems, or a bug detected by checks within the kernel code itself. The normal response to an oops is to output a bunch of diagnostic information to the system log and kill the process that was running when the problem occurred.
The system as a whole, however, will continue on after an oops if at all possible. Killing the system would deprive the users of the ability to save any outstanding work and can also make problems much harder to debug than they would otherwise be. So the kernel will do its best to continue executing even when something has clearly gone badly wrong. An immediate result of that design decision is that any given system can oops more than once. Indeed, for some types of problems, multiple oopses are common and may continue until somebody gets fed up and reboots the system.
Jann Horn recently started to wonder whether perhaps the kernel should just give up and go into a panic (which will cause a reboot) if it oopses too many times. This could be a wise course of action in general; a kernel that is oopsing frequently is clearly not in a good condition and allowing it to continue could lead to problems like data corruption. But Horn had another concern: oopsing a system enough times might be a way to exploit security problems.
An oops, almost by definition, will leave an operation halfway completed; there is usually no way to clean up everything that might need cleaning when something has gone wrong in an unexpected place. So an oops might cause locks to be left in a held state or might lead to the failure to decrement counters that have been incremented. Counters are a particular concern; if an oops causes a counter to not be properly decremented, oopsing repeatedly might well become a way to overflow that counter, creating an exploitable situation.
To thwart attacks of this type, Horn wrote a patch putting an upper limit on the number of times the system can oops before it simply calls panic() and reboots. The limit was set to 10,000, but can be changed with the oops_limit command-line parameter.
One might well wonder whether oopsing the kernel repeatedly is a realistic way to exploit a kernel vulnerability. A kernel oops takes a bit of time, depending on a number of factors including the amount of data to be logged and the speed of the console device(s). The development community has put a vast amount of effort into optimizing many parts of the kernel, but speeding up oops handling has somehow never been a priority. To determine how long handling an oops takes, Horn wrote a special sort of benchmark:
In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run.
Based on that, he concluded that it would take between eight and 12 days of
constant oopsing, in the best of conditions, to overflow a 32-bit counter
that was incremented once for each
oops. So it is not the quickest path toward a system exploit; it is also
not the most discreet: "this is a *very* noisy and violent approach to
exploiting the kernel
". While there are almost certainly systems out there
that can oops continuously for over a week without anybody noticing, they
are probably relatively rare.
Even so, nobody seemed opposed to putting an upper limit on the number of oopses any given kernel can be expected to endure. Nobody even really felt the need to argue over the 10,000 number, though Linus Torvalds did note in passing that he would have preferred a higher limit. Alexander "Solar Designer" Peslyak suggested that, rather than creating a separate command-line parameter, Horn could just turn the existing panic_on_oops boolean parameter into an integer and use that. That idea didn't get too far, though, due to the number of places in the kernel that check that parameter and use it to modify their behavior now.
A few days later, Kees Cook posted an updated version of the patch (since revised), turning it into a six-part series. The behavior implemented by Horn remained unchanged, but there have been some additions, starting with a separate count to panic the system if the kernel emits too many warnings. Cook also concluded that, since the kernel was now tracking the number of oopses and warnings, that information could be provided to user space via sysfs, where it might be useful to monitoring systems.
No opposition to this change appears to be in the offing, so
chances are that this patch set will find its way into the 6.2 kernel in
something close to its current form. Thereafter, no kernel will be forced
to put up with the indignity of constant oopsing for too long and, perhaps, some
clever exploit might even be fended off. "Don't panic" might be good
advice for galactic hitchhikers, but sometimes it is the right thing for
the kernel to do.
Index entries for this article | |
---|---|
Kernel | Security/Kernel hardening |
Posted Nov 18, 2022 17:21 UTC (Fri)
by adobriyan (subscriber, #30858)
[Link] (8 responses)
We'll soon find out if Twitter had such systems. :-)
More seriously, 10.000 is gross, oops_limit if implemented should default to a small multiple of "possible" CPUs.
Posted Nov 20, 2022 21:07 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (6 responses)
Posted Nov 29, 2022 0:41 UTC (Tue)
by jccleaver (guest, #127418)
[Link] (5 responses)
Posted Nov 29, 2022 11:27 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (4 responses)
With a fleet of cattle, as opposed to a fleet of pets, you don't presume that all wear and tear is identical - you have monitoring that measures the state of each machine individually, and handles it. E.g. your monitoring detects that machine 10,994,421 is frequently throwing ECC corrected errors, and it migrates all jobs off that machine and sends it off for repair. Or you detect filesystem corruption, and migrate everything off then send that machine to repair.
The key is that automation handles everything if you have cattle, not pets. Instead of knowing that http10294 is a unique snowflake with certain problems and fixes for problems, you have your automation look at the state of all hardware, and identify and deal with the problems of the machines en-masse.
Posted Nov 29, 2022 16:14 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
Shouldn't that be a hurd?
(G, D&R)
Cheers,
Posted Nov 29, 2022 22:47 UTC (Tue)
by jccleaver (guest, #127418)
[Link] (2 responses)
At the risk of mixing metaphors further: you can automate pets as well. Plenty of shops use robust config management but don't (for a variety of different reasons) presume that indefinite horizontal scaling is the proper solution to all of life's problems, and thus can have a lot of systems running where the system cattle type n <= 4, which when combined with data locality, master/standby, or prod/test, is very much in the pet range for whatever number your tweaks go to.
Not everyone is Google Scale, and designing for Google Scale -- or creating enough complexity so that you can self-fullfillingly justify designing for Google Scale -- constitutes owning only the world largest k8s hammer.
Posted Nov 30, 2022 12:12 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (1 responses)
Absolutely - and automating admin is not, in itself, a sign that you're going for the "herd of cattle" phase. It's still valuable to automate repeated jobs for pets.
The point of "herd of cattle" is that there's a continuum from the "pet" (e.g. my laptop), where all admin is done by hand, as a careful operation, through to the "cattle" where you have so many servers to look after that doing admin by hand is simply too time-consuming, and thus you must automate everything so that any server that leaves the "expected" state is either repaired by automation or kicked out as "definite hardware fault" by automation, with clear instructions on how to fix or replace the faulty hardware.
Where you need to be on that line depends on how much time you have to do admin. If you have 40 hours a week to look after a single server that runs a single service, then pets are absolutely fine. If you have an hour a month to look after 10,000 servers, then you need to treat them as cattle. If you have 8 hours a week and 10 servers, then you're probably somewhere in the middle - you want to have the baseline dealt with by automation (things like checking for hardware failures, making sure the underlying OS is up to date etc), but you can treat the services on top of the hardware (the things you care about like a database setup, a web server, your organisation's secret special code etc) as pets.
And it's a good rule of thumb to assume that anything you do more than 3 times should be automated. For a small shop, this means that the things that recur have you treating the servers as cattle (why care about manually checking whether both disks are present in the RAID-1, or that btrfs checksums match data, when the server can do that automatically and notify you when bad things happen? Why manually choose all the options in the installer, when kickstart or equivalents can do that for you?) but once-offs (there's been a schema change from 2021-09 to 2021-10 deploy, needs manual application) are done by hand, treating the affected servers as pets.
Posted Nov 30, 2022 16:30 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
> The point of "herd of cattle" is that there's a continuum from the "pet" (e.g. my laptop), where all admin is done by hand, as a careful operation,
Even with a single, "pet" computer, automating admin where possible just makes sense. My /usr/local/bin has a few scripts (I ought to write more) that do assorted things like creating lvm volumes to assist my weekly backups/emerges, and there's plenty of stuff I ought to do regularly, that scripts would help with.
Cheers,
Posted Nov 22, 2022 0:22 UTC (Tue)
by KaiRo (subscriber, #1987)
[Link]
Posted Nov 18, 2022 19:38 UTC (Fri)
by xi0n (subscriber, #138144)
[Link] (2 responses)
Granted, I don’t have a good mental model as to how severe an oops is. But if something like a faulty driver for an obscure device can trigger it consistently without much harm for the rest of the kernel, then I can imagine a long running system may eventually hit the limit and panic seemingly out of the blue.
Posted Nov 18, 2022 20:31 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link]
Posted Nov 19, 2022 5:26 UTC (Sat)
by developer122 (guest, #152928)
[Link]
Meanwhile, a timer is terrible for catching normal oops, because if the problem is infrequent enough it goes completely unnoticed while the corruption it causes persists.
Posted Nov 18, 2022 20:18 UTC (Fri)
by flussence (guest, #85566)
[Link] (2 responses)
Something in the middle, between that headache scenario and a dmesg full of emergencies going ignored, seems preferable to either... but if the option existed, I'd rather have the kernel vocalise its impending doom through the PC speaker.
Posted Nov 18, 2022 23:25 UTC (Fri)
by sroracle (guest, #124960)
[Link] (1 responses)
Posted Nov 24, 2022 12:45 UTC (Thu)
by flussence (guest, #85566)
[Link]
There's good reasons why that code is impossible to activate by accident, so I can't complain too much.
Posted Nov 18, 2022 21:00 UTC (Fri)
by unixbhaskar (guest, #44758)
[Link] (2 responses)
I am driving this notion by seeing it in different contexts in a specific environment.
See, I am looking for an "easy way out" than putting in any kind of effort. My lacuna.
Posted Nov 21, 2022 8:58 UTC (Mon)
by jafd (subscriber, #129642)
[Link]
Posted Nov 21, 2022 10:44 UTC (Mon)
by aaronmdjones (subscriber, #119973)
[Link]
Posted Nov 22, 2022 0:30 UTC (Tue)
by KaiRo (subscriber, #1987)
[Link]
Posted Dec 1, 2022 15:42 UTC (Thu)
by esemwy (guest, #83963)
[Link] (4 responses)
Is there something I’m missing?
Posted Dec 1, 2022 16:57 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (3 responses)
You've already got a denial of service if you can trigger oopses at will - oops handling is not free, and if two or more CPUs hit an oops at the same time, one CPU is chosen as the oops handler, while the others spin in kernel space, preventing you from doing anything useful until the oops is handled. If I can trigger oopses freely, I can have all but one core spinning in the kernel, while the one that's trying to do work keeps re-entering the oops handler and not making progress.
Plus, the nature of oopses is that some are one-off ("cosmic ray") events that never recur, while others leave the system in a state where doing the same thing will result in another oops. The older knob (panic_on_oops) lets you turn all oopses into panics, so that the machine comes into a known state before trying again. This knob allows you to attempt to distinguish cosmic ray events from important subsystems broken, so that (e.g.) if I can reliably oops you because part of the TCP stack has trashed state for one tx queue, and you'll oops every time you send data, you'll panic, but if the oops is because the driver author for my NIC didn't think about what would happen if I tried to tx during a NIC reset, and everything will be fine after the NIC reset completes, well, this is a rare event and I'll recover.
And to some extent, the key difference between an oops and a panic is that an oops says "the system is partially broken; some things still work, but other things are permanently broken", while a panic says "the system is beyond recovery, short of a complete reboot". On my laptop, for example, an oops that occurs every time my TCP stack tries to handle a receive window of more than 8 bytes is a pain, but I can deal - save files (NVMe is unaffected by a broken TCP stack), and reboot. If I lose my input subsystem, however, I'm stuck - if I can't tell the computer what to do, all I can do is reset it. On a network server at Twitter, however, losing the input subsystem is no big deal, but a network stack that can't handle a receive window of more than 8 bytes is effectively useless.
Posted Dec 1, 2022 17:54 UTC (Thu)
by esemwy (guest, #83963)
[Link] (2 responses)
Posted Dec 1, 2022 18:05 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (1 responses)
It's also worth noting that this is an optional mechanism; if your system is meant to be extremely long running, and to keep going through oopses, you'd turn this off. But for a use case like Twitter (or Amazon Web Services, or Facebook, or Google), this mechanism will result in servers that are accumulating software faults panicking and thus rebooting before things get serious - and you'll have other monitoring that says "this server is rebooting frequently, take it out of service" already.
Posted Jan 5, 2023 14:50 UTC (Thu)
by sammythesnake (guest, #17693)
[Link]
Providing the "cosmic ray" and "meh - non-critical subsystem" oopsen don't add up to 10,000 much more quickly than that, then your reboot rate will probably be pretty unaffected, I'd have thought.
For my own use case, the most common cause of reboots is accidentally knocking the power cable loose while hunting for something that fell down the back of my desk, so this is all a little academic for me :-P
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Wol
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Wol
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses
Averting excessive oopses