Revisiting "too small to fail"

By Jonathan Corbet
May 20, 2017

Back in 2014, the revelation that the kernel's memory-management subsystem would not allow relatively small allocation requests to fail created a bit of a stir. The discussion has settled down since then, but the "too small to fail" rule still clearly creates a certain amount of confusion in the kernel community, as is evidenced by a recent discussion inspired by the 4.12 merge window. It would appear that the rule remains in effect, but developers are asked to act as if it did not.

At the start of the 2014 discussion, memory-management developer Michal Hocko described the "unwritten rule" that small allocations never fail. "Small" is determined by the kernel's PAGE_ALLOC_COSTLY_ORDER constant, which is generally set to three; that puts the threshold at eight pages, or 32KB on most systems. Almost all memory allocations in the kernel are smaller than that (much effort has gone into keeping most of them no larger than a single page), so the end result is that memory allocation attempts almost never fail.

That created some unhappiness for a couple of reasons. One is that kernel developers have been told since the beginning that any memory allocation can fail, so they have been carefully writing failure-recovery paths that will never be used. This policy can also lead the kernel to do unpleasant things — such as summoning the dreaded out-of-memory killer — rather than fail a request, even if the requesting code is prepared to deal gracefully with an allocation failure. Proposals to change this policy have always foundered on the fear that enabling allocation failures would expose bugs throughout the kernel. The bulk of that failure-recovery code may have never been executed — or it may not exist at all. So the "too small to fail" behavior remains in place.

Trond Myklebust's NFS client fixes pull request included a line item reading: "Cleanup and removal of some memory failure paths now that GFP_NOFS is guaranteed to never fail". The description was inaccurate: the code in question is using a mempool, which pre-allocates memory and, if used properly, can indeed guarantee that allocation failures will not occur. But it was enough to prompt Nikolay Borisov to ask whether success was truly guaranteed. If so, there would be an opportunity to clean up a lot of unneeded error-handling code throughout the kernel. Hocko replied that, while "small allocations never fail _practically_", the behavior was in no way guaranteed and that removing checks for allocation failures is "just wrong".

Myklebust was not entirely pleased with that response; he asked for a clear statement that small allocation requests can fail. He didn't get one. Instead, Hocko replied:

We would really like to fail those requests instead. I've tried that in the past but it was deemed too dangerous because _all_ kernel paths would have to be checked for a sane failure behavior. So we are keeping status quo instead.

The status quo — telling developers to be prepared for allocation failures while not actually failing allocation requests — is less than pleasing for many involved in these discussions. In many parts of the kernel, error handling makes up a large portion of the total amount of code. This code can be tricky to write and even trickier to test; it can be frustrating to be asked to do this work to prepare for a situation that is not ever going to happen.

The memory-management developers cannot just change this behavior, though. There can be little doubt that, in a kernel with thousands of never-executed, never-tested error-handling paths, some of those paths will contain bugs. Auditing the kernel and validating all of those paths would not be a small task, to put it lightly; it may not be feasible to do at all. What can be done is to validate and fix the code one piece at a time. This is how the big kernel lock (BKL) was finally removed in 2011. That job proceeded by getting rid of the BKL dependencies in one small bit of code at a time until, eventually, nothing needed it anymore. It took many years, but it got the job done.

In the case of memory-allocation failures, validating code will not always be easy. The fault injection framework can be used to force allocation errors, though, which can help in the testing of recovery paths. For code that is deemed to be properly prepared, the no-fail behavior can be turned off in any given allocation request by adding the __GFP_NORETRY flag; this has been done for roughly 100 allocation calls in the 4.12-rc1 kernel. Whether that flag will spread to larger parts of the kernel remains to be seen; as with the BKL removal, it will probably require the help of a group of developers who are willing to put a lot of time into the task.

The kernel community makes internal API changes on a regular basis; most of the time, it is a simple matter of a bunch of editing work or a Coccinelle script. But subtle semantic changes are harder, and eliminating the too-small-to-fail behavior certainly qualifies as that kind of change. The longer it remains, the more entrenched it is likely to become, but there are no signs that it will be able to change anytime soon.

Index entries for this article
Kernel	Memory management/Page allocator

Revisiting "too small to fail"

Posted May 20, 2017 14:01 UTC (Sat) by arjan (subscriber, #36785) [Link]

All the error paths also add to the code size in the binary, from that angle, a non-failing kmalloc() is actually a very nice thing, especially for the smaller side of embedded or cloud.

Revisiting "too small to fail"

Posted May 20, 2017 14:16 UTC (Sat) by vbabka (subscriber, #91706) [Link] (23 responses)

Ugh, is somebody purposely sprinkling __GFP_NORETRY just to allow non-costly-order requests fail? I wouldn't recommend that, as it also prevents the reclaim and compaction from retrying (with increasing priority for compaction). The failure can thus be premature, especially for non-zero orders. It's fine for opportunistic allocation attempts that e.g. have a fallback, but it's IMHO not a good idea to start massively marking all allocations that have an error path with __GFP_NORETRY.

Revisiting "too small to fail"

Posted May 20, 2017 14:30 UTC (Sat) by corbet (editor, #1) [Link] (22 responses)

So, then, what is the right approach to take here? Just accept the status quo?

Revisiting "too small to fail"

Posted May 20, 2017 14:38 UTC (Sat) by vbabka (subscriber, #91706) [Link] (17 responses)

It's possible we'll have to introduce __GFP_MAYFAIL after all, just so we can move forward. As much as I'll hate the churn this will cause - if we ever realize that everything is marked MAYFAIL or NOFAIL, we can change the default to MAYFAIL and drop the flag again. Is it the unavoidable price for the safest course?

Revisiting "too small to fail"

Posted May 20, 2017 18:02 UTC (Sat) by jhoblitt (subscriber, #77733) [Link] (16 responses)

I'm not sure I'm following, isn't "may fail" the default? Are you proposing the __GFP_NOFAIL become the default behavior?

Revisiting "too small to fail"

Posted May 21, 2017 5:13 UTC (Sun) by ncm (guest, #165) [Link] (4 responses)

Are you shocked? Yes, it's shocking, as it was to many in 2014. But people can get used to anything.

Version 7 UNIX would ungracefully panic (i.e. crash) on out-of-memory, on the assumption that rebooting would get you back to working again sooner than muddling along. The code to handle failures wouldn't have fit anyway.

Nowadays the kernel is likely to be running on a hypervisor, meaning it's really just a user-space program itself, but with pretensions. Killing it and starting ("spinning up") another is a reasonable alternative to muddling along.

Note that your phone OS is almost certainly running under a hypervisor, and has no direct physical access to the radio transmitter. I don't think any phone runs multiple VMs yet, but it is only a matter of time. Then, probably each app will run in its own VM, with its own toy OS.

It will be interesting when the spooks' exploits to suborn your phone hypervisor leak out.

Revisiting "too small to fail"

Posted May 21, 2017 14:50 UTC (Sun) by excors (subscriber, #95769) [Link] (1 responses)

> Note that your phone OS is almost certainly running under a hypervisor, and has no direct physical access to the radio transmitter.

As far as I'm aware (which might not be far enough), in most cases the closest thing to a hypervisor is TrustZone running exactly two OSes - the 'non-secure' Android one with unimportant stuff like the user's highly private data, and the 'secure' one that does a few vital things like protected DRM video playback. Almost all the hardware is accessed directly by the non-secure OS and is not abstracted enough to let multiple VMs share it. The main thing separating the OS from the radio transmitter is that they're on separate chips (or separate blocks of the same chip) and the only physical connection between them is RAM and interrupts used to implement some message-passing protocol, and if you're very lucky there might be a correctly-configured IOMMU blocking the CPU from accessing the modem OS's region of RAM and vice versa; there's no hypervisor involved in their communication.

> probably each app will run in its own VM, with its own toy OS.

How would that help with out-of-memory problems? If you run out of physical memory, killing an app VM doesn't sound much easier than killing an app process. Maybe you could reduce the risk of multi-page allocation failures caused by fragmentation if you overcommit and give each VM very large amounts of guest-physical address space, so it can easily find guest-physically-contiguous pages, but that'd only really work if the guests can return unused pages to the host with single-page granularity, which I assume isn't easy. (It also seems quite hacky to use a VM just so the kernel can pretend it's doing physical allocations when really they're virtual allocations; surely it'd be better to make the kernel know it's doing virtual allocations and not need the VM at all). What real benefits would come from using per-app VMs?

Revisiting "too small to fail"

Posted May 26, 2017 7:29 UTC (Fri) by Kamilion (guest, #42576) [Link]

> The main thing separating the OS from the radio transmitter is that they're on separate chips (or separate blocks of the same chip) and the only physical connection between them is RAM and interrupts used to implement some message-passing protocol,

"Would you believe two UARTs and a pack of playing cards missing the aces and the kings?" --Agent 86

Revisiting "too small to fail"

Posted May 21, 2017 15:47 UTC (Sun) by jejb (subscriber, #6654) [Link]

> Nowadays the kernel is likely to be running on a hypervisor, meaning it's really just a user-space program itself, but with pretensions. Killing it and starting ("spinning up") another is a reasonable alternative to muddling along.

That's a cop out: while the vast majority of cloud instances (and this ignores all the phones, tablets and laptops) may be running on a hypervisor; in most cloud cases Linux *is* the hypervisor as well. Crashing the kernel on memory failure would take down the whole physical node and thus all the instances. This is seen as undesirable even in the cloud.

Revisiting "too small to fail"

Posted May 22, 2017 6:50 UTC (Mon) by JdGordy (subscriber, #70103) [Link]

> I don't think any phone runs multiple VMs yet, but it is only a matter of time. Then, probably each app will run in its own VM, with its own toy OS.

In a previous life we (ok-labs.com) were running multiple full linux VM's on a single phone, one or 2 with the full android userland and half a dozen or more doing various safety/virtualisation things (network virtualisation, virtualising and sharing the various bits of hardware).

worked pretty well until GD bought the place.

Revisiting "too small to fail"

Posted May 21, 2017 13:47 UTC (Sun) by vbabka (subscriber, #91706) [Link] (10 responses)

Yes, the default is strictly speaking "may fail", but only under very specific conditions - as Michal Hocko mentioned in the thread, e.g. when the task itself is selected as OOM victim. This makes any error handling even more rarely executed, so even less likely to trust. So statistically speaking, the default is NOFAIL. I'm not proposing NOFAIL to become the real default. I'd rather see "may fail" the real default, but just flipping the switch and making the existing error handling more likely to be executed, would be too dangerous IMHO.

Revisiting "too small to fail"

Posted May 21, 2017 22:30 UTC (Sun) by neilbrown (subscriber, #359) [Link] (9 responses)

> I'm not proposing NOFAIL to become the real default.

Sad. I think NOFAIL really should be the default (at least for PAGE_SIZE or less). Almost always kmalloc doesn't fail, so it wouldn't really be a big change in behaviour.
I don't see that an OOM victim needs special treatment. The memory that is freed when a victim is killed is mostly the user-space mappings, and those are freed independently of what the kernel code is doing (aren't they?).

I think we need clearly defined waiting behaviour, and I think the options should be:
1/ don't wait - usable in interrupts and spinlocks
2/ wait for kswapd (or whatever) to make one pass trying to free memory - used when memory would be convenient an easy fall-back is available
3/ wait indefinitely - thread eventually goes onto a (per-cpu?) queue and as memory is made available (possibly by killing mem hogs), it is given to threads on the queue. This must never be used on the write-out path or in shrinkers etc.

> would be too dangerous

ahh for the good old days of even=stable, odd=devel. Then we could break everything in the devel series and clean up the pieces as they were found. Kernel development is just too safe these days!!

Revisiting "too small to fail"

Posted May 23, 2017 8:19 UTC (Tue) by vbabka (subscriber, #91706) [Link] (8 responses)

OK, Michal has reminded me that the idea of making the default to allow fail was already shot down by Linus once, because it would propagate more -ENOMEM to existing userspace, which might be unprepared to handle it. And it's hard to make sure that all allocations that can lead to the userspace ENOMEM are properly covered, before switching the default.
On the other hand here is no issue with TIF_MEMDIE tasks failing an allocation, as that cannot propagate to userspace (the task is killed before returning). So there's no way to change the default to the full "may fail", and little incentive to change the default to the full "can't fail".

Revisiting "too small to fail"

Posted May 23, 2017 10:02 UTC (Tue) by nix (subscriber, #2304) [Link] (6 responses)

Userspace has to be able to handle -ENO-anything-for-any-reason in any case. All sorts of things can already throw errors, and userspace is usually unprepared for all of them and just dies. -ENOMEM would be just the same, only more so: if it wasn't checking, it would immediately try to use whatever it was, dereference a null pointer, and die. And it has to be able to handle -ENOMEM in any case because userspace has no idea if the kernel-side allocation is 'too big' or not, and that can of course change at any time.

Revisiting "too small to fail"

Posted May 23, 2017 12:13 UTC (Tue) by epa (subscriber, #39769) [Link] (5 responses)

It's accepted that userspace code is useless and doesn't check for or handle failure conditions, even for something as basic as reading a file from disk. The only errors handled by userspace are those which tend to occur frequently in practice, and if you add a new error which didn't often happen before, almost all existing programs will fail to deal with it. That's why we have 'hard' mounts for NFS. Even after 30 years or so userspace has not progressed to the point where it's safe to return failure if a remote file server is down. Better to hang indefinitely and hope the problem goes away. (This applies almost as much for reading as for writing.) So it is too for kernel memory allocation failures.

Revisiting "too small to fail"

Posted May 26, 2017 23:39 UTC (Fri) by nix (subscriber, #2304) [Link] (4 responses)

For disk I/O I've been told by a lot of grizzled Unix people that "no, disk I/O can never fail except -ENOSPC or -EIO if the disk is damaged", and that NFS failing is just a sign that NFS is broken.

I asked them what exactly it's meant to do if the server goes down, and they look at me, puzzled, as if this is impossible to conceive of.

(These are grizzled-enough veterans that they probably consider Unix to be what BSD did -- POSIX? Who reads that?)

Revisiting "too small to fail"

Posted May 30, 2017 1:37 UTC (Tue) by neilbrown (subscriber, #359) [Link] (3 responses)

> no, disk I/O can never fail except -ENOSPC or -EIO if the disk is damaged

This must be pre-4.3BSD (or there abouts). They clearly don't know about EDQUOT.

> I asked them what exactly it's meant to do if the server goes down

We are talking about "disk I/O" here. There is no server. NFS just pretends, fakes a lot of stuff, glosses over the differences, mostly works but sometimes doesn't do quite what you want.

If you want a new contract between the application and the storage backend, you need to write one. You cannot just assume that an old contract can magically work in a new market place.

We could invent O_REMOTE which the application uses to acknowledge that the data might be stored in a remote location, and that it is prepare to handle the errors that might be associated with that - e.g. ETIMEDOUT ???
What would it mean to memory-map a file opened with O_REMOTE? That you are happy to receive SIGBUS?
What does it mean to execveat() a file, passing AT_REMOTE?? Should it download the whole file (and libraries?) and cache them locally before succeeding?

It really doesn't help to just whine because something doesn't magically match your perceived use-case. The only sensible way forward is to provide a clear design of a set of semantics that you would like to be available. Then we can have a meaningful discussion.

Revisiting "too small to fail"

Posted May 31, 2017 13:45 UTC (Wed) by nix (subscriber, #2304) [Link] (2 responses)

> This must be pre-4.3BSD (or there abouts). They clearly don't know about EDQUOT.

I suspect they were just shocked to get -EINTR from an NFS disk and were trying to argue that you should never need to check for short reads ever. Running out of quota, rather than disk space, is a sufficiently obscure edge case that even I'd forgotten about it. (They were doubly sure that you shouldn't need to check for -ENOSPC when writing inside files, and were surprised when I mentioned sparse files... sigh.)

Note: I said these people were grizzled, not that they were skilled. This was just a common belief among at least some of the "grunts on the ground" in the Unix parts of the City of London in the late 90s, is all... if they were skilled they would not have been working where they were, but somewhere else in the City that paid a lot more!

> We are talking about "disk I/O" here. There is no server. NFS just pretends, fakes a lot of stuff, glosses over the differences, mostly works but sometimes doesn't do quite what you want.

Given the ubiquity of NFS and Samba, this seems unfortunate, but it is true that the vast majority of applications are not remotely ready to deal with simple network failures, let alone a split-brain situation in a distributed filesystem! (Given the number of errors that even distributed consensus stores make in this area, I'm not sure *anyone* is truly competent to write code that runs atop something like that.)

> It really doesn't help to just whine because something doesn't magically match your perceived use-case. The only sensible way forward is to provide a clear design of a set of semantics that you would like to be available. Then we can have a meaningful discussion.

Agreed, but I'd also like to find one that doesn't break every application out there, nor suddenly stop them working over NFS: that's harder! (My $HOME has been on NFSv3, remote from my desktop, for my entire Unix-using life, so I have a very large and angry dog in this race: without NFS I can't do anything at all. Last week I flipped to NFSv4, and unlike last time, a couple of years back, it worked perfectly.)

Something like your proposed *_REMOTE option would seem like a good idea, but even that has problems: one that springs instantly to mind is libraries that open an fd on behalf of others. That library might be ready to handle errors, but what about the other things that fd gets passed off to? (The converse also exists: maybe the library doesn't use that flag because almost all it does with it is hands the fd back, so it never gets updated, even though the whole of the rest of the application is ready to handle -ESPLITBRAIN or whatever.)

Frankly I'm wondering if we need something better than errno and the ferociously annoying short reads thing in this area: a new set of rules that allows you to guarantee no short reads / writes but comes with extras wrapped around that, perhaps that the writes might later fail and throw an error back at you over netlink or something.

That all seems very asynchronous, but frankly we need that for normal writes too: if a writeback fails it's almost impossible for an application to tell *what* failed without fsync()ing everywhere and spending ages waiting... but this probably requires proper AIO so you can submit IOs, get an fd or other handle to them, then query for the state of that handle later. And we know how good Linux is in *that* area :( just because network I/Os are even more asynchronous by nature than other I/Os, and more likely to have bus faults and the like affecting them, doesn't mean that the same isn't true of disk I/O too. Disks are on little networks, after all, and always have been, and with SANs they're on not-so-little networks too.

(This is not even getting into how much more horrible this all gets when RAID or other 1:N or N:1 stuff gets into the picture. At least RAID split across disks on different machines is relatively rare, though it has saved my bacon in the past to be able to run md partially across NBD for a while during disaster recovery!)

Revisiting "too small to fail"

Posted Jun 2, 2017 18:11 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

I think a lot of people run raid-1 split over machines.

My moan at the moment is people who think that just because raid CAN detect integrity errors, then it SHOULDN'T. Never mind. I ought to use it as an exercise to learn kernel programming.

But it does appear that a lot of what the kernel does is still stuck in the POSIX mindset. It would be nice if people could sit down and say "POSIX is so last century, what should linux do today?". I think the problem is, though, as Linus said, it's like herding cats ...

Cheers,
Wol

Revisiting "too small to fail"

Posted Jun 3, 2017 21:55 UTC (Sat) by nix (subscriber, #2304) [Link]

That's not as much of a problem as I thought it was. raid6check does what's necessary (it's just, uh, not installed by default). Frankly, I'm happy to wait for the immensely rare occasion when mismatch_cnt rises above 0 on a RAID-6 array and then run raid6check on it: complete automation of events as rare as that isn't terribly important to me. (But maybe it is for others...)

Revisiting "too small to fail"

Posted May 24, 2017 1:17 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> So there's no way to change the default ....

Maybe you mean "there's no easy way", but there is a way.

1/ Introduce "GFP_DEFAULT" which does the right thing, and GFP_NOFAIL which really don't fail.
2/ Mark "GFP_KERNEL" as deprecated
3/ Start changing GFP_KERNEL to something else, and nagging others to do the same.

It worked for BKL .... eventually.
We can do this. We should do this! At least we can start doing this!!

Revisiting "too small to fail"

Posted May 20, 2017 20:44 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (1 responses)

Heavier use of kvalloc is probably a good idea, since it uses __GFP_NORETRY internally.

Revisiting "too small to fail"

Posted May 21, 2017 13:56 UTC (Sun) by vbabka (subscriber, #91706) [Link]

You mean kvmalloc? That can help for allocations larger than a single page, yeah, otherwise not.

Revisiting "too small to fail"

Posted May 22, 2017 11:43 UTC (Mon) by mstsxfx (subscriber, #41804) [Link]

JFTR I have tried to change __GFP_REPEAT to have "do not retry for ever" for
requests regardless of their size. The last attempt is
http://lkml.kernel.org/r/20170307154843.32516-1-mhocko@ke...

There doesn't seem to be a huge interest in this flag so far, though.

--
Michal Hocko

Revisiting "too small to fail"

Posted May 23, 2017 9:43 UTC (Tue) by dvrabel (subscriber, #9500) [Link]

32 KiB doesn't seem like a "small" allocation to me -- anything more than PAGE_SIZE can fail (or trigger the OOM killer) due to fragmentation. Perhaps the thing to do here is to reduce the size that is considered small, until small is a page or less?

By way of example, the Xen hypervisor has taken considerable effort to ensure all memory allocation are a page or less, or have fallback paths that do so.

Putting the problem into perspective

Posted May 20, 2017 19:15 UTC (Sat) by ebiederm (subscriber, #35028) [Link]

Several years ago I ran into the problem of code trying forever for small allocations.
The fix of the immediate symptoms was: 96c7a2ff2150 ("fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem")

The basic issue was that the file table expansion code was making a 32KiB allocation. Which is the maximum size at which the code retries forever.

Except the code doesn't exactly try forever. Instead of failing the allocation the code instead triggers the OOM killer. Which in effect shifts where the failure was happening. In the case I was dealing with this caused the OOM killer to be triggerd on a system with roughly 4GiB free memory. The problem was that there were no chunks of memory of size 32KiB large and at that point it had no way to defragment the memory to make a 32KiB chunk of memory available.

The file table code in question had a fallback to handle a large page allocation failure. The code performs a vmalloc instead of a kmalloc.

Which demonstrates two things.
- That all allocations <= PAGE_ALLOC_COSTLY_ORDER won't fail because such pages will always be available is observably wrong.
- That there is actually harm in retrying forever on some code paths. As my fix demonstrated the retrying forever heuristic took a system that would have stayed up and caused it to crash.

That said I don't argue that on most code paths retrying forever is generally harmless as many times there isn't much that can be done except return an error to userspace.

Revisiting "too small to fail"

Posted May 21, 2017 21:49 UTC (Sun) by zlynx (guest, #2285) [Link] (2 responses)

It sounds like we need a fault injection test framework that takes ideas from fuzzers like AFL and continues to try variants until it exercises all of the code paths.

Even fancier, if the test framework could examine branch conditions and work backward to create the conditions to exercise untested branches. Or if it can't work it out, log it for developer attention so a custom test rule can be created.

Although, it'd also need a lot of verification code written. Sure, the memory allocation failure during bad block handling during a btrfs scrub, at the same time the SAS link got hot-removed got handled, but is the filesystem still correct? Someone has to write that.

Or possibly a simpler idea, take one of the existing automatic unit test generators and modify it for kernel code, although there's always the problem of knowing if the hardware is being simulated correctly.

Revisiting "too small to fail"

Posted May 22, 2017 15:28 UTC (Mon) by mageta (subscriber, #89696) [Link]

> Or possibly a simpler idea, take one of the existing automatic unit test generators and modify it for kernel code, although there's always the problem of knowing if the hardware is being simulated correctly.

If there is simulation at all. CPU/MMU alright, even some basic I/O. But I'd wager, for over 90% of the hardware for which the kernel has drivers there is no simulation at all. There is also no trend to make more "simulations", its (understandably) much more interesting to do para-virtualized solutions.

Revisiting "too small to fail"

Posted May 23, 2017 9:34 UTC (Tue) by vegard (subscriber, #52330) [Link]

> It sounds like we need a fault injection test framework that takes ideas from fuzzers like AFL and continues to try variants until it exercises all of the code paths.

I've been running trinity and syzkaller with this patch, which records unique callchains and fails allocations from previous-unseen callers:

https://patchwork.kernel.org/patch/9378219/

It works pretty well and uncovered a handful of cases that I submitted patches for, but far less than I would have expected.