|
|
Log in / Subscribe / Register

Allowing small allocations to fail

By Jonathan Corbet
March 11, 2015

LSFMM 2015
As Michal Hocko noted at the beginning of his session at the 2015 Linux Storage, Filesystem, and Memory Management Summit, the news that the memory-management code will normally retry small allocations indefinitely rather than returning a failure status came as a surprise to many developers. Even so, this behavior is far from new; it was first added to the kernel in 2001. At that time, only order-0 (single-page) allocations were treated that way, but, as the years went by, that limit was raised repeatedly; in current kernels, anything that is order-3 (eight pages) or less will not normally be allowed to fail. The code to support this mode of operation has become more complex over time as well.

Relatively late in the game, the __GFP_NOFAIL flag was added to specifically annotate the places in the kernel where failure-proof allocations are needed, but the "too small to fail" behavior has never been removed from other allocation operations. After 14 years, Michal said, there will certainly be many places in the code that depend on these semantics. That is unfortunate, since the failure-proof mode is error-prone and unable to deal with real-world situations like infinite retry loops outside of the allocator, locking conflicts, and out-of-memory (OOM) situations. The result is occasional lockups as described in this article.

There have been various attempts to get around the problem, such as adding timeouts to the OOM killer (see this article), but Michal thinks such approaches are "not nice." The proper way to handle that kind of out-of-memory problem is to simply fail allocation requests when the necessary resources are not available. Most of the kernel already has code to check for and deal with such situations; [Michal Hocko] beyond that, the memory-management code should not be attempting to dictate the failure strategy to the rest of the kernel.

Changing the allocator's behavior is relatively easy; the harder question is how to make such a change without introducing all kinds of hard-to-debug problems. The current code has worked for 14 years, so there will be many paths in the kernel that rely on it. Changing its behavior will certainly expose bugs.

Michal posted a patch just before the summit demonstrating the approach to the problem that he is proposing. That patch adds a new sysctl knob that controls how many times the allocator should retry a failed attempt before returning a failure status; setting it to zero disables retries entirely, while a setting of -1 retains the current behavior. There is a three-stage plan for the use of this knob. In the first stage, the default setting would be for indefinite retries, leaving the kernel's behavior unchanged. Developers and other brave people, though, would be encouraged to set the value lower. The hope is to find and fix the worst of the resulting bugs in this stage.

In the second stage, an attempt would be made to get distributors to change the default value. In the third and final stage, the default would be changed in the upstream kernel itself. Even in this stage, where, in theory, the bugs have been found, the knob would remain in place so that especially conservative users could keep the old behavior.

Michal opened up the discussion by asking if the assembled developers thought this was the right approach. Rik van Riel said that most kernel code can handle allocation failure just fine, but a lot of those allocations happen in system calls. In such cases, the failures will be passed back to user space; that is likely to break applications that have never seen certain system calls fail in this way before.

Ted Ts'o added that the kernel would mostly likely be stuck in the first stage for a very long time. As soon as distributions start changing the allocator's behavior, their phones will start ringing off the hook. In the ext4 filesystem, he has always been nervous about passing out-of-memory failures back to user space because of the potential for application problems. If the system call interface does that instead it won't be his fault, he said, but things will still break.

Peter Zijlstra observed that ENOMEM is a valid return from a system call. Ted agreed, but said that, after all these years, applications will break anyway, and then users will be knocking at his door. He went on to say that in large data-center settings (Google, for example) where the same people control both kernel and user space it should be possible to find and fix the resulting bugs. But just fixing the bugs in open-source programs is going to be a long process. In the end, he said, such a change is going to have to provide a noticeable benefit to users — a much more robust kernel, say — or we will be torturing them for no reason.

Andrew Morton protested that the code we have now seems to work almost all of the time. Given that the reported issues are quite rare, he asked, what problem are we actually trying to solve? Andrea Arcangeli noted that he'd observed lockups and that the OOM killer's relatively unpredictable behavior does not help. He tried turning off the looping in the memory allocator and got errors out of the ext4 filesystem instead. It was a generally unpleasant situation.

Andrew suggested that making the OOM killer work better might be a better place to focus energy, but Dave Chinner disagreed, saying that it was an attempt to solve the wrong problem. Rather than fix the OOM killer, it would be better to not use it at all. We should, he said, take a step back and ask how we got into the OOM situation in the first place. The problem is that the system has been overcommitted. Michal said that overcommitting of resources was just the reality of modern systems, but Dave insisted that we need to look more closely at how we manage our resources.

Andrew returned to the question of improving the OOM killer. Perhaps, he said, it could be made to understand lock dependencies and avoid potential deadlock situations. Rik suggested that was easier said than done, though; for example, an OOM-killed process may need to acquire new locks in order to exit. There will be no way for the OOM killer to know what those locks might be prior to choosing a victim to kill. Andrew acknowledged the difficulties but insisted that not enough time has gone into making the OOM killer work better. Ted said that OOM killer improvements were needed regardless of any other changes; since the allocator's default behavior cannot be changed for years, we will be stuck with the OOM killer for some time.

Michal was nervous about the prospect of messing with the OOM killer. We don't, he said, want to go back to the bad old days when its behavior was far more random than it is now. Dave said, though, that it is not possible to have a truly deterministic OOM killer if the allocation layers above it are not deterministic. It will behave differently every time it is tested. Until things are solidified in the allocator, the OOM killer is, he said, not the place to put effort.

The session wound down with Michal saying that starting to test kernels that fail small allocations will be helpful even if the distributors do not change the default for a long time. Dave said that he would turn off looping in the xfstests suite by default. There was some talk about the best values to use, but it seems it matters little as long as the indefinite looping is turned off. Expect to see a number of interesting bugs once this testing begins.

[Your editor would like to thank LWN subscribers for funding his travel to LSFMM 2015.]

Index entries for this article
KernelMemory management/Page allocator
ConferenceStorage, Filesystem, and Memory-Management Summit/2015


to post comments

Allowing small allocations to fail

Posted Mar 11, 2015 1:11 UTC (Wed) by scientes (guest, #83068) [Link]

Won't the OOM killer have to remain around as long as __GFP_NOFAIL exists in non-deterministic ways, even if it is then used far less and under much better understood triggers? Eventually might all the complexity curves of the places that use __GFP_NOFAIL be analyzed so that the amount of reserve memory necessary to avoid such kills might be known?

Allowing small allocations to fail

Posted Mar 11, 2015 2:12 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (14 responses)

I wonder why they don't approach this problem from the other side - add a special flag __GFP_CAN_FAIL and start annotating the kernel with it.

Allowing small allocations to fail

Posted Mar 11, 2015 3:59 UTC (Wed) by neilbrown (subscriber, #359) [Link] (3 responses)

> I wonder why they don't approach this problem from the other side - add a special flag __GFP_CAN_FAIL and start annotating the kernel with it.

The flag exists and is spelt "__GFP_NORETRY".

Setting that everywhere could get very noisy.... might be a good idea though.

Allowing small allocations to fail

Posted Mar 11, 2015 9:21 UTC (Wed) by ewen (subscriber, #4772) [Link] (2 responses)

So that suggests a way forward:

* Document that allocations for up to 8 pages (order=3) implicitly have __GFP_NOFAIL set on them now, and that in N kernel versions only allocations of up to 4 pages (order=2) will have __GFP_NOFAIL *implicitly* set on them.

* Any code that already has its own recovery mechanism for "allocation might fail", can be amended to include __GFP_NORETRY on it -- or some other flag that means "try some simple stuff, but don't invoke the OOM killer" -- and any code that gets fixed up in advance to cope with the reduced "too small to fail" can have that flag added.

* Any code that cannot be so fixed gets updated to explicitly have __GFP_NOFAIL on it, to document that it cannot cope with running out of memory.

* In N kernel versions change the implicit "too small to fail" protection as promised, and publish another "in N kernel versions it will be changed to..." level.

* Lather, rinse and repeat.

If the clutter of all these extra flags becomes too much create some wrapper functions (macros?) that encapsulate the "do this if possible now" and "keep trying until you an do this", and change to those. As annotations go it doesn't strike me as *that* much more clutter than, eg, likely()/unlikely().

Plus of course test all these changes with the fault injection framework. Because it sounds like a lot of those "allocation failed" handlers probably haven't been tested in anger due to "too small to fail".

Ewen

Allowing small allocations to fail

Posted Mar 11, 2015 11:24 UTC (Wed) by vbabka (subscriber, #91706) [Link] (1 responses)

I don't see the benefit of annotating all places where allocation can fail, if the default is going to change later anyway. Sounds like a lot of work that will in the end just clutter the code needlessly. What's needed is annotating (or fixing, if possible) places where allocation *cannot* fail.

Allowing small allocations to fail

Posted Mar 11, 2015 12:01 UTC (Wed) by cesarb (subscriber, #6266) [Link]

> Sounds like a lot of work that will in the end just clutter the code needlessly.

It's a good way to change a default. In the initial state, you have a lot of code which might rely on the old default. Gradually you change each call site to explicitly state what it really needs (cannot fail, indifferent, can fail, can fail and must do so quickly), so after a while nothing depends on the default (except the ones explicitly verified as "indifferent"). Then you change the default, which will now have no effect (except for the call sites explicitly verified as "don't care"). Finally, you gradually remove the now redundant annotations corresponding to the new default.

Yes, it's more work, but it's also a safer path. Each step is small enough that it can be individually verified, and what would otherwise be a large step (the change in the default) becomes a minor step. It's similar to the concept of a "reversible process" in physics.

Allowing small allocations to fail

Posted Mar 11, 2015 9:31 UTC (Wed) by roblucid (guest, #48964) [Link] (9 responses)

What can most code do on memory allocation failure? Either retry or fail. What value is there in passing the burden to all callers, rather than have generic code for common cases?

Seems to me the filesystem code, who may be used to free memory, need some kind of reserve on that path for such action on behalf of the MM, so progress is made. As Chinner suggests.

Better than killing a process, would quiescing and paging out as much as possible, ones experiencing the repeated allocation re-trys. That ought to make memory hogs, suffer the performance problems, rather than hard to admin unpredictable process failures causing cascade of hard to recover error conditions.

Allowing small allocations to fail

Posted Mar 11, 2015 11:30 UTC (Wed) by vbabka (subscriber, #91706) [Link] (3 responses)

The reservations were discussed the next day, so I expect LWN will cover that soon too :)

As for "paging out as much as possible", that's already done. OOM is only invoked when there's nothing else to page out (memory reclaim cannot make any further progress). However that sometimes leads to thrashing the system for (tens of) minutes with last few remaining reclaimable pages, when OOM killing a runaway process would allow it to recover much more quickly.

Allowing small allocations to fail

Posted Mar 16, 2015 4:02 UTC (Mon) by ploxiln (subscriber, #58395) [Link] (2 responses)

I've acutely experienced that problem, of extended struggling before invoking the OOM killer. I had swap completely disabled on my personal laptop for a few years in an attempt to avoid it, but it still eventually happened (virtual machines, browsers...). I use linux on both personal machines and popular-website-serving servers, and in both cases prefer "fail fast", instead of "go into uselessly slow mode".

I suppose someone will say to disable overcommit, but many runtimes, particularly nodejs, pretty much require it. Browsers of course require it. Lots of threads with large stacks they won't use, also use a lot of virt.

Allowing small allocations to fail

Posted Mar 16, 2015 16:05 UTC (Mon) by vbabka (subscriber, #91706) [Link] (1 responses)

This is somewhat different problem. If OOM was not invoked, it means the reclaim still succeeded, so it wasn't a fail/nofail situation, however the system was already perceived as unusable. Reports such as yours appear from time to time, see e.g. https://lkml.org/lkml/2015/1/23/688 thread which even contains some candidate patch, but the reporter apparently didn't test it so far.

So if we call the OOM killer early, do we still get races?

Posted Mar 17, 2015 13:57 UTC (Tue) by gmatht (subscriber, #58961) [Link]

I think the point is that if the Kernel doesn't have 32KiB of clean pages left to dealloc, then from the point of an ordinary desktop, the system has already failed. It would already be unusable.

It is very hard to know *exactly* how much memory is needed to free a page. However, even on a low end 2GB netbook, and even if we reserve only 1% of ram for clean pages (and emergency allocations) that still allows a lot of "too small to fail" allocations. On most systems, reserving 10% would seem reasonable. On embedded systems without a disk it may not make much sense, but they may still have a use for Transcendent/ephemeral memory (and they probably aren't using XFS anyway).

Allowing small allocations to fail

Posted Mar 11, 2015 12:09 UTC (Wed) by cesarb (subscriber, #6266) [Link]

> What can most code do on memory allocation failure? Either retry or fail.

Had a flashback to the MS-DOS days here. The full set of options in DOS was "abort, retry, ignore, fail", IIRC.

Which means there are two more options on memory allocation failure. "Abort" would be "panic the kernel". Retry is "try again until it succeeds". "Ignore" would be "fail the allocation, but go ahead with the rest of the work". And "fail" would be "fail the allocation, undo any partial work, and return an error to the caller".

Allowing small allocations to fail

Posted Mar 11, 2015 21:59 UTC (Wed) by iabervon (subscriber, #722) [Link]

It may be able to release some locks, reacquire them, and then retry. This would make it possible that a process that wants to free memory (or is being forced to) will get the lock, and make the memory the first process needs available. The allocator doesn't know how to retry a complete critical section, which may be the way out of the situation.

Allowing small allocations to fail

Posted Mar 11, 2015 23:12 UTC (Wed) by gdt (subscriber, #6284) [Link] (2 responses)

What can most code do on memory allocation failure? Either retry or fail.

Yes, but an application being told of a low-memory situation is nicer than being a candidate for the OOM killer in a few seconds time. I write routing code and if I get a malloc() fail then I'm happy enough to drop the neighbour which owns those routes rather than risk losing all neighbours thanks to the OOM killer. I'm sure this is true of a lot of servers with multiple concurrent users. And of course there's a lot of application-layer caching which could use some indicator of platform memory pressure, even if a failed malloc().

Allowing small allocations to fail

Posted Mar 12, 2015 0:25 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

in theory you are right, but in practice too much stuff doesn't check malloc results that it's not going to help you for very long, any little bit of ram that your application releases will be gobbled up by other software.

But that still has nothing to do with this "too small to fail" problem. The "too small to fail" problem is entirely inside the kernel.

Allowing small allocations to fail

Posted Mar 16, 2015 16:12 UTC (Mon) by vbabka (subscriber, #91706) [Link]

> But that still has nothing to do with this "too small to fail" problem. The "too small to fail" problem is entirely inside the kernel.

Not necessarily AFAIK. It won't introduce failures for malloc() and other stuff based on mmap() and friends (i.e. userspace allocations that can already fail), but can introduce ENOMEM for other syscalls, which need to allocate in-kernel objects during their execution. Those wouldn't fail now, but will be allowed to.

Causing small allocations to fail!

Posted Mar 11, 2015 2:32 UTC (Wed) by droundy (guest, #4559) [Link] (1 responses)

Why not add a feature to occasionally fail allocations even when not under memory pressure? For testing purposes, obviously, but it seems like with such a patch one could quickly find a lot of the bugs, far faster than if you needed to create oom conditions just to test the failure handling.

Causing small allocations to fail!

Posted Mar 11, 2015 3:33 UTC (Wed) by josh (subscriber, #17465) [Link]

The kernel does have a fault injection framework; using it to test this kind of thing would make sense.

The memory cgroup already allows such failures

Posted Mar 11, 2015 6:56 UTC (Wed) by ebiederm (subscriber, #35028) [Link] (1 responses)

The memory cgroup already allows failures to make it to user space.

In practice bugs happen and everything deals with memory allocator failure better than looping forever in the memory allocator.

I have seen boxes trigger the OOM killer and go splat with gigabytes of memory free because someone figured that 32KiB was an allocation that could never fail, when in fact the odds are very high that memory can be fragmented.

The memory cgroup already allows such failures

Posted Mar 11, 2015 7:36 UTC (Wed) by renox (guest, #23785) [Link]

Interesting, let me guess: this box was in 32 bit mode or the memory needed to be in the first GB(DMA constraint) ?
Otherwise this is surprising..

Allowing small allocations to fail

Posted Mar 11, 2015 8:36 UTC (Wed) by epa (subscriber, #39769) [Link] (28 responses)

If userspace isn't designed to handle ENOMEM then a shim in the C library could be used for the transition. If a system call returns ENOMEM then just sleep for ten seconds and retry it - repeat as necessary. That will give something close to the current behaviour (assuming that the kernel is well-behaved and a system call that returns ENOMEM has done nothing - which is a big assumption I admit). All applications could be linked against this shim to begin with and it could gradually be removed as more code is 'certified' as being able to cope with this system call failure.

But really, I think I would prefer to just start returning ENOMEM to user space and let the applications deal with it. It's hard to construe it as a backwards compatibility break since it has always been documented that this return status is possible. But I know Linus takes a more pragmatic view that thinks more about what's visible to the end user.

Allowing small allocations to fail

Posted Mar 11, 2015 16:40 UTC (Wed) by roblucid (guest, #48964) [Link] (27 responses)

Great, so a temporary condition gets unwound further, tempts applications to handle badly a rare temporary failure in overal detrimental ways; which the kernel memory management understands better.

If you call a system call, you want errors to be your errors, or hardware errorts, not transistent system problems, caused by over-committing memory!

Seems to me, that whilst I understand passing buck is convenitent, Memory Management is a subystem, precisely because it IS hard. So if there are real problems with current approach, then stepping back and rethinging is right rather than proliferating complexity by pushing memory management isseus into whole of rest of system.

Allowing small allocations to fail

Posted Mar 11, 2015 18:26 UTC (Wed) by dlang (guest, #313) [Link] (25 responses)

feel free to step back and think of a better way to get the job done. If you find one, many people are going to be interested.

memory overcommit is not your fundamental problem here. In fact, it help avoid the problem the vast majority of the time.

The problem is not userspace applications handling memory failures (I don't understand how the discussion here has focused on userspace), the problem is the kernel allocating memory.

The fundamental problem is that memory is finite, when it's full and you want to write something to disk to free the memory, the process of writing the data out can involve memory allocations (in the filesystem, raid layer, lvm layer, network layer if using iscsi, etc). How do you handle this case?

As discussions have shown, you cannot know ahead of time how much memory these can require, so simply having some reserved memory isn't the answer (at least, it's not that simple to figure out how much memory to reserve)

The deeper the stack of layers that you have to go through to write a chunk of memory out to disk, the more likely you are going to run into problems.

If you use a swap partition on a SATA drive directly, there is far less chance of running into problems that if you are using a swap file on ext4 on top of LVM (with snapshots) on a RAID array connected over iSCSI (which is not an unreasonable setup)

Allowing small allocations to fail

Posted Mar 11, 2015 21:19 UTC (Wed) by wahern (subscriber, #37304) [Link] (4 responses)

memory overcommit is not your fundamental problem here. In fact, it help avoid the problem the vast majority of the time.

Isn't this basically the same logic used to justify mapping NULL to a read-only page filled with 0s? Which is what BSD did decades ago. It was a similarly poor design decision from the perspective of resilient systems and it took many years to undo.

Yes, the _majority_ of time you end up with benign behavior. The problem is the cases where you end up with completely wrong behavior. And the only way to solve _all_ the problems was for _everybody_ to write more robust code. The only way to coax code to become more robust was to make it fail spectacularly.

Have we forgotten about the very wise advice, fail fast, fail often?

The problem with overcommit and similar measures is that there's no incentive for people to write robust algorithms. Clearly the XFS people are in this camp. They didn't have to worry about it, so they didn't, and now their code is too complex to change.

You _can_ write robust algorithms resilient against low memory situations. Sometimes it's easy. RAII patterns simplify unwinding state, which is just as easy in C as in C++, minus the automatic destructors. Sometimes it's more difficult and might require rethinking your implementation in terms of an explicit state machine. Other times you must make a tradeoff between CPU cost and memory cost. Linux often already does this--for example using a O(2 log N ), cache-thrashing red-black tree instead of O(1), cache-friendly timing wheel for high resolution timers--but nobody really complains about it not being the absolute fastest possible because raw speed is rarely the _only_ consideration.

Not long ago people said that memory was cheap and worrying about memory constraints was a thing of the past. Then the embedded world exploded.

Properly handling OOM situations in core infrastructure code doesn't solve all the problems. But it's a prerequisite for solving all the problems. And we'll never make any headway as long as people can pass the buck to the OOM killer. The hierarchy of abstraction layers might not be so convoluted and poorly composable if they couldn't rely on simplifying assumption about memory allocation; the interfaces would have to been better designed to make unwinding state easier and make forward progress yieldable at a more fine-grained level.

Look at block devices. The reason we can't have non-blocking disk I/O is because Unix originally made the simplifying assumption that disk writes were per se non-blocking. The assumption was no doubt worthwhile back when it could reduce the complexity of your code by an order of magnitude. But now block device drivers are so complex anyhow that the cost+benefit tradeoff sucks. Same situation wrt to the big kernel lock. I would argue the same thing now applies to the OOM killer. The simplifying assumption has lost it's usefulness, presuming it was ever truly useful.

Allowing small allocations to fail

Posted Mar 11, 2015 21:27 UTC (Wed) by dlang (guest, #313) [Link] (3 responses)

The problem here is not userspace and overcommit. The problem is in the kernel when it needs memory in order to write data to disk to allow it to free memory.

Causing users programs to fail more frequently, in the hope that developers will get better at handling the failures, is not going to make the slightest bit of difference to this problem.

Eliminating the OOM killer isn't going to address the problem of needing memory in order to free memory.

I'll also point out the backlash against ext4 for loosing data when programmers didn't properly use fsync. This isn't perceived by users as being a problem caused by the programmers of the application they are using, it's perceived as being a failure of the kernel.

Allowing small allocations to fail

Posted Mar 11, 2015 22:46 UTC (Wed) by wahern (subscriber, #37304) [Link] (2 responses)

Non-failing allocations in the kernel is part-and-parcel of the OOM killer. The whole point is that some kernel developers assume that if an allocation request fails, the OOM killer will free up some memory.

Of course failing allocations in user space won't fix XFS or other code. It won't fix user space applications, either. (FWIW, malloc could always fail in Linux because of process resource limits.)

But it does change expectations. In an OOM killer world you can make the assumption that if you're called from user space, then a small allocation could never fail because the OOK killer could always kill at least one user space application and thus free up some memory.

Without the OOM killer, XFS would have had to explicitly make arrangements to reserve a bounded amount of memory ahead of time, rather than implicitly relying on fuzzy assumptions.

XFS and similar code exists, and completely refactoring those things is clearly out of the question. But somebody has to pull the short straw and endure a little more pain than the others if and when Linux moves away from the allocations-cannot-fail simplifying assumption.

Allowing small allocations to fail

Posted Mar 12, 2015 0:08 UTC (Thu) by rgmoore (✭ supporter ✭, #75) [Link]

Without the OOM killer, XFS would have had to explicitly make arrangements to reserve a bounded amount of memory ahead of time, rather than implicitly relying on fuzzy assumptions.

I think you have the story backward here, at least in regard to XFS. The XFS developers did exactly what you suggested they should do, and reserved a pool of memory in advance in case memory got really tight. XFS would still work in the case that small allocations were allowed to fail, and they're actually the ones who started pushing to allow them to fail. The problem isn't that XFS never considered what to do if memory was really tight, but that their solution never goes into effect because the triggering case- a failed allocation- isn't allowed to happen.

Allowing small allocations to fail

Posted Mar 12, 2015 0:28 UTC (Thu) by dlang (guest, #313) [Link]

It's far more the case that filesystem developers are doing some really complex things that are really hard to roll back cleanly, and so they assume that some other kernel thread is going to be able to make progress and free some memory than that they assume that the OOM killer will free up memory for them.

I think it's a bad thing that this has grown to be as large and as common as it is. I just don't blame it on overcommit.

Allowing small allocations to fail

Posted Mar 11, 2015 22:12 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> The fundamental problem is that memory is finite,

Amongst the many fundamental problems, the one that stands out to me is that memory is treated as a uniform resource that any code can dip into on a (nearly) equal basis. But different needs really are different, and on more than a high/medium/low priority basis.

There are a class of memory allocations which are only needed for a short period of time. These are the important ones for ensuring forward progress for write-out and freeing memory. They hold network packet headers, or filesystem index blocks that are being updated etc. The grand total amount of memory needed for all of these isn't really very much, and if we had some way to make all these transactions run in sequence - one at a time - you could make solid forward progress with very little memory (it would be slow, but it wouldn't lock-up).

A big issue is that a particular transaction may need to make a sequence of these allocations. The first one will only be genuinely short-lived if all subsequent allocations happen quickly. If a subsequent allocation waits for an earlier allocation to be freed - you deadlock.

mempools are a perfect fit for this need, though they are usually over-provisioned. The pool only really needs one element. Each different allocation needed for a transaction comes from a different mempool. So an allocation from a mempool can only ever block waiting on an allocation in that mempool, or another mempool further downstream, to be freed. This ensures that at least one transaction is always making forward progress.

This is how the block layer works - if you have a stack of block devices (loop over md over dm over SATA), each layer has its own mempools as needed.

But filesystems don't use mempools - mempools have fixed size allocations and filesystems are more complex and can need more variety. So we need something more general.

The big idea here is that the "first" allocation can safely block if there isn't enough space, but "subsequent" allocations must not. "all" we need to do is find a way to associate each allocation with a "transaction". Then we just need to limit the number of transactions that are concurrently active (if memory is tight) and only give the last of the memory to "transaction"-based allocations, and everything will be fine.

The early swap-over-NFS patches had something like this...
See patch 11 of the series linked here: https://lwn.net/Articles/256462/

They we eventually dropped from the series, but the ideas still could be useful.

Allowing small allocations to fail

Posted Mar 11, 2015 22:55 UTC (Wed) by roblucid (guest, #48964) [Link] (1 responses)

> The problem is not userspace applications handling memory failures
> (I don't understand how the discussion here has focused on userspace),
> the problem is the kernel allocating memory.

Simply because in part of the discussion, some suggested system calls returning a failure with ENOMEM, back to userspace.

Allowing small allocations to fail

Posted Mar 12, 2015 0:23 UTC (Thu) by dlang (guest, #313) [Link]

returning ENOMEM to userspace when userspace doesn't know how much memory was needed or why it's needed is a waste of time, and creating a new error code for userspace to look for will mean that existing programs that check error codes now will all be missing the new error code.

Allowing small allocations to fail

Posted Mar 12, 2015 7:44 UTC (Thu) by epa (subscriber, #39769) [Link] (16 responses)

<blockquote>If you use a swap partition on a SATA drive directly, there is far less chance of running into problems that if you are using a swap file on ext4 on top of LVM (with snapshots) on a RAID array connected over iSCSI (which is not an unreasonable setup)</blockquote>The disk setup in itself may not be unreasonable, but for swapping? A RAID array over iSCSI implies a moderately expensive server system. Surely if swap space is required, an SATA disk could be plugged into the local motherboard for that purpose.

In other words is the extra complexity of allowing swap files on a filesystem, rather than raw swap partitions, still needed? Swap space was a big deal twenty years ago but while it is still needed today it has diminished in importance a bit.

Allowing small allocations to fail

Posted Mar 12, 2015 7:55 UTC (Thu) by dlang (guest, #313) [Link] (3 responses)

many servers don't have any local storage (I forgot to add the hypervisor layer in my example to top things off)

Personally, I operate servers with minimal or no swap, but the people who are screaming about how evil overcommit and copy-on-write are need to have a LOT of swap so that they can pretend that it's real memory when a program forks.

Oh, by the way, they are still betting that it's never going to be needed, because if it actually was needed, the system would be unusable. I'd rather have a system fail, even if it triggers the OOM killer (which does log what it's doing, so my central log system can detect failures, including the failure of the log forwarder), rather than slow to a crawl but still appear to be working.

Allowing small allocations to fail

Posted Mar 13, 2015 7:03 UTC (Fri) by epa (subscriber, #39769) [Link] (2 responses)

...and so we come back to fork() being the wrong tool for the job 90% of the time (since most fork() is just a precursor to exec()) and how userspace should use posix_spawn() instead where possible, except that sometimes you need to do extra manipulations in the child process before exec(), but even then a whole copy of the parent's address space is not usually needed...

Allowing small allocations to fail

Posted Mar 17, 2015 18:54 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

posix_spawn() is insanely complex, hard to use, very *rarely* used and has as a result had serious bugs in the past. It's best avoided unless you expect your program to be useful on a box without an MMU.

Allowing small allocations to fail

Posted Mar 18, 2015 9:58 UTC (Wed) by cesarb (subscriber, #6266) [Link]

> It's best avoided unless you expect your program to be useful on a box without an MMU.

Isn't it also useful if you expect your program to be ported to operating systems without fork()/exec() (Windows) or operating systems where the GUI libraries don't like fork() (from what I've heard, this is the case on Mac)?

Allowing small allocations to fail

Posted Mar 13, 2015 1:02 UTC (Fri) by neilbrown (subscriber, #359) [Link] (11 responses)

> In other words is the extra complexity of allowing swap files on a filesystem, rather than raw swap partitions, still needed?

There is very little extra complexity here.
When you enable swap on a (non-NFS) filesystem, the kernel uses 'bmap' to find where all the blocks in the file are, ignores any fragments that aren't nicely page-sized, and then swaps to the block device using that list of addresses.

This is one reason that BTRFS doesn't support swap files. Files don't have a fixed address on just one block device.

The complexity in the example you cite doesn't come from the fact that a 'file' is used. LVM and RAID are pretty safe too - there is complexity there, but they handle it just fine because they have to.

I really don't know about iSCSI. To my mind it is by far the most complex part of the stack. But this may simply because I haven't looked at the code - maybe it works perfectly already.

Allowing small allocations to fail

Posted Mar 13, 2015 1:29 UTC (Fri) by dlang (guest, #313) [Link] (10 responses)

It's not that the layers are bad, just that getting the data out requires involving all the layers, and any of them can end up needing memory to get things done. At the very least, having many layers involved can cause allocations that would otherwise be short-lived end up being needed longer.

Similar to the way that these different layers can consume stack space to the point that the system gets in trouble (and working to move things off of the stack requires other allocations)

Allowing small allocations to fail

Posted Mar 13, 2015 1:57 UTC (Fri) by neilbrown (subscriber, #359) [Link] (9 responses)

With respect ... I think you are guessing. You are identifying things that you think could go wrong. Not things that actually go wrong.

Multiple layers of block devices do not use extra stack. When one layer sends a request to the next layer, the request is queued at a higher stack level, and not processed until the first block device's code has vacated the stack.

"more layers needs more memory" isn't really a good characterization of the sort of problems we can run in to. There really is plenty of memory, just like there are plenty of chopsticks for the dining philosophers. A bit of communication and sensible sharing is all you need.

Somewhat tangentially.... I'm also somewhat perplexed by the various mentions of returning ENOMEM errors to user-space. In my mind that is TOTALLY different conversation than talk about reserving memory to ensure progress when writing out dirty data.

-ENOMEM really isn't something that user-space should ever see for the vast majority of system calls. When code handling a systemcall needs a modest-sized allocation, waiting indefinitely is the right thing to do (maybe after dropping some locks).

Conversely when handling write-out, it is only appropriate to wait if you *know* that memory will be released that you *will* get access to.

Allowing small allocations to fail

Posted Mar 13, 2015 3:16 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

I use stack space ans an example after seeing the repeated discussions of "XFS + <layers> runs out of stack space" discussions on linux-kernel. I can't talk in detail about the issue, but I know from watching this that the different layers are not as isolated in their effects as you make it sound.

As for the layers, yes, I am saying that these layers can cause things to go wrong, not that they will cause problems. Raid, encryption, compression, snapshots can all require reading in data from disk in order to write data out to disk. Filesystem operations can require read-modify-write cycles that can require memory for the read, iscsi invokes the entire networking stack and needs memory to encapsulate the I/O data, etc.

When the kernel picks a hunk of memory to output to disk (either swap or pending writes), it has no way of knowing what is going to be involved to write this data out. It may be that all of these layers have sufficient memory pre-allocated that they never, ever need to allocate more during the running of the system, but I really have my doubts. I know that at least some of them have enough reserved memory that they can limp along to complete a single request if they can't get a normal allocation, but with the 'too small to fail' logic having been in place, how many of these emergency codepaths have really been tested? And are the allocations ending up in the 'blocking, waiting to succeed' mode when the programmer has actually coded a good failure mode and way to make at least some progress even without the allocation?

The stack I listed above was an off-the-cuff 'bad case', but as people have been challenging it, I've been thinking and don't think it's anywhere near the real worst case.

I can easily see someone having
filesystem
raid
lvm
snapshot
encryption
fuse
virtualization
network (with connection tracking and encryption on the network)

with the possibility that some of these layers may be repeated on the hypervisor level (which shouldn't contribute to memory issues in the guest, but guests could contribute to issues on the host)

I'm probably still not getting the real worst-case situation (it would be interesting to see not just speculation, but real-world information, I'll bet that real-world examples will make the speculation look good)

Allowing small allocations to fail

Posted Mar 14, 2015 6:05 UTC (Sat) by neilbrown (subscriber, #359) [Link]

> I know that at least some of them have enough reserved memory that they can limp along to complete a single request if they can't get a normal allocation, but with the 'too small to fail' logic having been in place, how many of these emergency codepaths have really been tested?

Lots of them.

"too small to fail" doesn't apply to all kernel allocations. It does apply to those with the GFP_KERNEL flags set, but not, for example, those with GFP_ATOMIC.

Any code that has been written with a clear emergency fallback almost certainly uses an allocation style that can fail - if it didn't there is a very good chance that it would deadlock. mempools, for example, always set __GFP_NORETRY, so normal allocation *will* fail if no memory is easily available, and the fall-back to the pre-allocated pools is often used.

GFP_NOFS allocations are probably the most problematic. I think they can be treated as "too small to fail", but there can be lots of dirty memory that they cannot touch.

Of all the things in your stack, I think filesystems, fuse, and networking are the most likely to have interesting issues: filesystems because they are complex, fuse because it can depend on userspace behaving correctly, and networking because it is highly optimized for speed and tries to avoid special cases on the fast-paths.
All these can be made to work well, but the problems they might face are really quite independent of what other things in the stack might be doing.

Allowing small allocations to fail

Posted Mar 13, 2015 7:08 UTC (Fri) by epa (subscriber, #39769) [Link] (6 responses)

Waiting and retrying for an allocation failure is fair enough, but waiting indefinitely? Surely it's better to inform userspace of the failure so a well-written program can take some appropriate action, rather than just hanging the whole userspace process or thread.

Allowing small allocations to fail

Posted Mar 13, 2015 7:35 UTC (Fri) by dlang (guest, #313) [Link] (2 responses)

so what do you do when you can't notify userspace? For example, a memory allocation failure when you are trying to write data to disk and the program that created the data has already exited.

Allowing small allocations to fail

Posted Mar 13, 2015 7:35 UTC (Fri) by dlang (guest, #313) [Link]

remember that these 'too small to fail' allocations are allocations made by the kernel, not ones made by userspace via malloc or similar.

Allowing small allocations to fail

Posted Mar 13, 2015 16:08 UTC (Fri) by epa (subscriber, #39769) [Link]

The general principle is to notify the caller. If the caller was a userspace program that made a system call, and the system call can't complete because there isn't enough memory, you return a failure status such as ENOMEM. If the caller was a kernel routine, again you return the failure status. In neither case is hanging indefinitely really a sensible thing to do - although certainly a limited amount of waiting and retrying can be a good idea.

Allowing small allocations to fail

Posted Mar 14, 2015 5:44 UTC (Sat) by neilbrown (subscriber, #359) [Link] (2 responses)

> Waiting and retrying for an allocation failure is fair enough, but waiting indefinitely?

To know whether or not it is reasonable to wait indefinitely, you need to have a clear idea of what you are waiting for. You are obviously waiting for something to be freed.

We can classify users of kernel memory in various ways, but one would be:

- permanent allocations which will never be freed, normally made a boot time. These don't affect how you wait.
- caches. If the system is working, these will eventually shrink to a reasonable size. You need to wait for that.
- stack/heap memory of processes (i.e. "anonymous memory"). These can be released by the OOM killer as a last resort, which might take a while to decide the time has come.
- other kernel objects that are refcounted and freed when not in use. There are limits on all of these (such as the limit on number of open filedescriptors), so the total memory used by these should not get out of control. We never wait for these.

The only two of these that can really exhaust memory are the caches and the anonymous memory. On a properly managed system, the total of anonymous pages will not approach the total of swap+physical memory, so you just need to wait for the caches to flush. That is an indefinite wait, but not an infinite wait.
On a misbehaving system, you might need to wait for the OOM killer to do its job - which is a hard job and may not be quick.

So yes: I think "indefinite waits" are entirely appropriate for code that is not cleaning out a cache or part of the OOM killer. For code that is, carefully prioritized memory reservations are needed.

Allowing small allocations to fail

Posted Mar 14, 2015 7:36 UTC (Sat) by dlang (guest, #313) [Link] (1 responses)

one other category of allocations is memory used by programs that may finish running and exit, freeing memory

Allowing small allocations to fail

Posted Mar 17, 2015 18:59 UTC (Tue) by nix (subscriber, #2304) [Link]

Neil covered those: they're either anonymous memory or memory in the cache.

Allowing small allocations to fail

Posted Mar 13, 2015 1:25 UTC (Fri) by dlang (guest, #313) [Link]

by the way, something I managed to forget in this discussion.

Clearing memory does not mean writing to swap most of the time. The most effective way to clear memory is to find pending writes to files and flush that data to the disk. This requires going through all the layers in the I/O stack, and there is no cheat along the lines of "just use a local drive"

Straw man of user space unable to cope with errors

Posted Mar 11, 2015 9:34 UTC (Wed) by iq-0 (subscriber, #36655) [Link] (7 responses)

The kernel can change behaviour without breaking it's ABI compatiblity. We trained hundres of programmers to don't care about data resilience in ext3 by having bad fsync-performance and 99% reliable storage by having implicit savepoints every 30s.
Now the world is better suited to more advanced filesystems (ext4, xfs, btrfs, ..) that in turn radically improved disk interactions at the expense of implicit data safety or use f(data)sync for reliable storage. Data was lost because programs followed the old logic, but programs we're fixed. Corner cases were worked around (sync on rename). But most people never noticed a thing about these changes as stuff got fixed that mattered (and 99% of the stuff simply doesn't matter).

This change is the similar. Most people don't run their systems in OOM-scenario's. Most programs will never get an ENOMEM error. If the systems was really running out of memory people don't respond worse to "my program crashed" then to "my system hangs" (and I had to reset it) or "my program was OOM killed".
And most syscalls will never return ENOMEM and the few that do (and are most likely to do so) can get their corner case weakened by performing N-retries of the syscall (or the lower-level operation depending on what is safer, easier, better for that case).

But the 'user space coping with errors' is just a straw man. User space copes with what it encounters. If we never return ENOMEM, most programs won't cope with that. Start returning ENOMEM and the few programs/frameworks/libraries/virtuals-machines that matter will cope with them, just like they cope with the occasional EINTR return from a syscall. And the more they're exposed to it, the better they'll cope with it.

And the most critical applications (like database, webservers and the like) are cross platform. So they're pretty likely to have proper handling of ENOMEM because I don't think many other systems made the decision to handle scarce resource allocation in the most difficult place cope with such problems.

Straw man of user space unable to cope with errors

Posted Mar 11, 2015 9:57 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (5 responses)

> But the 'user space coping with errors' is just a straw man. User space copes with what it encounters. If we never return ENOMEM, most programs won't cope with that. Start returning ENOMEM and the few programs/frameworks/libraries/virtuals-machines that matter will cope with them, just like they cope with the occasional EINTR return from a syscall. And the more they're exposed to it, the better they'll cope with it.

I don't find that convincing. With EINTR you know that retrying makes sense, and it's easy enough to get them during normal testing. Note that EINTR was already annoying enough to many people , so that SA_RESTART has been introduced.

This would essentially mean that you need to wrap every syscall in a retry loop. But you need to have smarts about how these retry loops look, because ENOMEM *does* have other meanings than 'kernel internal memory allocation failed'. E.g. shmget() will return ENOMEM on some platforms if you exceed configured limits.

So realistically what will happen is that userspace programs will just exit with an error.

> And the most critical applications (like database, webservers and the like) are cross platform. So they're pretty likely to have proper handling of ENOMEM because I don't think many other systems made the decision to handle scarce resource allocation in the most difficult place cope with such problems

I don't think that's the case for many syscalls.

Straw man of user space unable to cope with errors

Posted Mar 11, 2015 10:56 UTC (Wed) by tao (subscriber, #17563) [Link] (3 responses)

I've long been a proponent for SIGDANGER, as available in AIX. This doesn't solve the ENOMEM risk, to be sure, but it makes it more likely that the apps getting killed are those that do not handle memory properly, and also makes it less likely that the system runs out of memory in the first place (since any low memory conditions will trigger cache flushes, etc.)

Straw man of user space unable to cope with errors

Posted Mar 12, 2015 10:01 UTC (Thu) by iq-0 (subscriber, #36655) [Link] (2 responses)

Please no signal. Handling ENOMEM in multithreaded daemons is often a simple case of aborting or failing a single request (and thus also freeing up resources in the daemon itself, often releasing things like filedescriptors and the like as well). A signal would not confer any specificity about what caused it.

Sure normal programs should be allowed to be killed, but you'd pretty soon end up with a situation that all daemons would block this flag (since they can't really handle it gracefully). And on servers that effectively everything running (that really take up resources).

The only benefit would be if such a signal could be used to release all internal buffers, but that is often a problem that needs temporary resources and often needs to wait for other threads/processes to enter a quiescent state (and that possibly keep claiming resources until then). So that would only work before you have a problem.

Straw man of user space unable to cope with errors

Posted Mar 12, 2015 10:32 UTC (Thu) by tao (subscriber, #17563) [Link] (1 responses)

SIGDANGER isn't intended or used to signal that the system is *out of memory*. SIGDANGER is used to convey to processes (that choose to register a handler for this) that memory is dangerously low.

Imagine how awesome it'd be to have memory hogging browsers (for instance) empty up cached memory? At the same time the signal could also be used by the UI to let the user know that something is a bit wrong (obviously such a feature should be used with a bit of caution -- getting to know that the system is low on memory at the same time Chrome frees up 4GB of cache would be rather pointless).

Straw man of user space unable to cope with errors

Posted Mar 12, 2015 19:42 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

See the memory pressure cgroup mechanism.

Straw man of user space unable to cope with errors

Posted Mar 12, 2015 9:53 UTC (Thu) by iq-0 (subscriber, #36655) [Link]

> This would essentially mean that you need to wrap every syscall in a retry loop. But you need to have smarts about how these retry loops look, because ENOMEM *does* have other meanings than 'kernel internal memory allocation failed'. E.g. shmget() will return ENOMEM on some platforms if you exceed configured limits.

> I don't think that's the case for many syscalls.

Many syscalls don't necessarily require memory allocation. And I'm not suggesting each and every memory allocation should be returned to user-space. Retries in the memory allocating code or at the syscall level inside the kernel (and outside of critical sections so that we can back-off a little and possibly be killable when it's in an infinite retry loop) is a good solution for most syscalls that users won't normally expect ENOMEM from.

But certain interactions with the kernel like, mostly related to I/O (which most programs already have to deal with corner cases) should allow for ENOMEM. Userspace can then take appriopriate action, retry is a valid option, but deferring or aborting equally so.

> So realistically what will happen is that userspace programs will just exit with an error.

Yup. For programs that don't care about handling this situation (it's not like being killed, they do it themselves). Otherwise we'd get hanging systems, OOM-killed processes and general nastyness.

Straw man of user space unable to cope with errors

Posted Mar 11, 2015 11:05 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

It seems to me that from the point of view of syscall error values being handed to a userspace program, ENOMEM is only marginally milder than ENOSPC, EPIPE, and EIO in terms of "if you get these, you might as well give up", and that temporary resource shortages should be indicated by EAGAIN.

Allowing small allocations to fail

Posted Mar 11, 2015 10:24 UTC (Wed) by andresfreund (subscriber, #69562) [Link] (6 responses)

> Peter Zijlstra observed that ENOMEM is a valid return from a system call.

FWIW, that's not generally true. E.g. close() is not documented to return ENOMEM and afaics could end up returning ENOMEM under the new regimen (close => __close_fd => filp_close => f_op->flush).

Allowing small allocations to fail

Posted Mar 11, 2015 12:22 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (4 responses)

Close returning failure already has confusing semantics. Is the FD closed? So far, Linux won't leave it open while HP-UX will. There was an article on it here not too long ago and I think a bug was filed for POSIX to specify it as "must be closed".

Allowing small allocations to fail

Posted Mar 11, 2015 12:37 UTC (Wed) by andresfreund (subscriber, #69562) [Link]

close() was just the first example that came to mind. There's many more. E.g. fsync(), which certainly looks like it could return ENOMEM after the change given the variety of things done on its behalf in various FSs.

Allowing small allocations to fail

Posted Mar 11, 2015 18:24 UTC (Wed) by gutschke (subscriber, #27910) [Link] (1 responses)

We ran into that problem a while ago. I had gotten into the habit of checking return codes for all system calls, and close() was the one that would occasionally fail for no discernible reason, but still (sometimes?) close the file descriptor.

This makes error handling super difficult -- especially if you need to worry about other threads opening new file handles concurrently. It is then impossible to accurately tell whether the original file descriptor is still open or whether it has been replaced by a completely unrelated one.

I don't recall if we ever managed to pin-point the root cause. It might have only happened with some kernel versions, or with specific file systems (or maybe with non-standard user-space wrappers around the system call). I don't think we ever actually managed to reproduce the failure in house.

If I recall correctly, we ran into particularly nasty errors for a while, as we had wrapped close() into TEMP_FAILURE_RETRY(), which apparently is just not a safe thing to do.

My conclusion was that for close() I should not check error codes, and hope for the best, as I can neither trust the error code nor is there really any way I can handle the error without possibly making things worse. This is particularly frustrating, as the manual page explicit warns against ignoring error codes...

Allowing small allocations to fail

Posted Mar 12, 2015 23:39 UTC (Thu) by peter-b (guest, #66996) [Link]

It's not possible to handle close(2) errors on Linux. The file descriptor is deallocated *before* any condition that could cause failure can be hit -- so no matter what happens, the file descriptor is invalid when close(2) returns and you can't do anything else with it.

Allowing small allocations to fail

Posted Mar 11, 2015 21:25 UTC (Wed) by wahern (subscriber, #37304) [Link]

http://austingroupbugs.net/view.php?id=529

The resolution is that close will never fail. HP-UX was the odd man out, and it seems that they'll just have to suck it up, although I doubt HP-UX is even compliant to the current POSIX revision, let alone the upcoming one.

Allowing small allocations to fail

Posted Mar 14, 2015 12:48 UTC (Sat) by epa (subscriber, #39769) [Link]

If close() is not documented to return ENOMEM, then the kernel needs to make sure that out-of-memory conditions cannot occur on close(), by reserving enough memory in advance (perhaps on open()). Having it hang forever instead is not really an answer, since close() is not documented to hang either.

Too simple to fail

Posted Mar 11, 2015 10:53 UTC (Wed) by dgm (subscriber, #49227) [Link] (3 responses)

"Everything should be made as simple as possible, but not simpler".

It's a famous Einstein quote that applies so nicely to the case at hand.

Essentially, here we have two problems. One is handling failures at the wrong level, and the other is pretending that you have an infinite amount of something. Both are introduced in the name of simplicity, of course.

It is understandable that people want to simplify their code by ignoring error paths. They are ugly, make otherwise beautiful looking algorithms look ugly too, and worst of all, are difficult to test (and get right).

But errors should be handled by the level that has enough information about what is going on, and the memory allocator does not have enough information to decide by itself what is the correct recovery strategy.

Additionally, endless retrying of anything is not a correct error handling strategy (unless there's absolutely nothing else you can do, including returning an error). This is essentially pretending that you have infinite time, or infinite memory, or both. Any system built on such premise is essentially wrong.

With all that in mind, the only approach that makes sense in my mind is introducing new allocation primitives that do the right thing (fail if no memory is available) and urge everybody to migrate to they, and forbid new use of the old ones. Finally, put a hard line for removal of the faulty allocation primitives.

I think new flags will not cut it because they are optional. Also, a global knob (as suggested by Ewen) would give you a false sense of security. You cannot be certain that all wrong uses have been removed by just observing the system run: you need to read the code and see if there's an error handling path.

Too simple to fail

Posted Mar 11, 2015 15:24 UTC (Wed) by lgeorget (guest, #99972) [Link] (2 responses)

> With all that in mind, the only approach that makes sense in my mind is introducing new allocation primitives that do the right thing (fail if no memory is available) and urge everybody to migrate to they, and forbid new use of the old ones. Finally, put a hard line for removal of the faulty allocation primitives.

First rule of kernel development is not breaking legacy user code. Any new error that would come out of the kernel that did not before is considered "breaking user space". For example, as they say in the article, returning ENOMEM in some system calls that would always succeed (perhaps through triggering the OOM as a side effect) before. New allocation primitives won't help you against that.

Too simple to fail

Posted Mar 12, 2015 0:59 UTC (Thu) by ncm (guest, #165) [Link]

Recall that this is a kernel problem. It is not user-space allocations that never fail.

A new allocation primitive for in-kernel use would be a cleaner alternative to calling the old one with the "no-retry" flag, and would be more greppable, but that's the only difference.

Too simple to fail

Posted Mar 12, 2015 15:20 UTC (Thu) by tdz (subscriber, #58733) [Link]

Any user-space program with that attitude will break sooner or later. The ERRORS sections in the man pages should not be seen as exclusive lists of possible errors. Rather the errno value coming out of the kernel or libc should be seen as a hint on what went wrong.

Allowing small allocations to fail

Posted Mar 12, 2015 6:42 UTC (Thu) by malor (guest, #2973) [Link] (10 responses)

> We should, he said, take a step back and ask how we got into the OOM situation in the first place.

The thought occurs that the underlying problem is really this: the kernel lies to userspace about memory allocation. It makes enormous promises that it can't possibly keep, and then we get into this enormous snarl when programs expect it to deliver what it said it could. Most of the time, programs don't need what they're asking for, so lying to them is okay, but sometimes they're right about what they need, and things go to hell.

I realize that this deception is deeply entwined into the Linux kernel, and into userspace expectations, but it seems to me that rethinking this now, relatively near the beginning of 64-bit computing, is probably the right time. If there will ever be a time when the kernel can start telling the truth about memory, it would be now.

Allowing small allocations to fail

Posted Mar 12, 2015 7:51 UTC (Thu) by dlang (guest, #313) [Link] (3 responses)

except this "too small to fail" has nothing to do with userspace programs, it's all about kernel allocations.

And overcommit is not as firmly baked into Linux as you are saying. It's a simple sysctl setting to completely turn off overcommit, leading the the predictable failures (both in allocations that don't need to be failed, like when you are executing a small program from a massive one, and in application errors because they don't test the failure modes or the recovery code)

Allowing small allocations to fail

Posted Mar 13, 2015 1:52 UTC (Fri) by malor (guest, #2973) [Link] (1 responses)

>except this "too small to fail" has nothing to do with userspace programs, it's all about kernel allocations.

Sigh. The only time a kernel memory commit can fail is if too much memory has already been granted to userspace. (or you're on a very, very tiny system to begin with.)

Allowing small allocations to fail

Posted Mar 13, 2015 2:59 UTC (Fri) by dlang (guest, #313) [Link]

> The only time a kernel memory commit can fail is if too much memory has already been granted to userspace.

Only if you include memory used for buffering writes to disk as "granted to userspace". And that memory is not something that overcommit or copy-on-write will affect in any way.

Remember that failures like this are pretty rare, if a system has even a 'little' swap (say a couple GB), it will almost certainly slow to an unusable dead crawl from the disk I/O long before the OOM killer kicks in.

Allowing small allocations to fail

Posted Mar 13, 2015 1:57 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

> except this "too small to fail" has nothing to do with userspace programs, it's all about kernel allocations.

Meh. They currently can often end up as the return value of a system call if the memory allocation failure happened during the execution of a syscall. Such errors happen to be dealt with by returning the error code up the stack...

Allowing small allocations to fail

Posted Mar 12, 2015 12:15 UTC (Thu) by MrWim (subscriber, #47432) [Link] (5 responses)

relatively near the beginning of 64-bit computing

Time flies when you're having fun :).

According to Wikipedia Intel's first 32-bit processors were released in April 1989 (26 years ago). The first AMD64 Opteron was released in April 2003 (12 years ago).

Allowing small allocations to fail

Posted Mar 12, 2015 15:06 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (1 responses)

And first 64 bit port of Linux (to Alpha) was released in 1995 – 20 years ago. According to Wikipedia, first 64 bit UNIX was released 10 years earlier: it was Cray UNICOS in 1985.

Allowing small allocations to fail

Posted Mar 12, 2015 22:54 UTC (Thu) by malor (guest, #2973) [Link]

That's why I said 'relatively'; I knew it's been awhile in the Linux space, but 64-bit-by-default really only got going in the last few years, because for the first time, we routinely have more than 4 gigs. In fact, this last year or two is probably the first time that 32-bit mode isn't really suitable for even the cheapest desktops.

But that fundamentally means that we're just now reaching the full 32nd bit of memory addressing on cheap machines, where expensive desktops can get to 35 (32 gigs), and high-end workstations are at maybe 38. At 2 years per bit, best case, it'll be a long time before we hit 64.... and assuming that Moore's Law will hold that long is probably ridiculously optimistic.

So, I stand behind the observation: in relative terms, it's very early. 64-bit computing is past its infancy, and might be ready to head off for kindergarten. It's got a long, LONG life ahead of it.

Allowing small allocations to fail

Posted Mar 21, 2015 19:47 UTC (Sat) by jzbiciak (guest, #5246) [Link] (2 responses)

Intel's first 32-bit processor was in 1989? Huh? The 80386 was introduced in 1985, and the iAPX432 was introduced in 1981.

Ok, the iAPX432 wasn't commercially successful, but the 386 was.

Allowing small allocations to fail

Posted Mar 21, 2015 20:37 UTC (Sat) by viro (subscriber, #7872) [Link] (1 responses)

I suspect that he has somewhat mangled recollections of Intel marketing around i860 - it was released in '89 and they used to call it 64bit (64bit registers in FPU and 64bit buses, IIRC).

Allowing small allocations to fail

Posted Mar 21, 2015 20:48 UTC (Sat) by jzbiciak (guest, #5246) [Link]

Or misremembered the 486 as the first 32-bit x86, as it came out in April 1989.

The i860 looks like it was a neat processor. I never had the opportunity to play with one.


Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds