User: Password:
|
|
Subscribe / Log in / New account

Memory management when failure is not an option

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
March 4, 2015
Last December, a discussion of system stalls related to low-memory situations led to the revelation that small memory allocations never fail in the kernel. Since then, the discussion on how to best handle low-memory situations has continued, focusing in particular on situations where the kernel cannot afford to let a memory allocation fail. That discussion has exposed some significant differences of opinion on how memory allocation should work in the kernel.

Some introductory concepts

The kernel's memory-management subsystem is charged with ensuring that memory is available when it is needed by either the kernel or a user-space process. That job is easy when a lot of memory is free, but it gets harder once memory fills up — as it inevitably does. When memory gets tight and somebody is requesting more, the kernel has a couple of options: (1) free some memory currently in use elsewhere, or (2) deny (fail) the allocation request.

The process of freeing (or "reclaiming") memory may involve writing the current contents of that memory to persistent storage. That, in turn, involves calling into the filesystem or block I/O code. But if any of those subsystems are, in fact, the source of the allocation request, calling back into them can lead to deadlocks and other unfortunate situations. For that reason (among others), allocation requests carry a set of flags describing the actions that can be performed in the handling of the request. The two flags of interest in this article are GFP_NOFS (calls back into filesystems are not allowed), and GFP_NOIO (no type of I/O can be started). The former inhibits attempts to write dirty pages back to files on disk; the latter can block activity like writing pages to swap.

Obviously, the more constrained the memory-management subsystem is, the higher the chances of it being unable to satisfy an allocation request at all. Kernel developers have long been told that (almost) any allocation request can fail; as a result, the kernel is full of error-handling paths meant to deal with that eventuality. But it became clear recently that the memory-management code does not actually allow smaller requests to fail; it will, instead, loop indefinitely trying to free some memory. That behavior has been seen to lead occasionally to locked-up systems, despite the fact that the code involved is prepared to deal with allocation failures. The "too small to fail" behavior is controversial, but would prove hard to change at this point.

There are, however, places in the kernel that are simply unprepared to deal with allocation failures, usually because the allocation happens deep within a complex series of operations that would be difficult to unwind. The __GFP_NOFAIL flag exists to explicitly state that failure is not an option for a given request, though its use is heavily discouraged.

The following discussion, in the end, focuses on two related questions: (1) should the kernel really be treating small allocations as if they all had __GFP_NOFAIL set, and (2) should failure-proof allocations be supported at all, and, if so, how can that support be made more robust?

No longer too small to fail

The discussion (re)started when Tetsuo Handa noted that memory allocation behavior had changed in the 3.19 kernel; in particular, small allocations with the GFP_NOIO or GFP_NOFS flags would fail under severe memory pressure. In previous kernels, such allocations would loop indefinitely if no memory was available. Among other things, this change can cause filesystem operations to fail on memory-stressed systems where they would have (eventually) succeeded before.

The behavior change is the result of this patch from Johannes Weiner which was aimed at avoiding the memory-allocation deadlocks that started the December discussion. The intent was to avoid looping forever in an allocation attempt if it appeared that no progress was being made toward freeing some memory for that allocation, but, by accident, it also prevented looping entirely in the GFP_NOIO and GFP_NOFS cases. So those allocations can now fail; that is a significant change from how previous kernels worked.

Johannes initially wanted to keep the new behavior, saying that it "makes more sense." But the filesystem developers disagreed strongly. It seems that there are numerous places in the filesystem code that depend on allocations succeeding reliably, and that many of them are not marked with __GFP_NOFAIL. Ted Ts'o threatened to add a lot of __GFP_NOFAIL flags to allocation calls in the ext4 filesystem if the change were not reverted. The memory-management developers were thus faced with the need to pick the option they disliked least.

In the end, the filesystem developers won out on this one; Johannes merged a change into 4.0-rc2 restoring the looping behavior for those allocation types. This change is likely to end up in the 3.19 stable series as well. The original patch is a good argument for the approach of refusing "cleanup" patches late in the development cycle. It was merged for the 3.19-rc7 prepatch, meaning that there was almost no time for problems to be noticed before the final 3.19 release came out.

The discussion was not limited to the unexpected effects of one late-arriving memory-management patch, though. The bigger problems of how to avoid deadlocks in low-memory situations and how to ensure that important tasks can proceed in those situations remain unsolved.

The OOM killer

The out-of-memory (OOM) killer is implicated in a number of stall scenarios. In the original problem reported last December, the OOM killer would choose a victim that was blocked on a lock, but that lock was held by the process waiting (forever) for a memory allocation to proceed. As a result, the victim could not exit and, thus, could not free its memory. Since the OOM killer only goes after a single process at a time, everything would stop at that point.

Johannes suggested a change to how the OOM killer works: if a targeted process failed to exit after five seconds, the OOM killer would give up and move on to another victim. The idea was not hugely popular, though. David Rientjes pointed out that there was no guarantee that the next victim would be any more appropriate than the one that came before. Dave Chinner claimed more broadly that efforts to tweak the OOM killer are misdirected:

I really don't care about the OOM Killer corner cases - it's completely the wrong line of development to be spending time on and you aren't going to convince me otherwise. The OOM killer a crutch used to justify having a memory allocation subsystem that can't provide forward progress guarantee mechanisms to callers that need it.

The end result is that OOM-killer timeouts will probably not find their way into the memory-management subsystem anytime soon.

__GFP_NOFAIL and looping

From the point of view of the memory-management developers, many things would get easier if any allocation request could fail when the necessary resources are not available. That would mean getting rid of the implicit "small allocations never fail" rule, but, beyond that, it would also require getting rid of the explicit __GFP_NOFAIL call sites. Michal Hocko was perhaps the most outspoken in this regard, saying that __GFP_NOFAIL "is deprecated and shouldn't be used." He also suggested that existing __GFP_NOFAIL call sites should be reimplemented in a way that allows them to recover from allocation failures.

Dave took issue with that idea, saying that failure-proof allocations are a hard requirement for the XFS filesystem. To rework XFS to be able to roll back dirty transactions in the face of an allocation failure would increase its complexity significantly, he said; the project would take a couple of years to reach a point where it could be put into production use. He summarized by saying "I'm not about to spend a couple of years rewriting XFS just so the VM can get rid of a GFP_NOFAIL user." Strangely enough, there were no other developers volunteering to take on that job either.

Contemporary filesystems are complex beasts that have to meet a wide variety of demands. They incorporate complex transaction mechanisms that help them to maintain filesystem integrity in every situation possible. Implementing such a mechanism in a way that allows it to recover from a memory-allocation failure in the middle of a transaction, after resources have been committed, locks taken, etc., is not a simple task. Filesystem developers on Linux have not taken on that task because, in the end, there has not been a need to. Allocations that cannot be allowed to fail have proved sufficient in almost all situations.

Once one accepts that some sort of failure-proof allocation mechanism is needed, though, the next question is: how should it be done? The __GFP_NOFAIL flag is one solution, but it turns out that quite a bit of code in the kernel does not make use of it. Instead, there are a number of places in the kernel that implement their own retry loops on top of a kmalloc() call without __GFP_NOFAIL. That is something that the memory-management developers don't like; those developers would rather not see __GFP_NOFAIL used at all, but they still prefer its use to retry loops implemented outside of the memory-management subsystem. Consider, for example, this message from Johannes saying that the XFS developers should replace a retry loop with a single __GFP_NOFAIL call.

There are couple of reasons why such loops exist. One of those is that __GFP_NOFAIL was explicitly deprecated in 2009; the patch (from Andrew Morton) said:

__GFP_NOFAIL is a bad fiction. Allocations _can_ fail, and callers should detect and suitably handle this (and not by lamely moving the infinite loop up to the caller level either).

After this change went in, it became harder to get code containing __GFP_NOFAIL past reviewers. Whether it is done lamely or not, a hand-coded infinite retry loop is easier to sneak into the kernel than an easily greppable __GFP_NOFAIL use. So that is what developers did.

The memory-management developers dislike must-succeed allocations because they complicate the code and, as has occasionally been seen, create the possibility of deadlocks. If such allocations must be made, they would rather see the looping done in the memory-management code, where behavior can be tweaked and appropriate action taken (starting the OOM killer, for example) if it becomes clear that no progress is being made. In the real world, though, according to both Ted and Dave, looping actually works pretty well. The XFS code has a "canary" that puts out a warning when the looping goes on for too long, but, Dave said:

yet we *rarely* see the canary warnings we emit when we do too many allocation retries, the code has been that way for 13-odd years. Hence, despite your protestations that your way is *better*, we have code that is tried, tested and proven in rugged production environments. That's far more convincing evidence that the *code should not change* than your assertions that it is broken and needs to be fixed.

One might take that as a statement that the XFS developers are currently uninterested in replacing their own loops with __GFP_NOFAIL invocations. But they actually have another reason to maintain a loop outside of the memory-management code: they want to retain control over how the filesystem should respond to low-memory conditions. It is, in their mind, a policy decision that the memory-management code lacks the information to handle. There are currently plans afoot to expose some of that policy to user space, allowing administrators to configure what the filesystem's low-memory response should be.

Reservations

Still, there is no real disagreement over this idea: looping over a failing memory allocation is undesirable and best avoided whenever possible. Thus it may well be that the most useful part of the discussion came when the developers got around to the topic of avoiding allocation failures altogether. There are a few ways of working toward that goal.

One of those is preallocation — allocating all of the needed memory resources before the code gets to a point where it can't back out of a transaction. Preallocation is used in many contexts in the kernel and works well, so it was natural for the memory-management developers to ask whether it can be used in this context. Dave shot that idea down fairly quickly:

However, preallocation is dumb, complex, CPU and memory intensive and will have a *massive* impact on performance. Allocating 10-100 pages to a reserve which we will almost *never use* and then free them again *on every single transaction* is a lot of unnecessary additional fast path overhead. Hence a "preallocate for every context" reserve pool is not a viable solution.

Mempools were also raised as a possibility. They are a form of preallocation that might avoid some of the overhead described above. But they are, according to Dave, poorly suited to the problem at hand. Mempools deal with a single size of object, while a filesystem transaction needs a wide variety of objects; that implies that several mempools would be needed at various levels in the stack. There is also a mismatch between object lifetimes that make mempools difficult to use across multiple transactions. So mempools do not appear to be an option either.

Dave's suggestion, instead, is to add the concept of "reservations" to the memory-management subsystem. Prior to entering a transaction, the filesystem code would inform the memory-management code that it will need guaranteed access to a certain amount of memory; calculating an approximate memory requirement is, apparently, not that hard. The memory-management code would then ensure that the requisite amount of memory would be available; subsequent allocation requests would dip into the reserve if need be. As long as the estimate for the size of the reserve is sufficient, there should be no problem with failing allocations during the transaction.

Reservations may look a lot like preallocation, but there is a crucial difference. The memory-management code already maintains a "watermark," a level of free memory below which it is unwilling to go unless absolutely necessary. A reservation would simply raise that watermark, making a bit less memory available to the system as a whole. If a reservation would raise the watermark above the amount of memory that is currently free, the request would block until more memory could be reclaimed. In the simplest case, a reservation would be represented as an increased value in a single integer variable.

There seems to be some general support for the addition of a reservation mechanism, but things get less clear once one looks at the details. Andrew Morton suggested a scheme where a process making a reservation would get a number of "tokens"; subsequent allocations done by that process would come from the reserve first. Dave does not like that idea, saying that it fails to account for the fact that many objects allocated during a transaction will be freed (perhaps by others) shortly and, thus, should not come from the reservation. His view of the reservation, instead, is a range of memory that is not touched at all unless there is no alternative; even then, only allocations using the GFP_RESERVE flag would be able to get at that memory. The reservation, in his view, comes into play when the kernel would have otherwise put the OOM killer into action.

Johannes, instead, says that this approach will not work. The problem is that "we simply don't KNOW the exact point when we ran out of reclaimable memory," so the memory-management subsystem cannot easily guarantee the sort of loose reservation that Dave has described. Dave disagreed with that assessment, it almost goes without saying. And that is about where the conversation wound down.

Reservations are a promising idea for a solution to some of the kernel's memory-allocation challenges. But, at this point, it is just an idea; it has neither code nor a design consensus behind it. The discussion has slowed for the moment, but that is almost certain to be a temporary state of affairs. The annual Linux Storage, Filesystem, and Memory Management Summit is less than one week away as of this writing. This subject is on the agenda, and LWN will be there to report on the discussion.


(Log in to post comments)

Memory management when failure is not an option

Posted Mar 6, 2015 3:05 UTC (Fri) by reubenhwk (guest, #75803) [Link]

I realize this is easier said than done, but why not...

if (!allocate_memory_up_front())
    return -1;
do_stuff()
free()

In what situations is that scheme not possible?

Memory management when failure is not an option

Posted Mar 6, 2015 12:05 UTC (Fri) by vonbrand (guest, #4458) [Link]

If it is just one allocation, fine. If it is a string of operations, with allocations (and freeings) dependent on earlier ones...

Preallocation

Posted Mar 6, 2015 14:09 UTC (Fri) by corbet (editor, #1) [Link]

That is the preallocation scenario described in the article. In relatively simple situations it can be done, and that pattern appears in a lot of places in the kernel. In more complex situations, though, there is really no way to know how much to preallocate or in what form.

Preallocation

Posted Mar 6, 2015 14:34 UTC (Fri) by cesarb (subscriber, #6266) [Link]

> In more complex situations, though, there is really no way to know how much to preallocate or in what form.

You could sort-of say that the kernel already sets aside a few megabytes of preallocation, to be used during interrupts (GFP_ATOMIC).

Memory management when failure is not an option

Posted Mar 6, 2015 21:39 UTC (Fri) by SirCmpwn (subscriber, #99589) [Link]

The whole fork/exec model combined with the OOM killer is so ridiculously stupid. I wonder how that ever managed to become the standard in the first place.

Memory management when failure is not an option

Posted Mar 11, 2015 19:20 UTC (Wed) by nix (subscriber, #2304) [Link]

Because the alternatives to optimistically hoping that there's enough storage and CoWing as needed, failing only if you actually run out, are also awful.

You can reserve the entire process's dirty memory space on every fork(), as Solaris does, but that makes fork()/exec() require massive amounts of memory that is hardly ever used. As an Emacs user, I used to get really annoyed when I couldn't do a simple C-u M-! blah RET because there were 'only' a few gigabytes of free swap!

Alternatively you can ditch fork()/exec() and go to a spawn-like model, like Windows does, eliminating the need to allocate that extra memory by constraining what you can do between the conceptual fork() and exec(). But that leads to a horrifyingly complex interface which is always too limited, since it's trying to reimplement "everything you might ever want to do in the child before executing" using something simple enough that allocation is constrained.

There is no easy solution here. If there was, people would have found it by now.

(Eliminating fork()/exec() also won't fix the problem -- among other things, stack allocations are still optimistic, even on Windows, IIRC.)

Memory management when failure is not an option

Posted Mar 11, 2015 19:24 UTC (Wed) by SirCmpwn (subscriber, #99589) [Link]

I have written an operating system and kernel, and the model I use is to spawn a process (load it up into memory and give it space in the process table and so on) without actually starting it until the person calling it is ready. Then you have any number of interfaces for tweaking the environment it runs in before you finally give the go-ahead to start executing it.

Memory management when failure is not an option

Posted Mar 12, 2015 13:25 UTC (Thu) by cortana (guest, #24596) [Link]

I wish Linux had this--particularly if the first system call gave back a file descriptor corresponding to the process that could be used to monitor/kill the child.

Memory management when failure is not an option

Posted Mar 12, 2015 17:14 UTC (Thu) by nix (subscriber, #2304) [Link]

It does: fork() gives back the PID, and you can use it to both monitor and kill the child.

The API used to monitor the child (ptrace(), natch) is so terrible that nobody uses it for ordinary post-fork() actions, but still, you already have exactly that.

Memory management when failure is not an option

Posted Mar 12, 2015 18:30 UTC (Thu) by JGR (subscriber, #93631) [Link]

The PID that fork gives you only gets you so far. The child process may have died and the PID been recycled before you try to use the PID in the parent process. You can mostly handle this by handling SIGCHLD properly, but that can be fiddly, and doesn't really work when you get to grandchildren and so on.

Memory management when failure is not an option

Posted Mar 12, 2015 21:45 UTC (Thu) by nix (subscriber, #2304) [Link]

Nah, it's safe: if you're going to use ptrace() you have to waitpid() anyway, and waitpid() is exactly what you need if you're going to detect the process dying, let alone reap it, which is a precondition for PID recycling to happen.

Memory management when failure is not an option

Posted Mar 12, 2015 18:37 UTC (Thu) by cortana (guest, #24596) [Link]

That's... horrible! This is one place that Windows is better than Linux; CreateProcess returns a HANDLE that can be used to monitor/kill the process. Not just a transient identifier that will be reused when it wraps around...

Memory management when failure is not an option

Posted Mar 13, 2015 2:10 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

Well, BSD has forkfd which gives you back a file descriptor to a pid you can toss into kqueue and friends. Why Linux hasn't gotten around to forkfd is something I'd like to know :) .

Memory management when failure is not an option

Posted Mar 11, 2015 19:33 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

> Alternatively you can ditch fork()/exec() and go to a spawn-like model, like Windows does, eliminating the need to allocate that extra memory by constraining what you can do between the conceptual fork() and exec(). But that leads to a horrifyingly complex interface which is always too limited, since it's trying to reimplement "everything you might ever want to do in the child before executing" using something simple enough that allocation is constrained.

One option here—though extremely disruptive—would be to take all the system calls that modify a process's state and make them take an additional handle/FD identifying the target process. Then you could just spawn a blank, suspended process, obtaining a handle, and use normal system calls to get the process into the right state before making it runnable. With this approach you wouldn't need to implement anything twice; the downside is that it implies a new security domain, permission to perform system calls on behalf of another process. That permission could automatically terminate once the process is made runnable, ending the initialization phase.

Memory management when failure is not an option

Posted Mar 11, 2015 19:53 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

What about a fork() taking flags and/or a memfd for writeable memory which indicate things like "replacement" (daemonization), "read only until exec", and other common cases?

Memory management when failure is not an option

Posted Mar 12, 2015 1:54 UTC (Thu) by nix (subscriber, #2304) [Link]

That would be the terrifying posix_spawn*() family of APIs. Their sheer number suggests just how unscalable this approach is.

(The 'have everything take a PID' approach suggested by nybble41 is a good one -- as a side benefit it would give us something better than ptrace(). But it's very hard to get there from here.)


Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds