Reservations for must-succeed memory allocations
Michal Hocko started off by saying that the memory-management developers would like to deprecate the __GFP_NOFAIL flag, which is used to mark allocation requests that must succeed at any cost. But doing so, it turns out, just drives developers to put infinite retry loops into their own code rather than using the allocator's version. That, he noted dryly, is not a step forward. Retry loops spread throughout the kernel are harder to find and harder to fix, and they hide the "must succeed" nature of the request from the memory-management code.
Getting rid of those loops is thus, from the point of view of the memory-management developers, a desirable thing to do. So Michal asked the gathered developers to work toward their elimination. Whenever such a loop is encountered, he said, it should just be replaced by a __GFP_NOFAIL allocation. Once that's done, the next step is to figure out how to get rid of the must-succeed allocation altogether. Michal has been trying to find ways of locating these retry loops automatically, but attempts to use Coccinelle to that end have shown that the problem is surprisingly hard.
Johannes Weiner mentioned that he has been working recently to improve the out-of-memory (OOM) killer, but that goal proved hard to reach as well. No matter how good the OOM killer is, it is still based on heuristics and will often get things wrong. The fact that almost everything involved with the OOM killer runs in various error paths does not help; it makes OOM-killer changes hard to verify.
The OOM killer is also subject to deadlocks. Whenever code requests a memory allocation while holding a lock, it is relying on there being a potential OOM-killer victim task out there that does not need that particular lock. There are some workloads, often involving a small number of processes running in a memory control group, where every task depends on the same lock. On such systems, a low-memory situation that brings the OOM killer into play may well lead to a full system lockup.
Rather than depend on the OOM killer, he said, it is far better for kernel code to ensure that the resources it needs are available before starting a transaction or getting into some other situation where things cannot be backed out. To that end, there has been talk recently of creating some sort of reservation system for memory. Reservations have downsides too, though; they can be more wasteful of memory overall. Some of that waste can be reduced by placing reclaimable pages in the reserve; that memory is in use, but it can be reclaimed and reallocated quickly should the need arise.
James Bottomley suggested that reserves need only be a page or so of
memory, but XFS maintainer Dave Chinner was quick to state that this is not
the case. Imagine, he said, a transaction to create a file in an XFS
filesystem. It starts with allocations to create an inode and update the
directory; that may involve allocating memory to hold and manipulate
free-space bitmaps. Some blocks may need to be allocated to hold the
directory itself; it may be necessary to work through 1MB of stuff to find
the directory block that can hold the new entry. Once that happens, the
target block can be pinned.
This work cannot be backed out once it has begun. Actually, it might be possible to implement a robust back-out mechanism for XFS transactions, but it would take years and double the memory requirements, making the actual problem worse. All of this is complicated by the fact that the virtual filesystem (VFS) layer will have already taken locks before calling into the filesystem code. It is not worth the trouble to implement a rollback mechanism, he said, just to be able to handle a rare corner case.
Since the amount of work required to execute the transaction is not known ahead of time, it is not possible to preallocate all of the needed memory before crossing the point of no return. It should be possible, though, to produce a worst-case estimate of memory requirements and set aside a reserve in the memory-management layer. The size of that reserve, for an XFS transaction, would be on the order of 200-300KB, but the filesystem would almost never use it all. That memory could be used for other purposes while the transaction is running as long as it can be grabbed if need be.
XFS has a reservation system built into it now, but it manages space in the transaction log rather than memory. The amount of concurrency in the filesystem is limited by the available log space; on a busy system with a large log he has seen 7-8000 transactions active at once. The reservation system works well and is already generating estimates of the amount of space required; all that is needed is to extend it to memory.
A couple of developers raised concerns about the rest of the I/O stack; even if the filesystem knows what it needs, it has little visibility into what the lower I/O layers will require. But Dave replied that these layers were all converted to use mempools years ago; they are guaranteed to be able to make forward progress, even if it's slow. Filesystems layered on top of other filesystems could add some complication; it may be necessary to add a mechanism where the lower-level filesystem can report its worst-case requirement to the upper-level filesystem.
The reserve would be maintained by the memory-management subsystem. Prior to entering a transaction, a filesystem (or other module with similar memory needs) would request a reservation for its worst-case memory use. If that memory is not available, the request will stall at this point, throttling the users of reservations. Thereafter, a special GFP flag would indicate that an allocation should dip into the reserve if memory is tight. There is a slight complication around demand paging, though: as XFS is reading in all of those directory blocks to find a place to put a new file, it will have to allocate memory to hold them in the page cache. Most of the time, though, the blocks are not needed for any period of time and can be reclaimed almost immediately; these blocks, Dave said, should not be counted against the reserve. Actual accounting of reserved memory should, instead, be done when a page is pinned.
Johannes pointed out that all reservations would be managed in a single, large pool. If one user underestimates their needs and allocates beyond their reservation, it could ruin the guarantees for all users. Dave answered that this eventuality is what the reservation accounting is for. The accounting code can tell when a transaction overruns its reservation and put out a big log message showing where things went wrong. On systems configured for debugging it could even panic the system, though one would not do that on a production system, of course.
The handling of slab allocations brings some challenges of its own. The way forward there seems to be to assume that every object allocated from a slab requires a full page allocation to support it. That adds a fair amount to the memory requirements — an XFS transaction can require as many as fifty slab allocations.
Many (or most) transactions will not need to use their full reservation to complete. Given that there may be a large number of transactions running at any given time, it was suggested, perhaps the kernel could get away with a reservation pool that is smaller than the total number of pages requested in all of the active reservations. But Dave was unenthusiastic, describing this as another way of overcommitting memory that would lead to problems eventually.
Johannes worried that a reservation system would add a bunch of complexity to the system. And, perhaps, nobody will want to use it; instead, they will all want to enable overcommitting of the reserve to get their memory and (maybe) performance back. Ted Ts'o also thought that there might not be much demand for this capability; in the real world, deadlocks related to low-memory conditions are exceedingly rare. But Dave said that the extra complexity should be minimal; XFS, in particular, already has almost everything that it needs.
Ted insisted, though, that this work is addressing a corner case; things work properly, he said, 99.9999% of the time. Do we really want to add the extra complexity just to make things work better on under-resourced systems? Ric Wheeler responded that we really shouldn't have a system where unprivileged users can fire off too much work and crash the box. Dave agreed that such problems can, and should, be fixed.
Even if there is a reserve, Ted said, administrators will often turn it off in order to eliminate the performance hit from the reservation system (which he estimated at 5%); they'll do so with local kernel patches if need be. Dave agreed that it should be possible to turn the reservation system off, but doubted that there would be any significant runtime impact. Chris Mason agreed, saying that there is no code yet, so we should not assume that it will cause a performance hit. Dave said that the real effect of a reservation would be to move throttling from the middle of a transaction to the beginning; the throttling happens in either case. James was not entirely ready to accept that, though; in current systems, he said, we usually muddle through a low-memory situation, while with a reservation we will be actively throttling requests. Throughput could well suffer in that situation.
The only reliable way to judge the performance impact of a reservation
system, though, will be to observe it in operation; that will be hard to do
until this system is implemented. Johannes closed out the session by
stating the apparent consensus: the reservation system should be
implemented, but it should be configurable for administrators who want to
turn it off. So the next step is to wait for the patches to show up.
Index entries for this article | |
---|---|
Kernel | Memory management/Page allocator |
Conference | Storage, Filesystem, and Memory-Management Summit/2015 |
Posted Mar 17, 2015 17:23 UTC (Tue)
by amonnet (guest, #54852)
[Link] (2 responses)
Keeping it dirty, but much simpler, why not reserve a fixed amount of memory, that can be used in the failing use case (ie: allocation while oom killer is waiting for a process to be killed) ?
+++
Posted Mar 17, 2015 22:11 UTC (Tue)
by fandingo (guest, #67019)
[Link] (1 responses)
Posted Mar 18, 2015 14:23 UTC (Wed)
by mm7323 (subscriber, #87386)
[Link]
In general there's probably enough stuff floating around in the system that there would always be a sizeable reservation area, but the 99.999% occasion could still be problematic, so an API would be needed to check that the reservation area is at least a certain size before XFS or other things goes off on a path of no return. The reservation request function could have a blocking variant (which tries to increase the reservation area to meet demand if needed, or waits for other reservation users to complete), or return a failure which could be propagated back to userspace well before any critical actions have taken place in the caller e.g. open() might return ENOMEM if the reservation area isn't sufficiently large to meet the demands needed to ensure that the system call can progress in the worst case. After a critical operation completes, the reservation area request should be released.
Some other API changes may be needed so that a caller can request pages that use the reservation area, and book-keeping to ensure callers don't request more from the reservation than they have previously 'reserved' would be prudent.
Finally, I was also thinking that a simple swap device could also help. Simple swap would mean that pages can be read and written trivially without calling memory allocators or introducing lock dependencies. If a block device could indicate that it was 'simple swap' compatible according to these requirements, then any of its free space could be accounted to the 'reservation area' by enabling dirty pages to be swapped out without ending up in the circular allocation and lock dependency battles which seem to be the cause of all these problems. Directly accessed partitions on a locally attached hard-drive, or zram could probably be made to flag as 'simple swap' compatible.
Posted Mar 17, 2015 18:04 UTC (Tue)
by post-factum (subscriber, #53836)
[Link] (2 responses)
Posted Mar 18, 2015 1:15 UTC (Wed)
by roblucid (guest, #48964)
[Link] (1 responses)
Posted Mar 19, 2015 13:41 UTC (Thu)
by CChittleborough (subscriber, #60775)
[Link]
Posted Mar 17, 2015 18:57 UTC (Tue)
by josh (subscriber, #17465)
[Link] (5 responses)
Does this mean that GFP_NOIO and similar are obsolete?
Posted Mar 17, 2015 19:55 UTC (Tue)
by dlang (guest, #313)
[Link]
Posted Mar 17, 2015 20:41 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (3 responses)
There is some subtlety here...
Superficially: no, mempools don't make GFP_NOIO obsolete. They protect against different things.
GFP_NOIO is all about the locks. GFP_NOIO is called while holding a lock that might be taken during "IO". Reclaim to satisfy such an allocation must not perform "IO" as that could block on a lock that is already held, resulting in a deadlock.
mempools are about which actively used memory you are willing to wait for to become freed. So a mempool allocation does a normal memory allocation which may initiate reclaim and fs-writeback and IO etc. But it will not wait for memory to become free. If nothing is available, it will then use something in the pool, or wait for a previous pool allocation to be returned.
However ... direct reclaim doesn't do IO any more at all - it just kicks kswapd and lets it do all the reclaim and IO. So it is possible that GFP_NOIO is obsolete, but not because of mempool.
Posted Mar 19, 2015 1:02 UTC (Thu)
by Paf (subscriber, #91811)
[Link] (2 responses)
In an entirely real example (I've debugged it), kswapd can call shrinkers (in at least some cases registered by file systems) which attempt to clear caches which (if the file system was asking for the memory) can require locks which are not available.
In our specific case (Lustre), we were actually spawning threads rather than allocating memory directly, so we had to spawn our threads without the relevant locks held... But when directly allocating memory, we must be careful to, most of the time, use GFP_NOFS. That's not GFP_NOIO, of course, but I don't see a fundamental difference.
Have I missed something here?
Posted Mar 19, 2015 1:04 UTC (Thu)
by Paf (subscriber, #91811)
[Link]
Posted Mar 19, 2015 1:39 UTC (Thu)
by neilbrown (subscriber, #359)
[Link]
No, I was.
kswapd does all the "writeback to filesystems", but direct reclaim can still call the shrinkers.
Thanks.
Posted Mar 17, 2015 20:29 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (23 responses)
All of the mm developers, or just some? And do we know why?
> Ted insisted, though, that this work is addressing a corner case; things work properly, he said, 99.9999% of the time.
So team-red says "It's broken, we need to fix it", and team blue says "it ain't broke, don't fix it".
> There are some workloads, often involving a small number of processes running in a memory control group, where every task depends on the same lock.
I suspect this is the elephant in the room - memory control groups. Things work properly 99.9999% of the time .... when you aren't using control groups!?!
So the proposal is to rewrite some filesystems to make implementing memory control groups easier. Did I get that right?
Posted Mar 17, 2015 20:36 UTC (Tue)
by dlang (guest, #313)
[Link]
I think that team blue isn't saying "it ain't broke" but rather "the fix is worse than the problem"
Posted Mar 17, 2015 22:59 UTC (Tue)
by nix (subscriber, #2304)
[Link] (18 responses)
Often (the vast majority of the time, I expect) you're lucky and the big process trips the oom-killer while it's doing other work in the middle of that big I/O (few processes do solid metadata-heavy I/O all the time), but that's *luck*, not judgement. And I don't much like relying on luck to keep my systems from deadlocking! :) particularly given that this sort of situation seems like something it wouldn't be *all* that terribly hard to engineer. It's not like the various contending processes need to run in different privilege domains or anything.
Posted Mar 17, 2015 23:02 UTC (Tue)
by dlang (guest, #313)
[Link] (3 responses)
or am I missing something here?
Posted Mar 17, 2015 23:37 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (2 responses)
That perspective misses the point. The problem isn't exactly being out of memory. The problem is memory allocation requests failing or blocking indefinitely. A memory-constrained process can have a memory allocation fail even when the system as a whole has plenty of free memory. If the code which makes that failing request isn't written to expect that behaviour, it could easily cause further problems.
There is a lot of complexity and subtlety in the VM to try to keep memory balanced between different needs, and to avoid deadlocks and maintain liveness. For memory cgroups to impose limits on in-kernel allocations, it needs to replicate all that subtlety inside the memcg system. Certainly that should be possible, but I doubt it would be easy.
Posted Mar 17, 2015 23:56 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
As long as the overall system isn't out of memory, the fact that a user/container/vm is using all the memory it's allowed shouldn't cause this sort of problem for things outside of that user/container/vm
Posted Mar 18, 2015 11:04 UTC (Wed)
by dgm (subscriber, #49227)
[Link]
Posted Mar 17, 2015 23:28 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (13 responses)
yes, I have too. In those cases they were removed by relatively simple code fixes.
While there are some common pattern, each deadlock is potentially quite different.
Without looking at the precise details of a particular deadlock, you cannot know what sort of approach might be needed to ensure it never happens again.
So saying "I've seen deadlocks" is like saying "there are bugs". Undoubtedly true, but not very helpful.
Whether there are deadlocks that can only (or most easily) be fixed by new memory reservation schemes is the important question. It is one that can only be answered by careful analysis of lots of details.
Posted Mar 18, 2015 15:30 UTC (Wed)
by vbabka (subscriber, #91706)
[Link] (12 responses)
>yes, I have too. In those cases they were removed by relatively simple code fixes.
>While there are some common pattern, each deadlock is potentially quite different.
>Without looking at the precise details of a particular deadlock, you cannot know what sort of approach might be needed to ensure it never happens again.
>So saying "I've seen deadlocks" is like saying "there are bugs". Undoubtedly true, but not very helpful.
Yes, in some cases the fix is simple. But AFAIU in general it's not feasible for OOM killer to know which task is holding which locks (without the kind of overhead that enabling lockdep has), so it's not possible to guarantee it will select victims in a way that guarantees forward progress.
Posted Mar 18, 2015 22:13 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (11 responses)
What I keep wondering is why this matters so much.
I'm sure this has come up before, but I don't remember why it doesn't happen. Any ideas?
Posted Mar 18, 2015 22:26 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Mar 18, 2015 23:31 UTC (Wed)
by nix (subscriber, #2304)
[Link] (5 responses)
Posted Mar 18, 2015 23:31 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
I clearly need to go to sleep...
Posted Mar 19, 2015 1:12 UTC (Thu)
by Paf (subscriber, #91811)
[Link]
Uninterruptible sleeping, and sleeping with sigkill blocked. Doing either one in a syscall means the process won't act on sigkill until it is woken up. I believe when sleeping uninterruptibly, sigkill is ignored. (I'm pretty sure.)
One particularly fun thing in multi-threaded systems I've actually seen: The intended waker is killed and the sleeper is now unwakeable and unkillable.
Posted Mar 19, 2015 0:03 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Mar 19, 2015 0:32 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (1 responses)
So either they will have called get_user_pages() and will hold references to the pages which will keep them safe, or it will be calling copy_{to,from}_user which is designed to handle missing
Is there some other way to access user memory that I have missed? Or is one of those racy in a way that I cannot see?
Posted Mar 19, 2015 18:45 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> Is there some other way to access user memory that I have missed? Or is one of those racy in a way that I cannot see?
Posted Mar 19, 2015 8:08 UTC (Thu)
by vbabka (subscriber, #91706)
[Link] (3 responses)
> I'm sure this has come up before, but I don't remember why it doesn't happen. Any ideas?
Yeah Mel suggested this to Dave before the session, but it didn't seem a sufficient solution to avoid the need for reservations completely.
I'm not sure about the exact reason, but if you think about it, there's not much difference between the pages you can reclaim and pages you can unmap. And as long as you can reclaim, OOM is not invoked.
- file pages that are clean, could have been reclaimed, those that are dirty cannot be simply discarded (maybe except some temporary files that have been already unlinked)
Also did you know that SLE11 (SP1? not sure) kernel already has some limited form of memory reservations? For swap over NFS, I heard :)
Posted Mar 19, 2015 8:30 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (2 responses)
There may still be a need for reservations, but that seems to be a largely separate problem from the OOM killer not being able to free memory from the worst offender.
Posted Mar 19, 2015 19:45 UTC (Thu)
by mm7323 (subscriber, #87386)
[Link] (1 responses)
Now if XFS could check (and temporarily reserve) how much reclaimable memory is available before starting a transaction, XFS could fail early, or perhaps OOM killer could be started before the situation deteriorates to the point of no progress can be made due to un-reclaimable and swap memory exhaustion.
Posted Mar 21, 2015 11:42 UTC (Sat)
by mtanski (guest, #56423)
[Link]
Think of this as back pressure in a low resource scenario...and it's the right place to apply back pressure, before the transaction start., before it's too late (not enought memory to make progress).
The downside is that it will lower concurrency on heavily loaded but under resourced (memory) systems.
Posted Mar 18, 2015 15:23 UTC (Wed)
by vbabka (subscriber, #91706)
[Link] (1 responses)
>All of the mm developers, or just some? And do we know why?
Actually, it has been already deprecated for years. Which is exactly what lead to people working around it (literally :) with retry loops. Which means the MM subsystem cannot know (without seeing the flag) that the particular allocation site in fact cannot fail, and cannot treat it specially.
I think the article is a bit misleading on the "would like to deprecate" part here. In fact, Michal has already posted a patch to clarify the wording:
* __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
So AFAIU the goal now is not to deprecate the flag, but to handle the allocations that do need to use it in a more reliable way than relying just on OOM.
> > Ted insisted, though, that this work is addressing a corner case; things work properly, he said, 99.9999% of the time.
> So team-red says "It's broken, we need to fix it", and team blue says "it ain't broke, don't fix it".
I guess there might be users which want things to work properly 100% of the time, and not rely on luck.
> So the proposal is to rewrite some filesystems to make implementing memory control groups easier. Did I get that right?
As has been already mentioned, this is not at all limited to memcg.
Posted Mar 18, 2015 15:34 UTC (Wed)
by vbabka (subscriber, #91706)
[Link]
More precisely, this was meant in a way that MM devs had wanted to deprecate the __GFP_NOFAIL flag (and even in fact they did that, in its description), but have later realized that there are allocation sites that do need it and while it would be nice to get rid of it, it doesn't seem to be a realistic goal.
Posted Mar 27, 2015 15:17 UTC (Fri)
by mstsxfx (subscriber, #41804)
[Link]
Not at all. Memory control groups had a similar issue but this has been solved because we now trigger memcg oom killer only from the page fault path after all the previous locks were dropped (have a look at pagefault_out_of_memory). It is the !memcg which is hitting the same problem now. There are certainly ways how to make OOM killer smarter (e.g. tearing down parts of the address space). The point remains though. There are non-failing allocations (__GFP_NOFAIL) which are holding locks (i_mutex to name the most visible example) which might be preventing from the further progress. It is surprisingly easy to hit some of those corner cases without privileges.
Now I am not suggesting that the full reservation system is a must but we might eventually need it if all our other options are not sufficient. Filesystem people are already using a reservation system so it is not surprising they are pushing for the similar thing in the MM code as well. I suspect that MM implementation will be tricky so we will push back and try everything before going that route.
Posted Mar 23, 2015 16:30 UTC (Mon)
by meyert (subscriber, #32097)
[Link] (1 responses)
That's really an interesting definition of a "transaction"...
Posted Mar 23, 2015 17:36 UTC (Mon)
by pizza (subscriber, #46)
[Link]
Think of it from the perspective of data-on-disk.
Posted Mar 29, 2015 21:26 UTC (Sun)
by toyotabedzrock (guest, #88005)
[Link]
Posted Apr 1, 2015 2:41 UTC (Wed)
by zblaxell (subscriber, #26385)
[Link] (12 responses)
It seems odd to see all the debate and complexity to save a few hundred kB, and I question the priorities of some of these apocryphal "administrators" who would prefer a 5% performance gain over non-deterministic lockups and a visit from the Chaos Monkey. I threw several gigabytes of RAM at this problem years ago and never looked back. I would love to see filesystems just grab a megabyte or ten of RAM at mount time to handle their worst-case peak memory demands, and be done with this kind of problem forever.
IMHO "no-fail" allocations should come out of a previously reserved (and fully committed!) pool. Allocations in excess of the amount reserved should fail in _all_ cases, not just low-memory cases. This eliminates low-memory corner cases by making them identical to the normal cases.
Posted Apr 1, 2015 3:43 UTC (Wed)
by dlang (guest, #313)
[Link] (2 responses)
Even in the server space, a lot of people running VMs are constrained far more by the amount of RAM that they can cram into the system than the CPU cycles available.
It's seldom as simple as "trade 5% speed for frequent random crashes"
Posted Apr 1, 2015 18:04 UTC (Wed)
by zblaxell (subscriber, #26385)
[Link] (1 responses)
The workloads are proportionally smaller too (no multi-layer filesystem + LVM + RAID, maximum burst write size is smaller, etc), so in practice less than 100MB needs to be set aside. It's still 20% of the machine, though, and it wouldn't need to be set aside at all if the kernel could be trusted to manage its own memory.
If the worst-case transaction RAM usage in my favorite filesystem is 20MB, and I have a 16MB machine, I cannot use that filesystem on that machine. Even if the filesystem only requires 1MB 99% of the time, as soon as that 1%-of-the-time case pops up, the filesystem will fail, and it will probably take most or all of the application stack down with it. There is no option in this case that does not lead to failure. The only question is when the failure is detected. I'd much prefer mount to fail at the start because the filesystem can't reserve space for one instance of its worst case space requirement. The alternative is to fail later in the field. Possibly literally in the field, if the Pi has been installed in some kind of autonomous robot...
I don't expect a Pi to be able to sustain 2000 simultaneous writing threads for a dozen reasons, only one of which is not having enough RAM--reserved or otherwise--to handle all the filesystem transactions at once. I'd expect either serialization or ENOMEM, but what I get is a hang or a Chaos Monkey.
> Even in the server space, a lot of people running VMs are constrained far more by the amount of RAM that they can cram into the system than the CPU cycles available.
RAM size determines workload size and vice versa. If the workload exceeds the available RAM the application will fail whether we use a reservation scheme or not. The admin's job is to adjust workload or RAM sizes until there's enough RAM for the workload and not too much workload for the RAM.
In practice, the admin currently has to determine what the workload's worst-case RAM requirement is, and guess how much headroom to add on top to prevent the kernel from randomly failing. Ideally, the kernel would just manage the headroom it needs by itself, and eliminate the guesswork for the admin.
Posted Apr 1, 2015 18:42 UTC (Wed)
by dlang (guest, #313)
[Link]
In general people do just 'throw memory at it', this entire situation only comes up when a system is using all the memory it has, and can't easily free more.
Posted Apr 1, 2015 9:43 UTC (Wed)
by etienne (guest, #25256)
[Link] (8 responses)
But would you love grabbing "a megabyte or ten" each times someone writes one byte to a file/device because the file may be on a userspace filesystem of a unionfs of a complex filesystem on a RAID partition of a...
Posted Apr 1, 2015 16:50 UTC (Wed)
by zblaxell (subscriber, #26385)
[Link] (5 responses)
If the alternative is a messy sort of failure, then yes. I either need the memory or I don't. If I need the memory, I need the memory. The algorithms implemented in the software I'm running won't work without the memory they require _by definition_. This should not be controversial.
Currently I have to let gigabytes of RAM lie fallow because my kernel can't be trusted to manage its own memory sanely. How could a few hundred or even a few thousand 1MB preallocations make that worse? If anything, such a scheme would _save_ memory for me.
> The reservation shall be at the kernel entry (from userspace) because that is where no locks are taken, that is also where nobody knows how much memory will be needed to write that byte.
The syscall entry is really too late. The reservations should be done much earlier, e.g. when the filesystem is mounted or when files are opened for writing. We should reserve RAM as soon as usage becomes possible so we are not surprised when the bad cases pop up later.
The filesystem should always know what it would need to write _a_ byte. Multiply that amount by the number of simultaneously active writing threads (possibly less if the filesystem can combine similar requests into a single RAM reservation requirement, or serialize large requests to reduce peak RAM usage). Reserve that amount for the filesystem to use. Recurse and repeat for each lower layer until all the memory required to write the byte is reserved.
Posted Apr 1, 2015 18:44 UTC (Wed)
by dlang (guest, #313)
[Link] (2 responses)
where in the world did you get this from?
Posted Apr 1, 2015 19:37 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Apr 1, 2015 21:28 UTC (Wed)
by zblaxell (subscriber, #26385)
[Link]
Most of my big workload applications are based on a single constant-sized blob of data (e.g. a RDBMS server, git repo, or similar) with latency-bound processing requirements (i.e. waiting for disk I/O is not permitted). I'd love to say "processes in group A get access to 75% of RAM all the time, and everything else on the system gets to fight over the remainder." As far as I can tell, cgroups provide only the exact opposite of that: "if I limit everything else to 25% of RAM, group A gets its RAM by default maybe 98% of the time." (the other 2% is a failure mode where all the RAM gets eaten by something invisible to cgroups and slabtop, and the machine watchdog-resets).
So I've got all of userspace throttled by cgroups, and suddenly many of the stupid things that the kernel does when memory is low just go away. No more high CPU usage in kernel space when free RAM is low, no more randomly killed processes, fewer random crashes, fewer spurious I/O errors, and fewer other random and bizarre bugs that only seem to occur when something uses the last free pages of RAM. Occasionally there's a kernel stack trace with "memcg" in it, but that's usually followed a few days later by a kernel patch to fix it.
I've experimentally found that somewhere between 10 and 30% of the RAM has to be inaccessible to userspace before I get predictable performance results, which is a few gigabytes on a typical 16GB system. Most of the variation comes from VFS dentry/inode cache, which isn't directly controlled by cgroups, but uses space roughly proportional to cgroup page cache limits.
Posted Apr 1, 2015 19:40 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
(Note: it might be acceptable to have threads block at the point when they would otherwise be about to initiate an fs write if an allocation cannot be obtained -- but how is that different from what we have now, particularly given that writes that are necessary to resolve memory pressure can still lead to deadlocks in this scenario?)
Posted Apr 1, 2015 21:22 UTC (Wed)
by zblaxell (subscriber, #26385)
[Link]
That number is already bounded by the amount of RAM you have to support those threads. I'd propose just lowering it slightly, e.g. to "only half the number of threads that will cripple the system by exhausting all available memory."
> would require every thread creation to be accompanied by God only knows how much peripheral allocation by everything in the kernel that might potentially need to do work on behalf of that thread in the future
When a thread modifies a file, it is the file (or the filesystem containing it) that ultimately needs the reserved space. The threads were just there incidentally.
In practice the thread doesn't have to own anything. The reserved space for writing the file would be owned by the file or its filesystem. The thread that first wrote something would create the reserved allocation and attach it to the file, and whichever thread got stuck with the job of flushing the page to disk would consume the reserved allocation from the file. There may be recycling of allocations. Or rather the filesystem code executed by the thread would do all that, since the filesystem is the expert on how much memory it needs in the first place.
I am proposing that before we let a thread dirty a page, enough space is reserved to be able to reliably clean it in the future. That doesn't seem unreasonable to me. It may not even be extra work (for the machine) since the filesystem was going to allocate and use that memory anyway, just at a different time.
> Note: it might be acceptable to have threads block at the point when they would otherwise be about to initiate an fs write if an allocation cannot be obtained
We can block earlier, before (many) locks are held.
> but how is that different from what we have now, particularly given that writes that are necessary to resolve memory pressure can still lead to deadlocks in this scenario?
I'd really prefer to not have memory pressure and writes interact at all. That's half the reason why I'm using cgroup hacks right there: to prevent any group of processes from using dirty pages to export memory pressure to the rest of the system. Among many other things, cgroups are a crude way of forcing dirty pages to be counted separately from any other kind of page. "cgroup" is just close enough to "page type" to be effective, since most of my cgroups tend to be dominated by a single page type.
Posted Apr 1, 2015 22:45 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (1 responses)
We already have that.
#define ERESTARTSYS 512
Posted Apr 2, 2015 6:00 UTC (Thu)
by kleptog (subscriber, #1183)
[Link]
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
So it is about waiting for memory, not waiting for locks.
And it is only a maybe. The change to avoid direct reclaim will have made the role of GFP_NOIO quite different, but I would need to examine code carefully before proclaiming that it was dead.
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
So GFP_NOFS and GFP_NOIO are still needed (at least) so shrinkers can decide if it is safe to take various locks.
Reservations for must-succeed memory allocations
It seems that discussing a solution might be premature - more airtime needed on the problem?
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
But I think you are suggesting that a memory-constrained process cannot run the whole system out of memory and so cannot cause problems - is that right?
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Once the OOM killer has identified a process and sent it SIGKILL, why not just pro-actively unmap all its user-space memory. That should immediately resolve the memory problems, and the shell of the old process can be left to sort itself out as locks become available.
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
addresses and will return an appropriate error status if the memory isn't there.
Reservations for must-succeed memory allocations
Wouldn't this require splitting the victim's VMA to free pages that are not pinned (requiring more RAM to do it)? On the other hand, in most cases only a couple of pages are going to be pinned at any given moment.
Other than weird zero-copy scenarios I think you're not missing anything.
Reservations for must-succeed memory allocations
- anonymous pages could have been swapped out. Yes, there might be a difference if your swap is full, or file-backed (thus potentially blocking). Otherwise mempools in I/O layer should have guaranteed progress swapping out during reclaim.
- unevictable pages (mlock) - here unmapping on OOM could help, but we could also maybe just breach mlock guarantees and reclaim the pages if the system is in trouble - at that point, any performance guarantees are probably lost anyway. OK, maybe not, since you might be using mlock to prevent sensitive data in anonymous private mappings to hit persistent storage...
- pages holding the page tables, once you empty them - that will gain you some memory, but likely not guaranteed enough to save the situation
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
- * cannot handle allocation failures. This modifier is deprecated and no new
- * users should be added.
+ * cannot handle allocation failures. New users should be evaluated carefuly
+ * (and the flag should be used only when there is no reasonable failure policy)
+ * but it is definitely preferable to use the flag rather than opencode endless
+ * loop around allocator.
It seems that discussing a solution might be premature - more airtime needed on the problem?
Reservations for must-succeed memory allocations
> I think the article is a bit misleading on the "would like to deprecate" part here.
Reservations for must-succeed memory allocations
> > There are some workloads, often involving a small number of processes
> > running in a memory control group, where every task depends on the
> > same lock.
>
> I suspect this is the elephant in the room - memory control
> groups. Things work properly 99.9999% of the time .... when you aren't
> using control groups!?!
>
> So the proposal is to rewrite some filesystems to make implementing
> memory control groups easier. Did I get that right?
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
The reservation shall be at the kernel entry (from userspace) because that is where no locks are taken, that is also where nobody knows how much memory will be needed to write that byte.
Maybe a solution could be a new error code internal to Linux, something like -EKERNELRETRY, where it is like an error and everything is cancelled, but just before returning to the usermode application the request is retried entirely. Still a very complex solution...
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations