Reservations for must-succeed memory allocations

Posted Mar 17, 2015 22:59 UTC (Tue) by nix (subscriber, #2304)
In reply to: Reservations for must-succeed memory allocations by neilbrown
Parent article: Reservations for must-succeed memory allocations

I have seen deadlocks on non-memcg memory-constrained systems when memory runs out. It *can* happen, generally when you have a few big processes eating lots of memory, then doing some simultaneous I/O by mischance and all blocking on an allocation down in the fs layer (metadata allocation is where I've seen it). The oom-killer kicks into action and slaughters the little processes that are lying around (since it can't slaughter the big ones), whereupon one of the big processes fires up, eats that memory too, blocks again... and then everything deadlocks.

Often (the vast majority of the time, I expect) you're lucky and the big process trips the oom-killer while it's doing other work in the middle of that big I/O (few processes do solid metadata-heavy I/O all the time), but that's *luck*, not judgement. And I don't much like relying on luck to keep my systems from deadlocking! :) particularly given that this sort of situation seems like something it wouldn't be *all* that terribly hard to engineer. It's not like the various contending processes need to run in different privilege domains or anything.

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 23:02 UTC (Tue) by dlang (guest, #313) [Link] (3 responses)

well, unless there are no limits on the amount of memory that the processes are allowed to use, they won't be able to run the system completely out of memory to trigger the problem.

or am I missing something here?

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 23:37 UTC (Tue) by neilbrown (subscriber, #359) [Link] (2 responses)

I'm having a bit of trouble parsing what you wrote, so please forgive me if I misunderstand.
But I think you are suggesting that a memory-constrained process cannot run the whole system out of memory and so cannot cause problems - is that right?

That perspective misses the point. The problem isn't exactly being out of memory. The problem is memory allocation requests failing or blocking indefinitely. A memory-constrained process can have a memory allocation fail even when the system as a whole has plenty of free memory. If the code which makes that failing request isn't written to expect that behaviour, it could easily cause further problems.

There is a lot of complexity and subtlety in the VM to try to keep memory balanced between different needs, and to avoid deadlocks and maintain liveness. For memory cgroups to impose limits on in-kernel allocations, it needs to replicate all that subtlety inside the memcg system. Certainly that should be possible, but I doubt it would be easy.

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 23:56 UTC (Tue) by dlang (guest, #313) [Link] (1 responses)

I was responding to the portion that seemed to be implying that the problem could be caused by an unprivileged user, or a user constrained within a container/VM

As long as the overall system isn't out of memory, the fact that a user/container/vm is using all the memory it's allowed shouldn't cause this sort of problem for things outside of that user/container/vm

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 11:04 UTC (Wed) by dgm (subscriber, #49227) [Link]

It concerns me deeply that this situation happens at the FS level. If a situation arises where failure is not an option and progress cannot be made, the logical conclusion is filesystem corruption that a constrained user/vm can trigger at will.

Reservations for must-succeed memory allocations

Posted Mar 17, 2015 23:28 UTC (Tue) by neilbrown (subscriber, #359) [Link] (13 responses)

> I have seen deadlocks on non-memcg memory-constrained systems when memory runs out.

yes, I have too. In those cases they were removed by relatively simple code fixes.

While there are some common pattern, each deadlock is potentially quite different.

Without looking at the precise details of a particular deadlock, you cannot know what sort of approach might be needed to ensure it never happens again.

So saying "I've seen deadlocks" is like saying "there are bugs". Undoubtedly true, but not very helpful.

Whether there are deadlocks that can only (or most easily) be fixed by new memory reservation schemes is the important question. It is one that can only be answered by careful analysis of lots of details.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 15:30 UTC (Wed) by vbabka (subscriber, #91706) [Link] (12 responses)

>> I have seen deadlocks on non-memcg memory-constrained systems when memory runs out.

>yes, I have too. In those cases they were removed by relatively simple code fixes.

>While there are some common pattern, each deadlock is potentially quite different.

>Without looking at the precise details of a particular deadlock, you cannot know what sort of approach might be needed to ensure it never happens again.

>So saying "I've seen deadlocks" is like saying "there are bugs". Undoubtedly true, but not very helpful.

Yes, in some cases the fix is simple. But AFAIU in general it's not feasible for OOM killer to know which task is holding which locks (without the kind of overhead that enabling lockdep has), so it's not possible to guarantee it will select victims in a way that guarantees forward progress.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 22:13 UTC (Wed) by neilbrown (subscriber, #359) [Link] (11 responses)

> But AFAIU in general it's not feasible for OOM killer to know which task is holding which locks

What I keep wondering is why this matters so much.
Once the OOM killer has identified a process and sent it SIGKILL, why not just pro-actively unmap all its user-space memory. That should immediately resolve the memory problems, and the shell of the old process can be left to sort itself out as locks become available.

I'm sure this has come up before, but I don't remember why it doesn't happen. Any ideas?

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 22:26 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

I remember that it has something to do with the threads. A signal must be delivered to all the threads, some of which are quite possibly blocked inside the kernel space.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 23:31 UTC (Wed) by nix (subscriber, #2304) [Link] (5 responses)

If the process is being SIGKILLed, the process cannot receive the signal anyway, so there's no need to queue it and no need to do anything with its userspace component. You should just be able to tear it down, then let the kernel side unwind itself up to the syscall level and then go away. I too don't see why this isn't practical.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 23:31 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

I meant, of course, 'cannot *catch* the signal anyway'.

I clearly need to go to sleep...

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 1:12 UTC (Thu) by Paf (subscriber, #91811) [Link]

Two problems.

Uninterruptible sleeping, and sleeping with sigkill blocked. Doing either one in a syscall means the process won't act on sigkill until it is woken up. I believe when sleeping uninterruptibly, sigkill is ignored. (I'm pretty sure.)

One particularly fun thing in multi-threaded systems I've actually seen: The intended waker is killed and the sleeper is now unwakeable and unkillable.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 0:03 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Kernel threads might be reading memory that is currently being reclaimed, so you _need_ to deliver the signal to all threads before starting to free the RAM used.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 0:32 UTC (Thu) by neilbrown (subscriber, #359) [Link] (1 responses)

> Kernel threads might be reading memory that is currently being reclaimed,

So either they will have called get_user_pages() and will hold references to the pages which will keep them safe, or it will be calling copy_{to,from}_user which is designed to handle missing
addresses and will return an appropriate error status if the memory isn't there.

Is there some other way to access user memory that I have missed? Or is one of those racy in a way that I cannot see?

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 18:45 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> So either they will have called get_user_pages() and will hold references to the pages which will keep them safe
Wouldn't this require splitting the victim's VMA to free pages that are not pinned (requiring more RAM to do it)? On the other hand, in most cases only a couple of pages are going to be pinned at any given moment.

> Is there some other way to access user memory that I have missed? Or is one of those racy in a way that I cannot see?
Other than weird zero-copy scenarios I think you're not missing anything.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 8:08 UTC (Thu) by vbabka (subscriber, #91706) [Link] (3 responses)

> Once the OOM killer has identified a process and sent it SIGKILL, why not just pro-actively unmap all its user-space memory. That should immediately resolve the memory problems, and the shell of the old process can be left to sort itself out as locks become available.

> I'm sure this has come up before, but I don't remember why it doesn't happen. Any ideas?

Yeah Mel suggested this to Dave before the session, but it didn't seem a sufficient solution to avoid the need for reservations completely.

I'm not sure about the exact reason, but if you think about it, there's not much difference between the pages you can reclaim and pages you can unmap. And as long as you can reclaim, OOM is not invoked.

- file pages that are clean, could have been reclaimed, those that are dirty cannot be simply discarded (maybe except some temporary files that have been already unlinked)
- anonymous pages could have been swapped out. Yes, there might be a difference if your swap is full, or file-backed (thus potentially blocking). Otherwise mempools in I/O layer should have guaranteed progress swapping out during reclaim.
- unevictable pages (mlock) - here unmapping on OOM could help, but we could also maybe just breach mlock guarantees and reclaim the pages if the system is in trouble - at that point, any performance guarantees are probably lost anyway. OK, maybe not, since you might be using mlock to prevent sensitive data in anonymous private mappings to hit persistent storage...
- pages holding the page tables, once you empty them - that will gain you some memory, but likely not guaranteed enough to save the situation

Also did you know that SLE11 (SP1? not sure) kernel already has some limited form of memory reservations? For swap over NFS, I heard :)

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 8:30 UTC (Thu) by neilbrown (subscriber, #359) [Link] (2 responses)

Surely an excess of anonymous or mlocked pages while swap is full is the only situation that can trigger OOM? Those are exactly the pages that can be unmapped but not reclaimed.

There may still be a need for reservations, but that seems to be a largely separate problem from the OOM killer not being able to free memory from the worst offender.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 19:45 UTC (Thu) by mm7323 (subscriber, #87386) [Link] (1 responses)

I think the problems become related when the system deadlocks due to OOM killer not being able to make progress due to memory requesters holding locks or needing more memory for transactions to complete!

Now if XFS could check (and temporarily reserve) how much reclaimable memory is available before starting a transaction, XFS could fail early, or perhaps OOM killer could be started before the situation deteriorates to the point of no progress can be made due to un-reclaimable and swap memory exhaustion.

Reservations for must-succeed memory allocations

Posted Mar 21, 2015 11:42 UTC (Sat) by mtanski (guest, #56423) [Link]

That is exactly what was proposed in the talk and what a lot of the commenters are missing. These changes make a reservation before the transaction star. At that point you have a choice to cleaning space, returning an error, or waiting on previous transactions to finish.

Think of this as back pressure in a low resource scenario...and it's the right place to apply back pressure, before the transaction start., before it's too late (not enought memory to make progress).

The downside is that it will lower concurrency on heavily loaded but under resourced (memory) systems.