Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Posted Mar 17, 2015 22:59 UTC (Tue) by nix (subscriber, #2304)In reply to: Reservations for must-succeed memory allocations by neilbrown
Parent article: Reservations for must-succeed memory allocations
Often (the vast majority of the time, I expect) you're lucky and the big process trips the oom-killer while it's doing other work in the middle of that big I/O (few processes do solid metadata-heavy I/O all the time), but that's *luck*, not judgement. And I don't much like relying on luck to keep my systems from deadlocking! :) particularly given that this sort of situation seems like something it wouldn't be *all* that terribly hard to engineer. It's not like the various contending processes need to run in different privilege domains or anything.
Posted Mar 17, 2015 23:02 UTC (Tue)
by dlang (guest, #313)
[Link] (3 responses)
or am I missing something here?
Posted Mar 17, 2015 23:37 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (2 responses)
That perspective misses the point. The problem isn't exactly being out of memory. The problem is memory allocation requests failing or blocking indefinitely. A memory-constrained process can have a memory allocation fail even when the system as a whole has plenty of free memory. If the code which makes that failing request isn't written to expect that behaviour, it could easily cause further problems.
There is a lot of complexity and subtlety in the VM to try to keep memory balanced between different needs, and to avoid deadlocks and maintain liveness. For memory cgroups to impose limits on in-kernel allocations, it needs to replicate all that subtlety inside the memcg system. Certainly that should be possible, but I doubt it would be easy.
Posted Mar 17, 2015 23:56 UTC (Tue)
by dlang (guest, #313)
[Link] (1 responses)
As long as the overall system isn't out of memory, the fact that a user/container/vm is using all the memory it's allowed shouldn't cause this sort of problem for things outside of that user/container/vm
Posted Mar 18, 2015 11:04 UTC (Wed)
by dgm (subscriber, #49227)
[Link]
Posted Mar 17, 2015 23:28 UTC (Tue)
by neilbrown (subscriber, #359)
[Link] (13 responses)
yes, I have too. In those cases they were removed by relatively simple code fixes.
While there are some common pattern, each deadlock is potentially quite different.
Without looking at the precise details of a particular deadlock, you cannot know what sort of approach might be needed to ensure it never happens again.
So saying "I've seen deadlocks" is like saying "there are bugs". Undoubtedly true, but not very helpful.
Whether there are deadlocks that can only (or most easily) be fixed by new memory reservation schemes is the important question. It is one that can only be answered by careful analysis of lots of details.
Posted Mar 18, 2015 15:30 UTC (Wed)
by vbabka (subscriber, #91706)
[Link] (12 responses)
>yes, I have too. In those cases they were removed by relatively simple code fixes.
>While there are some common pattern, each deadlock is potentially quite different.
>Without looking at the precise details of a particular deadlock, you cannot know what sort of approach might be needed to ensure it never happens again.
>So saying "I've seen deadlocks" is like saying "there are bugs". Undoubtedly true, but not very helpful.
Yes, in some cases the fix is simple. But AFAIU in general it's not feasible for OOM killer to know which task is holding which locks (without the kind of overhead that enabling lockdep has), so it's not possible to guarantee it will select victims in a way that guarantees forward progress.
Posted Mar 18, 2015 22:13 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (11 responses)
What I keep wondering is why this matters so much.
I'm sure this has come up before, but I don't remember why it doesn't happen. Any ideas?
Posted Mar 18, 2015 22:26 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Mar 18, 2015 23:31 UTC (Wed)
by nix (subscriber, #2304)
[Link] (5 responses)
Posted Mar 18, 2015 23:31 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
I clearly need to go to sleep...
Posted Mar 19, 2015 1:12 UTC (Thu)
by Paf (subscriber, #91811)
[Link]
Uninterruptible sleeping, and sleeping with sigkill blocked. Doing either one in a syscall means the process won't act on sigkill until it is woken up. I believe when sleeping uninterruptibly, sigkill is ignored. (I'm pretty sure.)
One particularly fun thing in multi-threaded systems I've actually seen: The intended waker is killed and the sleeper is now unwakeable and unkillable.
Posted Mar 19, 2015 0:03 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Mar 19, 2015 0:32 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (1 responses)
So either they will have called get_user_pages() and will hold references to the pages which will keep them safe, or it will be calling copy_{to,from}_user which is designed to handle missing
Is there some other way to access user memory that I have missed? Or is one of those racy in a way that I cannot see?
Posted Mar 19, 2015 18:45 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> Is there some other way to access user memory that I have missed? Or is one of those racy in a way that I cannot see?
Posted Mar 19, 2015 8:08 UTC (Thu)
by vbabka (subscriber, #91706)
[Link] (3 responses)
> I'm sure this has come up before, but I don't remember why it doesn't happen. Any ideas?
Yeah Mel suggested this to Dave before the session, but it didn't seem a sufficient solution to avoid the need for reservations completely.
I'm not sure about the exact reason, but if you think about it, there's not much difference between the pages you can reclaim and pages you can unmap. And as long as you can reclaim, OOM is not invoked.
- file pages that are clean, could have been reclaimed, those that are dirty cannot be simply discarded (maybe except some temporary files that have been already unlinked)
Also did you know that SLE11 (SP1? not sure) kernel already has some limited form of memory reservations? For swap over NFS, I heard :)
Posted Mar 19, 2015 8:30 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (2 responses)
There may still be a need for reservations, but that seems to be a largely separate problem from the OOM killer not being able to free memory from the worst offender.
Posted Mar 19, 2015 19:45 UTC (Thu)
by mm7323 (subscriber, #87386)
[Link] (1 responses)
Now if XFS could check (and temporarily reserve) how much reclaimable memory is available before starting a transaction, XFS could fail early, or perhaps OOM killer could be started before the situation deteriorates to the point of no progress can be made due to un-reclaimable and swap memory exhaustion.
Posted Mar 21, 2015 11:42 UTC (Sat)
by mtanski (guest, #56423)
[Link]
Think of this as back pressure in a low resource scenario...and it's the right place to apply back pressure, before the transaction start., before it's too late (not enought memory to make progress).
The downside is that it will lower concurrency on heavily loaded but under resourced (memory) systems.
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
But I think you are suggesting that a memory-constrained process cannot run the whole system out of memory and so cannot cause problems - is that right?
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Once the OOM killer has identified a process and sent it SIGKILL, why not just pro-actively unmap all its user-space memory. That should immediately resolve the memory problems, and the shell of the old process can be left to sort itself out as locks become available.
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
addresses and will return an appropriate error status if the memory isn't there.
Reservations for must-succeed memory allocations
Wouldn't this require splitting the victim's VMA to free pages that are not pinned (requiring more RAM to do it)? On the other hand, in most cases only a couple of pages are going to be pinned at any given moment.
Other than weird zero-copy scenarios I think you're not missing anything.
Reservations for must-succeed memory allocations
- anonymous pages could have been swapped out. Yes, there might be a difference if your swap is full, or file-backed (thus potentially blocking). Otherwise mempools in I/O layer should have guaranteed progress swapping out during reclaim.
- unevictable pages (mlock) - here unmapping on OOM could help, but we could also maybe just breach mlock guarantees and reclaim the pages if the system is in trouble - at that point, any performance guarantees are probably lost anyway. OK, maybe not, since you might be using mlock to prevent sensitive data in anonymous private mappings to hit persistent storage...
- pages holding the page tables, once you empty them - that will gain you some memory, but likely not guaranteed enough to save the situation
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations
Reservations for must-succeed memory allocations