Reservations for must-succeed memory allocations

Posted Mar 18, 2015 22:13 UTC (Wed) by neilbrown (subscriber, #359)
In reply to: Reservations for must-succeed memory allocations by vbabka
Parent article: Reservations for must-succeed memory allocations

> But AFAIU in general it's not feasible for OOM killer to know which task is holding which locks

What I keep wondering is why this matters so much.
Once the OOM killer has identified a process and sent it SIGKILL, why not just pro-actively unmap all its user-space memory. That should immediately resolve the memory problems, and the shell of the old process can be left to sort itself out as locks become available.

I'm sure this has come up before, but I don't remember why it doesn't happen. Any ideas?

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 22:26 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

I remember that it has something to do with the threads. A signal must be delivered to all the threads, some of which are quite possibly blocked inside the kernel space.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 23:31 UTC (Wed) by nix (subscriber, #2304) [Link] (5 responses)

If the process is being SIGKILLed, the process cannot receive the signal anyway, so there's no need to queue it and no need to do anything with its userspace component. You should just be able to tear it down, then let the kernel side unwind itself up to the syscall level and then go away. I too don't see why this isn't practical.

Reservations for must-succeed memory allocations

Posted Mar 18, 2015 23:31 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

I meant, of course, 'cannot *catch* the signal anyway'.

I clearly need to go to sleep...

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 1:12 UTC (Thu) by Paf (subscriber, #91811) [Link]

Two problems.

Uninterruptible sleeping, and sleeping with sigkill blocked. Doing either one in a syscall means the process won't act on sigkill until it is woken up. I believe when sleeping uninterruptibly, sigkill is ignored. (I'm pretty sure.)

One particularly fun thing in multi-threaded systems I've actually seen: The intended waker is killed and the sleeper is now unwakeable and unkillable.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 0:03 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Kernel threads might be reading memory that is currently being reclaimed, so you _need_ to deliver the signal to all threads before starting to free the RAM used.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 0:32 UTC (Thu) by neilbrown (subscriber, #359) [Link] (1 responses)

> Kernel threads might be reading memory that is currently being reclaimed,

So either they will have called get_user_pages() and will hold references to the pages which will keep them safe, or it will be calling copy_{to,from}_user which is designed to handle missing
addresses and will return an appropriate error status if the memory isn't there.

Is there some other way to access user memory that I have missed? Or is one of those racy in a way that I cannot see?

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 18:45 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> So either they will have called get_user_pages() and will hold references to the pages which will keep them safe
Wouldn't this require splitting the victim's VMA to free pages that are not pinned (requiring more RAM to do it)? On the other hand, in most cases only a couple of pages are going to be pinned at any given moment.

> Is there some other way to access user memory that I have missed? Or is one of those racy in a way that I cannot see?
Other than weird zero-copy scenarios I think you're not missing anything.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 8:08 UTC (Thu) by vbabka (subscriber, #91706) [Link] (3 responses)

> Once the OOM killer has identified a process and sent it SIGKILL, why not just pro-actively unmap all its user-space memory. That should immediately resolve the memory problems, and the shell of the old process can be left to sort itself out as locks become available.

> I'm sure this has come up before, but I don't remember why it doesn't happen. Any ideas?

Yeah Mel suggested this to Dave before the session, but it didn't seem a sufficient solution to avoid the need for reservations completely.

I'm not sure about the exact reason, but if you think about it, there's not much difference between the pages you can reclaim and pages you can unmap. And as long as you can reclaim, OOM is not invoked.

- file pages that are clean, could have been reclaimed, those that are dirty cannot be simply discarded (maybe except some temporary files that have been already unlinked)
- anonymous pages could have been swapped out. Yes, there might be a difference if your swap is full, or file-backed (thus potentially blocking). Otherwise mempools in I/O layer should have guaranteed progress swapping out during reclaim.
- unevictable pages (mlock) - here unmapping on OOM could help, but we could also maybe just breach mlock guarantees and reclaim the pages if the system is in trouble - at that point, any performance guarantees are probably lost anyway. OK, maybe not, since you might be using mlock to prevent sensitive data in anonymous private mappings to hit persistent storage...
- pages holding the page tables, once you empty them - that will gain you some memory, but likely not guaranteed enough to save the situation

Also did you know that SLE11 (SP1? not sure) kernel already has some limited form of memory reservations? For swap over NFS, I heard :)

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 8:30 UTC (Thu) by neilbrown (subscriber, #359) [Link] (2 responses)

Surely an excess of anonymous or mlocked pages while swap is full is the only situation that can trigger OOM? Those are exactly the pages that can be unmapped but not reclaimed.

There may still be a need for reservations, but that seems to be a largely separate problem from the OOM killer not being able to free memory from the worst offender.

Reservations for must-succeed memory allocations

Posted Mar 19, 2015 19:45 UTC (Thu) by mm7323 (subscriber, #87386) [Link] (1 responses)

I think the problems become related when the system deadlocks due to OOM killer not being able to make progress due to memory requesters holding locks or needing more memory for transactions to complete!

Now if XFS could check (and temporarily reserve) how much reclaimable memory is available before starting a transaction, XFS could fail early, or perhaps OOM killer could be started before the situation deteriorates to the point of no progress can be made due to un-reclaimable and swap memory exhaustion.

Reservations for must-succeed memory allocations

Posted Mar 21, 2015 11:42 UTC (Sat) by mtanski (guest, #56423) [Link]

That is exactly what was proposed in the talk and what a lot of the commenters are missing. These changes make a reservation before the transaction star. At that point you have a choice to cleaning space, returning an error, or waiting on previous transactions to finish.

Think of this as back pressure in a low resource scenario...and it's the right place to apply back pressure, before the transaction start., before it's too late (not enought memory to make progress).

The downside is that it will lower concurrency on heavily loaded but under resourced (memory) systems.