In some cases, direct reclaim can have a strong impact on the performance of memory-intensive processes; as a result, this technique is unpopular in some areas. Linus suggested that the real problem is that we still have not learned how to do direct reclaim right. Rather than tossing it out altogether, we should put more effort into figuring out how to get the right degree of throttling without ruining performance altogether.
Linus has already been wrong once in the last three years, this would make it twice ;-)
So-called direct reclaim, i.e., letting any process that fails to get the memory it needs, magically transform itself into a memory scanner was a bad idea from day one, and slowly, the incriminating evidence has built. For me, the glaringly obvious deficiency is the tendency to end up with a thundering herd of memory scanners that interact with each other in no sane way.
This problem reminds me of how the VM was before rmap: we would walk process page tables one mm at a time and unmap a physical page just from one table at a time, and hope that we would manage to catch all shared mappings before some process remaps the page, undoing all our hard work. This approach did not scale: the tendency to livelock increased with the size of memory, and memory, like the universe, always expands. Rmap (and a simpler, more obviously correct scanner) fixed this.
Now the "direct reclaim" idea (I prefer the name "thundering herd of scanners") is starting to crack under the strain of steadily multiplying tasks. It is the same effect, really. It is a random algorithm that hopes to achieve its intended result by side effects of the user space load. It is usually a bad idea to hope that the usage pattern will always cooperate.
The right thing to do is to block a task that cannot get memory, or gets too much. This throttles the task and furthermore makes it easy to apply a true throttling policy to the list of blocked processes if we want to (we probably do). Scanning should be done only by dedicated, non-blocking scanner tasks. We might want to have as many as one scanner per cpu, but it makes absolutely zero sense to have more.
Scanner tasks will initiate writeout as now, but since we will no longer co-opt a user task to drive the actual io, it needs to be driven by another mechanism, a work queue, say. Completing the aio work to make it really, truly non-blocking (even in the case of filesystem metadata read-before-write) would mean that writeout can be driven directly by a scanning task as a non-blocking state machine, and everything would get that much nicer. But that is just an optimization, the key thing is to get the scanning right.
Implementing this is left as an exercise for the interested reader.
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds