Kswapd and high-order allocations

[Posted September 8, 2004 by corbet]

The core memory allocation mechanism inside the kernel is page-based; it will attempt to find a certain number of contiguous pages in response to a request (where "a certain number" is always a power of two). After the system has been running for a while, however, "higher-order" allocations requiring multiple contiguous pages become hard to satisfy. The virtual memory subsystem fragments physical memory to the point that the free pages tend to be separated from each other.

Curious readers can query /proc/buddyinfo to see how fragmented the currently free pages are. On a 1GB system, your editor currently sees the following:

      Node 0, zone   Normal 258 9 5 0 1 2 0 1 1 0 0

On this system, 258 single pages could be allocated immediately, but only nine contiguous pairs exist, and only five groups of four pages can be found. If something comes along which needs a lot of higher-order allocations, the available memory will be exhausted quickly, and those allocations may start to fail.

Nick Piggin has recently looked at this issue and found one area where improvements can be made. The problem is with the kswapd process, which is charged with running in the background and making free pages available to the memory allocator (by evicting user pages). The current kswapd code only looks at the number of free pages available; if that number is high enough, kswapd takes a rest regardless of whether any of those pages are contiguous with others or not. That can lead to a situation where high-order allocations fail, but the system is not making any particular effort to free more contiguous pages.

Nick's patch is fairly straightforward; it simply keeps kswapd from resting until a sufficient number of higher-order allocations are possible.

It has been pointed out, however, that the approach used by kswapd has not really changed: it chooses pages to free without regard to whether those pages can be coalesced into larger groups or not. As a result, it may have to free a great many pages before it, by chance, creates some higher-order groupings of pages. In prior kernels, no better approach was possible, but 2.6 includes the reverse-mapping code. With reverse mapping, it should be possible to target contiguous pages for freeing and vastly improve the system's performance in that area.

Linus's objection to this idea is that it overrides the current page replacement policy, which does its best to evict pages which, with luck, will not be needed in the near future. Changing the policy to target contiguous blocks would make higher-order allocations easier, but it could also penalize system performance as a whole by throwing out useful pages. So, says Linus, if a "defragmentation" mode is to be implemented at all, it should be run rarely and as a separate process.

The other approach to this problem is to simply avoid higher-order allocations in the first place. The switch to 4K kernel stacks was a step in this direction; it eliminated a two-page allocation for every process created. In current kernels, one of the biggest users of high-order allocations would appear to be high-performance network adapter drivers. These adapters can handle large packets which do not fit in a single page, so the kernel must perform multi-page allocations to hold those packets.

Actually, those allocations are only required when the driver (and its hardware) cannot handle "nonlinear" packets which are spread out in memory. Most modern hardware can do scatter/gather DMA operations, and thus does not care whether the packet is stored in a single, contiguous area of memory. Using the hardware's scatter/gather capabilities requires additional work when writing the driver, however, and, for a number of drivers, that work has not yet been done. Addressing the high-order allocation problem from the demand side may prove to be far more effective than adding another objective to the page reclaim code, however.

Index entries for this article
Kernel	Kswapd
Kernel	Memory management/Large allocations
Kernel	Networking/Nonlinear packets v. large allocations
Kernel	/proc/buddyinfo

Why not copy around data in physical RAM?

Posted Sep 9, 2004 10:27 UTC (Thu) by scarabaeus (guest, #7142) [Link] (4 responses)

If a request for a contiguous memory area fails, why cannot the kernel copy physical memory around while leaving the virtual addresses unchanged? I.e. if physical page n is free, n+1 is not, but m is, and we want two consecutive pages, copy the content of n+1 to m and update the page table accordingly. I'm not a kernel hacker, but my impression was that the reverse-mapping code enables you to do that.

Why not copy around data in physical RAM?

Posted Sep 9, 2004 10:56 UTC (Thu) by rwmj (subscriber, #5474) [Link]

I actually suggested this on kernel-list many years ago. Alan Cox's (correct) objection was that it would require scanning lists to find out where a page was used. As you point out, rmap means you don't need to do that scanning any more.

So the idea would be, if higher-order allocations are not available, pick the largest available allocation, then start evicting physical pages used above this allocation.

The only case when this wouldn't work is when trying to do an atomic allocation - but it's very hard to satisfy large, atomic allocations anyway.

Rich.

Why not copy around data in physical RAM?

Posted Sep 9, 2004 14:47 UTC (Thu) by iabervon (subscriber, #722) [Link] (2 responses)

I believe someone mentioned this possibility in the thread, and Linus said that it is much more possible now than it was before rmap, but that rmap isn't actually quite complete, so there are pages you couldn't move. He seemed open to an implementation, but the poster of the original patch says making kswapd do a better job of making space is orthogonal to making kswapd keep trying until it actually makes space, and not what he's working on at the moment.

Why not copy around data in physical RAM?

Posted Sep 9, 2004 14:56 UTC (Thu) by joern (guest, #22392) [Link] (1 responses)

The big problem with this approach are pointers. Most pointers contain the virtual address of some memory area, those don't matter. But some actually contain the physical or bus address. For example, ethernet hardware usually has DMA engines and writes to certain pages in main memory. If you tried to move those pages around, you'd have a lot of fun handling the random application coredumps and occasional kernel panic.

Why not copy around data in physical RAM?

Posted Sep 9, 2005 1:14 UTC (Fri) by mmarq (guest, #2332) [Link]

i'm not a kernel hacker either... just a curious that try to understand, so...

"Most pointers contain the virtual address of some memory area, those don't matter. But some actually contain the physical or bus address... If you tried to move those pages around, you'd have a lot of fun handling the random application coredumps and occasional kernel panic. "

Just dont move them!... but there are many bits residing in physical memory that are obvious candidates, that shouldn't get in the way of those that cannot be moved or vice-versa. Is it something really stupid to advocate the creation of memory pools adressable by the kernel ?

My idea(stupid or not) is that pages marked as "obvious candidates" for swap should not be imediately swaped but trowned "defragmented" to a *reserved* portion of physical memory, very usefull because i suspect each time more truth that what is swapable now could be absolutely required next second in a highly CPU context swaping of 'highly threaded' world of applications and services... thus making kswapd lasy and stop him from wasting useful CPU cicles better used by a proper defragmentation code.

Other idea is that disk cache, should always be created as two separeted physical memory pools, program and data. Better, a *program cache pool* should be created, requiring that program bits 'should' enter this pool already in a *continuous order*, that is defragmented(and this is possible because programs bits only change when are upgraded,i.e. almost never in CPU time!), and not trowned in the general physical memory space 'highly competition' pool for any 4K page of physical memory, when or where ever available.

This program cache is certainly not a hot requirement for server systems, but could be a killer feature for workstation/desktop, because differently from a RAMDisk it would be quicker and more versatile as in the possibility of making their size hot dynamic, holding defragmented program bits from not only any required runtime but also other executables from /bin, /usr/bin or /usr/sbin scheduled from a simple algorithm, based on simple parameters as many times runned and usefulness.

Belive none of this will deprecate server performance, and in the middle you should get a much bigger pool of continously adressable memory pages.

Which cards can handle scatter DMA?

Posted Sep 10, 2004 23:19 UTC (Fri) by smoogen (subscriber, #97) [Link]

Well which cards are currently written to handle this so as to avoid the high allocation problem? And are those cards any good?