Toward more robust network-based block I/O

[Posted August 8, 2005 by corbet]

One thing which came out of this year's Kernel Summit is that the kernel still does not deal well with network-based block devices when memory gets tight. If the system is full of dirty memory, the kernel must write some of those dirty pages to their backing store so that the memory may be reused. But the act of writing that data over the network can require the allocation of more memory. Even worse, completing network-based I/O requires the ability to receive the acknowledgment packets back from the remote device. Not only does that packet reception require memory, but the system must contend with the fact that the network could also be the source of vast numbers of packets which are completely unrelated to the problem at hand. If the system cannot find a way to receive the packets it needs while ignoring unrelated packets, extreme memory pressure will eventually lead to a lockup.

Solving this problem is hard. At the Summit, Linus suggested that it might not even make sense to try; instead, users should be directed toward I/O hardware which does not present this sort of problem. In reality, however, Linux will do its best to support network-based block devices. Daniel Phillips has recently been working on a patch which tries to make some progress in that direction.

Like many before him, Daniel bases his approach on the use of preallocated memory pools - a chunk of memory which is set aside for use when no other memory is available. Daniel has tried to take things a little further by quantifying how much memory should be set aside. To that end, each network driver should, when an interface is brought up, make a call to:

    int adjust_memalloc_reserve(int pages);

Where pages is the number of pages required to be able to continue to receive packets on the given interface. A helper function, estimate_skb_pages(), can come up with a guess for how many pages will be required to hold a given number of packets with a specified maximum size. The call to adjust_memalloc_reserve() will cause the virtual memory subsystem to set aside the given number of pages for emergency use by the driver. In this way, it is hoped, the system will reserve a sufficient amount of memory without being overly wasteful.

Memory can be allocated from the reserve by adding the new __GFP_MEMALLOC flag to the allocation request. A new networking helper function, dev_memalloc_skb(), will use that flag if necessary to obtain a packet. Before doing so, however, it checks a count of packets allocated from the reserve; no interface is allowed to allocate beyond a maximum count, which defaults to 50. Unlike previous versions of the patch, the current code does not attempt to track which packets, in particular, were allocated from reserve memory. Any packets which originate from a given device will, when returned to the system, be credited to that device's reserve.

A longstanding problem with the reserve approach is that, if one is not careful, the reserve simply gets depleted and the system runs out of memory anyway. In a situation where memory use is not entirely within the system's control - when dealing with incoming network data, for example - this sort of depletion is especially likely. Your system may be doing its best to flush dirty pages to your home iSCSI array, but the network memory reserves are full of incoming music being downloaded by your children, so the entire system comes to a halt. Such an outcome may please the RIAA, but the kernel developers are trying to satisfy a different audience.

Daniel's answer to this problem is to add a special flag to network sockets which are involved in block I/O. Only sockets marked with SOCK_MEMALLOC are entitled to use packet memory from the reserves. When the packet arrives on the interface, the system cannot know whether it is useful or not, so that packet must be received (possibly using reserve memory) and fed into the system in the usual way. The protocol code, however, is expected to check each packet to see whether it comes from a device which is currently using reserve memory. If so, and the packet does not belong to a suitably-marked socket, that packet is to be dropped immediately. In this way, it is hoped, the system will be able to focus its remaining resources on recovering from its memory crunch.

This approach may have some promise. This patch needs some work, however, before it is ready for serious stress testing. Once it has been worked into shape, the patch can be applied to a suitably-equipped system, which can then be pushed into a state of serious memory pressure. That point has been the downfall of a number of other approaches to this problem; whether Daniel's work is up to this test remains to be seen.

Index entries for this article
Kernel	Memory management/Out-of-memory handling
Kernel	Networking

Toward more robust network-based block I/O

Posted Aug 11, 2005 19:20 UTC (Thu) by daniel (guest, #3181) [Link]

Thanks Jon, for once again expressing this all more succintly than I ever could. A subtle point:

"The protocol code... is expected to check each packet to see whether it comes from a device which is currently using reserve memory. If so, and the packet does not belong to a suitably-marked socket, that packet is to be dropped immediately."

...and the tasks driving the competing traffic are likely to be blocked waiting for memory to be freed, or soon will be. So the competing traffic problem is self-correcting.

Of course we need to worry not just about whether the system avoids deadlock under load, but that it keeps running smoothly. The vm system's highwater/lowwater scheme lets it get out of the way for a while after relieving memory pressure, a natural way of sharing network bandwidth with normal tasks.

However, we are not quite out of the woods yet. As soon as atomic allocs start to fail, the current patch starts dropping non-blockio packets, which might cause user-visible protocol stalls. But the machine does not exist solely to write out dirty memory - we want everything to keep running smoothly, not just block IO. After all, under load these low memory conditions are the rule, not the exception. Fortunately, this behavior is easily tunable: we can adjust the threshold at which non-blockio packets begin to be dropped, effectively giving non-vm traffic access to part of the reserve. When nicely tuned, the vm-related throttling should cause the non-blockio network traffic to taper off just as the vm writeout traffic begins to rise and no packets will ever have to be dropped.

Regards,

Daniel