The return of network block device deadlock prevention

[Posted August 14, 2006 by corbet]

Just over one year ago, LWN covered a patch set aimed at preventing potential deadlocks in the network subsystem. The problem being addressed can come about when the system is using a block (disk) device which is located on the other side of a network link. When the system runs short on memory, one of the things it must do is to write dirty pages back to disk, allowing that memory to be reused for other purposes. But writing to a network disk can require memory allocations in its own right - a need which comes at the worst possible time. This particular problem, which also arises with locally-attached drives, has been solved for a while by keeping a small memory reserve specifically for block I/O operations.

Network-attached drives have an additional problem, however, in that no write can be considered complete until an acknowledgment has been received from the remote device. Receiving that acknowledgment requires that the system be able to receive (and process) network packets - and that can require unbounded amounts of memory. There may be any amount of incoming network data which has nothing to do with outstanding block I/O requests, and that data can make it impossible to receive the packets which the memory-constrained system is so desperately waiting to receive. The deadlock avoidance patch made some changes aimed at ensuring that the system could always receive and process incoming block I/O traffic.

A year later, this patch set has resurfaced. The original author (Daniel Phillips) has stepped aside, and Peter Zijlstra has taken the lead. In many ways, the current version of the patch resembled its predecessors, but there have been enough changes to warrant a new look.

The patch still works by enlarging the emergency reserve area maintained by the core page allocator. There is a GFP flag (__GFP_MEMALLOC) which allows a particular allocation call to be satisfied out of the reserve, if necessary. The core idea is to use this reserve to receive vital incoming network packets without allowing it to be overrun with useless stuff.

To that end, code which is performing block I/O over a network connection sets the SOCK_MEMALLOC flag on its socket(s). Previous versions of the patch would then set a flag on any associated network interfaces to indicate that block I/O was passing through that interface, but the current version skips that step. Instead, any attempt to allocate an sk_buff (packet) structure from a network device driver will dip into the memory reserves if need be. Thus, as long as the reserves hold out, the system will always be able to allocate buffers for incoming packets.

The key is to receive the important packets without exhausting the reserves with useless data (streaming video from LinuxWorld keynotes, say). To that end, the networking code is patched to check for the SOCK_MEMALLOC flag as soon as possible after the socket for each incoming packet is identified. If that flag is not set, and the incoming packet is using memory from the reserves, the packet will be dropped immediately, freeing its memory for other uses. So packets related to block I/O are received and processed as usual; just about everything else gets dropped at the earliest possible moment.

The latest version of the patch includes a new memory allocator, called SROG, which is used for handling reserve memory. It is intended to be fast and simple, and to release memory back to the system as quickly as possible. To that end, it tries to group related allocations together, and it isolates each group of allocations (generally the sk_buff structure and its associated data area) onto their own pages. So every time a packet is released, its associated memory immediately becomes available to the system as a whole.

This patch set is proving to be a bit of a hard sell, however. The deadlock scenario is seen as being relatively unlikely - there have not been streams of bug reports on this topic - and, in most cases, it can be avoided simply by swapping to a local disk. The set of systems whose owners can afford fancy network storage arrays, but where those same owners are unable to invest in a local disk for swapping, is thought to be small. Making the networking layer more complex to address this particular problem does not appeal to everybody.

Networking maintainer David Miller would like to see a different sort of approach to network memory allocations:

I think there is more profitability from a solution that really does something about "network memory", and doesn't try to say "these devices are special" or "these sockets are special". Special cases generally suck.

We already limit and control TCP socket memory globally in the system. If we do this for all socket and anonymous network buffer allocations, which is sort of implicity in Evgeniy's network tree allocator design, we can solve this problem in a more reasonable way.

This comment refers to Evgeniy Polyakov's network memory allocator patch, recently posted for consideration. This work is in a highly transitional state and is a little hard to read. The core, however, is this: it is (yet another) separate memory allocator, oriented toward the needs of the networking system. It is designed to keep memory allocations local to a single CPU, so each processor has its own set of pages to hand out. Allocated objects are packed as tightly as possible, minimizing internal fragmentation. There is no recourse to the system memory allocator in the current design, so, when a particular processor runs out, allocations will fail. Memory exhaustion in the rest of the system will not affect the network allocator, however. The author claims improved networking performance:

Benchmarks with trivial epoll based web server showed noticeable (more than 40%) improvements of the request rates (1600-1800 requests per second vs. more than 2300 ones). It can be described by more cache-friendly freeing algorithm, by tighter objects packing and thus reduced cache line ping-pongs, reduced lookups into higher-layer caches and so on.

This code is also written with an eye toward mapping networking buffers directly into user space, perhaps in conjunction with a future network channel implementation.

The network allocator patch clearly has the eye of the networking maintainer at the moment. That code is fairly far from being ready to merge, however, and not everybody agrees that it solves all of problems. So this is a discussion which could go on for some time yet.

Index entries for this article
Kernel	Memory management/Out-of-memory handling
Kernel	Networking

The return of network block device deadlock prevention

Posted Aug 18, 2006 0:10 UTC (Fri) by nicku (guest, #777) [Link]

…without exhausting the reserves with useless data (streaming video from LinuxWorld keynotes, say).

I laughed uncontrollably, and then went to see who is delivering the keynotes. For the most part, the joke looks spot on.

The return of network block device deadlock prevention

Posted Aug 18, 2006 16:47 UTC (Fri) by giraffedata (guest, #1954) [Link]

It's worth mentioning that network block devices aren't the only thing with this problem. Network filesystems have it too. (Specifically, any filesystem driver that has to talk to someone over a network in order to write out a dirty page of file data cache).

The article mentions how unimportant the problem is because of the unimportance of swapping to a network block device, but in contrast, buffering writes to a network filesystem is pretty important.

In fact, deadlocks are easy to come by with network filesystems, except that there are arbitrary limits placed on the amount of memory that can be used for file data cache -- in effect, a very large reserve is made for network memory requirements. If the resource inversion could actually be fixed, we could free filesystems to be more self-tuning and make more efficient use of available memory.

Greetings from userland. Wish you were here.

Posted Aug 20, 2006 2:49 UTC (Sun) by pengo (guest, #7787) [Link]

How about some easy-to-use tools for allocating swap space and/or warning that memory is low? Or better yet, how about a module to dynamically create a temporary local swap file (rather than a partition) in the user's home directory when memory runs out? No, it's not the same problem as what this patch would address, and this patch isn't without its reasons, but there are more broad general-purpose solutions for memory allocation woes. So I'd like to give a gripe...

When I removed one of my hard drives (the drive was making grinding noises and about to die) I failed to notice it happened to be the drive with my only swap partition on it. However, a gig of ram is plenty to run Ubuntu and it ran just fine.. for the time being.

When I did run out of memory though, my system grinded to a halt very horribly--which is to be expected when memory fills up I guess (though I'm not sure why the OOM Killer didn't just kill the app (a badly written python script) that was actually hogging and allocating RAM like it was going out of fashion). However, the problem for me was that there was still no (user space) warning that I was out of memory, and no (kernel) attempt to allocate some sort of temporary swap file, and no warning to the user of what was going on or why (although it was obviously something swap related to me, it wouldn't have been obvious if I was remotely logged in and couldn't hear the machine trying to tip itself over). From a user perspective, a pretty horrible experience.

So instead of being Kernel-space-fighter-pilots, it would be wonderful if some kernel dev'ers would stick their heads out of the kernel-box and look at the bigger picture for a change and consider the actual scenarios where memory runs out and work out some real world SOLUTIONS from there, rather than constructing wonderfully intricate mechanisms out of matchsticks, like this one, which, while it certainly has merit, has a much narrower coverage than any solution which started by looking at actual problem scenarios, and didn't limit itself to kernel space.

I would like to see some better user space integration for dealing with this kind of thing, instead of pretending that world doesn't exist.

Pengo.

Cluster nodes? Thin clients? blades?

Posted Aug 24, 2006 7:29 UTC (Thu) by ringerc (subscriber, #3071) [Link]

...there have not been streams of bug reports on this topic - and, in most cases, it can be avoided simply by swapping to a local disk.

This fails to consider diskless thin clients, blade systems, and cluster nodes where power use, heat, and noise as well as cost are major factors in the desire to avoid a local disk.

I suspect the bug reports haven't been coming because "everybody knows linux can't swap reliably over nfs or nbd" - or is told so as soon as they do any research on the matter. That rather reduces the test base, and reports will be further reduced by the expectation that problem reports will be dismissed with a comment along the lines of "why are you doing that, it's unreliable and can deadlock." .

Whether the statement about reliability in the previous paragraph is actually true, and whether the issues really arise very much, is a good question. I don't think there's any easy way to tell the difference between a feature everybody is told is unreliable and so avoids, and one that's used by a significant number of people without many issues despite theoretical problems.

I personally use LTSP thin clients that swap over NFS. The latest LTSP has moved to swap-over-nbd, so I'll be finding out first hand soon. Experience with LTSP suggests that swap over NFS, at least, is not something you want to rely on for more than the most casual use, as it DOES deadlock.

The set of systems whose owners can afford fancy network storage arrays, but where those same owners are unable to invest in a local disk for swapping, is thought to be small.

This might be true - but there are several types of systems that have this need, as noted above, and the reasons for avoiding local disks certainly aren't limited to cost. I'd argue that the number of people with >16 CPU machines is pretty small too, but the scheduler certainly gets a disproportionate amount of attention and added complexity for this small group.