LWN.net Logo

Network block devices and OOM safety

iSCSI is, for all practical purposes, a way of attaching storage devices to a fast network interconnect and making them look like local SCSI drives. There is a great deal of interest in iSCSI for high-end "storage area network" applications, and a few competing iSCSI implementations exist for Linux. Top-quality Linux iSCSI support would be a good thing to have; it turns out, however, that iSCSI raises an interesting issue with how the block subsystem works, especially when it must interact with the networking layer.

When the system gets short of memory, one of the things it must do is to force dirty pages to be written to their backing store, so that those pages may be freed. This activity becomes doubly urgent when the system runs completely out of memory. What happens, however, if the act of writing those pages to disk also requires a memory allocation? In the iSCSI case, those pages must be written via a TCP socket, so the networking layer must be able to allocate enough memory to handle the TCP protocol's needs. If the system is completely out of memory, where will this additional allocation come from?

This particular problem was solved for the block layer some time ago with the mempool mechanism. A mempool sets aside a certain amount of memory for emergencies. When all else fails, the block layer can allocate needed memory from the mempool; in that way, it is guaranteed of being able to make at least some progress and free memory for the system.

A similar mechanism could be put in place for network-based devices, probably through a special socket option which would cause a mempool to be set up for a specific connection. Attaching a mempool to a socket would guarantee that the system could send data through that connection. Unfortunately, in this case, using a mempool in this way does not solve the entire problem.

When a block driver writes data to a local device, it can easily tell when the operation has completed (and the relevant memory can be freed). In many cases, it is simply a matter of waiting for an interrupt and querying ports on the host controller. Newer, more complex protocols can be handled by setting aside a small amount of memory for replies from the controller. The controller is unlikely to overwhelm the system with spurious messages; about the only thing that will come back is responses to operations initiated by the system. In the iSCSI case, a write to the device cannot be deemed to have succeeded until the device sends back an acknowledgment, which will arrive as one of possibly many TCP packets. If the system does not have memory available to receive those packets and process the ACKs, it will be unable to complete the write operations and free up more memory. So everything stalls, or, in the worst case, deadlocks completely.

Just creating another mempool for incoming packets is not a solution, however. The number of packets arriving on a network interface can be huge, and the bulk of them are likely to be entirely unrelated to the crucial outstanding iSCSI operations. A system which is in an out-of-memory state simply cannot attempt to keep up with the full flood of packets arriving on its network interfaces. But, if it is unable to deal with the specific packets it is looking for, it may never get out of its memory crunch.

Various possible solutions have been floated. Many network interfaces can be programmed, in great detail, to drop uninteresting packets. So, when the system hits a memory crunch, it could instruct its network drivers to restrict the incoming packet stream to acknowledgments on high-priority connections. This approach would work, but it would require complicated communications between network drivers and the higher layers of the system. Network adaptors are also limited in the amount of programming they can handle; this limitation would restrict the number of iSCSI devices which could be reliably supported by the system.

Another possible solution was posted by Andrea Arcangeli. When an attempt to allocate memory for an incoming packet fails, the system would perform the allocation from one of the mempools (chosen at random) associated with sockets routed through the relevant interface. Once the packet was fed into the networking layer, a quick check would be made to see if the packet is, in fact, associated with one of the high-priority sockets; if not, it would be quickly dropped and the memory returned to the mempool. Packets belonging to high-priority sockets would be processed normally, resulting, hopefully, in the completion of write operations and the freeing of memory.

This discussion has not reached any sort of consensus, and has made it clear that a number of issues arise when the block and networking layers interact. The attempt to find a solution, in this case, is likely to be deferred to the Kernel Summit, to be held in Ottawa this July. It should be an interesting session.


(Log in to post comments)

Network block devices and OOM safety

Posted Mar 31, 2005 5:25 UTC (Thu) by jwb (guest, #15467) [Link]

Is this only a problem for network block devices? Why don't network file systems like NFS,
Lustre, StorNext, etc have this problem? Are file systems not capable of spooling up backlogs of
dirty pages the way block devices can?

Network block devices and OOM safety

Posted Mar 31, 2005 9:47 UTC (Thu) by nix (subscriber, #2304) [Link]

They do have this problem, I think: this was also the reason why shared writable mmap support was removed from FUSE. (One nice thing about this iSCSI-inspired work is that it might lead to FUSE regaining that capability again. I hope so.)

Network block devices and OOM safety

Posted Apr 1, 2005 1:47 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

I can confirm that filesystem drivers have this problem. I work on network filesystems, some of them based on ISCSI devices, and memory deadlocks and I have become good friends.

But they're a lot less common in filesystems, which is why people didn't demand a fix to the fundamental problem (network layer being simultaneously above and below the main memory pool) years ago.

Most network filesystem access is with NFS. NFS in its normal configuration always writes synchronously, so the amount of dirty pages is very small.

StorNext does relatively little direct network I/O; the majority of its I/O is through block devices. (And since they aren't usually TCP/IP-based devices, the current problem is inapplicable).

Lustre hasn't seen a large variety of applications; it probably gets lucky.

Most filesystem drivers naturally meter their activity by using the buffer cache, with its somewhat ham-handed limitation on the total amount of memory it's willing to occupy. So long before memory usage gets critical, file-using processes slow down, waiting for new buffers, thereby giving the system a chance to clean out the old ones.

I work on a filesystem driver that uses its own cache manager instead of the buffer cache -- a cache manager that isn't afraid to use every byte of memory for file cache if that's the optimum use for it. Hence my deep experience with these deadlocks.

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds