Network block devices and OOM safety
[Posted March 30, 2005 by corbet]
iSCSI is, for all practical purposes, a way of attaching storage devices to
a fast network interconnect and making them look like local SCSI drives.
There is a great deal of interest in iSCSI for high-end "storage area
network" applications, and a few competing iSCSI implementations exist for
Linux. Top-quality Linux iSCSI support would be a good thing to
have; it turns out, however, that iSCSI raises an interesting issue with
how the block subsystem works, especially when it must interact with the
networking layer.
When the system gets short of memory, one of the things it must do is to
force dirty pages to be written to their backing store, so that those pages
may be freed. This activity becomes doubly urgent when the system runs
completely out of memory. What happens, however, if the act of writing
those pages to disk also requires a memory allocation? In the iSCSI case,
those pages must be written via a TCP socket, so the networking layer must
be able to allocate enough memory to handle the TCP protocol's needs. If
the system is completely out of memory, where will this additional
allocation come from?
This particular problem was solved for the block layer some time ago with
the mempool mechanism. A mempool sets aside
a certain amount of memory for emergencies. When all else fails, the block
layer can allocate needed memory from the mempool; in that way, it is
guaranteed of being able to make at least some progress and free memory for
the system.
A similar mechanism could be put in place for network-based devices,
probably through a special socket option which would cause a mempool to be
set up for a specific connection. Attaching a mempool to a socket would
guarantee that the system could send data through that connection.
Unfortunately, in this case, using a mempool in this way does not solve the
entire problem.
When a block driver writes data to a local device, it can easily tell when
the operation has completed (and the relevant memory can be freed). In
many cases, it is simply a matter of
waiting for an interrupt and querying ports on the host controller. Newer,
more complex protocols can be handled by setting aside a small amount of
memory for replies from the controller. The controller is unlikely to
overwhelm the system with spurious messages; about the only thing that will
come back is responses to operations initiated by the system.
In the iSCSI case, a write to the device cannot be deemed to have succeeded
until the device sends back an acknowledgment, which will arrive as one of
possibly many TCP packets. If the system does not have memory available to
receive those packets and process the ACKs, it will be unable to complete
the write operations and free up more memory. So everything stalls, or, in
the worst case, deadlocks completely.
Just creating another mempool for incoming packets is not a solution,
however. The number of packets arriving on a network interface can be
huge, and the bulk of them are likely to be entirely unrelated to the
crucial outstanding iSCSI operations. A system which is in an
out-of-memory state simply cannot attempt to keep up with the full flood of
packets arriving on its network interfaces. But, if it is unable to deal
with the specific packets it is looking for, it may never get out of its
memory crunch.
Various possible solutions have been floated. Many network interfaces can
be programmed, in great detail, to drop uninteresting packets. So, when
the system hits a memory crunch, it could instruct its network drivers to
restrict the incoming packet stream to acknowledgments on high-priority
connections. This approach would work, but it would require complicated
communications between network drivers and the higher layers of the
system. Network adaptors are also limited in the amount of programming
they can handle; this limitation would restrict the number of iSCSI devices
which could be reliably supported by the system.
Another possible solution was posted by
Andrea Arcangeli. When an attempt to allocate memory for an incoming
packet fails, the system would perform the allocation from one of the
mempools (chosen at random) associated with sockets routed through the
relevant interface. Once the packet was fed into the networking layer, a
quick check would be made to see if the packet is, in fact, associated with
one of the high-priority sockets; if not, it would be quickly dropped and
the memory returned to the mempool. Packets belonging to high-priority
sockets would be processed normally, resulting, hopefully, in the
completion of write operations and the freeing of memory.
This discussion has not reached any sort of consensus, and has made it
clear that a number of issues arise when the block and networking layers
interact. The attempt to find a solution, in this case, is likely to be
deferred to the Kernel Summit, to be held in Ottawa this July. It should
be an interesting session.
(
Log in to post comments)