KS2008: Filesystem and block layer interaction

By Jonathan Corbet
September 16, 2008

LWN's 2008 Kernel Summit coverage

Much is happening with Linux filesystems currently; this is a situation which is likely to persist for some time. As filesystems develop, it is becoming clear that there need to be some changes in the interactions between the filesystem and block I/O layers. This kernel summit session discussed some of the places where changes are needed, but did not get much into their implementation.

Chris Mason is the lead developer of the up-and-coming btrfs filesystem. One of the items on Chris's shopping list is a way for filesystems to obtain a better understanding of the topology and nature of the storage system underneath them. He would like, for example, to be able to determine whether a filesystem is sitting on a solid-state device or on a traditional rotating disk. Certain decisions will be made very differently depending on the nature of the underlying device; filesystems stored on solid-state drives, for example, can be laid out without being concerned about seek times.

The topology of the device also matters. Especially when multipath storage systems are in use, the filesystem would like to be able to understand what the various paths are, and to be able to partition it into truly independent failure domains. With this information, filesystems can find the optimal ways to perform I/O to the underlying devices.

Information needs to flow the other way as well. Upcoming filesystems will perform extensive checksumming on data, so they will be able to inform the storage layer when a block has gone bad. For mirrored devices, that will enable the storage driver to recover the block from an uncorrupted mirror - if the filesystem is able to tell it which mirror went bad.

Chris asked for information on storage latency - how long operations can be expected to last - and the optimal I/O sizes and alignments. The motivation behind this request is to optimize I/O to solid-state devices. Here Linus jumped in and suggested that the filesystem developers should "take a deep breath and wait a year." Solid-state devices will change a lot over that time, and many of the problems which exist now will be gone by then. So filesystems designed for today's solid-state drives will contain a lot of useless code by the time those drives are truly widespread. It is better, Linus says, to just treat them as a fast, random-access disk and not worry about the details.

Another request was for filesystems to be able to allocate their own bio structures, rather than using the block layer's allocation functions. That would allow the filesystems to store their own private data with the bio without the need to tack on a chain of separate structures via the bi_private pointer. There's also a general need to rework the address space operations to facilitate better layout and more rational locking.

The kswapd process is a bit of a problem for contemporary filesystems. Kswapd is charged with freeing up pages for the memory allocator; it needs to be able to get its job done at times when system memory is very tight. Currently kswapd will attempt to write out dirty pages so that they can be freed. The problem is that this writeout can require more memory to carry out; as filesystems become more complex, the amount of extra memory needed seems to be growing. That can lead to deadlocks if that extra memory is not available. So the filesystem developers would like kswapd to concern itself exclusively with clean pages, which can be freed without performing I/O.

One answer that came back was that the writepage() VFS callback can be treated as advisory. That is what btrfs does now; if a writepage() call comes in the context of a process with the PF_MEMALLOC bit set (meaning that the system is trying to free memory), the call will simply fail. That is all legal, but it can hurt performance.

In the end, kswapd does writeout because, historically, it was possible for a Linux system to end up with all of its pages being dirty. In that kind of situation, writeout is the only way to make memory available again. But current kernels are able to keep close tabs on how much of memory is dirty at any given time, and they can avoid getting into that kind of situation. So writeout in kswapd is no longer necessary; it can, instead, be handled in contexts where memory is not in critically short supply. This change seems likely to be made in the near future.

The final topic, discussed briefly, was I/O barriers. The filesystem developers would really like it if the more complex storage layers - such as the software RAID and device mapper code - would implement write barriers. That is a hard thing to do with the current concept of barriers, though; the performance costs will be high. James Bottomley noticed that a better job could be done with a more complex barrier API. But it is not clear whether the benefits that would come would be worth the extra cost.

Index entries for this article
Kernel	Filesystems

KS2008: Filesystem and block layer interaction

Posted Sep 16, 2008 18:19 UTC (Tue) by pj (subscriber, #4506) [Link] (1 responses)

Has anyone considered that mkfs could, well, *probe* the block device? Do some basic block i/o and profile the responses, then use that data to optimize filesystem layout. Ideally you'd be able to re-do this probe at any later time - like doing a fsck - if the admin knows that the storage topology changed.

KS2008: Filesystem and block layer interaction

Posted Sep 26, 2008 18:01 UTC (Fri) by Russ.Dill@gmail.com (guest, #52805) [Link]

Seems too quick simple and well thought out to work right.

KS2008: Filesystem and block layer interaction

Posted Sep 16, 2008 21:36 UTC (Tue) by jengelh (guest, #33263) [Link] (1 responses)

>He would like, for example, to be able to determine whether a filesystem is sitting on a solid-state device or on a traditional rotating disk.

But what would we do with loop, crypto, NBD devices, and lastly, FUSE and union mounts, which can all have various seek times and/or storage characteristics!

KS2008: Filesystem and block layer interaction

Posted Sep 16, 2008 23:55 UTC (Tue) by vomlehn (guest, #45588) [Link]

I think the right answer is that *all* of the possible block devices should be considered, not just rotating media and solid-state devices.

KS2008: Filesystem and block layer interaction

Posted Sep 22, 2008 15:49 UTC (Mon) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

Some rotating hard disks are said to be quicker if reads/writes are aligned to big blocks (4096 bytes aligned read/write), and unfortunately the usual PC partition table make the usual first partition start at sector 63 (sector size 512 bytes) - completely unaligned.
Some bootloader use those sectors at the beginning of the disk (but not all bootloaders use them).
If the filesystem code could detect unaligned start of partition, it could insert a padding sector at beginning...