Sunsetting buffer heads

By Jonathan Corbet
May 18, 2023

The buffer head is a kernel data structure that dates back to the first Linux release; for much of the time since then, kernel developers have been hoping to get rid of it. Hannes Reinecke started a plenary session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit by saying that everybody agrees that buffer heads are a bad idea, but there is less agreement on how to take them out of the kernel. The core functionality they provide — facilitating sector-size I/O operations to a block device underlying a filesystem — must be provided somehow.

The key question, he said, was whether the existing buffer-head API should be reimplemented using folios or, instead, the best approach would be to just replace buffer heads with folios directly. One problem with the latter approach is that folios only support page-size I/O; pages are usually larger than sectors, so page-size I/O operations will necessarily transfer multiple sectors. It is not clear to him that this matters much, though. Filesystems work hard to pack data on disk, and chances are good that the adjacent sectors will also be accessed soon. With current hardware, he said, I/O size is no longer as important for performance, so filesystem developers should not hesitate to use page-sized I/O operations.

An advantage that would come from using folios is that they can use the page cache directly. The buffer cache duplicates a fair amount of page-cache functionality; that code would be good to drop. Changing filesystems to read page-sized chunks is relatively easy, he said, but the write side is less so. Filesystems often assume that a write will only affect one sector, but that is not the case with folios. There is a patch series from Ritesh Harjani adding support for sub-page dirty tracking that should help with the conversion process.

Luis Chamberlain said that one possible sticking point is the block-device cache, which uses buffer heads for tasks like partition scanning. It's not clear that the benefits of converting this code are worth the risk. Reinecke answered that a full conversion away from buffer heads may never be necessary. Chamberlain said that a number of filesystems use buffer heads for metadata I/O and asked whether they wanted to move away from that; the answer appeared to be "yes". Jan Kara said that moving away from buffer heads was important, but that there is still a need for some sort of intermediate layer; Reinecke agreed that a drop-in replacement for buffer heads is important.

Kara pointed out that filesystems use the buffer-head layer to maintain the association between sectors and the inode of the file containing them. It is, he said, "one of the darker corners" of the buffer-head code. Reinecke questioned whether that functionality was needed, since the dirtying of data happens at the folio level in any case.

Ted Ts'o said that the buffer-head code hides a lot of functionality, and that each filesystem uses a different subset of that functionality. Sub-page tracking only replaces one piece of that puzzle. The ext4 filesystem still supports 1KB block sizes, so sub-page tracking can't be ignored; it needs that piece. Things get more complicated when small block sizes have to be supported on CPUs with 64KB base page sizes.

Some filesystems also use the buffer cache to associate dirty buffers with inodes, Ts'o continued. Filesystem developers are slowly converting over to the iomap support layer, which makes that problem go away, but finishing the conversion is "a big lift". Some other buffer-cache functionality is only used by the journaled block device (JBD2) layer, and might be best just moved into that code.

Yet another problem is that ext4 has to be able to support utilities that open the underlying block device while a filesystem is mounted; keeping everything coherent in that environment is a challenge, but the buffer cache makes it happen for free. He concluded by saying that there is value in replacing the ancient buffer-head code, even if it still provides functionality needed by filesystems, but it will be an incremental task. It will not be possible to just switch to folios for metadata; some sort of intermediate layer will still be needed.

Josef Bacik agreed that "we all hate buffer heads", but said that every filesystem manages metadata in its own special way. Btrfs uses extent buffers that sit on top of struct page and will be converted to folios soon. XFS has its own structure for metadata. And so on. Just dropping buffer heads will not be the big win that everybody seems to think it will be, he said. A common replacement layer for filesystems is not the solution, since every filesystem is different. It would take some coercion to get him to switch Btrfs to a common layer. Reinecke answered that he is trying to outline a path by which filesystems can be converted away from buffer heads and is not trying to force anybody.

Ts'o said that the buffer-head work won't matter for filesystems that do not use buffer heads now, but there are still quite a few that do use them, and the buffer-head layer provides an important support system. Bacik said that any conversion work should just be done within the buffer-head layer so that filesystems wouldn't notice the difference. Matthew Wilcox asked whether there was a desire for large folio support within the buffer head layer; that would be difficult to add transparently. Ts'o advised keeping things simple and stick to single-page folios for a replacement buffer-head layer. Filesystems that want multi-page support can switch to iomap.

Bacik brought the session to a close by stating the apparent conclusion from the session: the buffer-head layer will be converted to use folios internally while minimizing changes visible to the filesystems using it. Only single-page folios will be used within this new buffer-head layer. Any other desires, he said, can be addressed later after this problem has been solved.

Index entries for this article
Kernel	Block layer/Buffer heads
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023

Sunsetting buffer heads

Posted May 18, 2023 14:56 UTC (Thu) by hkario (subscriber, #94864) [Link] (5 responses)

Don't NVMe block devices (basically the only ones where you really care for IOPS) internally work on multiples of sector size anyway? So the only downside of asking for 4kiB instead of 512 bytes is the quadrupling of the data exchanged, but then how often that would happen in practice? Databases usually work with 4kiB clusters anyway, same for filesystems.

So it looks to me like handling of 512B requests could go through the slow path and still end up with the OS being faster.
But I'm afraid that we won't be able to tell without somebody actually implementing it and then testing it...

Sunsetting buffer heads

Posted May 18, 2023 16:48 UTC (Thu) by farnz (subscriber, #17727) [Link] (4 responses)

The only potential issue is systems with larger minimum page size (e.g. Apple Silicon prefers a 16 KiB minimum page size, some POWER systems are happier with 64 KiB minimum page size). Combining a 64 KiB page size with a 1 KiB block size filesystem on a 512 B sector size device means transferring 128 sectors in order to discard 126 sectors.

Sunsetting buffer heads

Posted May 23, 2023 0:20 UTC (Tue) by Fowl (subscriber, #65667) [Link] (3 responses)

Don't many modern storage devices - flash with a fancy controller, ie. aprox. all of it, SMR hard drives - have to do read-modify-write cycles for smaller writes anyway, and therefore a larger write could be faster despite the extra bus traffic?

Sunsetting buffer heads

Posted May 23, 2023 9:55 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

Flash doesn't do read-modify-write cycles; it has a translation layer that maintains a mapping from LBAs to where on the physical flash the data is actually stored, and can thus do small writes in a log-structured fashion. SMR drives can operate in one of two modes; one where the device exposes the write limitations to the host, and you have to comply, and one where it either log-structures things itself or does read-modify-write (whichever is faster).

But note that even if your device does do read-modify-write, working in larger blocks isn't guaranteed to be faster; it may well still be faster to read 512 bytes, modify 16 or 128 bytes (a single partition table entry in MBR or GPT), then write 512 bytes than to read 1024 KiB, do the modification, and write 1024 KiB back. While the 1024 KiB write may take the same time (or even a little less) than a 512 byte write, it'll be dominated by the bus transfer time involved in transferring all that data to and from the host.

And note that if the device has a decent amount of cache on it, it may well service the 512 byte read request by transferring a full physical sector into its cache in anticipation of a 512 byte write - so the read cost of the read-modify-write cycle has already been paid by the time you transfer 512 bytes back to write, and your write is as cheap as a bigger write.

This is not to be confused with the issues with partition alignment on early "advanced format" drives, where the underlying hardware had 4096 byte sectors and 512 byte sectors; the issue there was that you could have a situation where the host was making 4096 byte writes (optimal size for the hardware), but was misaligned with the physical sectors, so every 4096 byte write turned into a read-modify-write. The same could happen with 1 GiB writes, misaligned by 512 bytes, where a consequence is that what should have been a single large operation becomes a read-modify-write.

Sunsetting buffer heads

Posted May 24, 2023 18:22 UTC (Wed) by willy (subscriber, #9762) [Link] (1 responses)

You're not wrong, but you're not aware of more recent problems encountered by flash drives which is that it takes a lot of memory to maintain the translation layer. If you double the number of LBAs, you double the size of the amount of RAM needed to hold the translation layer. Unless you go past the 32->64 threshold, then it quadruples. Anyway.

To shrink the size of this table (saving money and hopefully resulting in a cheaper drive), some drives track block mapping on a 4kB or larger boundary instead of the 512 byte LBA boundary. That shrinks the table by a factor of 8! Downside ... we're back to a read-modify-write cycle for single-block writes. So even an NVMe drive may prefer 4kB aligned writes.

Sunsetting buffer heads

Posted May 25, 2023 10:32 UTC (Thu) by farnz (subscriber, #17727) [Link]

That doesn't affect what I said at all, though - the difference between a host read-modify-write and a device read-modify-write is unlikely to be in the host's favour, since it's the same medium accesses, but one also involves a bus transfer, and the other doesn't.

And there's other weirdness with FTLs out there - one I've encountered tracked the entire logical device in large chunks, where a chunk could either be a pointer to flash locations, or a pointer to a 32k split. Each 32k split could, in turn, point to a flash location, or to a 512 byte split. And both sizes of split were limited resources, statically allocated; if you ran out of splits of the required size to satisfy a write, the FTL would delay the write while it did the defragmentation needed to free up a split of the required size. In the worst case, a 512 byte write would force you to defrag a 32k split into a large chunk, and then you could use the newly freed 32k split to defrag a 512 byte write, and then use the newly freed 512 byte split to handle the write. But, this process is quicker than having the host read a large chunk, modify 512 bytes, and write a large chunk - since in the read and then write case, the FTL has to identify the 32k and 512 byte splits that are affected by this large write, and mark them as free for reuse.

Sunsetting buffer heads

Posted May 19, 2023 11:35 UTC (Fri) by heatd (subscriber, #160156) [Link] (3 responses)

Silly question but: why do we hate buffer_heads?

I've been trying to figure out why we indeed hate these (that indeed have been a part of every UNIX system since ages ago), and why iomap (I think it's the proposed replacement?) is better.

Sunsetting buffer heads

Posted May 19, 2023 15:27 UTC (Fri) by sroracle (subscriber, #124960) [Link] (1 responses)

https://lwn.net/Articles/930173/

Sunsetting buffer heads

Posted May 19, 2023 16:17 UTC (Fri) by heatd (subscriber, #160156) [Link]

I have read that, that's where my questions really began. The article doesn't mention exactly what concerns the mm/fs folks have with buffer heads (merely says "They also present an obstacle to changes [...]"), which is what I want to know.

Sunsetting buffer heads

Posted May 23, 2023 2:02 UTC (Tue) by dgc (subscriber, #6611) [Link]

> Silly question but: why do we hate buffer_heads?

On my current debian distro kernel, a buffer head is 104 bytes in size.

Filesystems require one buffer head per filesystem block cached in the page cache. In cases where fs block size = PAGE_SIZE, the only non-redeundant information the bufferhead carries is the sector address of filesystem block the page maps to. i.e. 8 bytes of information. And, really, this can still be considered redundant because the canonical source of the mapping information is the filesystem, not the bufferhead.

Consider a typical modern server that has >1TB of memory in it. Say for a given workload half of that memory is page cache and we have one buffer head per 4kB page. 500GB of page cache -> ~125 million pages = ~125 million buffer heads = ~10GB of RAM just for bufferheads. IOWs, a machine who's memory is 50% full of cached file data is going to be using at least 1% of the entire machine's RAM just to store bufferheads.

If you have block size < page size, then you have multiple bufferheads per page, and then they typically each only carry an extra 2 bits more of information - per-block dirty and uptodate state. If you have a 1kB block size on a 64kB page size then there are 64 individual buffer heads attached to that page in a circular linked list. Iteration of the buffer heads (e.g. during writeback) can then cost a 50-100 cache misses depending on what fields in the bufferheads are being accessed....

iomap solves this problem (and others) by querying the filesystem only when mapping information is needed by the IO paths. The reduces the per-block state that needs to be carried in the page cache down to 2 bits - uptodate and dirty state. We currently only carry per-block uptodate state in a per-folio bitmap, but work is in progress to move from per-folio to per-block dirty state using the same technique.

At this point, a single 2MB folio in the page cache with a 4kB block size will only need to carry ~140bytes of state information to track all necessary state information. Using bufferheads would require the 2MB folio to carry a list of 512 individual bufferheads and hence would use ~50kB of memory to track the same state information. That's a pretty big difference in resource consumption and should also demonstrate why it was decided that bufferheads will only be used with PAGE_SIZE sized folios...

There's plenty of other more complex/subtle reasons for not using bufferheads, but the compelling reason for modern systems is simply that per-block information is expensive to maintain. Filesystems have used extents for over 3 decades for this reason, and the iomap infrastructure leverages the efficiency of extent based mapping indexes already implemented in the filesystems themselves to minimise the memory footprint and CPU overhead of caching file data....

-Dave.