A kernel without buffer heads

By Jonathan Corbet
May 1, 2023

No data structures found in the Linux kernel — at least, in any version that escaped from Linus Torvalds's development machine — are older than the buffer head. Like many other legacies from the early days of Linux, buffer heads have been targeted for removal for years. They persist, though, despite the problems they present. Now, Christoph Hellwig has posted a patch series that enables the building of a kernel without buffer heads — but the cost of doing so at this point will be more than most want to pay.

The first public release of the Linux kernel was version 0.01, and struct buffer_head was a part of it:

    struct buffer_head {
	char * b_data;			/* pointer to data block (1024 bytes) */
	unsigned short b_dev;		/* device (0 = free) */
	unsigned short b_blocknr;	/* block number */
	unsigned char b_uptodate;
	unsigned char b_dirt;		/* 0-clean,1-dirty */
	unsigned char b_count;		/* users using this block */
	unsigned char b_lock;		/* 0 - ok, 1 -locked */
	struct task_struct * b_wait;
	struct buffer_head * b_prev;
	struct buffer_head * b_next;
	struct buffer_head * b_prev_free;
	struct buffer_head * b_next_free;
    };

While the best disk drives available decades ago were nominally "fast", accessing data on disk was still slower, by several orders of magnitude, than accessing data in main memory. So the importance of caching file data was well understood long before Linux was born. The approach that was generally in use at that time was to cache disk blocks, with filesystem code operating on data in that cache; Torvalds followed that model with Linux. Thus, from the beginning, the Linux kernel included a "buffer cache" that held copies of blocks found on the system's disks.

The buffer_head structure was the key to managing the buffer cache. The combination of the b_dev and b_blocknr fields uniquely identified which block a given buffer cache entry referred to, while b_data pointed to the cached data itself. The other fields tracked whether the block needed to be written back to disk, how many users it had, and more. It was a core part of the kernel's block I/O subsystem — and of its memory management code as well.

Over time, it became clear that file caching could be done better if it were implemented as a cache of file data, rather than of disk blocks. During the 1.3 development cycle, Torvalds began implementing a new feature known as the "page cache", which would manage pages of data from files, rather than disk blocks. A number of advantages came from that change; many operations on file data could avoid calling into the filesystem code entirely if that data could be found in the cache, for example. Caching data at a higher level better matched how that data was used, and the ability to cache full pages (generally eight times larger than the 512-byte block size typically found at that time) improved efficiency.

The only problem was that the buffer cache was deeply wired into both the block subsystem and the filesystem implementations, so this cache continued to exist, alongside the page cache, for several more years until the two were unified. Even then, the buffer cache was at the core of the API used for block I/O. This was not optimal: filesystems worked hard to store data contiguously on disk, and the page cache could keep that data together in memory with at least page granularity, but the buffer-head interface required every I/O operation to be broken down into 512-byte blocks — each with its own buffer_head structure. That was a lot of overhead, much of which just added work for storage drivers, which had to try to reassemble larger chunks for reasonable I/O performance.

The 2.5 development series (the last of the odd-number development kernels under the older model) addressed this problem by reworking the block layer around a new data structure called the "bio" that could represent block I/O requests more efficiently. Over the years, the bio has evolved considerably as the need to support ever-higher I/O rates has grown, but it still remains the way that block I/O requests are assembled and managed.

Meanwhile, though, struct buffer_head can still be found in current kernels. And, more to the point, a number of filesystems still use it. The role that buffer heads once played in cache management has long since ended, but they still handle an important task in parts of the kernel: tracking the mapping between data cached in memory and the location on persistent storage where that data lives. The kernel has a rather more modern interface (iomap) for this purpose, but not all subsystems are using it.

One of the holdouts is ext4, which still makes heavy use of buffer heads. This filesystem, of course, is derived from ext2, which first entered the kernel with the 0.99.7 release in early 1993. Ext2 was based on block pointers; each file would have a list associated with it containing the numbers of the blocks on disk holding that file's data. Such a layout, where each block on disk is a separate entity (even if the filesystem tries to keep them together) fits the buffer head model reasonably well. So it is not surprising the buffer heads were embedded deeply within ext2, and are still there 30 years later in ext4, even though ext4 gained support for extents — a rather more efficient representation of large files — in 2006.

Buffer heads, clearly, still work, but they still add overhead to file I/O. They also present an obstacle to changes that developers want to make to the memory-management and filesystem layers, including the ongoing folio work. So the desire to get rid of buffer heads, which has been present for a long time, seems to be getting stronger.

But, as Hellwig's patch series shows, ext4 is not the only place where buffer heads persist. That series, after a bit of refactoring, adds a new BUFFER_HEAD configuration option that controls the compilation of buffer-head support. Any code that needs buffer heads will select that option; if a kernel is built without any code needing buffer heads, then the resulting kernel will not have that support. Such a kernel will be lacking a few important features, though, including the ext4 filesystem, but also F2FS, FAT, GFS2, HFS, ISO9660 (CDROM), JFS, NTFS, NTFS3, and the device-mapper layer. On the other hand, it is possible to build a buffer-head-free kernel that supports Btrfs and XFS.

It seems unlikely that there will be many kernels built without buffer-head support in the near future. This work does, however, make it easier to see where the remaining users are, which should help to focus work toward getting rid of buffer heads for real. That job is still likely to take some time — one does not perform major surgery on a heavily used filesystem in a hurry — and it may accelerate the removal of some old and unloved filesystems (like JFS). One of these years, though, it will become possible to drop this core kernel data structure that has been there since the beginning.

Index entries for this article
Kernel	Block layer

A kernel without buffer heads

Posted May 1, 2023 19:31 UTC (Mon) by flussence (guest, #85566) [Link] (4 responses)

Oh! From that list of affected filesystems it looks like I could try out these patches on my desktop for fun… it would have to use FUSE instead for writing to the ESP but that mountpoint doesn't exactly need to win any races. (And honestly, I'm tired of being reminded DOS codepages exist when tweaking .config)

A kernel without buffer heads

Posted May 2, 2023 16:52 UTC (Tue) by jengelh (guest, #33263) [Link] (3 responses)

90s tools for 90s filesystems: might as well use mtools ;-)

A kernel without buffer heads

Posted May 3, 2023 7:58 UTC (Wed) by rsidd (subscriber, #2582) [Link] (2 responses)

I don't think mtools will work for the EFI partition! Not even sure about fuse?

A kernel without buffer heads

Posted May 3, 2023 9:04 UTC (Wed) by mjg59 (subscriber, #23239) [Link] (1 responses)

mtools ought to work just fine? It's a standard fat16 partition with long filename support, you could probably mount it under Windows 95 if you could deal with GPT.

A kernel without buffer heads

Posted May 6, 2023 11:50 UTC (Sat) by Hello71 (subscriber, #103412) [Link]

actually, most Linux distros/tools use mtools to build the EFI partition without root: https://github.com/search?q=%2F%5Cbmcopy%5Cb.*EFI%2F&..., https://codesearch.debian.net/search?q=%5Cbmcopy%5Cb.*EFI...

A kernel without buffer heads

Posted May 1, 2023 23:09 UTC (Mon) by charmitro (subscriber, #164741) [Link] (2 responses)

Which filesystems are widely used and need to adjust quicker than others?

A kernel without buffer heads

Posted May 2, 2023 1:14 UTC (Tue) by jhoblitt (subscriber, #77733) [Link] (1 responses)

My guess would be that ext4 is by far the most widely used filesystem for servers and certainly is for Android. After that, I would guess that fat is still widely used for USB storage and iso9660 for install media (USB) and VMs.

A kernel without buffer heads

Posted May 2, 2023 5:09 UTC (Tue) by ABCD (subscriber, #53650) [Link]

I would also mention that the EFI System Partition is almost always FAT, and the device-mapper layer is used to support LVM and dm-crypt/LUKS (among others), so (while not a filesystem per se) it would also need to be migrated early.