Filesystem support for block sizes larger than the page size

February 20, 2025

This article was contributed by Pankaj Raghav

The maximum filesystem block size that the kernel can support has always been limited by the host page size for Linux, even if the filesystems could handle larger block sizes. The large-block-size (LBS) patches that were merged for the 6.12 kernel removed this limitation in XFS, thereby decoupling the page size from the filesystem block size. XFS is the first filesystem to gain this support, with other filesystems likely to add LBS support in the future. In addition, the LBS patches have been used to get the initial atomic-write support into XFS.

LBS is an overloaded term, so it is good to clarify what it means in the context of the kernel. The term LBS is used in this article to refer to places where the filesystem block size is larger than the page size of the system. A filesystem block is the smallest unit of data that the filesystem uses to store file data on the disk. Setting the filesystem block size will only affect the I/O granularity in the data path and will not have any impact on the filesystem metadata.

Long history

The earliest use case for LBS came from the CD/DVD world, where reads and writes had to be performed in 32KB or 64KB chunks. LBS support was proposed to handle these devices and avoid workarounds at the device-driver level, but it was never merged. Beyond these historical needs, LBS enables testing filesystem block sizes larger than the host page size, allowing developers to verify XFS functionality with 64KB blocks on x86_64 systems that do not support 64KB pages. This is particularly valuable given the increasing adoption of architectures with larger page sizes.

Another emerging use case for LBS comes from modern high-capacity solid-state storage devices (SSDs). Storage vendors are increasing their internal mapping unit (commonly called the Indirection Unit or IU) beyond 4KB to support these devices. When I/O operations are not sized for this larger IU, the device must perform read-modify-write operations, increasing the write amplification factor. LBS enables filesystems to match their block size with the device's IU, avoiding these costly operations.

Although LBS sounds like it has something to do with the block layer, the block-size limit in the kernel actually comes from the page cache, not the block layer. The main requirement to get LBS support for filesystems is the ability to track a filesystem block as a single unit in the page cache. Since a block is the smallest unit of data, the page cache should not partially evict a single block during writeback.

There were multiple attempts in the past to add LBS support. The most recent effort from Dave Chinner in 2018 worked around the page-cache limitation by adding the IOMAP_F_ZERO_AROUND flag in iomap. This flag pads I/O operations with zeroes if the size is less than a single block. The patches also removed the writepage() callback to ensure that the entire large block was written during the writeback. This effort was not upstreamed as folios were getting traction, which had the potential to solve the partial-eviction problem directly at the virtual filesystem layer.

Large folio support was added to the page cache, and it was enabled in XFS in 5.18. If a filesystem supports large folios, then the page cache will opportunistically allocate larger folios in the read and write paths based on the size of the I/O operation. The filesystem calls mapping_set_large_folios() during inode initialization to enable the large-folios feature. But the page cache could still fall back to allocating an order-0 folio (a single page) if the system is running low on memory, so there is no guarantee on the size of the folios.

The LBS patches, which were developed by me and some colleagues, add a way for the filesystem to inform the page cache of the minimum and maximum order of folio allocation to match its block size. The page cache will allocate large folios that match the order constraints set by the filesystem and ensure that no partial eviction of blocks occurs. The mapping_set_folio_min_order() and mapping_set_folio_order_range() APIs have been added to control the allocation order in the page cache. The order information is encoded in the flags member of the address_space struct.

Setting the minimum folio order is sufficient for filesystems to add LBS support since they only need a guarantee on the smallest folio allocated in the page cache. Filesystems can set the minimum folio order based on the block size during inode initialization. Existing callers of mapping_set_large_folios() will not notice any change in behavior because that function will now set the minimum order to zero and the maximum order to the MAX_PAGECACHE_ORDER.

Under memory pressure, the kernel will try to break up a large folio into individual pages, which could violate the promise of minimum folio order in the page cache. The main constraint is that the page cache must always ensure that the folios in the page cache are never smaller than the minimum order. Since the 6.8 kernel, the memory-management subsystem has the support to split a large folio into any lower-order folio. LBS support uses this feature to always maintain the minimum folio order in the page cache even when a large folio is split due to memory pressure or truncation.

Other filesystems?

Readers might be wondering if it will be trivial to add LBS support to other filesystems since the page cache infrastructure is now in place. The answer unfortunately is: it depends. Even though the necessary infrastructure is in place in the page cache to support LBS in any filesystem, the path toward adding this support depends on the filesystem implementation. XFS has been preparing for LBS support for a long time, which resulted in LBS patches requiring minimal changes in XFS.

The filesystem needs to support large folios to support LBS. So any filesystem that is using buffer heads in the data path cannot support LBS at the moment. XFS developers moved away from buffer heads and designed iomap to address the shortcomings in buffer heads. While there is work underway to support large folios in buffer heads, it might take some time before it gets added. Once large folios support is added to the filesystem, adding LBS support is all about finding any corner cases that have made any assumption on block size in the filesystem.

LBS has already found a use case in the kernel. The guarantee that the memory in the page cache representing a filesystem block will not be split has been used by the atomic-write support in XFS for 6.13. The initial support in XFS will only allow writing one filesystem block atomically. The drive needs to be formatted with the desired size for atomic writes as its filesystem block size.

The next focus in the LBS project is to remove the logical-block-size restriction for block devices. Similar to filesystem block size, the logical block size, which is the smallest size that a storage device can address, is restricted to the host page size due to limitations in the page cache. Block devices cache data in the page cache when applications performed buffered I/O operations directly on the devices and they use buffer heads by default to interact with the page cache. So large folio support is needed in buffer heads to remove this limitation for block devices.

Many core changes, such as large folios support, XFS using iomap instead of buffer heads, multi-page bvec support in the block layer, and so on, took place over the past 17 years after the first LBS attempt. This resulted in adding LBS support with relatively fewer changes. With that being done for 6.12, XFS will finally support all the features that it supported in IRIX before it was ported to Linux in 2001.

[I would like to thank Luis Chamberlain and Daniel Gomez for their contributions to both this article and the LBS patches. Special thanks to Matthew Wilcox, Hannes Reinecke, Dave Chinner, Darrick Wong, and Zi Yan for their thorough reviews and valuable feedback that helped shape the LBS patch series.]

Index entries for this article
GuestArticles	Raghav, Pankaj

Shows how ahead of its time SGI really was

Posted Feb 20, 2025 21:16 UTC (Thu) by jmalcolm (subscriber, #8876) [Link] (2 responses)

We tend to look back to the era of "proprietary" UNIX as a lesser time but the sentence "XFS will finally support all the features that it supported in IRIX before it was ported to Linux in 2001" really hits home. Not only do we have XFS itself only because it was gifted by SGI but the kernel is only now able to offer all the features that XFS could rely on in IRIX almost 30 years ago.

My point is not that IRIX is better than Linux. Only that, if you consider what your home PC would have looked like in 2000, it is kind of mind blowing how advanced the software already was back then. Which makes sense of course because, though it would not hold a candle to today, the proprietary UNIX systems were designed for money-no-object hardware that pushed the technology of the time as far as it could go.

Shows how ahead of its time SGI really was

Posted Feb 21, 2025 1:20 UTC (Fri) by gerdesj (subscriber, #5446) [Link] (1 responses)

"but the kernel is only now able to offer all the features that XFS could rely on in IRIX almost 30 years ago."

reflinks?

XFS nowadays has holes in the knees of its jeans, which it mock derides as "ironic". For me XFS is a safe haven for data, especially for large files.

Veeam has supported reflinks for some time now and it is a bit of a game changer when you are dealing with backups that have a full and incrementals. A "synthetic full" can be generated within seconds by shuffling reflinks instead of blocks. Add a flag to cp and you can create a sort of clone within seconds - a bit like a volume snapshot but for individual files.

The real beauty of Linux and other promiscuous Unixes is that we have a lot of choice and sometimes we as sysadmins pick the right one for the job in hand.

Windows has vFAT, NTFS and ReFS and that's your lot. To be fair ReFS is shaping up nicely these days (it doesn't kick your puppies quite so often) and NTFS is very, very stable - regardless of how much you abuse it. vFAT is FAT - keep it simples.

Apples have files and I'm sure they are lovely.

Shows how ahead of its time SGI really was

Posted Feb 21, 2025 4:38 UTC (Fri) by bmenrigh (subscriber, #63018) [Link]

XFS has reflinks. I’ve been de-duping files on it for a while.

Thanks for the nudge

Posted Feb 20, 2025 21:44 UTC (Thu) by koverstreet (✭ supporter ✭, #4296) [Link] (12 responses)

This was on my todo list, but I'd been too lazy to look up the relevant helper :)

bcachefs now has it in for-next, so it should land in 6.15:
https://evilpiepirate.org/git/bcachefs.git/commit/?h=for-...

bcachefs

Posted Feb 21, 2025 15:16 UTC (Fri) by DemiMarie (subscriber, #164188) [Link] (11 responses)

Given that it is CoW, could bcachefs support arbitrary atomic writes, or even (gulp) transactions in the style of NTFS?

bcachefs

Posted Feb 21, 2025 18:10 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Unlikely. And even NTFS' successor removed the transactional support.

The problem is not in the filesystem itself, where transactions are reasonably easy, but in the VFS layer and the page cache. There's no easy way to impose isolation between parts of the page cache, and rollbacks are even more tricky.

bcachefs

Posted Feb 21, 2025 20:05 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link] (9 responses)

Yes, it could support arbitrary atomic writes (not of infinite size, bcachefs transactions have practical limits). If someone wanted to fund it - it's not a particularly big interest of mine.

Unfamiliar with NTFS style transactions.

bcachefs

Posted Feb 22, 2025 22:41 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (8 responses)

To the best of my understanding, "NTFS style transactions" means, roughly speaking, "you expose full transaction semantics to userspace, so that userspace can construct transactions with arbitrary combinations of writes, including writes that span multiple files or directories." And then, once it exists and works correctly, you write documentation[1] telling userspace not to use it, supposedly because userspace never really wanted it in the first place (which I find hard to believe, personally).

[1]: https://learn.microsoft.com/en-us/windows/win32/fileio/de...

bcachefs

Posted Feb 22, 2025 23:23 UTC (Sat) by koverstreet (✭ supporter ✭, #4296) [Link] (6 responses)

> Many applications which deal with "document-like" data tend to load the entire document into memory, operate on it, and then write it back out to save the changes. The needed atomicity here is that the changes either are completely applied or not applied at all, as an inconsistent state would render the file corrupt. A common approach is to write the document to a new file, then replace the original file with the new one. One method to do this is with the ReplaceFile API.

Yeah, I tend to agree with Microsoft :) I'm not aware of applications that would benefit, but if you do know of some please let me know.

I'm more interested in optimizations for fsync overhead.

bcachefs

Posted Feb 22, 2025 23:45 UTC (Sat) by intelfx (subscriber, #130118) [Link] (1 responses)

> but if you do know of some please let me know.

Package managers? Text editors? Basically anything that currently has to do the fsync+rename+fsync dance?

Now, I'm not saying that someone should get on coding userspace transactions yesterday™, but at a glance, there are definitely uses for that.

bcachefs

Posted Feb 27, 2025 10:30 UTC (Thu) by koverstreet (✭ supporter ✭, #4296) [Link]

That fsync already isnn't needed on bcachefs (nor ext4, and maybe xfs as well) since we do an implicit fsync on an overwrite rename, where we flush the data but not the journal.

That is, you get ordering, not persistence, which is exactly what applications want in this situation.

bcachefs

Posted Feb 23, 2025 9:37 UTC (Sun) by Wol (subscriber, #4433) [Link]

Depends how much state, across how many files, but (if I understand correctly) I'm sure object based databases could benefit.

I would want to update part of a file (maybe two or three blocks, across a several-meg (or more) file) and the ability to rewrite just the blocks of interest, then flush a new inode or whatever, changing just those block pointers, would be wonderful.

Maybe we already have that. Maybe it's too complicated (as in multiple people trying to update the same file at the same time ...)

Cheers,
Wol

bcachefs

Posted Feb 24, 2025 18:52 UTC (Mon) by tim-day-387 (subscriber, #171751) [Link] (2 responses)

Lustre would benefit from a filesystem agnostic transaction API (at least, in kernel space). The OSD layer is essentially implementing that. We're making a push to get Lustre included upstream and the fate of OSD/ldiskfs/ext4 is one of the big open questions. Having a shared transaction API would make that much easier to answer.

bcachefs

Posted Feb 24, 2025 21:37 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

How does Lustre currently handle transactions? Especially rollbacks?

It looks like transactions in Lustre are more like an atomic group of operations, rather than something long-lived? I.e. you can't start a transaction, spend 2 hours doing something with it, and then commit it?

bcachefs

Posted Feb 25, 2025 16:49 UTC (Tue) by tim-day-387 (subscriber, #171751) [Link]

Currently, Lustre hooks into ext4 transactions in osd_trans_start() and osd_trans_stop() [1]. So the transactions aren't long-lived and are usually scoped to a single function. Lustre patches ext4 (to create ldiskfs) and interfaces with it directly. But it'd probably be better to have a generic way for filesystems to (optionally) expose these primitives. Infiniband has a concept of kverbs - drivers can optionally expose an interface to in-kernel users. We'd could do something similar for transaction handling.

[1] https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob;...

bcachefs

Posted Feb 23, 2025 1:45 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

> And then, once it exists and works correctly

Except this never happened :) I wrote an application that actually used distributed transactions with NTFS and SQL Server, for video file management from CCTV cameras, some time around 2008.

There were tons of corner cases that didn't work quite right. For example, if you created a folder and a file within that folder, then nobody else could create files in that folder until the transaction commits. Because the folder had to be deleted during the rollback.

And this at least made some sense within Windows, as it's a very lock-heavy system. It will make much less sense in a Linux FS.

Performance hit?

Posted Feb 21, 2025 4:43 UTC (Fri) by bmenrigh (subscriber, #63018) [Link] (1 responses)

Does this come at a performance cost (CPU-wise, not disk-wise). From the description of the details it sounds like it doesn’t, or the overhead is too tiny to matter.

Performance hit?

Posted Feb 21, 2025 12:18 UTC (Fri) by mcgrof (subscriber, #25917) [Link]

We did measurements of the impact to existing 4k workloads and found the impact to be within the noise. The more interesting things were about how larger block sizes perform against 4k block size workloads, and while that was also found to be within the noise I figured I'd share the old results here [0]. The more interesting results were found once one started to use unbounded IO devices such as pmem [0], part of which lead to last year's LSFMM topic to "Measuring and improving buffered I/O". One of the key conclusions of that discussion was the prospect of parallelizing writeback, a topic which has been proposed for this year's LSFMM [2] and for which there are RFC patches out to help review.

[0] https://docs.google.com/presentation/d/e/2PACX-1vS6jYbdGD...
[1] https://lwn.net/Articles/976856/
[2] https://lore.kernel.org/all/Z6GAYFN3foyBlUxK@dread.disast...

Can't actually use currently?

Posted Mar 9, 2025 10:42 UTC (Sun) by bmenrigh (subscriber, #63018) [Link]

On kernel 6.13.5 I was able to make an XFS filesystem (to a file) with 8192 byte sector sizes but when I try to mount it via loopback (either with losetup or mount -o loop) I run into:

[ 3713.182037] XFS (loop0): Cannot set_blocksize to 8192 on device loop0

Or trying to specify losetup -b 8192:
[ 3770.934627] Invalid logical block size (8192)

Maybe I'm missing something? Otherwise the support is there in XFS but the support isn't in the the block layer (at least not the loopback part) to work.