Filesystem support for block sizes larger than the page size
The maximum filesystem block size that the kernel can support has always been limited by the host page size for Linux, even if the filesystems could handle larger block sizes. The large-block-size (LBS) patches that were merged for the 6.12 kernel removed this limitation in XFS, thereby decoupling the page size from the filesystem block size. XFS is the first filesystem to gain this support, with other filesystems likely to add LBS support in the future. In addition, the LBS patches have been used to get the initial atomic-write support into XFS.
LBS is an overloaded term, so it is good to clarify what it means in the context of the kernel. The term LBS is used in this article to refer to places where the filesystem block size is larger than the page size of the system. A filesystem block is the smallest unit of data that the filesystem uses to store file data on the disk. Setting the filesystem block size will only affect the I/O granularity in the data path and will not have any impact on the filesystem metadata.
Long history
The earliest use case for LBS came from the CD/DVD world, where reads and writes had to be performed in 32KB or 64KB chunks. LBS support was proposed to handle these devices and avoid workarounds at the device-driver level, but it was never merged. Beyond these historical needs, LBS enables testing filesystem block sizes larger than the host page size, allowing developers to verify XFS functionality with 64KB blocks on x86_64 systems that do not support 64KB pages. This is particularly valuable given the increasing adoption of architectures with larger page sizes.
Another emerging use case for LBS comes from modern high-capacity solid-state storage devices (SSDs). Storage vendors are increasing their internal mapping unit (commonly called the Indirection Unit or IU) beyond 4KB to support these devices. When I/O operations are not sized for this larger IU, the device must perform read-modify-write operations, increasing the write amplification factor. LBS enables filesystems to match their block size with the device's IU, avoiding these costly operations.
Although LBS sounds like it has something to do with the block layer, the block-size limit in the kernel actually comes from the page cache, not the block layer. The main requirement to get LBS support for filesystems is the ability to track a filesystem block as a single unit in the page cache. Since a block is the smallest unit of data, the page cache should not partially evict a single block during writeback.
There were multiple attempts in the past to add LBS support. The most recent effort from Dave Chinner in 2018 worked around the page-cache limitation by adding the IOMAP_F_ZERO_AROUND flag in iomap. This flag pads I/O operations with zeroes if the size is less than a single block. The patches also removed the writepage() callback to ensure that the entire large block was written during the writeback. This effort was not upstreamed as folios were getting traction, which had the potential to solve the partial-eviction problem directly at the virtual filesystem layer.
Large folio support was added to the page cache, and it was enabled in XFS in 5.18. If a filesystem supports large folios, then the page cache will opportunistically allocate larger folios in the read and write paths based on the size of the I/O operation. The filesystem calls mapping_set_large_folios() during inode initialization to enable the large-folios feature. But the page cache could still fall back to allocating an order-0 folio (a single page) if the system is running low on memory, so there is no guarantee on the size of the folios.
The LBS patches, which were developed by me and some colleagues, add a way for the filesystem to inform the page cache of the minimum and maximum order of folio allocation to match its block size. The page cache will allocate large folios that match the order constraints set by the filesystem and ensure that no partial eviction of blocks occurs. The mapping_set_folio_min_order() and mapping_set_folio_order_range() APIs have been added to control the allocation order in the page cache. The order information is encoded in the flags member of the address_space struct.
Setting the minimum folio order is sufficient for filesystems to add LBS support since they only need a guarantee on the smallest folio allocated in the page cache. Filesystems can set the minimum folio order based on the block size during inode initialization. Existing callers of mapping_set_large_folios() will not notice any change in behavior because that function will now set the minimum order to zero and the maximum order to the MAX_PAGECACHE_ORDER.
Under memory pressure, the kernel will try to break up a large folio into individual pages, which could violate the promise of minimum folio order in the page cache. The main constraint is that the page cache must always ensure that the folios in the page cache are never smaller than the minimum order. Since the 6.8 kernel, the memory-management subsystem has the support to split a large folio into any lower-order folio. LBS support uses this feature to always maintain the minimum folio order in the page cache even when a large folio is split due to memory pressure or truncation.
Other filesystems?
Readers might be wondering if it will be trivial to add LBS support to other filesystems since the page cache infrastructure is now in place. The answer unfortunately is: it depends. Even though the necessary infrastructure is in place in the page cache to support LBS in any filesystem, the path toward adding this support depends on the filesystem implementation. XFS has been preparing for LBS support for a long time, which resulted in LBS patches requiring minimal changes in XFS.
The filesystem needs to support large folios to support LBS. So any filesystem that is using buffer heads in the data path cannot support LBS at the moment. XFS developers moved away from buffer heads and designed iomap to address the shortcomings in buffer heads. While there is work underway to support large folios in buffer heads, it might take some time before it gets added. Once large folios support is added to the filesystem, adding LBS support is all about finding any corner cases that have made any assumption on block size in the filesystem.
LBS has already found a use case in the kernel. The guarantee that the memory in the page cache representing a filesystem block will not be split has been used by the atomic-write support in XFS for 6.13. The initial support in XFS will only allow writing one filesystem block atomically. The drive needs to be formatted with the desired size for atomic writes as its filesystem block size.
The next focus in the LBS project is to remove the logical-block-size restriction for block devices. Similar to filesystem block size, the logical block size, which is the smallest size that a storage device can address, is restricted to the host page size due to limitations in the page cache. Block devices cache data in the page cache when applications performed buffered I/O operations directly on the devices and they use buffer heads by default to interact with the page cache. So large folio support is needed in buffer heads to remove this limitation for block devices.
Many core changes, such as large folios support, XFS using iomap instead of buffer heads, multi-page bvec support in the block layer, and so on, took place over the past 17 years after the first LBS attempt. This resulted in adding LBS support with relatively fewer changes. With that being done for 6.12, XFS will finally support all the features that it supported in IRIX before it was ported to Linux in 2001.
[I would like to thank Luis Chamberlain and Daniel Gomez for their contributions to both this article and the LBS patches. Special thanks to Matthew Wilcox, Hannes Reinecke, Dave Chinner, Darrick Wong, and Zi Yan for their thorough reviews and valuable feedback that helped shape the LBS patch series.]
| Index entries for this article | |
|---|---|
| GuestArticles | Raghav, Pankaj |
Posted Feb 20, 2025 21:16 UTC (Thu)
by jmalcolm (subscriber, #8876)
[Link] (2 responses)
My point is not that IRIX is better than Linux. Only that, if you consider what your home PC would have looked like in 2000, it is kind of mind blowing how advanced the software already was back then. Which makes sense of course because, though it would not hold a candle to today, the proprietary UNIX systems were designed for money-no-object hardware that pushed the technology of the time as far as it could go.
Posted Feb 21, 2025 1:20 UTC (Fri)
by gerdesj (subscriber, #5446)
[Link] (1 responses)
reflinks?
XFS nowadays has holes in the knees of its jeans, which it mock derides as "ironic". For me XFS is a safe haven for data, especially for large files.
Veeam has supported reflinks for some time now and it is a bit of a game changer when you are dealing with backups that have a full and incrementals. A "synthetic full" can be generated within seconds by shuffling reflinks instead of blocks. Add a flag to cp and you can create a sort of clone within seconds - a bit like a volume snapshot but for individual files.
The real beauty of Linux and other promiscuous Unixes is that we have a lot of choice and sometimes we as sysadmins pick the right one for the job in hand.
Windows has vFAT, NTFS and ReFS and that's your lot. To be fair ReFS is shaping up nicely these days (it doesn't kick your puppies quite so often) and NTFS is very, very stable - regardless of how much you abuse it. vFAT is FAT - keep it simples.
Apples have files and I'm sure they are lovely.
Posted Feb 21, 2025 4:38 UTC (Fri)
by bmenrigh (subscriber, #63018)
[Link]
Posted Feb 20, 2025 21:44 UTC (Thu)
by koverstreet (✭ supporter ✭, #4296)
[Link] (12 responses)
bcachefs now has it in for-next, so it should land in 6.15:
Posted Feb 21, 2025 15:16 UTC (Fri)
by DemiMarie (subscriber, #164188)
[Link] (11 responses)
Posted Feb 21, 2025 18:10 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
The problem is not in the filesystem itself, where transactions are reasonably easy, but in the VFS layer and the page cache. There's no easy way to impose isolation between parts of the page cache, and rollbacks are even more tricky.
Posted Feb 21, 2025 20:05 UTC (Fri)
by koverstreet (✭ supporter ✭, #4296)
[Link] (9 responses)
Unfamiliar with NTFS style transactions.
Posted Feb 22, 2025 22:41 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (8 responses)
[1]: https://learn.microsoft.com/en-us/windows/win32/fileio/de...
Posted Feb 22, 2025 23:23 UTC (Sat)
by koverstreet (✭ supporter ✭, #4296)
[Link] (6 responses)
Yeah, I tend to agree with Microsoft :) I'm not aware of applications that would benefit, but if you do know of some please let me know.
I'm more interested in optimizations for fsync overhead.
Posted Feb 22, 2025 23:45 UTC (Sat)
by intelfx (subscriber, #130118)
[Link] (1 responses)
Package managers? Text editors? Basically anything that currently has to do the fsync+rename+fsync dance?
Now, I'm not saying that someone should get on coding userspace transactions yesterday™, but at a glance, there are definitely uses for that.
Posted Feb 27, 2025 10:30 UTC (Thu)
by koverstreet (✭ supporter ✭, #4296)
[Link]
That is, you get ordering, not persistence, which is exactly what applications want in this situation.
Posted Feb 23, 2025 9:37 UTC (Sun)
by Wol (subscriber, #4433)
[Link]
I would want to update part of a file (maybe two or three blocks, across a several-meg (or more) file) and the ability to rewrite just the blocks of interest, then flush a new inode or whatever, changing just those block pointers, would be wonderful.
Maybe we already have that. Maybe it's too complicated (as in multiple people trying to update the same file at the same time ...)
Cheers,
Posted Feb 24, 2025 18:52 UTC (Mon)
by tim-day-387 (subscriber, #171751)
[Link] (2 responses)
Posted Feb 24, 2025 21:37 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
It looks like transactions in Lustre are more like an atomic group of operations, rather than something long-lived? I.e. you can't start a transaction, spend 2 hours doing something with it, and then commit it?
Posted Feb 25, 2025 16:49 UTC (Tue)
by tim-day-387 (subscriber, #171751)
[Link]
[1] https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob;...
Posted Feb 23, 2025 1:45 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Except this never happened :) I wrote an application that actually used distributed transactions with NTFS and SQL Server, for video file management from CCTV cameras, some time around 2008.
There were tons of corner cases that didn't work quite right. For example, if you created a folder and a file within that folder, then nobody else could create files in that folder until the transaction commits. Because the folder had to be deleted during the rollback.
And this at least made some sense within Windows, as it's a very lock-heavy system. It will make much less sense in a Linux FS.
Posted Feb 21, 2025 4:43 UTC (Fri)
by bmenrigh (subscriber, #63018)
[Link] (1 responses)
Posted Feb 21, 2025 12:18 UTC (Fri)
by mcgrof (subscriber, #25917)
[Link]
[0] https://docs.google.com/presentation/d/e/2PACX-1vS6jYbdGD...
Posted Mar 9, 2025 10:42 UTC (Sun)
by bmenrigh (subscriber, #63018)
[Link]
[ 3713.182037] XFS (loop0): Cannot set_blocksize to 8192 on device loop0
Or trying to specify losetup -b 8192:
Maybe I'm missing something? Otherwise the support is there in XFS but the support isn't in the the block layer (at least not the loopback part) to work.
Shows how ahead of its time SGI really was
Shows how ahead of its time SGI really was
Shows how ahead of its time SGI really was
Thanks for the nudge
https://evilpiepirate.org/git/bcachefs.git/commit/?h=for-...
bcachefs
bcachefs
bcachefs
bcachefs
bcachefs
bcachefs
bcachefs
bcachefs
Wol
bcachefs
bcachefs
bcachefs
bcachefs
Performance hit?
Performance hit?
[1] https://lwn.net/Articles/976856/
[2] https://lore.kernel.org/all/Z6GAYFN3foyBlUxK@dread.disast...
Can't actually use currently?
[ 3770.934627] Invalid logical block size (8192)
