Parallelizing filesystem writeback

By Jake Edge
June 12, 2025

Writeback for filesystems is the process of flushing the "dirty" (written) data in the page cache to storage. At the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF), Anuj Gupta led a combined storage and filesystem session on some work that has been done to parallelize the writeback process. Some of the performance problems that have been seen with the existing single-threaded writeback came up in a session at last year's summit, where the idea of doing writeback in parallel was discussed.

Gupta began by noting that Kundan Kumar, who posted the topic proposal, was supposed to be leading the session, but was unable to attend. Kumar and Gupta have both been working on a prototype for parallelizing writeback; the session was meant to gather feedback on it.

Currently, writeback for buffered I/O is single-threaded, though applications are issuing multithreaded writes, which can lead to contention. The backing storage device is represented in the kernel by a BDI (struct backing_dev_info), and each BDI has a single writeback thread that processes the struct bdi_writeback embedded in it. Each bdi_writeback has a single list of inodes that need to be written back and a single delayed_work instance, which makes the process single-threaded.

Based on ideas from Dave Chinner and Jan Kara in the topic-proposal thread, he and Kumar added a struct bdi_writeback_ctx to contain context information for a bdi_writeback, Gupta said. An array of the writeback contexts was added to the BDI, which allows for multiple threads doing writeback for a BDI. That can be seen in a patch series posted by Kumar at the end of May. They see two levels of parallelism that could be added, but focused on the high-level parallelism in the prototype, leaving low-level parallelism for later work.

The high-level parallelism splits up the work based on the filesystem geometry, Gupta said. For example, XFS has allocation groups that can be used to partition the inodes needing writeback into separate contexts of the BDI. The low-level parallelism could be added by having multiple delayed_work entries in the bdi_writeback; different types of filesystem operations, such as block allocation versus pure overwrites for XFS, could be handled by separate workqueues at that level.

Christoph Hellwig was not sure that allocation groups were the right choice for partitioning (sharding) the inodes for writeback. Gupta said that testing to see what kind of workloads benefit when using the allocation groups will be important. Parallel writeback will be an option; users can disable it if it is not beneficial for their workloads.

Amir Goldstein asked how severe the current bottleneck with single-threaded writeback is. Gupta said that he did not have any numbers on that. Chris Mason said that in his testing it largely depends on whether large folios are available; the easiest way to see performance problems with single-threaded writeback is to turn off large folios for XFS. In some simple testing, he could get around 800MB per second on XFS with large folios disabled before the kernel writeback thread was saturated. With large folios enabled, that number goes to around 2.4GB per second.

Hellwig was curious where the cycles were being spent in the no-large-folios case. Mason said that he did not remember the details, though pages were being moved around a lot to different lists. Matthew Wilcox suggested that the XArray API might be part of the problem, since there is no API to clear the dirty bits on a range of folios. If the kernel is clearing those bits one-by-one, that might be a bottleneck; it could be fixed, but he has not found the time to do so.

Jeff Layton said that the network filesystems would need some way to limit how much parallelization was happening for writeback. There may be underlying network conditions that will necessitate limiting parallel writeback. Gupta acknowledged that need and Hellwig said that local filesystems may also need a way to do that at times. One thing that Hellwig would like to see change is the Btrfs-internal mechanism for having multiple threads doing writeback with compression and checksumming; having a common solution with multiple writeback threads "would remove a lot of crap". He is concerned that other filesystems will pick up that code, so having a common solution would head that off; Mason said he was not opposed to that idea, though there may be large files that need their writeback handled specially.

Over the remote link, Jan Kara said that what he was hearing was that most of the interest was in the low-level parallelism, which would parallelize the writeback of individual inodes. He cautioned that there are various problems with doing that, including potential problems with data-integrity measurements, since writes are done in a different order.

Hellwig said that the "global-ish linked lists" that are used to track the inodes for writeback are a data structure that is known to cause scalability problems. Adding more threads to work on those lists will make things worse; he wondered if using an XArray instead would be better. Using allocation groups to shard the inode lists (or XArrays) makes sense for XFS—other filesystems may have their own obvious choice for sharding—but the management of the threads can be separate from the management of the lists. Other criteria could be used to determine the thread to handle a specific inode.

Mason agreed and wanted to suggest something similar. For the first version of multithreaded writeback, he said that keeping a single inode list would make sense. Rather than work out how to shard the inodes to different lists, simply "focus on being able to farm out multiple inodes at a time under writeback"; that will help determine where the lock contention is. Gupta was concerned about the lock contention for the inode list, but Kara did not think it would be that significant compared to all of the other work needed for writeback.

Once the basics of the feature are working, Goldstein said, it may make sense to provide a means for filesystems to help choose inodes for the different threads. Filesystems may have information that can assist the writeback machinery to achieve better parallelism.

Mason got in the last word of the session by relaying another bottleneck that he had found in his testing but had forgotten earlier: freeing clean pages. Once dirty pages have been written, those clean pages are freed by the kswapd kernel thread, which results in another bottleneck. He suggested that there may need to be some parallelism added there as well.

Index entries for this article
Kernel	Buffered I/O
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2025

No parallelism for single files?

Posted Jun 16, 2025 13:11 UTC (Mon) by Homer512 (subscriber, #85295) [Link] (3 responses)

If the intend is to parallelize based on inodes, does that mean single large files will get slower relative to work spread out over multiple files?

No parallelism for single files?

Posted Jun 16, 2025 19:38 UTC (Mon) by iq-0 (subscriber, #36655) [Link] (2 responses)

> If the intend is to parallelize based on inodes, does that mean single large files will get slower relative to work spread out over multiple files?

I would not expect there to be a significant slowdown caused by this patch. This is a multi-producer/single-consumer to multi-producer/multi-consumer like workload distribution change. The multi-producer part already causes the most contention, a small bounded set of consumers often will not add significantly more contention.

But in certain situations the new patch would allow allow some writes to different inodes to go faster then is currently possible. And if the workload is properly sharded it could even reduce contention.

So, given they are properly sharding the writeback queue by inode, then an host that only does a lot of parallel writes to one inode should be about as fast as the current logic. Whilst a host with a lot of parallel writes to a single inode with some random other I/O to other inodes might even see a bit of an improvement.

No parallelism for single files?

Posted Jun 17, 2025 8:07 UTC (Tue) by parametricpoly (subscriber, #143903) [Link] (1 responses)

Parallelism at this level makes sense for SSD drives, but what about spinning rust? Repositioning the heads takes a lot of time.

No parallelism for single files?

Posted Jun 17, 2025 11:42 UTC (Tue) by iq-0 (subscriber, #36655) [Link]

Depends on the storage stack. Some device mapper, md or other RAID setup might also benefit. And remote block devices (high latency) or battery backed write cached HDDs will probably also benefit.

Maybe the solution is sequential performance improvement

Posted Jun 19, 2025 16:36 UTC (Thu) by anton (subscriber, #25547) [Link]

Caveat: I don't have any knowledge about the topic except what is written in this article.

Still, I wonder if it would not be more effective to improve sequential performance rather than to try parallelization. Some of the reasons why that might be more promising:

The write bandwidth numbers given (0.8GB/s, 2.4GB/s) are far below what a single core on recent hardware is capable of for RAM access (e.g., 50GB copy bandwidth given here, although that's probably only 25GB/s in each direction).
Of course in the end the result cannot be faster than what the target device is capable of processing, but if the target device is the limit, parallelization at the CPU level will not help. The numbers given are quite a lot lower than what recent SSDs can process, as long as the SLC cache is not full and the SSD does not perform thermal throttling.
The fact that large folios provide such a big speedup indicates that there are overheads in processing that are what hold back the performance. Apparently large folios reduce some overhead that contributes a lot to the run-time. After large folios, there may still be some of that overhead left, and reducing that further might help; or something else becomes the bottleneck and one should eliminate or reduce that.
Given that the discussion is about parallelizing the writes to a single device, if one takes the parallelization approach, there will be some synchronization overhead between the threads doing the writing.