|
|
Log in / Subscribe / Register

Flexible data placement

By Jake Edge
May 2, 2025

LSFMM+BPF

At the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit (LSFMM+BPF) Kanchan Joshi and Keith Busch led a combined storage and filesystem session on data placement, which concerns how the data on a storage device is actually written. In a discussion that hearkened back to previous summits, the idea is to give hints to enterprise-class SSDs to help them make better choices on where the data should go; hinting was most recently discussed at the summit in 2023. If SSDs can group data with similar lifetimes together, it can lead to longer life for the devices, but there is a need to work out the details.

Joshi began by noting that the logical placement of data provided by the host system is not the same as the physical placement of it on the device. There is a question of where the placement decision is made; if there is a data creator and multiple layers between it and the device (e.g. filesystem, device mapper), it is the piece that is closest to the device that ultimately decides where the data goes, he said. Currently, data is generally written sequentially because there is a single append point in a single open erase block on the device.

[Kanchan Joshi]

Flexible data placement (FDP) is an NVMe SSD feature that allows writes to be tagged to indicate whether they should be grouped together or not. SSDs with FDP can have multiple append points in separate erase blocks in order to group the data based on its tag. It is not an error to write untagged data or with an invalid tag, however. It is an open question whether the applications or the layers between, like filesystems, should be deciding which tags to apply; the device itself does not care, but if the data is tagged, it "can get grouped as originally intended", Joshi said.

Busch said that SSDs generally have a lot of resources to do things in parallel, but that "without any hints, it's not going to know what the separation should be". Hints would allow multiple applications to be writing without sharing the resources. These hints will also help reduce write amplification because data with the same lifetime can be placed (and updated or erased) together.

Josef Bacik said that he knew Busch had run some experiments using Btrfs that would group data writes separately from metadata writes, but that the performance improvements were not that large. Busch agreed, noting that simply separating data from metadata was not granular enough. It would be better if the B-tree writes could be separated from the journal writes, for example.

Bacik suggested that he could tag based on the subvolume ID for a write, which might improve things. Providing a different tag for data based on grouping operations with similar write-time-to-discard (or -overwrite) characteristics will make a difference, Busch said. Bacik asked about the number of tag values available. Busch said the tag is 16 bits, but that today's hardware does not make use of all of them; the range is from eight tags up to low hundreds of tags. There is a likelihood of running out of tags depending on how they are allocated.

Should filesystem developers just start using the feature and hope that it is helping out, Bacik asked. Previous efforts of this sort lacked for any kind of feedback mechanism, Busch said, but the current protocols do provide ways to get an SSD endurance measure, which allows running experiments to see if the tagging is helping. The numbers provided measure the write amplification of a workload, he said; the number of bytes written to the device is reported along with the amount of data that was actually written to get those bytes on the media. Testing should probably be done with identical workloads on two systems, one without tagging and one with a tagging scheme.

All of the SSDs that Meta (where both Busch and Bacik work) is buying have FDP, Busch said, though it is not enabled by default. It takes a few minutes to configure it for a new drive. All of the major vendors are supporting the feature, but each does it a little differently, so the amount of improvement will vary between SSD types.

[Keith Busch]

Chris Mason asked if there had been testing done with other filesystems that have a journal, such as ext4 or XFS. "There are filesystems with a really specific lifetime for a very heavily used part of the SSD and Btrfs is not one." Busch said that the filesystem testing had not been particularly extensive; the focus shifted to the applications, but he agreed that tagging journal writes for those filesystems might have a big impact.

Ted Ts'o said that there had been some testing done long ago, perhaps by Martin Petersen, that would separate database log writes or filesystem journal writes for a particular type of storage device. The results were encouraging, but the storage devices were expensive and hard to obtain, which meant that developers did not have much of a chance to experiment.

Petersen said that the FDP model is fine for use cases where applications or filesystems have tagging added based on known workloads on a particular SSD model. "But that's not really a good model for all the other use cases". The reason that earlier hinting efforts failed also exists for FDP: there are, say, eight tags that need to somehow be split up between the various filesystems that are on the device. The problem is that the tag values are a scarce resource. If they were not, things could always be tagged based on how the data should be grouped, but FDP and the earlier mechanisms are not general enough; each drive, filesystem, and application combination has to be tested individually to see what works.

Zach Brown said that he was happy to see the increased visibility that FDP devices are providing; "we are so used to devices being shitty black boxes". Busch agreed, noting that it is likely that the lack of feedback is part of what caused earlier hinting efforts to fail.

The current patches do not yet plumb the FDP feature through the rest of the system, Busch said. Instead, there is a passthrough that allows user space to write commands directly to the NVMe device, "so you have full control there". It exposes the number of tags available as max_write_streams for the disk in sysfs. "The passthrough interface is not a particularly pleasant interface to use", he said. NVMe commands have to be constructed in user space, which is not the right level of abstraction. So there is a new write-stream command for io_uring that provides a nicer way to access the feature. It is only available for direct I/O and he questions whether it even makes sense to hook it up for buffered I/O.

Joshi said that connecting up FDP to filesystem I/O in the iomap interface is still an unsolved problem. There are plans of that sort which will need to be discussed, he said. Once a filesystem is mounted, it will own the block device and, thus, will see and can manage all of the write streams (tags). A filesystem can support application-managed write streams, with a mapping to the hardware tags based on its rules, or it can manage the streams itself directly. Those two may not be mutually exclusive, so a filesystem could choose to support a hybrid mode.

It will require a new user interface, as well as per-filesystem enablement, even just for the application-managed case, he said. There is also the question of whether there should be generic application-management code for write streams, so that each filesystem does not need to implement their own.

Stephen Bates asked whether the write streams were per-NVMe-namespace or global to the device. They are global, Busch said, but subsets of the tag values can be specifically assigned to a namespace.

Busch said that he had hoped to use the existing write-hints interface, which is currently a no-op, for FDP. Christoph Hellwig said that the filesystems should still be involved in order to map the application-provided tag values properly. That requires lots of work in the filesystems, Busch said, while the write-hints option does not; it would be better for the filesystems to be involved, but it is an easier path, with some of the benefits, to leave it to the applications.

Bacik would rather not put Btrfs, for example, in the middle as the arbiter of the tags, but filesystems may want to also use the tags. Busch said that filesystems could reserve some of the tag space for themselves if they are the arbiter, but even that may result in tag collisions with filesystems on other partitions. For that reason, Bacik would like the block layer to do the arbitration.

Petersen said that there had been experiments done along the way where filesystems would tag writes based on attributes, such as journal writes versus random writes. A scheduler was added to do the mapping from the filesystem tags to the hardware streams. "It worked. It wasn't pretty, but it had the benefit of being flexible, because you could change that scheduler to match your application workload". Now, developers could do something similar with a BPF program as a scheduler that was tuned for the workload.

One of his complaints about the application-driven hints is that the kernel already has the knowledge that it needs. The application should not have to tell the kernel that these writes are for an application and are bound for a particular file—"we know that, we're the kernel". The user interface should not be tied to the hardware; "we can, in the kernel, schedule resources, that's why we exist".

There are applications that could make use of multiple streams within a single file, though, Javier González said. Joshi agreed, saying that the filesystem and block layer cannot necessarily make the best decisions on the allocation of the write streams. Bacik said that he did not want the filesystems getting in the way of applications doing what they need to do.

Ts'o said that the small number of streams available is what makes things difficult; if there were an infinite number, for the sake of argument, the inode number could be used as the tag "and let the block layer sort it all out". Since that is not the case, something needs to allocate the streams, which may mean denying them to some requesters; every application will claim that its files are the most important, of course.

The session got a little chaotic toward the end, with multiple sub-conversations that made it hard to follow, perhaps because it had run well into the subsequent break. It turned to ways to measure how well (or poorly) the system is tagging its data. The measurements that can be used are the write-amplification information from the drive along with the 99th-percentile (p99) write latencies, both of which should reduce when data with similar lifetimes is being grouped correctly, Busch said. Joshi finished the session by briefly going over some of the results that came from a recent paper on FDP.


Index entries for this article
KernelBlock layer
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2025


to post comments

Too many layers of indirection?

Posted May 2, 2025 23:06 UTC (Fri) by DemiMarie (subscriber, #164188) [Link] (4 responses)

To me, this looks like there are too many layers of indirection. I expect that the heaviest I/O workloads are things like cluster file systems, databases, and object storage servers. Wouldn’t it be better for these programs to talk directly to the storage devices, bypassing filesystems or even the kernel? They know more about where the data should go than the kernel ever will, because they have knowledge the kernel will never have. Ceph already switched from Filestore to Bluestore because filesystems didn’t provide the features and performance it needed.

Too many layers of indirection?

Posted May 3, 2025 2:04 UTC (Sat) by jason.rahman (subscriber, #120510) [Link] (1 responses)

Many do in fact skip the file system. Why do you think the io_uring pass through APIs for NVMe were a thing in the first place? A lot of these sorts of applications won't see the light of day, being hidden at the bottom of the stack for the largest hyperscalers and storage appliance vendors.

Too many layers of indirection?

Posted May 3, 2025 4:35 UTC (Sat) by DemiMarie (subscriber, #164188) [Link]

Are there any serious open-source users of these APIs?

Too many layers of indirection?

Posted May 3, 2025 4:28 UTC (Sat) by butlerm (subscriber, #13312) [Link] (1 responses)

If you need something bad enough, or you need it to be portable, you do it yourself. One prominent database was originally designed to use multiple processes that do only blocking I/O against *raw block devices* for both reads and writes from a memory segment shared by all database processes for that instance. Multithreading was an afterthought, and if I am not mistaken is still the exception rather than the rule for that database.

Anyway, as a consequence the database has more or less complete control over all I/O ordering and buffering as long as there are no other only non-trivial I/O intensive applications on the machine that do not use the database for that, which is more than often the case. In that mode - which is similar to O_DIRECT except simpler - it generally doesn't use the buffer cache or the filesystem at all, except to start up and log error and auditing information. Control files, redo logs, rollback segments, (table, index, system, and temporary) tablespaces can go and apparently were originally intended to go in raw block devices on as many spindles as possible for raw performance.

And there are well known mainframe relational databases that operate pretty much the same way, and this particular database does run on IBM mainframes, and probably has almost from the beginning. The original version was written in assembly language for the PDP-11 with 128K of RAM in the late 70s, and it took quite a while for other relational databases to more or less catch up, and arguably in some ways they still haven't, other (sometimes major) advantages for most well known rivals aside.

Too many layers of indirection?

Posted May 3, 2025 8:21 UTC (Sat) by Wol (subscriber, #4433) [Link]

That sounds to me like OS/400 "the filesystem is the database".

Could equally be Pick - the permanent storage was accessed as if it was virtual memory. Don't know if it still does (I think it does) but D3 (and successors) certainly used to allow you to put the system on a raw device, even if you chose to put the user accounts on a normal filesystem.

Cheers,
Wol

Zoned storage

Posted May 3, 2025 17:49 UTC (Sat) by bvanassche (subscriber, #90104) [Link] (2 responses)

There are two approaches for improving SSD performance: zoned storage and FDP. In this presentation there is a chart that shows that the performance with zoned storage (ZNS) is better than FDP: Javier González, NAND Data Placement Landscape, Trade-Offs, and Direction, 2023. Zoned storage is supported by multiple Linux kernel filesystems.

Zoned storage

Posted May 5, 2025 13:23 UTC (Mon) by willy (subscriber, #9762) [Link] (1 responses)

Which filesystems? Other than ZoneFS, obviously. Looks like just btrfs and f2fs to me. Technically, that's "multiple", because it's more than 1, but that seems carefully worded to make it sound more widely supported than it really is.

Zoned storage

Posted May 5, 2025 16:38 UTC (Mon) by bvanassche (subscriber, #90104) [Link]

In addition to the filesystems that have already been mentioned, support for zoned block devices has been added recently in XFS. Zoned block device support is on the bcachefs roadmap.

A bit "niche"?

Posted May 3, 2025 21:43 UTC (Sat) by marcH (subscriber, #57642) [Link] (1 responses)

> Testing should probably be done with identical workloads on two systems, one without tagging and one with a tagging scheme.

... and then the results are likely valid only for a workload very similar to the tested one. While the numerous layers of indirections and APIs add more "chaos" in the mathematical sense.

The past history of a storage device could also affect the measurements? What is the opposite of "robust" in this context?

So this all sounds like it will never be "general purpose"? Only for very controlled environments like appliances or very specific workloads in datacenters (database,..). Nothing wrong with that; just wondering.

A bit "niche"?

Posted May 5, 2025 14:25 UTC (Mon) by DemiMarie (subscriber, #164188) [Link]

That’s why I think it would be better to use flexible data placement with NVMe or PCIe passthrough. The command set appears to assume that one application owns the whole device and has complete control of where data is written. That only makes sense with passthrough.

Replication, too!

Posted May 4, 2025 20:09 UTC (Sun) by jreiser (subscriber, #11027) [Link]

It can be helpful to consider not only improved placement, but also improved replication. For example, something like the first 1/log(N) of a large file should be replicated automatically, and dynamically adjusted as the file grows. This can support data recovery as well as improved throughput and reduced access time for the beginning blocks of the file, which often average more lifetime accesses than any particular random block. For another example, any identifiable (or heuristically guessed) index within a file also can have better resilience if replicated. The sore-thumb case is a .zip archive, whose index (table of contents) is at the end of the file. Running out of space for the index (Linux EDQUOT) often is disastrous. The opportunity to create automatically a partial index can be guessed by observing the Read-Only open() operations on other files.

obligatory 🦀

Posted Aug 9, 2025 20:40 UTC (Sat) by Rudd-O (guest, #61155) [Link]

> data with the same lifetime

wen rust compiler automatically inferring data lifetimes in kernel?

;-)


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds