Write-stream IDs

By Jonathan Corbet
April 7, 2015

Storage devices with large physical block sizes — solid-state storage devices (SSDs), for example — are subject to a problem known as "write amplification" that can affect both the performance and lifetime of the device. Applications often have information about the data they write that can be helpful in reducing write amplification problems, but there is currently no way to communicate that information to the relevant parts of the kernel. A new proposed addition to the block-layer API may help to solve that problem in the near future, though.

The kernel typically performs block I/O in units of 4KB, but a typical SSD has an erase-block size of many times that. The firmware in the drive itself performs the impedance matching between the small sector size exposed to the host computer and the real requirements imposed by the hardware. Whenever a sector is written, the firmware must find a home for it in a new erase block, leaving an empty space where the sector used to be. Occasionally, sectors must be shifted and coalesced during a garbage-collection pass to free up the empty spaces for new writes.

"Write amplification" refers to this extra work that must be performed when data is overwritten. It gets worse if short-lived data is mixed with long-lived data in the same erase blocks; garbage collection must happen more often and more data must be moved around. On the other hand, if short-lived data can be kept together, the rewriting of erase blocks and garbage-collection work can be minimized. The kernel could perhaps do this kind of separation for some types of filesystem metadata, but it has no knowledge of how user space plans to use the data that it writes to the filesystem. So, if long-lived user-space data is to be separated from the short-lived variety, user space is going to have to help with the job. That is where Jens Axboe's write-stream IDs patch set comes in.

A write-stream ID is simply an eight-bit integer value assigned to block data as it is written. The kernel does not interpret that value in any way other than as a hint that data with the same ID is likely to have approximately the same lifetime. Low-level storage drivers can use this ID to place data with the same life expectancy together on the media, hopefully reducing write-amplification problems in the process.

At the lowest level in the block layer, the stream ID is stored in eight bits of the bi_flags field in the bio structure. It can be set with bio_set_streamid() and queried with bio_get_streamid(). A call to bio_streamid_valid() can be used to determine whether a given bio structure has had its stream ID set; low-level block drivers can use a valid stream ID to instruct the hardware to group similar data on the physical media.

At the user-space level, the stream ID for an open file can be set with the new fadvise(POSIX_FADV_STREAMID) operation. Interestingly, this value is stored in two places: the file structure associated with the given file descriptor, and the inode structure representing the file itself. That might seem like an interesting choice, given that both structures are heavily used and bloating both of them with a new field is not something to be done lightly, but there is a reason for it.

When an application performs direct I/O, the data being written will be placed in a bio structure immediately and passed to the block layer. The file structure corresponding to the file descriptor passed by user space is available then, so the stream ID stored in that file structure can be copied directly into the bio structure.

That option is not available for buffered I/O, though. A buffered write() will simply copy the data into the page cache and mark the relevant page(s) dirty; the actual I/O on those pages will happen at some future time. By the time that the writeback code gets around to those pages, the file structure used to initiate the write may no longer exist. Even if it is still around, though, it is not readily accessible at that level of the kernel. But the inode structure is accessible. So, in this case, the stream ID must be taken from the inode structure.

One might ask why the inode-stored stream ID is not used all of the time. The patches are silent on this point, but the probable answer is that the more direct control afforded by storing the ID in the file structure is worth having when it is possible. It allows the stream ID to be changed from one I/O operation to the next; different file descriptors referring to the same on-disk file can also have different stream IDs. This flexibility is not available when doing buffered I/O and using the stream ID stored in the inode structure; since it's not possible to know when the actual writeback will happen, the stream ID cannot be changed between writes without the likelihood of affecting writes intended to go under a different ID.

There would be clear value in a closer association between stream IDs and specific buffered-write operations. Getting there would require storing the stream ID with each dirtied page, though; that, in turn, almost certainly implies shoehorning the stream ID into the associated page structure. That would not be an easy task; it is not surprising that it is not a part of this patch set. Should the lack of per-buffered-write stream IDs prove to be a serious constraint in the future, somebody will certainly be motivated to try to find a place to store another eight bits in struct page.

Meanwhile, there does not appear to be any real opposition to the patch set in its current form. Unless that situation changes, write-stream IDs would appear to be a feature headed for the mainline in a near-future development cycle.

Index entries for this article
Kernel	Block layer

Write-stream IDs

Posted Apr 9, 2015 2:10 UTC (Thu) by butlerm (subscriber, #13312) [Link] (2 responses)

Shouldn't filesystems use stream ids set like this to lay out files on the logical volume to reduce fragmentation, improve read ahead, lower seek times, and the like?

Furthermore isn't it a bit of a layering violation if these ids are passed to the block layer without the opportunity for the filesystem to rewrite / remap them? A filesystem could easily support hundreds if not thousands of write streams and reduce them to a much smaller number for the benefit of a supporting block device.

Write-stream IDs

Posted Apr 10, 2015 12:54 UTC (Fri) by dwmw2 (subscriber, #2063) [Link] (1 responses)

The whole "pretend to be a disk" thing is a layering problem. We really want the file system to be able to see the underlying flash and make use of it directly.

So you end up with precisely this kind of layering violation in order to pass additional information down to the "disk" to try to let it do its job better.

The last major layering violation like this was TRIM, which allows us to tell the disk that we no longer care about the contents of certain sectors. Look how well that worked... is it still disabled by default in most file systems because of media corruption and horrid performance? Is there a version of the SATA spec yet that actually allows it to be a tagged command and queue up with other reads/writes?

Write-stream IDs: layering violations

Posted Apr 12, 2015 3:30 UTC (Sun) by giraffedata (guest, #1954) [Link]

Such things are not actually layering violations because of the way they are defined. A command that says, "deallocate this block" is a layering violation. A command that declares, "this block does not back any file" is a layering violation. But a command that says, "Fill this block with zeroes" or even "make this block have unpredictable contents" is not, even if we understand that the device will respond to that information by deallocating the block.

Likewise, declaring that you're going to overwrite two blocks at about the same time in the future does not violate layering. Telling an SSD to put the two blocks in the same erase block would.

What this is instead is messy layering, a form of excessive complexity. That's almost as bad.

Write-stream IDs

Posted Apr 9, 2015 5:11 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

RAID arrays also end up with large "physical" block sizes and write amplification (a read-modify-write on the pairity blocks when any block in the stripe is written.

But trying to control this from tags that the application attaches to writes seems very broken.

with only 8 bits available to play with, it's small enough to run into problems on large systems, and small enough to be abused by processes that want to mess up other processes on the system. The application should also not have any idea how the data they are writing is going to end up on disk. If they are using a log structured filesystem, the OS may want to mix data from many sources together into one eraseblock (after all, there's no significant performance penalty for reading from all over the drive)

The problem is when writeback is about to happen, how do you find other blocks of data that the filesystem would like have together? What the application thinks can be a hint, but should not be more than that.

Write-stream IDs

Posted Apr 9, 2015 18:11 UTC (Thu) by HIGHGuY (subscriber, #62277) [Link]

Interesting idea.

I guess it could be worked around by assigning quasi-random (sequential) numbers to each open file and then having:
fadvise(initial_fd, ... , POSIX_FADVISE_SIMILAR, new_fd, ...);

to assign initial_fd's number to that of new_fd.

Coordination between separate applications

Posted Apr 9, 2015 14:09 UTC (Thu) by abatters (✭ supporter ✭, #6932) [Link] (2 responses)

> The idea is that writes with identical stream IDs have similar life times, not that stream ID 'X' has a shorter lifetime than stream ID 'X+1'.

Without a standard definition for stream ID values, there is no way for separate applications to coordinate with one another. For example, if a web browser uses larger stream IDs for files with longer lifetimes, and an email client uses smaller stream IDs for files with longer lifetimes, then it may end up making the situation worse, not better.

Coordination between separate applications

Posted Apr 9, 2015 15:37 UTC (Thu) by droundy (subscriber, #4559) [Link] (1 responses)

I would imagine that applications would each pick two or three tags randomly and just hope that there aren't too many conflicts. It seems that the tags are ephemeral, so if the two applications aren't simultaneously writing there shouldn't be much problem.

Coordination between separate applications

Posted Apr 15, 2015 23:13 UTC (Wed) by brouhaha (subscriber, #1698) [Link]

For that reason and others, it seems like using an 8-bit stream ID is overly restrictive.

Write-stream IDs

Posted Apr 10, 2015 10:16 UTC (Fri) by amonnet (guest, #54852) [Link] (1 responses)

Great, let's wait all userspace make (correct) use of this (linux only) feature !!

Assume that programmers have already made their homework, putting different kind of files in different directory (think log, cache, data, run files ...). If they have not, they will not do it now for you with this obscure feature.

Dirname is your stream id !

+++

Write-stream IDs

Posted Apr 10, 2015 15:23 UTC (Fri) by flussence (guest, #85566) [Link]

Presumably this is something intended for high-level APIs like GTK/Qt, where the toolkit takes care of all the OS-specific details. (They've hopefully gotten better at that job ever since the mass panic where ext4 started eating people's config files...)

I don't like the idea of automatic heuristics based on assumed filesystem usage, but I *would* use a way to declare usage patterns per directory; for example, marking /var/tmp or /home/*/.cache/ as expendable, and then having those behave like a disk-backed tmpfs where only memory pressure or unmounting can trigger writeback.

Write-stream IDs

Posted Apr 16, 2015 0:42 UTC (Thu) by scientes (guest, #83068) [Link] (1 responses)

Why don't we throw the block layer away entirely along with the FTL and use mtd for SSDs?

Write-stream IDs

Posted Apr 17, 2015 13:40 UTC (Fri) by intgr (subscriber, #39733) [Link]

People keep bringing up this question. To me, the answer seems obvious: because without the FTL, software bugs or misconfiguration can damage the hardware (when not doing proper wear levelling). With rare exceptions, hardware is usually designed to be safe from software bugs.

Sure, the firmware-level FTL can also have bugs, but when the firmware is buggy then it can only be the HW vendor's fault. With a fully software FTL, who takes the blame? Who foots the bill for replacing damaged hardware? This is also not a problem with software FTL on tightly integrated embedded devices, such as cell phones, because the hardware and software comes from the same place.