The balance between features and performance in the block layer

By Jonathan Corbet
November 5, 2021

Back in September, LWN reported on a series of block-layer optimizations that enabled a suitably equipped system to sustain 3.5 million I/O operations per second (IOPS). That optimization work has continued since then, and those 3.5 million IOPS would be a deeply disappointing result now. A recent disagreement over the addition of a new feature has highlighted the potential cost of a heavily optimized block layer, though; when is a feature deemed important enough to outweigh the drive for maximum performance?

Block subsystem maintainer Jens Axboe has continued working to make block I/O operations go faster. A recent round of patches tweaked various fast paths, changed the plugging mechanism to use a singly linked list, and made various other little changes. Each is a small optimization, but the work adds up over time; the claimed level of performance is now 8.2 million IOPS — well over September's rate, which looked good at the time. This work has since found its way into the mainline as part of the block pull request for 5.16.

So far, so good; few people will argue with massive performance improvements. But they might argue with changes that threaten to interfere, even in a tiny way, with those improvements.

Consider, for example, this patch set from Jane Chu. It adds a new flag (RWF_RECOVERY_DATA) to the preadv2() and pwritev2() system calls that can be used by applications trying to recover from nonvolatile-memory "poisoning". Implementations of nonvolatile memory have different ways of detecting and coping with data corruption; Intel memory, it seems, will poison the affected range, meaning that it cannot be accessed without generating an error (which then turns into a SIGBUS signal). An application can respond to that error by reading or writing the poisoned range with the new flag; a read will replace the poisoned data with zeroes (allowing the application to obtain whatever data is still readable), while a write will overwrite that data and attempt to clear the poisoned status. Either way, the application can attempt to recover from the problem and continue running.

Christoph Hellwig objected to this new flag on the grounds that it would slow down the I/O fast path:

Well, my point is doing recovery from bit errors is by definition not the fast path. Which is why I'd rather keep it away from the pmem read/write fast path, which also happens to be the (much more important) non-pmem read/write path.

Pavel Begunkov also complained, saying that each flag adds a bit of overhead that piles up over time: "default config kernels are already sluggish when it comes to really fast devices and it's not getting better". That caused Darrick Wong to ask: "So we can't have data recovery because moving fast [is] the only goal?". Begunkov denied saying that, but wasn't really clear on what he was saying.

The cost of this flag is tiny — perhaps not even measurable — in cases where it is not used. But even that overhead can look unacceptable to developers who are working to get the sustained IOPS numbers as high as possible. One flag leads to another and another, and someday in the future the performance cost becomes significant — or that is the argument, anyway. To avoid this kind of problem, the argument continues, niche features like nonvolatile memory poison recovery should be restricted to parts of the kernel that are not seen as being the fast path. In this case, adding the needed functionality to fallocate() has been suggested and tried, but it was eventually decided that fallocate() is not the right place for hardware-management features like poison-clearing.

Thus the current implementation, which has run into fast-path concerns. That, in turn, has provoked an extended outburst from Dave Chinner, who thinks that much of the current optimization work is misplaced:

The current approach of hyper-optimising the IO path for maximum per-core IOPS at the expense of everything else has been proven in the past to be exactly the wrong approach to be taking for IO path development. Yes, we need to be concerned about performance and work to improve it, but we should not be doing that at the cost of everything else that the IO stack needs to be able to do.

Optimization, he continued, should come after the needed functionality is present; "using 'fast path optimisation' as a blunt force implement to prevent new features from being implemented is just ... obnoxious".

The conversation stalled out shortly after that. This kind of disagreement over features and performance has been heard in the kernel community many times over the decades, though. Going faster is a popular goal, and the developers who are focused on performance have been known to get tired of working through performance regressions caused by feature additions that accumulate over time. But functionality, too, is important, and developers tire of having needed features blocked on seemingly insignificant performance concerns.

Often, performance-related pushback leads to revised solutions that do not have the same performance cost; along those lines, it's worth noting that Hellwig has some ideas for improved ways of handling I/O to nonvolatile memory. Other times, it just leads to the delay or outright blocking of needed functionality. What will happen in this case is not clear at this point, but debates like this will almost certainly be with us in the coming decades as well. It is, in the end, how the right balance between performance and functionality is (hopefully) found.

Index entries for this article
Kernel	Block layer/Scalability
Kernel	Scalability

The balance between features and performance in the block layer

Posted Nov 5, 2021 15:31 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (1 responses)

Applications can reasonably try to recover from hardware faults with system calls? <Head explodes/>

The balance between features and performance in the block layer

Posted Nov 5, 2021 16:07 UTC (Fri) by Wol (subscriber, #4433) [Link]

???

Would you call raid an application? This looks at first glance EXACTLY the functionality raid would welcome. We don't give a monkeys about the distinction between reading and writing at this point - the raid layer, on hitting an error, can do both.

Provided the io layer returns an error that we know about, we can try re-reading or re-writing whatever seems appropriate, and the actual application won't even realise anything out of the ordinary has happened.

Likewise databases, and many other applications where data integrity is king - they will welcome this functionality.

That's why my home raid now has dm-integrity as part of my stack - I don't care about the cost (which is real, but not noticeable to my lightly-used home system) - because my mirror-raid can now recover from corruption. Having integrity in the stack will cause a read of damaged data to blow up.

Cheers,
Wol

The balance between features and performance in the block layer

Posted Nov 5, 2021 17:34 UTC (Fri) by Sesse (subscriber, #53779) [Link]

The latest claim is >10M IOPS: https://twitter.com/axboe/status/1452689372395053062

5.15 supposedly is good for 9M (there are two patches that are not in mainline).

The balance between features and performance in the block layer

Posted Nov 8, 2021 12:14 UTC (Mon) by k3ninho (subscriber, #50375) [Link] (4 responses)

>optimising the [anything] for maximum [single statistic] at the expense of everything else
That sounds like premature optimisation -- but the point I'd rather make here, is about whole-context optimisation, where we must make a habit of improving the system-as-a-whole. Especially when you're only optimising one measure without clarifying the assumption that it's the best proxy for all the other things you're not taking into account.

K3n.

The balance between features and performance in the block layer

Posted Nov 8, 2021 19:45 UTC (Mon) by jezuch (subscriber, #52988) [Link]

This, and Amdahl's law. I guess at this point they're not increasing performance, but reducing overhead. In an increasingly marginal way.

The balance between features and performance in the block layer

Posted Nov 9, 2021 1:38 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

The purpose of an operating system is not to score well on benchmarks. Even a whole suite of numbers is not necessarily dispositive. If the OS can't do what the user wants it to do, then performance is wholly irrelevant.

The balance between features and performance in the block layer

Posted Nov 11, 2021 9:54 UTC (Thu) by wtarreau (subscriber, #51152) [Link] (1 responses)

It's more complicated, Linux suffers from being everyone's OS, and everyone has different use cases and priorities. For some it's useless without performance and for others it's useless without new features.

In haproxy we're facing this dilemma all the time, but we try to stay reasonable. We know that users want features, and we try to group slow operations in slow paths, or to implement bypass mechanisms. Sometimes the cost of checking one flag is okay but not two or three, so we arrange for grouping them under a same mask and use a slow path to test each of them. Other times we have high-level checks that decide what path to take, with some partially redundant code, which is more of a pain but occasionally needed. And we try to always keep in mind that saved performance is not just there to present numbers, but also to leave more room to welcome new features at zero cost. For sure it's never pleasant to work 3 months to save 5% and see those 5% disappear 3 months later, but if we're back to previous performance numbers for a nice feature improvement, well, it's not that bad.

One thing that developers tend to forget is that doing nothing can be extremely fast, but in the real world there are more operations around what they've optimized, so their savings that double the performance have in fact only cut the overhead in half, and that when placed in field, a lot more overhead will replace the one they removed. So their savings only become a few percent in the end. That's what I'm often trying to explain "in practice nobody runs at this level of performance due to other factors so the loss will be much lower".

The balance between features and performance in the block layer

Posted Nov 15, 2021 5:20 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

Of course performance matters. Half my job is (after the pager goes off) figuring out why we can't serve 99% of our RPCs within X milliseconds. But surprisingly often, the answer turns out to be "because the client asked us to do something inherently expensive, and we're lumping it in with the cheap requests," and so we end up changing the monitoring rather than improving the hot path (i.e. we change our definition of "good performance" to exclude the expensive operations, or to give them additional time).

The balance between features and performance in the block layer

Posted Nov 17, 2021 15:01 UTC (Wed) by zse (guest, #120483) [Link]

Personally I consider it bad API style in general to combine common and obscure functionality in the same call via option flags. When niche functionality is handled by a separate specialized API instead it also makes documentation less confusing and allows for that obscure thing to evolve more easily. Beter performance for the common operation is more the cherry on top. Wouldn't simply using a new ioctl for this feature be a solution?