The balance between features and performance in the block layer
Block subsystem maintainer Jens Axboe has continued working to make block I/O operations go faster. A recent round of patches tweaked various fast paths, changed the plugging mechanism to use a singly linked list, and made various other little changes. Each is a small optimization, but the work adds up over time; the claimed level of performance is now 8.2 million IOPS — well over September's rate, which looked good at the time. This work has since found its way into the mainline as part of the block pull request for 5.16.
So far, so good; few people will argue with massive performance improvements. But they might argue with changes that threaten to interfere, even in a tiny way, with those improvements.
Consider, for example, this patch set from Jane Chu. It adds a new flag (RWF_RECOVERY_DATA) to the preadv2() and pwritev2() system calls that can be used by applications trying to recover from nonvolatile-memory "poisoning". Implementations of nonvolatile memory have different ways of detecting and coping with data corruption; Intel memory, it seems, will poison the affected range, meaning that it cannot be accessed without generating an error (which then turns into a SIGBUS signal). An application can respond to that error by reading or writing the poisoned range with the new flag; a read will replace the poisoned data with zeroes (allowing the application to obtain whatever data is still readable), while a write will overwrite that data and attempt to clear the poisoned status. Either way, the application can attempt to recover from the problem and continue running.
Christoph Hellwig objected to this new flag on the grounds that it would slow down the I/O fast path:
Well, my point is doing recovery from bit errors is by definition not the fast path. Which is why I'd rather keep it away from the pmem read/write fast path, which also happens to be the (much more important) non-pmem read/write path.
Pavel Begunkov also complained,
saying that each flag adds a bit of overhead that piles up over time:
"default config kernels are already sluggish when it comes to really
fast devices and it's not getting better
". That caused Darrick Wong
to ask:
"So we can't have data recovery because moving fast [is] the only
goal?
". Begunkov denied
saying that, but wasn't really clear on what he was saying.
The cost of this flag is tiny — perhaps not even measurable — in cases where it is not used. But even that overhead can look unacceptable to developers who are working to get the sustained IOPS numbers as high as possible. One flag leads to another and another, and someday in the future the performance cost becomes significant — or that is the argument, anyway. To avoid this kind of problem, the argument continues, niche features like nonvolatile memory poison recovery should be restricted to parts of the kernel that are not seen as being the fast path. In this case, adding the needed functionality to fallocate() has been suggested and tried, but it was eventually decided that fallocate() is not the right place for hardware-management features like poison-clearing.
Thus the current implementation, which has run into fast-path concerns. That, in turn, has provoked an extended outburst from Dave Chinner, who thinks that much of the current optimization work is misplaced:
The current approach of hyper-optimising the IO path for maximum per-core IOPS at the expense of everything else has been proven in the past to be exactly the wrong approach to be taking for IO path development. Yes, we need to be concerned about performance and work to improve it, but we should not be doing that at the cost of everything else that the IO stack needs to be able to do.
Optimization, he continued, should come after the needed functionality is
present; "using 'fast path optimisation'
as a blunt force implement to prevent new features from being
implemented is just ... obnoxious
".
The conversation stalled out shortly after that. This kind of disagreement over features and performance has been heard in the kernel community many times over the decades, though. Going faster is a popular goal, and the developers who are focused on performance have been known to get tired of working through performance regressions caused by feature additions that accumulate over time. But functionality, too, is important, and developers tire of having needed features blocked on seemingly insignificant performance concerns.
Often, performance-related pushback leads to revised solutions that do not
have the same performance cost; along those lines, it's worth noting that
Hellwig has some
ideas for improved ways of handling I/O to nonvolatile memory.
Other times, it just leads to the delay or
outright blocking of needed functionality. What will happen in this case
is not clear at this point, but debates like this will almost certainly
be with us in the coming decades as well. It is, in the end, how the right
balance between performance and functionality is (hopefully) found.
Index entries for this article | |
---|---|
Kernel | Block layer/Scalability |
Kernel | Scalability |
Posted Nov 5, 2021 15:31 UTC (Fri)
by jhoblitt (subscriber, #77733)
[Link] (1 responses)
Posted Nov 5, 2021 16:07 UTC (Fri)
by Wol (subscriber, #4433)
[Link]
Would you call raid an application? This looks at first glance EXACTLY the functionality raid would welcome. We don't give a monkeys about the distinction between reading and writing at this point - the raid layer, on hitting an error, can do both.
Provided the io layer returns an error that we know about, we can try re-reading or re-writing whatever seems appropriate, and the actual application won't even realise anything out of the ordinary has happened.
Likewise databases, and many other applications where data integrity is king - they will welcome this functionality.
That's why my home raid now has dm-integrity as part of my stack - I don't care about the cost (which is real, but not noticeable to my lightly-used home system) - because my mirror-raid can now recover from corruption. Having integrity in the stack will cause a read of damaged data to blow up.
Cheers,
Posted Nov 5, 2021 17:34 UTC (Fri)
by Sesse (subscriber, #53779)
[Link]
5.15 supposedly is good for 9M (there are two patches that are not in mainline).
Posted Nov 8, 2021 12:14 UTC (Mon)
by k3ninho (subscriber, #50375)
[Link] (4 responses)
K3n.
Posted Nov 8, 2021 19:45 UTC (Mon)
by jezuch (subscriber, #52988)
[Link]
Posted Nov 9, 2021 1:38 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Posted Nov 11, 2021 9:54 UTC (Thu)
by wtarreau (subscriber, #51152)
[Link] (1 responses)
In haproxy we're facing this dilemma all the time, but we try to stay reasonable. We know that users want features, and we try to group slow operations in slow paths, or to implement bypass mechanisms. Sometimes the cost of checking one flag is okay but not two or three, so we arrange for grouping them under a same mask and use a slow path to test each of them. Other times we have high-level checks that decide what path to take, with some partially redundant code, which is more of a pain but occasionally needed. And we try to always keep in mind that saved performance is not just there to present numbers, but also to leave more room to welcome new features at zero cost. For sure it's never pleasant to work 3 months to save 5% and see those 5% disappear 3 months later, but if we're back to previous performance numbers for a nice feature improvement, well, it's not that bad.
One thing that developers tend to forget is that doing nothing can be extremely fast, but in the real world there are more operations around what they've optimized, so their savings that double the performance have in fact only cut the overhead in half, and that when placed in field, a lot more overhead will replace the one they removed. So their savings only become a few percent in the end. That's what I'm often trying to explain "in practice nobody runs at this level of performance due to other factors so the loss will be much lower".
Posted Nov 15, 2021 5:20 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link]
Posted Nov 17, 2021 15:01 UTC (Wed)
by zse (guest, #120483)
[Link]
The balance between features and performance in the block layer
The balance between features and performance in the block layer
Wol
The balance between features and performance in the block layer
The balance between features and performance in the block layer
That sounds like premature optimisation -- but the point I'd rather make here, is about whole-context optimisation, where we must make a habit of improving the system-as-a-whole. Especially when you're only optimising one measure without clarifying the assumption that it's the best proxy for all the other things you're not taking into account.
The balance between features and performance in the block layer
The balance between features and performance in the block layer
The balance between features and performance in the block layer
The balance between features and performance in the block layer
The balance between features and performance in the block layer