| Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
It's an experience many of us have had: write a bunch of data to a relatively slow block device, then try to get some other work done. In many cases, the system will slow to a crawl or even appear to freeze for a while; things do not recover until the bulk of the data has been written to the device. On a system with a lot of memory and a slow I/O device, getting things back to a workable state can take a long time, sometimes measured in minutes. Linux users are understandably unimpressed by this behavior pattern, but it has been stubbornly present for a long time. Now, perhaps, a new patch set will improve the situation.
That patch set, from block subsystem maintainer Jens Axboe, is titled "Make background writeback not suck." "Background writeback" here refers to the act of flushing block data from memory to the underlying storage device. With normal Linux buffered I/O, a write() call simply transfers the data to memory; it's up to the memory-management subsystem to, via writeback, push that data to the device behind the scenes. Buffering writes in this manner enables a number of performance enhancements, including allowing multiple operations to be combined and enabling filesystems to improve layout locality on disk.
So how is it that a performance-enhancing technique occasionally leads to such terrible performance? Jens's diagnosis is that it has to do with the queuing of I/O requests in the block layer. When the memory-management code decides to write a range of dirty data, the result is an I/O request submitted to the block subsystem. That request may spend some time in the I/O scheduler, but it is eventually dispatched to the driver for the destination device. Getting there requires passing through a series of queues.
The problem is that, if there is a lot of dirty data to write, there may end up being vast numbers (as in thousands) of requests queued for the device. Even a reasonably fast drive can take some time to work through that many requests. If some other activity (clicking a link in a web browser, say, or launching an application) generates I/O requests on the same block device, those requests go to the back of that long queue and may not be serviced for some time. If multiple, synchronous requests are generated — page faults from a newly launched application, for example — each of those requests may, in turn, have to pass through this long queue. That is the point where things appear to just stop.
In other words, the block layer has a bufferbloat problem that mirrors the issues that have been seen in the networking stack. Lengthy queues lead to lengthy delays.
As with bufferbloat, the answer lies in finding a way to reduce the length of the queues. In the networking stack, techniques like byte queue limits and TCP small queues have mitigated much of the bufferbloat problem. Jens's patches attempt to do something similar in the block subsystem.
Like networking, the block subsystem has queuing at multiple layers. Requests start in a submission queue and, perhaps after reordering or merging by an I/O scheduler, make their way to a dispatch queue for the target device. Most block drivers also maintain queues of their own internally. Those lower-level queues can be especially problematic since, by the time a request gets there, it is no longer subject to the I/O scheduler's control (if there is an I/O scheduler at all).
Jens's patch set aims to reduce the amount of data "in flight" through all of those queues by throttling requests when they are first submitted. To put it simply, each device has a maximum number of buffered-write requests that can be outstanding at any given time. If an incoming request would cause that limit to be exceeded, the process submitting the request will block until the length of the queue drops below the limit. That way, other requests will never be forced to wait for a long queue to drain before being acted upon.
In the real world, of course, things are not quite so simple. Writeback is not just important for ensuring that data makes it to persistent storage (though that is certainly important enough); it is also a key activity for the memory-management subsystem. Writeback is how dirty pages are made clean and, thus, available for reclaim and reuse; if writeback is impeded too much, the system could find itself in an out-of-memory situation. Running out of memory can lead to other user-disgruntling delays, along with unleashing the OOM killer. So any writeback throttling must be sure to not throttle things too much.
The patch set tries to avoid such unpleasantness by tracking the reason behind each buffered-write operation. If the memory-management subsystem is just pushing dirty pages out to disk as part of the regular task of making their contents persistent, the queue limit applies. If, instead, pages are being written to make them free for reclaim — if the system is running short of memory, in other words — the limit is increased. A higher limit also applies if a process is known to be waiting for writeback to complete (as might be the case for an fsync() call). On the other hand, if there have been any non-writeback requests within the last 100ms, the limit is reduced below the default for normal writeback requests.
There is also a potential trap in the form of drives that do their own write caching. Such drives will indicate that a write request has completed once the data has been transferred, but that data may just be sitting in a cache within the drive itself. In other words, the drive, too, may be maintaining a long queue. In an attempt to avoid overfilling that queue, the block layer will impose a delay between write operations on drives that are known to do caching. That delay is 10ms by default, but can be tweaked via a sysfs knob.
Jens tested this work by having one process write 100MB each to 50 files while another process tries to read a file. The reading process will, on current kernels, be penalized by having each successive read request placed at the end of a long queue created by all those write requests; as might be expected, it performs poorly. With the patches applied, the writing processes take a little longer to complete, but the reader runs much more quickly, with far fewer requests taking an inordinately long period of time.
This is an early-stage patch set; it is not expected to go upstream in the near future. Patches that change memory-management behavior can often cause unexpected problems with different workloads, so it takes a while to build confidence in a significant change, even after the development work is deemed to be complete (which is not the case here). Indeed, Dave Chinner has already reported a performance regression with one of his testing workloads. The tuning of the queue-size limits also needs to be made automatic if possible. There is clearly work still to be done here; the patch set is also likely to be a subject of discussion at the upcoming Linux Storage, Filesystem, and Memory-Management Summit. So users will have to wait a bit longer for this particular annoyance to be addressed.
Toward less-annoying background writeback
Posted Apr 14, 2016 19:57 UTC (Thu) by fandingo (guest, #67019) [Link]
Yuck. I don't see what useful purpose this artificial latency is supposed to serve. Surely, after all these years, we've realized that people who use writeback caching devices desire that behavior. Why get in their way? Users know what they're doing. And if they don't? They sure as shit don't know about a sysfs knob. Fight the urge to add these silly, never-to-be-used options.
I don't want 10ms of latency for no reason and from the OS layer that can't possibly know whether 10ms of latency per write provides any benefit whatsoever. What's the derogatory phrase people like to hurl about these sorts of things? Layering violation! OS manage your buffers; disk manage yours; don't try to do sneaky stuff to manipulate the other's.
Toward less-annoying background writeback
Posted Apr 14, 2016 20:32 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
It's not perfect, they should use something like CoDel to modulate the delay, but it's a start.
Toward less-annoying background writeback
Posted Apr 15, 2016 8:19 UTC (Fri) by roblucid (subscriber, #48964) [Link]
This sounds bit like a produce-consumer problem, perhaps the code could learn where the cliff is, then start to back off when outstanding writes are 2/3rd way to the cliff, leaving some room for priority fsync type stuff? If these queues in drive, do work, to improve performance, then starting to re-fill queue after a drain period, when it should be 1/3rd full, allows the disk cache to do it's job.
Toward less-annoying background writeback
Posted Apr 16, 2016 2:28 UTC (Sat) by axboe (subscriber, #904) [Link]
For the stats, I wrote this: http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-th...
And I did actually implement something resembling CoDel, on top of the stats patch: http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-th...
Currently experimenting with the latter, it'll be folded into the parent patch that introduces the throttling. For write back caching explicitly, the current approach is to allow the queue to drain to 0 before allowing more IOs, once the allowed writes have been dispatched. It might need to be a bit more aggressive, but I'm not going to re-add the explicit delay.
Toward less-annoying background writeback
Posted Apr 16, 2016 9:33 UTC (Sat) by roblucid (subscriber, #48964) [Link]
Toward less-annoying background writeback
Posted Apr 15, 2016 18:03 UTC (Fri) by nix (subscriber, #2304) [Link]
(However, this is clearly not always desirable. e.g my Areca controller aggressively promotes reads in front of writes in the queue; the only situation in which you cannot issue a read and get an immediate response is when an actual write is physically going on -- making it a device for which the desired delay would be zero. But whether this applies to any given device is a per-device property. It's clearly possible for the kernel to discern that read latency is apparentely being worsened by the presence of outstanding writes -- hence, I suppose, the eventual desire to discard this knob and go to self-tuning.)
Toward less-annoying background writeback
Posted Apr 15, 2016 20:41 UTC (Fri) by magila (subscriber, #49627) [Link]
Toward less-annoying background writeback
Posted Apr 16, 2016 0:02 UTC (Sat) by fandingo (guest, #67019) [Link]
I don't know how you arrive at this conclusion. This user likes writeback caching very much. Furthermore, for over a decade well over half of computer sales are for models that include a UPS: laptops. Plus, not all data are critical and can be safely cached. Data only needs to become persistent when synced, which should be causing ATA CACHE FLUSH commands to be dispatched.
> Most enterprise drives already ship with write caching disabled by default
Because data integrity needs are different for enterprise systems, especially since they're often configured in RAIDs.
> unfortunately consumer grade devices tend to follow a policy of speed above all else.
Considering how egregious slow storage, even SSDs, is compared to the rest of hardware, performance should be a high concern. As stated previously, there are plenty of strategies to ensure data integrity for data that matters, but we shouldn't blindly apply those strategies to all IO.
Toward less-annoying background writeback
Posted Apr 16, 2016 1:48 UTC (Sat) by magila (subscriber, #49627) [Link]
Historically write caching was a big deal because ATA drives didn't support command queuing at all. Write caching allowed drives to queue writes even without queuing support at the host interface. The performance difference between a queue depth of 1 and anything >1 is quite large, so it was judged worth the trade offs to enable. Now even low-end consumer drives support 32 queued commands, which is more than enough to reach the point of diminishing returns. So enabling write caching really isn't doing much for most people these days other than causing problems like the one mentioned in the article.
Enterprise drives using SCSI have had proper command queuing at lot longer than ATA drives have, so there's been less incentive for them to use it even aside from being more conservative about data integrity.
Toward less-annoying background writeback
Posted Apr 17, 2016 22:55 UTC (Sun) by nix (subscriber, #2304) [Link]
Toward less-annoying background writeback
Posted Apr 18, 2016 0:32 UTC (Mon) by magila (subscriber, #49627) [Link]
In your example, it would only make a difference if creating hard links was synchronous all the way down to the underlying storage device. I'm nearly 100% certain it's not.
Toward less-annoying background writeback
Posted Apr 18, 2016 10:33 UTC (Mon) by nix (subscriber, #2304) [Link]
I don't see any way the kernel can determine this: it needs per-driver knowledge, which is convenient because most controllers with this much RAM are RAID controllers which have specific per-controller kernel drivers in any case.
Toward less-annoying background writeback
Posted Apr 18, 2016 16:32 UTC (Mon) by magila (subscriber, #49627) [Link]
Toward less-annoying background writeback
Posted Apr 20, 2016 10:58 UTC (Wed) by Jonno (subscriber, #49613) [Link]
Toward less-annoying background writeback
Posted Apr 18, 2016 0:52 UTC (Mon) by magila (subscriber, #49627) [Link]
Toward less-annoying background writeback
Posted Apr 16, 2016 2:30 UTC (Sat) by axboe (subscriber, #904) [Link]
If you had looked at the code and original posting, you would have seen both that these options are going away, and why they currently exist. And no, most users don't know about their writeback caching, let alone how to turn it off. That's another knob.
Toward less-annoying background writeback
Posted Apr 14, 2016 21:27 UTC (Thu) by mtanski (guest, #56423) [Link]
Toward less-annoying background writeback
Posted Apr 15, 2016 9:12 UTC (Fri) by roblucid (subscriber, #48964) [Link]
Conceptually if each device had priority queues; urgent (for reads, filesystem/RAID journal/meta-data, fsync & reclaim writes) and when-idle allowing the background i/o to be starved, until there's a backlog; but when little urgent IO is going on background writes can be started once they're mature enough to have chance to be combined or eliminated in case of temporary files. That's analogous to a network router with multiple interfaces, where you want the queues to always be on the output, not the input.
The memory layer could try to clean older pages as non-urgent requests, rather than rely on general syncs every 5s to keep un-fsynced disk data; though requests could later be prioritised due to process fsyncing.
That could allow more direct tunables, to suit different characterics of SSDs, persistent memory, normal disks, shingled and so on; to avoid uncessary wear & read plus re-writing; with suitable queue depths. Then the buffer cache available to process could be device specifically sized rather than depending on queue depth heuristics, which are likely to block user processes in some cases sooner than really necessary.
Toward less-annoying background writeback
Posted Apr 15, 2016 23:00 UTC (Fri) by dlang (subscriber, #313) [Link]
you need queues to be long enough to be able to keep the hardware busy, but beyond that you are adding latency for new things that are added to the queue without a benefit.
It's been observed many times over the last few years that storage performance problems are starting to look more like network performance problems, so it makes sense that the same types of solutions will show up.
On the network side, we used to have queues of a (large) fixed number of packets, no matter how much data the packets contained. BQL has move that so that the queues are based on the amount of data to be sent, and this has make a huge improvement.
WiFi and storage aren't as simple, because how long it takes to process a given amount of data depends on so many variables (some of which are unpredictable without knowledge of the future), but reducing queue sizes to the minimums that are needed to keep the hardware busy, and allowing better interleaving of requests (which also allows for better handling of priority requests) will help a lot.
Toward less-annoying background writeback
Posted Apr 16, 2016 9:42 UTC (Sat) by roblucid (subscriber, #48964) [Link]
Toward less-annoying background writeback
Posted Apr 20, 2016 14:23 UTC (Wed) by farnz (subscriber, #17727) [Link]
That's an important part of the patchset. The commands for device I/O are fairly restricted - at best, you get three commands (buffered write, direct write bypassing buffering, flush all buffers), and three queueing options (skip the queue and execute now, execute when all previous commands are completed, execute when it's convenient); you do, of course, get notified as commands complete. Direct writes are inconvenient to use, because of the interactions with previously sent buffered writes, and are thus only really used by filesystem code (where we expect the experts to get it right).
If we send data from your job down to the device), then another job does some writes which also get sent to the device, then the second job wants to fsync() its data (e.g. because it needs to meet ACID guarantees), we have a problem; there is no way to tell the device "these previous commands can stay in the buffer if that's better for performance, but this subset need to be executed before this flush"; instead, we have to flush the lot in one big go, both the stuff your first job wrote (that doesn't need to go yet) and the stuff the database wants safely in persistent storage. Alternatively, we can waste device performance by sending the fsync() writes again, bypassing the queue - but this means that all the stuff the device does to reorder commands to get higher performance goes to waste.
If the queues are long (e.g. 5 seconds to fully flush the disk buffer), then we have a problem; when the database asks for its ACID flush, it takes 5 seconds, as it has to flush both the buffered data from your less important I/O and the buffered data from the database to disk. Further, during this 5 second flush, we can't service reads - the device is writing out its buffers buffer. If the queue is too short, we also have a problem - the device doesn't have enough to do, so it idles while we send more commands down to it, and Linux I/O is slow - write out that could have completed in 1 second takes 20.
The patchset aims to fix this - by keeping the amount queued on the device as small as possible without letting the device idle, it ensures that the kernel gets to pick and choose what data actually gets sent down. Thus the kernel can decide that it's not going to send all of your first job's data just yet, just a few microseconds worth at a time to keep the device busy; when the database does fsync(), it can send the appropriate set of buffered writes and buffer flushes to meet the database's ACID requirement ASAP, ignoring the long tail of data from your first job (which stays in kernel buffers until the device is at risk of going idle if it's not sent down). The result is that your big background I/O job no longer affects database latency to such a huge degree.
Toward less-annoying background writeback
Posted Apr 20, 2016 20:00 UTC (Wed) by dlang (subscriber, #313) [Link]
actually, this depends on the filesystem. Ext3 worked the way you describe (and IIRC data written after the flush was issued could get caught up in the flush as well). But other filesystems have been able to isolate the writes/fsyncs done by one program on it's files from writes done to other, unrelated files.
now, once things actually hit the disk queues, then you do loose that information, but if the queues are kept short, then you don't loose it until the point where you really don't care that much.
Toward less-annoying background writeback
Posted Apr 21, 2016 19:48 UTC (Thu) by farnz (subscriber, #17727) [Link]
I don't see how the filesystem affects background writeback - once we've chosen to send data down to the device queues, we have to flush it out to disk to complete an fsync() regardless of filesystem. The patchset "fixes" this, by ensuring that background writeback does not send much data down to the device queues.
Toward less-annoying background writeback
Posted Apr 21, 2016 21:40 UTC (Thu) by dlang (subscriber, #313) [Link]
If you have two different partitions on one drive and do a fsync on one of them, it has no effect on the other one.
I fully agree that the queues of work to be done by the disk need better management than the current process. These patches may be the beginning of this.
but I think the better beginning would be along the lines of the BQL changes in the networking stack. completely get away from the idea of queue of X packets/writes and instead try to look at it as Yms worth of work, which may be one large write, or may be a bunch of small things.
Then you can add smarter queue management on top of that, the way that fq_codel added on top of BQL is so successful. Doing fq_codel when you don't have BQL on network connections helps, as does BQL by itself, but neither along is nearly as good as the tww combined.
Toward less-annoying background writeback
Posted Apr 26, 2016 10:26 UTC (Tue) by farnz (subscriber, #17727) [Link]
fsync() may be a filesystem action, but it's implemented in terms of what you can actually do with the devices attached to the system;; unfortunately, everything (bar possibly NVMe, as I've not looked at that in depth) storage-related has "big hammer" flush mechanisms - once you've submitted work to the device, it will be included in the next flush whether you need it to be or not.
Have you read the patchset yet? The change it makes it to look at the latency of blocking I/O (reads, syncs etc) - if the latency of blocking I/O increases beyond a tolerable threshold (initial patches have it at 10ms, later patches change it to autotune to match the device's latency), you've got too much queueing, and should queue less non-blocking I/O.
And storage *already* does BQL type metering for the queues - because the link MTUs are so large (4 GiB for ATA, for example), nothing else has made sense for a long time. This patchset teaches the block layer to limit background transfer if it's causing excessive latency on blocking transfers.
Toward less-annoying background writeback
Posted Apr 15, 2016 0:49 UTC (Fri) by dlang (subscriber, #313) [Link]
wired networks have a huge advantage in that the data transmission rate is pretty constant (some variation with different packet sizes, but that can be accounted for if you know how many packets you are dealing with)
disk I/O is much more like WiFi where sending related things at once can be far more efficient than sending the same things at different times. this requires having expected data rates (BQL equivalents) that are more complex to take these sorts of things into account.
It will be good to see progress here. As has been noted before, reads tend to be synchronous (the app can't continue until it's complete) while writes tend to be far more tolerant of delays. This is similar to how some delays to network traffic (i.e. DNS) has far more impact on the user experience than others (i.e. http packets in a long-running connection). The solution on the network side has been multiple, small queues so that different flows don't block each other. Adapting this to both WiFi and I/O scheduling where the rate varies and aggregation drastically affects the rate will be an interesting challenge, but the rewards will be very large.
David Lang
Toward less-annoying background writeback
Posted Apr 15, 2016 17:42 UTC (Fri) by Beolach (subscriber, #77384) [Link]
This is basically the same issue the BFQ IO scheduler addresses, right? I thought BFQ was finally getting close to being upstreamed (as a modification of a trimmed-down CFQ). How does Jens' patch work when used in conjunction w/ BFQ?
Toward less-annoying background writeback
Posted Apr 20, 2016 9:59 UTC (Wed) by jospoortvliet (subscriber, #33164) [Link]
Well that's a start.
Posted Apr 17, 2016 6:35 UTC (Sun) by ksandstr (guest, #60862) [Link]
Tape drives aside, if there's a performance benefit[1] to sitting on things until they (possibly) choke up all queues then it's certainly outweighed by resource underutilization. It's not like disks are subject to a cablemodem-like timeslice arbitration.
Worse, in the meantime all spare memory has been captured for dirty or writeback pages, which the kernel prefers to flush out rather than go to swaps (which might be a lz4 compressor, i.e. plenty fast esp. when waiting for I/O), instead of not replacing useful cache data with eventual writeback in the first place. And so the impatient console user's terminals fail to refresh and eventually the entire X session jams up -- all because of a copy to an USB2 storage device, or SD card. If there's a management algorithm that alleviates unnecessary memory pressure, please pass it around & share the glass dick as well.
So the question is: why is there an apparent case of "fire and forget queueing"[2] in the kernel? Surely it was known[3] that, fancy policies or not, in order to keep latency from climbing, the maximum size of a queue must be restricted according to observed throughput? And that pretending that dirty data isn't de facto a queue doesn't help? There's an unkind comparison to a particular NoSQL database in here somewhere: insufficient design causing worst-case performance consistently when variables (predictably) get big enough.
[0] latency depending on writeback delay setting due to laptop-mode-utils, in itself rendered ineffective by synchronous filesystem journal commits anyway. Still, the default is five seconds, or 250 megs for a 2000-era hard disk; up to 3GB in the buffer if reading from a whiz-bang SSD.
[1] such as from elevator sorting; 2.2 made the hard disk so quiet...
[2] always a fuckup: proper queues have feedback, so properly-implemented clients have behaviour in response.
[3] e.g. from a CS class on the topic. Or a book, possibly. Queueing theory came from like the nineteen-aughties, so it's not exactly esoteric.
Well that's a start.
Posted Apr 20, 2016 14:32 UTC (Wed) by farnz (subscriber, #17727) [Link]
Fire-and-forget queueing is used because it's simple and it works up to a point - network stacks used to use it, too. The assumption you're making is that feedback from a "proper" queue wouldn't adjust the queue length because the queue is sufficiently small that there's no benefit from doing so; once this assumption is broken (as it has been in networks and block devices), then you need feedback to adjust the queue length. But, in the short term, just doing a fixed-size fire-and-forget queue can be a good enough approximation of a variable length queue with feedback.
Toward less-annoying background writeback
Posted Apr 18, 2016 8:14 UTC (Mon) by rmano (guest, #49886) [Link]
Thanks for the effort!
Toward less-annoying background writeback
Posted Apr 21, 2016 7:01 UTC (Thu) by kevinm (guest, #69913) [Link]
In particular, the network is hostile (or at least, self-interested) - network queuing can't really trust the endpoints: if you prioritize a flow that the endpoint has described as "important" then everyone will just start doing that.
That's not the case in block i/o - we control the whole kernel, so we can prioritize different sorts of flows based on their source and trust that we aren't being gamed by another part of the kernel.
Toward less-annoying background writeback
Posted Dec 15, 2016 12:36 UTC (Thu) by jubal (subscriber, #67202) [Link]
Unless what you perceive as a block device is, in fact, a network device. Case in point, Amazon's EBS. :-)
Toward less-annoying background writeback
Posted Apr 27, 2016 6:31 UTC (Wed) by loa (guest, #108477) [Link]
Toward less-annoying background writeback
Posted Apr 29, 2016 10:41 UTC (Fri) by jospoortvliet (subscriber, #33164) [Link]
Toward less-annoying background writeback
Posted Dec 28, 2016 18:08 UTC (Wed) by darkbasic (guest, #107872) [Link]
This is the bug report: https://bugzilla.kernel.org/show_bug.cgi?id=191391
Copyright © 2016, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds