User: Password:
|
|
Subscribe / Log in / New account

Toward less-annoying background writeback

By Jonathan Corbet
April 13, 2016
It's an experience many of us have had: write a bunch of data to a relatively slow block device, then try to get some other work done. In many cases, the system will slow to a crawl or even appear to freeze for a while; things do not recover until the bulk of the data has been written to the device. On a system with a lot of memory and a slow I/O device, getting things back to a workable state can take a long time, sometimes measured in minutes. Linux users are understandably unimpressed by this behavior pattern, but it has been stubbornly present for a long time. Now, perhaps, a new patch set will improve the situation.

That patch set, from block subsystem maintainer Jens Axboe, is titled "Make background writeback not suck." "Background writeback" here refers to the act of flushing block data from memory to the underlying storage device. With normal Linux buffered I/O, a write() call simply transfers the data to memory; it's up to the memory-management subsystem to, via writeback, push that data to the device behind the scenes. Buffering writes in this manner enables a number of performance enhancements, including allowing multiple operations to be combined and enabling filesystems to improve layout locality on disk.

So how is it that a performance-enhancing technique occasionally leads to such terrible performance? Jens's diagnosis is that it has to do with the queuing of I/O requests in the block layer. When the memory-management code decides to write a range of dirty data, the result is an I/O request submitted to the block subsystem. That request may spend some time in the I/O scheduler, but it is eventually dispatched to the driver for the destination device. Getting there requires passing through a series of queues.

The problem is that, if there is a lot of dirty data to write, there may end up being vast numbers (as in thousands) of requests queued for the device. Even a reasonably fast drive can take some time to work through that many requests. If some other activity (clicking a link in a web browser, say, or launching an application) generates I/O requests on the same block device, those requests go to the back of that long queue and may not be serviced for some time. If multiple, synchronous requests are generated — page faults from a newly launched application, for example — each of those requests may, in turn, have to pass through this long queue. That is the point where things appear to just stop.

In other words, the block layer has a bufferbloat problem that mirrors the issues that have been seen in the networking stack. Lengthy queues lead to lengthy delays.

As with bufferbloat, the answer lies in finding a way to reduce the length of the queues. In the networking stack, techniques like byte queue limits and TCP small queues have mitigated much of the bufferbloat problem. Jens's patches attempt to do something similar in the block subsystem.

Taming the queues

Like networking, the block subsystem has queuing at multiple layers. Requests start in a submission queue and, perhaps after reordering or merging by an I/O scheduler, make their way to a dispatch queue for the target device. Most block drivers also maintain queues of their own internally. Those lower-level queues can be especially problematic since, by the time a request gets there, it is no longer subject to the I/O scheduler's control (if there is an I/O scheduler at all).

Jens's patch set aims to reduce the amount of data "in flight" through all of those queues by throttling requests when they are first submitted. To put it simply, each device has a maximum number of buffered-write requests that can be outstanding at any given time. If an incoming request would cause that limit to be exceeded, the process submitting the request will block until the length of the queue drops below the limit. That way, other requests will never be forced to wait for a long queue to drain before being acted upon.

In the real world, of course, things are not quite so simple. Writeback is not just important for ensuring that data makes it to persistent storage (though that is certainly important enough); it is also a key activity for the memory-management subsystem. Writeback is how dirty pages are made clean and, thus, available for reclaim and reuse; if writeback is impeded too much, the system could find itself in an out-of-memory situation. Running out of memory can lead to other user-disgruntling delays, along with unleashing the OOM killer. So any writeback throttling must be sure to not throttle things too much.

The patch set tries to avoid such unpleasantness by tracking the reason behind each buffered-write operation. If the memory-management subsystem is just pushing dirty pages out to disk as part of the regular task of making their contents persistent, the queue limit applies. If, instead, pages are being written to make them free for reclaim — if the system is running short of memory, in other words — the limit is increased. A higher limit also applies if a process is known to be waiting for writeback to complete (as might be the case for an fsync() call). On the other hand, if there have been any non-writeback requests within the last 100ms, the limit is reduced below the default for normal writeback requests.

There is also a potential trap in the form of drives that do their own write caching. Such drives will indicate that a write request has completed once the data has been transferred, but that data may just be sitting in a cache within the drive itself. In other words, the drive, too, may be maintaining a long queue. In an attempt to avoid overfilling that queue, the block layer will impose a delay between write operations on drives that are known to do caching. That delay is 10ms by default, but can be tweaked via a sysfs knob.

Jens tested this work by having one process write 100MB each to 50 files while another process tries to read a file. The reading process will, on current kernels, be penalized by having each successive read request placed at the end of a long queue created by all those write requests; as might be expected, it performs poorly. With the patches applied, the writing processes take a little longer to complete, but the reader runs much more quickly, with far fewer requests taking an inordinately long period of time.

This is an early-stage patch set; it is not expected to go upstream in the near future. Patches that change memory-management behavior can often cause unexpected problems with different workloads, so it takes a while to build confidence in a significant change, even after the development work is deemed to be complete (which is not the case here). Indeed, Dave Chinner has already reported a performance regression with one of his testing workloads. The tuning of the queue-size limits also needs to be made automatic if possible. There is clearly work still to be done here; the patch set is also likely to be a subject of discussion at the upcoming Linux Storage, Filesystem, and Memory-Management Summit. So users will have to wait a bit longer for this particular annoyance to be addressed.


(Log in to post comments)

Toward less-annoying background writeback

Posted Apr 14, 2016 19:57 UTC (Thu) by fandingo (guest, #67019) [Link]

> In an attempt to avoid overfilling that queue, the block layer will impose a delay between write operations on drives that are known to do caching. That delay is 10ms by default, but can be tweaked via a sysfs knob.

Yuck. I don't see what useful purpose this artificial latency is supposed to serve. Surely, after all these years, we've realized that people who use writeback caching devices desire that behavior. Why get in their way? Users know what they're doing. And if they don't? They sure as shit don't know about a sysfs knob. Fight the urge to add these silly, never-to-be-used options.

I don't want 10ms of latency for no reason and from the OS layer that can't possibly know whether 10ms of latency per write provides any benefit whatsoever. What's the derogatory phrase people like to hurl about these sorts of things? Layering violation! OS manage your buffers; disk manage yours; don't try to do sneaky stuff to manipulate the other's.

Toward less-annoying background writeback

Posted Apr 14, 2016 20:32 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

If you don't impose a delay then you will get hit by an unexpected cliff once the device's buffer fills.

It's not perfect, they should use something like CoDel to modulate the delay, but it's a start.

Toward less-annoying background writeback

Posted Apr 15, 2016 8:19 UTC (Fri) by roblucid (subscriber, #48964) [Link]

A fixed 10ms, does sounds like the kind of abitary restriction that middle layers impose, which may become very, very wrong, with exotic technology .. perhaps peristant memory, cache backed devices for example.

This sounds bit like a produce-consumer problem, perhaps the code could learn where the cliff is, then start to back off when outstanding writes are 2/3rd way to the cliff, leaving some room for priority fsync type stuff? If these queues in drive, do work, to improve performance, then starting to re-fill queue after a drain period, when it should be 1/3rd full, allows the disk cache to do it's job.

Toward less-annoying background writeback

Posted Apr 16, 2016 2:28 UTC (Sat) by axboe (subscriber, #904) [Link]

This is the intent, to make it auto-tuning. The cache delay etc aren't really great solutions, as mentioned in the original posting of the patchset, it's more to highlight the problem than a proper final solution. Network scheduling has the luxury of being able to drop packets, that's not something I can do. At least not without corrupting data. So backing off is the best solution we have. I am working on improving it, so that the final tunable will just be a latency metric. Or maybe not even that, we can track most of that and store it.

For the stats, I wrote this: http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-th...

And I did actually implement something resembling CoDel, on top of the stats patch: http://git.kernel.dk/cgit/linux-block/commit/?h=wb-buf-th...

Currently experimenting with the latter, it'll be folded into the parent patch that introduces the throttling. For write back caching explicitly, the current approach is to allow the queue to drain to 0 before allowing more IOs, once the allowed writes have been dispatched. It might need to be a bit more aggressive, but I'm not going to re-add the explicit delay.

Toward less-annoying background writeback

Posted Apr 16, 2016 9:33 UTC (Sat) by roblucid (subscriber, #48964) [Link]

Thank you! This sounds very promising :)

Toward less-annoying background writeback

Posted Apr 15, 2016 18:03 UTC (Fri) by nix (subscriber, #2304) [Link]

The artificial latency is supposed to be roughly as long as it takes the drive to commit a write to storage. The idea is that, as at other layers, we prevent the drive queueing up vast numbers of writes and sticking reads at the back of that queue.

(However, this is clearly not always desirable. e.g my Areca controller aggressively promotes reads in front of writes in the queue; the only situation in which you cannot issue a read and get an immediate response is when an actual write is physically going on -- making it a device for which the desired delay would be zero. But whether this applies to any given device is a per-device property. It's clearly possible for the kernel to discern that read latency is apparentely being worsened by the presence of outstanding writes -- hence, I suppose, the eventual desire to discard this knob and go to self-tuning.)

Toward less-annoying background writeback

Posted Apr 15, 2016 20:41 UTC (Fri) by magila (subscriber, #49627) [Link]

What would be really nice is if distros would start adding a udev rule to disable write caching in devices by default. With storage devices which have robust command queuing, which is pretty much anything made in the last 8 years or so, the vast majority of users would be better off without them doing writeback caching. Most enterprise drives already ship with write caching disabled by default, unfortunately consumer grade devices tend to follow a policy of speed above all else.

Toward less-annoying background writeback

Posted Apr 16, 2016 0:02 UTC (Sat) by fandingo (guest, #67019) [Link]

> the vast majority of users would be better off without them doing writeback caching.

I don't know how you arrive at this conclusion. This user likes writeback caching very much. Furthermore, for over a decade well over half of computer sales are for models that include a UPS: laptops. Plus, not all data are critical and can be safely cached. Data only needs to become persistent when synced, which should be causing ATA CACHE FLUSH commands to be dispatched.

> Most enterprise drives already ship with write caching disabled by default

Because data integrity needs are different for enterprise systems, especially since they're often configured in RAIDs.

> unfortunately consumer grade devices tend to follow a policy of speed above all else.

Considering how egregious slow storage, even SSDs, is compared to the rest of hardware, performance should be a high concern. As stated previously, there are plenty of strategies to ensure data integrity for data that matters, but we shouldn't blindly apply those strategies to all IO.

Toward less-annoying background writeback

Posted Apr 16, 2016 1:48 UTC (Sat) by magila (subscriber, #49627) [Link]

The point I was perhaps overly slyly making is that with a drive that has proper command queuing, the performance gain from write caching is actually pretty marginal. It's certainly too small for the average user to be able to tell the difference outside of benchmarks.

Historically write caching was a big deal because ATA drives didn't support command queuing at all. Write caching allowed drives to queue writes even without queuing support at the host interface. The performance difference between a queue depth of 1 and anything >1 is quite large, so it was judged worth the trade offs to enable. Now even low-end consumer drives support 32 queued commands, which is more than enough to reach the point of diminishing returns. So enabling write caching really isn't doing much for most people these days other than causing problems like the one mentioned in the article.

Enterprise drives using SCSI have had proper command queuing at lot longer than ATA drives have, so there's been less incentive for them to use it even aside from being more conservative about data integrity.

Toward less-annoying background writeback

Posted Apr 17, 2016 22:55 UTC (Sun) by nix (subscriber, #2304) [Link]

I have no idea what makes you think that write caching isn't doing much. The difference between taking milliseconds to (say) cp -al a large directory tree with a hot cache (all done in-core) and taking dozens to hundreds of seconds blocked because of a massively seeky wait for thousands of tiny writes is *huge*.

Toward less-annoying background writeback

Posted Apr 18, 2016 0:32 UTC (Mon) by magila (subscriber, #49627) [Link]

I think you misunderstood me. I'm only talking about write caching in the drive itself, not in the OS. Write caching at the OS level is worthwhile for the same reason it used to be for drives: It massively speeds up synchronous write requests. But now that every drive has support for command queuing, the kernel is always issuing asynchronous requests to them. The main thing write caching in the drive does at this point is deny the kernel knowledge of when writes have actually been completed, which is useful information to have when you're trying to do flow control.

In your example, it would only make a difference if creating hard links was synchronous all the way down to the underlying storage device. I'm nearly 100% certain it's not.

Toward less-annoying background writeback

Posted Apr 18, 2016 10:33 UTC (Mon) by nix (subscriber, #2304) [Link]

Aha: agreed. There is one exception (which I note is allowed for in the patch): drives with battery-backed storage which can safely be mounted "nobarrier". These often have really quite big caches: e.g. my very old Areca controller has a half-gigabyte of storage and can easily cache up four or five seconds of RAID writes in there. It's probably worth not throwing *that* away -- but it may still be worth doing queue control on it if the controller is sufficiently stupid that it allows writes in the queue to dominate reads. (Most controllers with that much RAM are not that stupid: in particular, mine isn't.)

I don't see any way the kernel can determine this: it needs per-driver knowledge, which is convenient because most controllers with this much RAM are RAID controllers which have specific per-controller kernel drivers in any case.

Toward less-annoying background writeback

Posted Apr 18, 2016 16:32 UTC (Mon) by magila (subscriber, #49627) [Link]

Good point about RAID controllers. The udev rule I was envisioning would only apply to SATA drives. SCSI targets can generally be trusted to come with a reasonable default when comes to write caching.

Toward less-annoying background writeback

Posted Apr 20, 2016 10:58 UTC (Wed) by Jonno (subscriber, #49613) [Link]

> The udev rule I was envisioning would only apply to SATA drives.
Please limit such a rule to SATA drives with NCQ support (connected to SATA controllers with NCQ support) . I still have a couple of older SATA drives without NCQ support lying around, and trying to use them without write cache is *painfully* slow.

Toward less-annoying background writeback

Posted Apr 18, 2016 0:52 UTC (Mon) by magila (subscriber, #49627) [Link]

I should also point out that even if you were making requests which were not cached by the kernel, you would still be held up waiting for the drive to write (some of) the data. Drives have a limited number of outstanding I/O requests they can keep track of internally, typically between 128 and 256, maybe less for the lowest end drives. Once you hit that limit your writes are going to start to block even with write caching enabled.

Toward less-annoying background writeback

Posted Apr 16, 2016 2:30 UTC (Sat) by axboe (subscriber, #904) [Link]

>Yuck. I don't see what useful purpose this artificial latency is supposed to serve. Surely, after all these years, we've realized that people who use writeback caching devices >desire that behavior. Why get in their way? Users know what they're doing. And if they don't? They sure as shit don't know about a sysfs knob. Fight the urge to add these >silly, never-to-be-used options.

If you had looked at the code and original posting, you would have seen both that these options are going away, and why they currently exist. And no, most users don't know about their writeback caching, let alone how to turn it off. That's another knob.

Toward less-annoying background writeback

Posted Apr 14, 2016 21:27 UTC (Thu) by mtanski (guest, #56423) [Link]

I felt like the patch should have been named "Make background writeback great again for the first time"

Toward less-annoying background writeback

Posted Apr 15, 2016 9:12 UTC (Fri) by roblucid (subscriber, #48964) [Link]

Not really happy that queue depth heuristics and arbitary time delays are the best or most direct solution. It isn't really the queue depth that's the problem but poor prioritisation; if non-synchronous writes block earlier, then a system is more likely to have un-used (ie wasted) memory and a sequence of jobs, read/process in memory/write results, may slow significantly through less usage of buffering. But I guess directly addressing this would be impractical because of a wholesale impact to mm & block parts of kernel.

Conceptually if each device had priority queues; urgent (for reads, filesystem/RAID journal/meta-data, fsync & reclaim writes) and when-idle allowing the background i/o to be starved, until there's a backlog; but when little urgent IO is going on background writes can be started once they're mature enough to have chance to be combined or eliminated in case of temporary files. That's analogous to a network router with multiple interfaces, where you want the queues to always be on the output, not the input.

The memory layer could try to clean older pages as non-urgent requests, rather than rely on general syncs every 5s to keep un-fsynced disk data; though requests could later be prioritised due to process fsyncing.

That could allow more direct tunables, to suit different characterics of SSDs, persistent memory, normal disks, shingled and so on; to avoid uncessary wear & read plus re-writing; with suitable queue depths. Then the buffer cache available to process could be device specifically sized rather than depending on queue depth heuristics, which are likely to block user processes in some cases sooner than really necessary.

Toward less-annoying background writeback

Posted Apr 15, 2016 23:00 UTC (Fri) by dlang (subscriber, #313) [Link]

There is a point beyond which additional queue depth doesn't help at all.

you need queues to be long enough to be able to keep the hardware busy, but beyond that you are adding latency for new things that are added to the queue without a benefit.

It's been observed many times over the last few years that storage performance problems are starting to look more like network performance problems, so it makes sense that the same types of solutions will show up.

On the network side, we used to have queues of a (large) fixed number of packets, no matter how much data the packets contained. BQL has move that so that the queues are based on the amount of data to be sent, and this has make a huge improvement.

WiFi and storage aren't as simple, because how long it takes to process a given amount of data depends on so many variables (some of which are unpredictable without knowledge of the future), but reducing queue sizes to the minimums that are needed to keep the hardware busy, and allowing better interleaving of requests (which also allows for better handling of priority requests) will help a lot.

Toward less-annoying background writeback

Posted Apr 16, 2016 9:42 UTC (Sat) by roblucid (subscriber, #48964) [Link]

Considering the i/o throughput point of view.. yes.
BUT if I'm running a mix of jobs in a sequence, part which may generate i/o but NOT DataBase needing ACID, then I could rather like the process to get on with it, finish and NOT care that is is only in memory. Then next part of processing could be done, perhaps accessing a different disk for example

Toward less-annoying background writeback

Posted Apr 20, 2016 14:23 UTC (Wed) by farnz (subscriber, #17727) [Link]

That's an important part of the patchset. The commands for device I/O are fairly restricted - at best, you get three commands (buffered write, direct write bypassing buffering, flush all buffers), and three queueing options (skip the queue and execute now, execute when all previous commands are completed, execute when it's convenient); you do, of course, get notified as commands complete. Direct writes are inconvenient to use, because of the interactions with previously sent buffered writes, and are thus only really used by filesystem code (where we expect the experts to get it right).

If we send data from your job down to the device), then another job does some writes which also get sent to the device, then the second job wants to fsync() its data (e.g. because it needs to meet ACID guarantees), we have a problem; there is no way to tell the device "these previous commands can stay in the buffer if that's better for performance, but this subset need to be executed before this flush"; instead, we have to flush the lot in one big go, both the stuff your first job wrote (that doesn't need to go yet) and the stuff the database wants safely in persistent storage. Alternatively, we can waste device performance by sending the fsync() writes again, bypassing the queue - but this means that all the stuff the device does to reorder commands to get higher performance goes to waste.

If the queues are long (e.g. 5 seconds to fully flush the disk buffer), then we have a problem; when the database asks for its ACID flush, it takes 5 seconds, as it has to flush both the buffered data from your less important I/O and the buffered data from the database to disk. Further, during this 5 second flush, we can't service reads - the device is writing out its buffers buffer. If the queue is too short, we also have a problem - the device doesn't have enough to do, so it idles while we send more commands down to it, and Linux I/O is slow - write out that could have completed in 1 second takes 20.

The patchset aims to fix this - by keeping the amount queued on the device as small as possible without letting the device idle, it ensures that the kernel gets to pick and choose what data actually gets sent down. Thus the kernel can decide that it's not going to send all of your first job's data just yet, just a few microseconds worth at a time to keep the device busy; when the database does fsync(), it can send the appropriate set of buffered writes and buffer flushes to meet the database's ACID requirement ASAP, ignoring the long tail of data from your first job (which stays in kernel buffers until the device is at risk of going idle if it's not sent down). The result is that your big background I/O job no longer affects database latency to such a huge degree.

Toward less-annoying background writeback

Posted Apr 20, 2016 20:00 UTC (Wed) by dlang (subscriber, #313) [Link]

> If we send data from your job down to the device), then another job does some writes which also get sent to the device, then the second job wants to fsync() its data (e.g. because it needs to meet ACID guarantees), we have a problem; there is no way to tell the device "these previous commands can stay in the buffer if that's better for performance, but this subset need to be executed before this flush"; instead, we have to flush the lot in one big go, both the stuff your first job wrote (that doesn't need to go yet) and the stuff the database wants safely in persistent storage.

actually, this depends on the filesystem. Ext3 worked the way you describe (and IIRC data written after the flush was issued could get caught up in the flush as well). But other filesystems have been able to isolate the writes/fsyncs done by one program on it's files from writes done to other, unrelated files.

now, once things actually hit the disk queues, then you do loose that information, but if the queues are kept short, then you don't loose it until the point where you really don't care that much.

Toward less-annoying background writeback

Posted Apr 21, 2016 19:48 UTC (Thu) by farnz (subscriber, #17727) [Link]

I don't see how the filesystem affects background writeback - once we've chosen to send data down to the device queues, we have to flush it out to disk to complete an fsync() regardless of filesystem. The patchset "fixes" this, by ensuring that background writeback does not send much data down to the device queues.

Toward less-annoying background writeback

Posted Apr 21, 2016 21:40 UTC (Thu) by dlang (subscriber, #313) [Link]

fsync is a filesystem level action, not a device level action.

If you have two different partitions on one drive and do a fsync on one of them, it has no effect on the other one.

I fully agree that the queues of work to be done by the disk need better management than the current process. These patches may be the beginning of this.

but I think the better beginning would be along the lines of the BQL changes in the networking stack. completely get away from the idea of queue of X packets/writes and instead try to look at it as Yms worth of work, which may be one large write, or may be a bunch of small things.

Then you can add smarter queue management on top of that, the way that fq_codel added on top of BQL is so successful. Doing fq_codel when you don't have BQL on network connections helps, as does BQL by itself, but neither along is nearly as good as the tww combined.

Toward less-annoying background writeback

Posted Apr 26, 2016 10:26 UTC (Tue) by farnz (subscriber, #17727) [Link]

fsync() may be a filesystem action, but it's implemented in terms of what you can actually do with the devices attached to the system;; unfortunately, everything (bar possibly NVMe, as I've not looked at that in depth) storage-related has "big hammer" flush mechanisms - once you've submitted work to the device, it will be included in the next flush whether you need it to be or not.

Have you read the patchset yet? The change it makes it to look at the latency of blocking I/O (reads, syncs etc) - if the latency of blocking I/O increases beyond a tolerable threshold (initial patches have it at 10ms, later patches change it to autotune to match the device's latency), you've got too much queueing, and should queue less non-blocking I/O.

And storage *already* does BQL type metering for the queues - because the link MTUs are so large (4 GiB for ATA, for example), nothing else has made sense for a long time. This patchset teaches the block layer to limit background transfer if it's causing excessive latency on blocking transfers.

Toward less-annoying background writeback

Posted Apr 15, 2016 0:49 UTC (Fri) by dlang (subscriber, #313) [Link]

sounds like we need the equivalent of a fq_codel I/O scheduler.

wired networks have a huge advantage in that the data transmission rate is pretty constant (some variation with different packet sizes, but that can be accounted for if you know how many packets you are dealing with)

disk I/O is much more like WiFi where sending related things at once can be far more efficient than sending the same things at different times. this requires having expected data rates (BQL equivalents) that are more complex to take these sorts of things into account.

It will be good to see progress here. As has been noted before, reads tend to be synchronous (the app can't continue until it's complete) while writes tend to be far more tolerant of delays. This is similar to how some delays to network traffic (i.e. DNS) has far more impact on the user experience than others (i.e. http packets in a long-running connection). The solution on the network side has been multiple, small queues so that different flows don't block each other. Adapting this to both WiFi and I/O scheduling where the rate varies and aggregation drastically affects the rate will be an interesting challenge, but the rewards will be very large.

David Lang

Toward less-annoying background writeback

Posted Apr 15, 2016 17:42 UTC (Fri) by Beolach (subscriber, #77384) [Link]

This is basically the same issue the BFQ IO scheduler addresses, right? I thought BFQ was finally getting close to being upstreamed (as a modification of a trimmed-down CFQ). How does Jens' patch work when used in conjunction w/ BFQ?

Toward less-annoying background writeback

Posted Apr 20, 2016 9:59 UTC (Wed) by jospoortvliet (subscriber, #33164) [Link]

The problems seem orthogonal - BFQ suffers from these problems too, I understand from the conversation on the kernel ML. It isn't so much a scheduling issue as a "all buffers are full and there's nothing the scheduler can do" problem.

Well that's a start.

Posted Apr 17, 2016 6:35 UTC (Sun) by ksandstr (guest, #60862) [Link]

There's a second problem that's readily observable from userspace: when copying data from a fast device to a slow one, the kernel accumulates a fantastic amount of writeback before ever starting to write to the target device. This results in device bandwidth that's going unused, leading to (say) a write-limited 5GB copy to a 50MB/s device taking not just the requisite 100s but also what's possibly tens of seconds until the kernel finally deigns to chooch along[0] -- instead of having both the read and write sides go at write speed, as PCs did back in the nineties. (yes yes, TCQ and NCQ happened, but is it intended that those amount to a degraded level of service?)

Tape drives aside, if there's a performance benefit[1] to sitting on things until they (possibly) choke up all queues then it's certainly outweighed by resource underutilization. It's not like disks are subject to a cablemodem-like timeslice arbitration.

Worse, in the meantime all spare memory has been captured for dirty or writeback pages, which the kernel prefers to flush out rather than go to swaps (which might be a lz4 compressor, i.e. plenty fast esp. when waiting for I/O), instead of not replacing useful cache data with eventual writeback in the first place. And so the impatient console user's terminals fail to refresh and eventually the entire X session jams up -- all because of a copy to an USB2 storage device, or SD card. If there's a management algorithm that alleviates unnecessary memory pressure, please pass it around & share the glass dick as well.

So the question is: why is there an apparent case of "fire and forget queueing"[2] in the kernel? Surely it was known[3] that, fancy policies or not, in order to keep latency from climbing, the maximum size of a queue must be restricted according to observed throughput? And that pretending that dirty data isn't de facto a queue doesn't help? There's an unkind comparison to a particular NoSQL database in here somewhere: insufficient design causing worst-case performance consistently when variables (predictably) get big enough.

[0] latency depending on writeback delay setting due to laptop-mode-utils, in itself rendered ineffective by synchronous filesystem journal commits anyway. Still, the default is five seconds, or 250 megs for a 2000-era hard disk; up to 3GB in the buffer if reading from a whiz-bang SSD.
[1] such as from elevator sorting; 2.2 made the hard disk so quiet...
[2] always a fuckup: proper queues have feedback, so properly-implemented clients have behaviour in response.
[3] e.g. from a CS class on the topic. Or a book, possibly. Queueing theory came from like the nineteen-aughties, so it's not exactly esoteric.

Well that's a start.

Posted Apr 20, 2016 14:32 UTC (Wed) by farnz (subscriber, #17727) [Link]

Fire-and-forget queueing is used because it's simple and it works up to a point - network stacks used to use it, too. The assumption you're making is that feedback from a "proper" queue wouldn't adjust the queue length because the queue is sufficiently small that there's no benefit from doing so; once this assumption is broken (as it has been in networks and block devices), then you need feedback to adjust the queue length. But, in the short term, just doing a fixed-size fire-and-forget queue can be a good enough approximation of a variable length queue with feedback.

Toward less-annoying background writeback

Posted Apr 18, 2016 8:14 UTC (Mon) by rmano (guest, #49886) [Link]

Nice --- I look forward to see it. This problem bites a lot of people --- this: http://unix.stackexchange.com/questions/107703/why-is-my-... is my most seen answer in unix.stackexchange, and I suspect is the reason why interactivity go nuts while Recoll is crawling my drives...

Thanks for the effort!

Toward less-annoying background writeback

Posted Apr 21, 2016 7:01 UTC (Thu) by kevinm (guest, #69913) [Link]

It's important not to take the analogy with networking too seriously.

In particular, the network is hostile (or at least, self-interested) - network queuing can't really trust the endpoints: if you prioritize a flow that the endpoint has described as "important" then everyone will just start doing that.

That's not the case in block i/o - we control the whole kernel, so we can prioritize different sorts of flows based on their source and trust that we aren't being gamed by another part of the kernel.

Toward less-annoying background writeback

Posted Dec 15, 2016 12:36 UTC (Thu) by jubal (subscriber, #67202) [Link]

Unless what you perceive as a block device is, in fact, a network device. Case in point, Amazon's EBS. :-)

Toward less-annoying background writeback

Posted Apr 27, 2016 6:31 UTC (Wed) by loa (guest, #108477) [Link]

Why not implement the obvious solution:
Make the queues at all levels prioritize reads over writes, and reordering the queues (almost) always when reads come in and the queue is full of writes?

Toward less-annoying background writeback

Posted Apr 29, 2016 10:41 UTC (Fri) by jospoortvliet (subscriber, #33164) [Link]

The queue being dealt with here is in hardware, so that means changing and then swapping it out, or at least get manufacturing to get you a new firmware. Plus, it still doesn't solve all issues like urgent vs non-urgent writes.

Toward less-annoying background writeback

Posted Dec 28, 2016 18:08 UTC (Wed) by darkbasic (guest, #107872) [Link]

I wanted to test kernel 4.10 because of the automatic throttling of writeback queues on the block side. In fact with 4.8.13 every time I copy big files to my usb stick the system becomes unresponsive.
Unfortunately with 4.10 when I try to write the Arch image to an usb stick using "sudo dd if=archlinux-2016.12.01-dual.iso of=/dev/sdb bs=1 status=progress" it instantly finishes (like when you write to the cache but you still have to sync). It wrote something to the stick, but the image doesn't boot. Manually syncing does not help. Everything works flawlessly with kernel 4.8.

This is the bug report: https://bugzilla.kernel.org/show_bug.cgi?id=191391


Copyright © 2016, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds