Re: BFQ (The creation of the io.latency block I/O controller)

frr — Mon, 08 Apr 2019 14:50:21 +0000

I remember about a decade ago, when "dropbox style" services were all the rage, several "operators" of such services were among our potential customers for traditional RAID boxes with inexpensive SATA drives. One of the first morales was, that RAID makes the IOps math even more complicated, over bare disk drives :-) In the following text, I will abstract from the striped RAID nastiness, I'll refer to bare drives - and specifically spinning rust, as it has some cheap volume to offer.

Our potential customers' traffic pattern was: read-mostly, with numerous clients trying to download the bulk content from the servers. I.e., imagine a hundred HTTP server threads, each performing a sequential download of a large file - but in time, their individual read requests get interleaved, leading to a pretty good approximation of random access to the LBA address space of the disk (or RAID volume).

BFQ seems to promise to solve something similar: apparently it tries to keep some "flow stats" per "control group" (?) and if the summary pattern for the group amounts to sequential access, it can do transaction combining to achieve performace benefits. Similar but not the same, I suspect.

The "parallel readers" pattern is probably not what Facebook are facing - by the description they have a more varied mix of different load patterns, and basically consider the access paterns to be mostly random.

I also tested a scenario with many parallel *writing* threads. Imagine a storage back end for surveilance video recording from many cameras. Here the Linux kernel can offer a configurable and huge dirty cache, suitable for write-combining during write-back caching. Especially promising with the "deadline" scheduler. But, as you increase the load, the write-back pending times start growing, and unfortunately the "deadline" scheduler employs a strict timeout, after which the pending request is scheduled for "immediate" writeback = into a FIFO queue with no more ordering :-( So the promise of elevator-based optimization (albeit geologically slow) collapses into a pure FIFO and the show basically stops.

In the days when I was playing with this, the cheap bulky SATA drives could do about 100-130 MBps (nowadays north of 200 MBps) but during the years, the truly random IOps remains at just about 75. That's 75 random seeks per second. And to squeeze "close to sequential" MBps performance out of a cheap disk drive, you would need to combine the traffic into transactions no smaller than say 4 MB (10 MB would be even better). Here is a graph based on my simple measurements:
http://support.fccps.cz/download/adv/frr/hdd/perf_vs_tsiz...

And also the elevator-based ordering only has a limited room to scale the IOps thoughput - possibly bounded by the disk drive electronics processing horsepower for IOps:
http://support.fccps.cz/download/adv/frr/hdd/IOps_vs_qdep...

Once we get into SSD's, at a first glance there's no point in trying to order and combine the transactions, as there's no mechanical arm with magnetic heads to be moved about, there's no track seeking inertia, and the elevator sorting algorithms and "stream tracking stats" are too expensive (in terms of latency / crunching horsepower) to be any use.
The basic "page" or "row" for NAND Flash reading / writing is about 4 kB, i.e. just about one memory page of the host CPU.
Still, beware of writing. The erase block size (a binary number on the order of megabytes) may again be a good clue.
Even sequential writing may hit a hidden boundary after a couple hundred megabytes (or maybe gigabytes in the bigger drives), long before the drive's payload capacity is fully overwritten - allegedly because modern drives first start using the MLC/TLC chips in "SLC mode" (which is faster) and only when that space runs out, they need to fall back to the native "symbol depth".
Random writing of "much less than an erase block per transaction" is generally not a good idea, because it requires more "flash janitoring" to take place (allocation of pages from different erase blocks, wear leveling must "shuffle pages out of the way" etc) and can really slow things down...

It's a heap of factors to consider, between the application-level abstraction of a file (or database, or whatever) and the physical sectors on spinning rust or pages in NAND Flash chips. I've already hinted at the fun available in striped RAID, and also note that even sequential files need to store metadata, which may add some "out of sequence" seeks as well. Modulo barrier operations. Admittedly difficult to optimize for at the block level or just slightly above, while the apps not necessarily even care to use "hinting" about their future intentions ( fadvise() ) which is readily available from the kernel. Does the httpd even have some useful clues about the traffic pattern it is serving? Well at least it could know the size of a file being read from disk...

IMO it's really the user space apps that should be aware of their respective buffering needs, and should do the optimal stream buffering internally, keeping a suitably large buffer in the user space. Makes me wonder if mmapping the source file of a stream (rather than an explicit read() = copy into a user-space buffer) would give the kernel more opportunities for merging multiple parallel sessions, that are trying to "stream" the same popular file down a myriad independent TCP sessions running at different transfer rates... Not sure, maybe someone has already written an optimized "file-serving back-end httpd" during the decade that has passed, I'm not keeping an eye on it. Not sure though, people seem to be focusing more on building clouds of Java VM's, using JavaScript on the server side etc... and the most popular HTTP daemons apparently still have buckets or buffers in kilobyte sizes. And the "file download services" are tackling the issue using automatic Flash-SSD-based cache that caters for the popular downloads, so that the spinning rust does not need to be accessed so intensively.

This debate about storage IO bandwidth reservations feels like the "kernel and sysadmin team" fighting the "user-space apps team". Sounds odd to me. There should be an architect who would know better than play chess with himself on both sides of the checkerboard (on each side frantically trying to beat the opponent).

The creation of the io.latency block I/O controller

josefbacik — Mon, 18 Mar 2019 15:37:15 +0000

At the time I was doing io.latency bfq wasn't mature enough to use. bfq is more akin to our current io.weight work, however we have found in testing that the latency induced by bfq is way more than we are willing to pay for. The io scheduler infrastructure currently only operates on requests, which means they get a request and that request is holding resources up for the entirety of its lifetime. This is why io.latency/wbt operate above the io scheduler, we can throttle all we want and not affect other workloads. Throttling at the io scheduler level means we're still holding on to that extra resource and punishing all the other workloads because of this lack of resource.

This isn't an impossible problem to solve by any means, and is not a complaint against bfq itself. We just know this method works, and it works extremely well, and then allows us to run whatever io scheduler we want underneath it, wether it's kyber or mq-deadline or whatever.

The creation of the io.latency block I/O controller

juril — Mon, 18 Mar 2019 13:30:06 +0000

Hi Josef. I was wondering how your work relates to BFQ (https://lwn.net/Articles/601799/).
Could you comment on differences?
Thanks!

The creation of the io.latency block I/O controller

josefbacik — Sat, 16 Mar 2019 07:35:05 +0000

The 1s timer is only armed while there’s IO happening. No IO, no periodic timer.

The creation of the io.latency block I/O controller

martin.langhoff — Sat, 16 Mar 2019 02:49:11 +0000

How does the 1s tick play with power management? The progress towards tickless systems makes for visibly better battery runtime, and better VM behavior on the server side...

The creation of the io.latency block I/O controller

unixbhaskar — Fri, 15 Mar 2019 03:41:07 +0000

Well, sounds good. Many many moons ago, I used to handle throttled web server, specifically runs Apache, and I had encountered quite a few oom. Due to my lack of understanding and precautions, I failed to get over it. Looks like something going to assist people like me in a big way. Thank you and the entire fellas for the hard work.

LWN: Comments on "The creation of the io.latency block I/O controller"

Re: BFQ (The creation of the io.latency block I/O controller)

The creation of the io.latency block I/O controller

The creation of the io.latency block I/O controller

The creation of the io.latency block I/O controller

The creation of the io.latency block I/O controller

The creation of the io.latency block I/O controller