|
|
Log in / Subscribe / Register

Re: BFQ (The creation of the io.latency block I/O controller)

Re: BFQ (The creation of the io.latency block I/O controller)

Posted Apr 8, 2019 14:50 UTC (Mon) by frr (guest, #74556)
In reply to: The creation of the io.latency block I/O controller by juril
Parent article: The creation of the io.latency block I/O controller

I remember about a decade ago, when "dropbox style" services were all the rage, several "operators" of such services were among our potential customers for traditional RAID boxes with inexpensive SATA drives. One of the first morales was, that RAID makes the IOps math even more complicated, over bare disk drives :-) In the following text, I will abstract from the striped RAID nastiness, I'll refer to bare drives - and specifically spinning rust, as it has some cheap volume to offer.

Our potential customers' traffic pattern was: read-mostly, with numerous clients trying to download the bulk content from the servers. I.e., imagine a hundred HTTP server threads, each performing a sequential download of a large file - but in time, their individual read requests get interleaved, leading to a pretty good approximation of random access to the LBA address space of the disk (or RAID volume).

BFQ seems to promise to solve something similar: apparently it tries to keep some "flow stats" per "control group" (?) and if the summary pattern for the group amounts to sequential access, it can do transaction combining to achieve performace benefits. Similar but not the same, I suspect.

The "parallel readers" pattern is probably not what Facebook are facing - by the description they have a more varied mix of different load patterns, and basically consider the access paterns to be mostly random.

I also tested a scenario with many parallel *writing* threads. Imagine a storage back end for surveilance video recording from many cameras. Here the Linux kernel can offer a configurable and huge dirty cache, suitable for write-combining during write-back caching. Especially promising with the "deadline" scheduler. But, as you increase the load, the write-back pending times start growing, and unfortunately the "deadline" scheduler employs a strict timeout, after which the pending request is scheduled for "immediate" writeback = into a FIFO queue with no more ordering :-( So the promise of elevator-based optimization (albeit geologically slow) collapses into a pure FIFO and the show basically stops.

In the days when I was playing with this, the cheap bulky SATA drives could do about 100-130 MBps (nowadays north of 200 MBps) but during the years, the truly random IOps remains at just about 75. That's 75 random seeks per second. And to squeeze "close to sequential" MBps performance out of a cheap disk drive, you would need to combine the traffic into transactions no smaller than say 4 MB (10 MB would be even better). Here is a graph based on my simple measurements:
http://support.fccps.cz/download/adv/frr/hdd/perf_vs_tsiz...

And also the elevator-based ordering only has a limited room to scale the IOps thoughput - possibly bounded by the disk drive electronics processing horsepower for IOps:
http://support.fccps.cz/download/adv/frr/hdd/IOps_vs_qdep...

Once we get into SSD's, at a first glance there's no point in trying to order and combine the transactions, as there's no mechanical arm with magnetic heads to be moved about, there's no track seeking inertia, and the elevator sorting algorithms and "stream tracking stats" are too expensive (in terms of latency / crunching horsepower) to be any use.
The basic "page" or "row" for NAND Flash reading / writing is about 4 kB, i.e. just about one memory page of the host CPU.
Still, beware of writing. The erase block size (a binary number on the order of megabytes) may again be a good clue.
Even sequential writing may hit a hidden boundary after a couple hundred megabytes (or maybe gigabytes in the bigger drives), long before the drive's payload capacity is fully overwritten - allegedly because modern drives first start using the MLC/TLC chips in "SLC mode" (which is faster) and only when that space runs out, they need to fall back to the native "symbol depth".
Random writing of "much less than an erase block per transaction" is generally not a good idea, because it requires more "flash janitoring" to take place (allocation of pages from different erase blocks, wear leveling must "shuffle pages out of the way" etc) and can really slow things down...

It's a heap of factors to consider, between the application-level abstraction of a file (or database, or whatever) and the physical sectors on spinning rust or pages in NAND Flash chips. I've already hinted at the fun available in striped RAID, and also note that even sequential files need to store metadata, which may add some "out of sequence" seeks as well. Modulo barrier operations. Admittedly difficult to optimize for at the block level or just slightly above, while the apps not necessarily even care to use "hinting" about their future intentions ( fadvise() ) which is readily available from the kernel. Does the httpd even have some useful clues about the traffic pattern it is serving? Well at least it could know the size of a file being read from disk...

IMO it's really the user space apps that should be aware of their respective buffering needs, and should do the optimal stream buffering internally, keeping a suitably large buffer in the user space. Makes me wonder if mmapping the source file of a stream (rather than an explicit read() = copy into a user-space buffer) would give the kernel more opportunities for merging multiple parallel sessions, that are trying to "stream" the same popular file down a myriad independent TCP sessions running at different transfer rates... Not sure, maybe someone has already written an optimized "file-serving back-end httpd" during the decade that has passed, I'm not keeping an eye on it. Not sure though, people seem to be focusing more on building clouds of Java VM's, using JavaScript on the server side etc... and the most popular HTTP daemons apparently still have buckets or buffers in kilobyte sizes. And the "file download services" are tackling the issue using automatic Flash-SSD-based cache that caters for the popular downloads, so that the spinning rust does not need to be accessed so intensively.

This debate about storage IO bandwidth reservations feels like the "kernel and sysadmin team" fighting the "user-space apps team". Sounds odd to me. There should be an architect who would know better than play chess with himself on both sides of the checkerboard (on each side frantically trying to beat the opponent).


to post comments


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds