Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

[Posted April 15, 2016 by jake]

Over at the Scylla blog, Glauber Costa looks at why a high-performance datastore application might want to do its own I/O scheduling. "If one is using a threaded approach for managing I/O, a thread can be assigned to a different priority group by tools such as ionice. However, ionice only allows us to choose between general concepts like real-time, best-effort and idle. And while Linux will try to preserve fairness among the different actors, that doesn’t allow any fine tuning to take place. Dividing bandwidth among users is a common task in network processing, but it is usually not possible with disk I/O without resorting to infrastructure like cgroups. More importantly, modern designs like the Seastar framework used by Scylla to build its infrastructure may stay away from threads in favor of a thread-per-core design in the search for better scalability. In the light of these considerations, can a userspace application like Scylla somehow guarantee that all actors are served according to the priorities we would want them to obey?"

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 17, 2016 19:24 UTC (Sun) by jospoortvliet (guest, #33164) [Link] (10 responses)

Doesn't this point to issues in the kernel's I/O scheduler? It is supposed to handle large amounts of requests, is it not?

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 18, 2016 18:39 UTC (Mon) by Wol (subscriber, #4433) [Link] (8 responses)

Well, from following the raid kernel list, it looks like the kernel has a massive legacy clusterfsck to deal with, so no wonder i/o can be a real pain ...

Let's say we've got a distributed file system, with lvm on top, then raid on top of that, then finally the ext or whatever file system that userspace sees. And I think that's simple compared to some ... (that distributed filesystem means it's split across multiple systems, which may themselves be ext over raid or somesuch ...)

And isn't scheduling quite the black art anyway? If you have to schedule through multiple layers of indirections, that makes matters worse (a lot worse :-)

That's a major argument for the new file systems like btrfs, zfs etc. While it makes *logical* sense to separate out volume management, raid, the filesystem itself, it makes things much messier in a *practical* sense. Far better, performance-wise, to just give the low-level disk to the high-level filesystem, and let it manage everything. It's the microkernel/monolithic kernel argument all over again, except with filesystems not kernels.

Cheers,
Wol

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 18, 2016 20:46 UTC (Mon) by jospoortvliet (guest, #33164) [Link] (3 responses)

Yeah, when you lay it out like that I see the appeal of btrfs... It is why I installed it, data security without the complexity.

So what you were saying is: the many layers often make the kernel scheduler ineffective. I wonder a bit how the app can then do better by adding another internal abstraction. But... Thanks for the answer :-)

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 19, 2016 6:26 UTC (Tue) by joib (subscriber, #8541) [Link] (2 responses)

Well, they want to do async block I/O, which means they must use the (Linux-specific) io_submit interface, which implies they use O_DIRECT, page-aligned I/O etc. So it's not adding yet another layer on top of the kernel as much as replacing the kernel stuff with their own I/O scheduler, page cache etc.

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 19, 2016 12:01 UTC (Tue) by rnsanchez (guest, #32570) [Link] (1 responses)

You still have the kernel's page cache if you make your own userspace I/O scheduler, which is a tremendous help if the page cache is not the problem you're trying to solve.

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 19, 2016 16:26 UTC (Tue) by mtanski (guest, #56423) [Link]

But without O_DIRECT and io_submit you lose async behavior of read/write which many people care about.

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 19, 2016 10:30 UTC (Tue) by nye (subscriber, #51576) [Link] (3 responses)

>That's a major argument for the new file systems like btrfs, zfs etc. While it makes *logical* sense to separate out volume management, raid, the filesystem itself, it makes things much messier in a *practical* sense. Far better, performance-wise, to just give the low-level disk to the high-level filesystem, and let it manage everything.

I don't know about btrfs etc, but that's definitely not how ZFS works. ZFS has clearly separated volume and filesystem layers, and I'd be fairly surprised if it were alone in that.

The difference from my perspective is that the volume management layer is bigger, incorporating the moral equivalent of LVM and RAID (and checksumming, and encryption if you're using Oracle's proprietary ZFS), and that the filesystem layer is not intended to be layered on top of anything else, so can be tailored to fit the layer beneath it.

It's not that a layered stack is the problem per se, but more that allowing arbitrary selections of layers chosen from a generous smorgasbord and combined in arbitrary orders dramatically expands the scope of the problem.

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 19, 2016 15:17 UTC (Tue) by Wol (subscriber, #4433) [Link] (2 responses)

> The difference from my perspective is that the volume management layer is bigger, incorporating the moral equivalent of LVM and RAID (and checksumming, and encryption if you're using Oracle's proprietary ZFS), and that the filesystem layer is not intended to be layered on top of anything else, so can be tailored to fit the layer beneath it.

As I understand it (and no way am I an expert in filesystems :-) it comes over that the more the top (user interface) knows about the bottom (the disk interface) the easier it is to make sensible decisions that don't have exponential performance as the load rises ...

> It's not that a layered stack is the problem per se, but more that allowing arbitrary selections of layers chosen from a generous smorgasbord and combined in arbitrary orders dramatically expands the scope of the problem.

:-)

If ZFS is split into two tightly-coupled components, that sounds in line with my understanding. Could you put a ZFS filesystem layer over a linux lvm layer? Or a linux ext4 over the ZFS volume layer? I guess the answer's "no", so ZFS as a whole has control over the whole stack :-) and hence should achieve far better performance.

Cheers,
Wol

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 19, 2016 16:41 UTC (Tue) by zdzichu (subscriber, #17118) [Link] (1 responses)

You can't put ZPL (ZFS Posix Layer, the filesystem) on top of LVM.
You CAN put ext4 on ZVOLs. Or swap. Or export it via iSCSI.

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 20, 2016 11:07 UTC (Wed) by nye (subscriber, #51576) [Link]

There are even circumstances where it would actually be reasonably sane to use ZFS (the volume layer) as your volume management layer, and ext4 as your filesystem layer, aside from the obvious like exporting the volume to a VM or an iSCSI client that just does its own thing.

I've seriously considered it for storing my email backups where more than 50% of the space usage is waste due to internal fragmentation, where ZFS (the filesystem layer) suffers fairly badly compared to ext4. Ultimately, losing low double digit gigabytes isn't a pressing concern these days so I've not bothered, but I can well imagine there would be circumstances where ext4 or some other filesystem would be a sensible choice.

Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)

Posted Apr 19, 2016 12:09 UTC (Tue) by rnsanchez (guest, #32570) [Link]

The kernel's I/O scheduler is supposed to be one-size-fits-all. For heavy workloads, it is common to run into conflicts with the metrics (i.e., what to prioritize when the world is collapsing), and the tunables are of little help. Not to mention that it is rather troublesome to cancel async-I/O. It is not impossible, it is just not fast enough when you put enough pressure across the entire I/O subsystem. Also, it is a good scheduler for throughput, NOT latency.

They know their workload better than the kernel's I/O scheduler, so it makes sense for them to schedule on their own (according to whatever metrics might be critical at a specific point in time), and then submit to the kernel what they really need and when.