Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
If one is using a threaded approach for managing I/O, a thread can be assigned to a different priority group by tools such as ionice. However, ionice only allows us to choose between general concepts like real-time, best-effort and idle. And while Linux will try to preserve fairness among the different actors, that doesn’t allow any fine tuning to take place. Dividing bandwidth among users is a common task in network processing, but it is usually not possible with disk I/O without resorting to infrastructure like cgroups. More importantly, modern designs like the Seastar framework used by Scylla to build its infrastructure may stay away from threads in favor of a thread-per-core design in the search for better scalability. In the light of these considerations, can a userspace application like Scylla somehow guarantee that all actors are served according to the priorities we would want them to obey?"
Posted Apr 17, 2016 19:24 UTC (Sun)
by jospoortvliet (guest, #33164)
[Link] (10 responses)
Posted Apr 18, 2016 18:39 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (8 responses)
Let's say we've got a distributed file system, with lvm on top, then raid on top of that, then finally the ext or whatever file system that userspace sees. And I think that's simple compared to some ... (that distributed filesystem means it's split across multiple systems, which may themselves be ext over raid or somesuch ...)
And isn't scheduling quite the black art anyway? If you have to schedule through multiple layers of indirections, that makes matters worse (a lot worse :-)
That's a major argument for the new file systems like btrfs, zfs etc. While it makes *logical* sense to separate out volume management, raid, the filesystem itself, it makes things much messier in a *practical* sense. Far better, performance-wise, to just give the low-level disk to the high-level filesystem, and let it manage everything. It's the microkernel/monolithic kernel argument all over again, except with filesystems not kernels.
Cheers,
Posted Apr 18, 2016 20:46 UTC (Mon)
by jospoortvliet (guest, #33164)
[Link] (3 responses)
So what you were saying is: the many layers often make the kernel scheduler ineffective. I wonder a bit how the app can then do better by adding another internal abstraction. But... Thanks for the answer :-)
Posted Apr 19, 2016 6:26 UTC (Tue)
by joib (subscriber, #8541)
[Link] (2 responses)
Posted Apr 19, 2016 12:01 UTC (Tue)
by rnsanchez (guest, #32570)
[Link] (1 responses)
Posted Apr 19, 2016 16:26 UTC (Tue)
by mtanski (guest, #56423)
[Link]
Posted Apr 19, 2016 10:30 UTC (Tue)
by nye (subscriber, #51576)
[Link] (3 responses)
I don't know about btrfs etc, but that's definitely not how ZFS works. ZFS has clearly separated volume and filesystem layers, and I'd be fairly surprised if it were alone in that.
The difference from my perspective is that the volume management layer is bigger, incorporating the moral equivalent of LVM and RAID (and checksumming, and encryption if you're using Oracle's proprietary ZFS), and that the filesystem layer is not intended to be layered on top of anything else, so can be tailored to fit the layer beneath it.
It's not that a layered stack is the problem per se, but more that allowing arbitrary selections of layers chosen from a generous smorgasbord and combined in arbitrary orders dramatically expands the scope of the problem.
Posted Apr 19, 2016 15:17 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (2 responses)
As I understand it (and no way am I an expert in filesystems :-) it comes over that the more the top (user interface) knows about the bottom (the disk interface) the easier it is to make sensible decisions that don't have exponential performance as the load rises ...
> It's not that a layered stack is the problem per se, but more that allowing arbitrary selections of layers chosen from a generous smorgasbord and combined in arbitrary orders dramatically expands the scope of the problem.
:-)
If ZFS is split into two tightly-coupled components, that sounds in line with my understanding. Could you put a ZFS filesystem layer over a linux lvm layer? Or a linux ext4 over the ZFS volume layer? I guess the answer's "no", so ZFS as a whole has control over the whole stack :-) and hence should achieve far better performance.
Cheers,
Posted Apr 19, 2016 16:41 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link] (1 responses)
Posted Apr 20, 2016 11:07 UTC (Wed)
by nye (subscriber, #51576)
[Link]
I've seriously considered it for storing my email backups where more than 50% of the space usage is waste due to internal fragmentation, where ZFS (the filesystem layer) suffers fairly badly compared to ext4. Ultimately, losing low double digit gigabytes isn't a pressing concern these days so I've not bothered, but I can well imagine there would be circumstances where ext4 or some other filesystem would be a sensible choice.
Posted Apr 19, 2016 12:09 UTC (Tue)
by rnsanchez (guest, #32570)
[Link]
They know their workload better than the kernel's I/O scheduler, so it makes sense for them to schedule on their own (according to whatever metrics might be critical at a specific point in time), and then submit to the kernel what they really need and when.
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Wol
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Wol
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
You CAN put ext4 on ZVOLs. Or swap. Or export it via iSCSI.
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)