Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Posted Apr 17, 2016 19:24 UTC (Sun) by jospoortvliet (guest, #33164)Parent article: Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Posted Apr 18, 2016 18:39 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (8 responses)
Let's say we've got a distributed file system, with lvm on top, then raid on top of that, then finally the ext or whatever file system that userspace sees. And I think that's simple compared to some ... (that distributed filesystem means it's split across multiple systems, which may themselves be ext over raid or somesuch ...)
And isn't scheduling quite the black art anyway? If you have to schedule through multiple layers of indirections, that makes matters worse (a lot worse :-)
That's a major argument for the new file systems like btrfs, zfs etc. While it makes *logical* sense to separate out volume management, raid, the filesystem itself, it makes things much messier in a *practical* sense. Far better, performance-wise, to just give the low-level disk to the high-level filesystem, and let it manage everything. It's the microkernel/monolithic kernel argument all over again, except with filesystems not kernels.
Cheers,
Posted Apr 18, 2016 20:46 UTC (Mon)
by jospoortvliet (guest, #33164)
[Link] (3 responses)
So what you were saying is: the many layers often make the kernel scheduler ineffective. I wonder a bit how the app can then do better by adding another internal abstraction. But... Thanks for the answer :-)
Posted Apr 19, 2016 6:26 UTC (Tue)
by joib (subscriber, #8541)
[Link] (2 responses)
Posted Apr 19, 2016 12:01 UTC (Tue)
by rnsanchez (guest, #32570)
[Link] (1 responses)
Posted Apr 19, 2016 16:26 UTC (Tue)
by mtanski (guest, #56423)
[Link]
Posted Apr 19, 2016 10:30 UTC (Tue)
by nye (subscriber, #51576)
[Link] (3 responses)
I don't know about btrfs etc, but that's definitely not how ZFS works. ZFS has clearly separated volume and filesystem layers, and I'd be fairly surprised if it were alone in that.
The difference from my perspective is that the volume management layer is bigger, incorporating the moral equivalent of LVM and RAID (and checksumming, and encryption if you're using Oracle's proprietary ZFS), and that the filesystem layer is not intended to be layered on top of anything else, so can be tailored to fit the layer beneath it.
It's not that a layered stack is the problem per se, but more that allowing arbitrary selections of layers chosen from a generous smorgasbord and combined in arbitrary orders dramatically expands the scope of the problem.
Posted Apr 19, 2016 15:17 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (2 responses)
As I understand it (and no way am I an expert in filesystems :-) it comes over that the more the top (user interface) knows about the bottom (the disk interface) the easier it is to make sensible decisions that don't have exponential performance as the load rises ...
> It's not that a layered stack is the problem per se, but more that allowing arbitrary selections of layers chosen from a generous smorgasbord and combined in arbitrary orders dramatically expands the scope of the problem.
:-)
If ZFS is split into two tightly-coupled components, that sounds in line with my understanding. Could you put a ZFS filesystem layer over a linux lvm layer? Or a linux ext4 over the ZFS volume layer? I guess the answer's "no", so ZFS as a whole has control over the whole stack :-) and hence should achieve far better performance.
Cheers,
Posted Apr 19, 2016 16:41 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link] (1 responses)
Posted Apr 20, 2016 11:07 UTC (Wed)
by nye (subscriber, #51576)
[Link]
I've seriously considered it for storing my email backups where more than 50% of the space usage is waste due to internal fragmentation, where ZFS (the filesystem layer) suffers fairly badly compared to ext4. Ultimately, losing low double digit gigabytes isn't a pressing concern these days so I've not bothered, but I can well imagine there would be circumstances where ext4 or some other filesystem would be a sensible choice.
Posted Apr 19, 2016 12:09 UTC (Tue)
by rnsanchez (guest, #32570)
[Link]
They know their workload better than the kernel's I/O scheduler, so it makes sense for them to schedule on their own (according to whatever metrics might be critical at a specific point in time), and then submit to the kernel what they really need and when.
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Wol
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Wol
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
You CAN put ext4 on ZVOLs. Or swap. Or export it via iSCSI.
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)
Costa: Designing a Userspace Disk I/O Scheduler for Modern Datastores: the Scylla example (Part 1)