Crash recovery for user-space block drivers
If an in-kernel block driver crashes, it is likely to bring down the entire kernel with it. Putting those drivers into user space can, theoretically, result in a more robust system, since the kernel can now survive a driver crash. With ublk as found in the 6.0 kernel, though, a driver crash will result in the associated devices disappearing and all outstanding I/O requests failing. From a user's point of view, this result may be nearly indistinguishable from a complete crash of the system. As patch author Ziyang Zhang notes in the cover letter, some users might be disappointed by this outcome:
This is not a good choice in practice because users do not expect aborted requests, I/O errors and a released device. They may want a recovery mechanism so that no requests are aborted and no I/O error occurs. Anyway, users just want everything works as usual.
The goal of this patch set is to grant this wish.
A user-space block driver that implements crash recovery should set up its ublk devices with the new UBLK_F_USER_RECOVERY flag. There is also an optional flag, UBLK_F_USER_RECOVERY_REISSUE, that controls how recovery is done; more on that below. After setup, no other changes are needed for normal driver operation.
Should a recovery-capable ublk driver crash, the kernel will stop the associated I/O request queues to prevent the addition of future requests, then wait patiently for a new driver process to come along. That wait can be indefinite; if a driver claims to be able to do recovery, then the kernel will expect it to live up to that claim. There is no notification mechanism for a driver crash; user space is required to notice on its own that the driver has come to an untimely end and start a new one.
That new driver process will connect to the ublk subsystem and issue the START_USER_RECOVERY command. That causes ublk to verify that the old driver is really gone and clean up after it, including dealing with all of the outstanding I/O requests. Any requests that showed up after the crash and were not accepted by the old driver can simply be requeued to the new one. Requests that were accepted may have to be handled a bit more carefully, though, since the kernel does not know if they were actually executed or not.
There are, evidently, some ublk back-ends that cannot properly deal with duplicated writes; such writes must be avoided in that case. That is what the UBLK_F_USER_RECOVERY_REISSUE flag is for; if it is present, all outstanding requests will be reissued. Otherwise, requests that had been picked up by the driver, but for which no completion status had been posted, will fail with an error status. This will happen even with read requests, which one would normally expect to be harmless if repeated.
After starting the recovery process, the new driver should reconnect to each device and issue a new FETCH_REQ command on each to enable the flow of I/O requests. Once all of the devices have been set up, an END_USER_RECOVERY command will restart the request queue and get everything moving again. With luck, users may not even notice that the block driver crashed and was replaced.
The ublk subsystem came out of Red Hat and only includes a simple file-backed driver, essentially replicating the loop driver, as an example. At the time, various use cases for this subsystem were mentioned in a vague way, but it was not clear how (or if) it is being used outside of a demonstration mode. It looks a bit like an interesting solution waiting for a problem.
The appearance of this recovery mechanism from a different company (Alibaba), just a few weeks after ublk was merged, suggests that more advanced use cases exist, and that ublk is, indeed, already in active use. This sort of recovery mechanism tends not to be developed in the absence of some hard experience indicating that it is necessary. Hopefully some of these real-world use cases will come to light — with code — so that the rest of the world can benefit from this work.
Just as usefully, this information might give some clues about where Linux
is headed in the coming years. The effort to blur the boundaries between
kernel tasks and those handled in user space shows no signs of slowing
down; it would not be surprising to see more ublk-like mechanisms in the
future. It would be interesting indeed to have an idea of where these
changes are taking us — and to be shown that it isn't a world where
development moves to proprietary, user-space drivers.
Index entries for this article | |
---|---|
Kernel | Block layer/Block drivers |
Kernel | io_uring |
Posted Aug 29, 2022 16:27 UTC (Mon)
by developer122 (guest, #152928)
[Link] (4 responses)
GPU drivers are a bit weird, since mesa is a library and programs call into it directly instead of going down through the kernel and up into the userspace portion of the driver, and this systems seems like a place where an even stronger argument could be made. The userspace driver is needed to make the driver interface of the kernel itself work, and therefor definitely an open source reference implementation should be provided.
Posted Aug 29, 2022 17:02 UTC (Mon)
by iabervon (subscriber, #722)
[Link]
Posted Aug 30, 2022 4:20 UTC (Tue)
by riking (guest, #95706)
[Link] (2 responses)
Posted Aug 30, 2022 4:21 UTC (Tue)
by riking (guest, #95706)
[Link]
Posted Aug 30, 2022 10:05 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Aug 29, 2022 16:43 UTC (Mon)
by ejr (subscriber, #51652)
[Link]
Actually, could you expose a PostgreSQL table as a "block device" over this? That'd be cute although likely silly.
Posted Aug 29, 2022 20:08 UTC (Mon)
by storner (subscriber, #119)
[Link] (4 responses)
I don't want to reignite that discussion, but it feels like Linux is becoming more and more supportive of a hybrid monolithic/micro kernel design where features are implemented in userspace, with only a minimum of hooks into the "core" kernel to get a reasonable performance.
Posted Aug 30, 2022 0:32 UTC (Tue)
by developer122 (guest, #152928)
[Link]
[1] https://www.youtube.com/watch?v=cypmufnPfLw around 37:20 according to the shownotes
Posted Aug 30, 2022 0:45 UTC (Tue)
by dvdeug (guest, #10998)
[Link] (2 responses)
Posted Aug 30, 2022 5:50 UTC (Tue)
by zev (subscriber, #88455)
[Link] (1 responses)
Posted Aug 30, 2022 10:09 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
But monolithic IMPLEMENTATION brings a lot of advantages too, not least speed.
That's why Linux has always been a modular system.
Whatever floats your boat, and if pushing stuff into user space brings advantages (which it clearly does in many cases, not least security), go for it!
Cheers,
Posted Aug 30, 2022 11:08 UTC (Tue)
by bbockelm (subscriber, #71069)
[Link] (3 responses)
https://www.usenix.org/system/files/atc20-li-huiba.pdf
The above paper describes a virtual block device, implemented in user space, and including crash recovery. The overall application is to deliver block-based container images to their platform.
Posted Aug 30, 2022 11:55 UTC (Tue)
by hsiangkao (guest, #123981)
[Link] (1 responses)
Posted Aug 30, 2022 12:07 UTC (Tue)
by bbockelm (subscriber, #71069)
[Link]
Posted Aug 30, 2022 12:29 UTC (Tue)
by old-memories (guest, #160155)
[Link]
By the way, ublk is not for container images. It is a generic userspace block driver. Please read related articles published before more patiently, really.
Posted Aug 30, 2022 18:51 UTC (Tue)
by lperkov (guest, #136950)
[Link]
Posted Sep 1, 2022 11:17 UTC (Thu)
by rwmj (subscriber, #5474)
[Link]
Posted Sep 3, 2022 2:12 UTC (Sat)
by dacut (subscriber, #131937)
[Link] (5 responses)
This was before AWS had their NFS solution (Elastic File System/EFS) and EBS volumes were limited in size to 1 TB, so I demoed an absurd idea of a tiny host (t2.micro) with a 1 PB XFS-formatted UBD volume attached. Blocks are sparsely allocated, so only the superblocks were written to S3. Still, this took overnight just to format the filesystem.
It was a fun demo to write over a few weekends, but so slow as to be completely unpractical.
Posted Sep 3, 2022 15:54 UTC (Sat)
by rwmj (subscriber, #5474)
[Link] (1 responses)
Anyway, here's our attempt: https://libguestfs.org/nbdkit-S3-plugin.1.html It's written in Python and apparently used somewhat widely (judging by bug reports etc).
Posted Sep 5, 2022 2:08 UTC (Mon)
by dacut (subscriber, #131937)
[Link]
Glad that your engineered solution (vs. my toy weekend project) is getting usage!
Posted Sep 5, 2022 8:23 UTC (Mon)
by cortana (subscriber, #24596)
[Link] (1 responses)
Posted Sep 7, 2022 3:17 UTC (Wed)
by dacut (subscriber, #131937)
[Link]
Thank you!
Posted Sep 10, 2022 14:57 UTC (Sat)
by ming.lei (guest, #74703)
[Link]
https://dspace.cuni.cz/bitstream/handle/20.500.11956/1487...
Posted Sep 5, 2022 15:27 UTC (Mon)
by pturmel (guest, #95781)
[Link] (3 responses)
I often set up custom bridge-vlan-VM network environments for my laboratory of industrial hardware (mostly Allen-Bradley and Omron products). Until recently, it was done with native Linux bridges and native Linux vlans.
Then I discovered that KVM can tap into an OpenVSwitch bridge with a specified vlan tag. Greatly simplified the rats' nest of configurations I've been switching around for various simulations.
But some sibling VMs on the same infrastructure had their network performance noticeably improve. I didn't think to benchmark it all, but I'm very happy with the results.
Bring on more userspace drivers, thank you.
Posted Sep 6, 2022 12:34 UTC (Tue)
by cortana (subscriber, #24596)
[Link] (2 responses)
Posted Sep 6, 2022 12:53 UTC (Tue)
by pturmel (guest, #95781)
[Link] (1 responses)
Posted Sep 8, 2022 10:40 UTC (Thu)
by cortana (subscriber, #24596)
[Link]
Posted Sep 10, 2022 14:51 UTC (Sat)
by ming.lei (guest, #74703)
[Link]
AFAIK, the above seems not true.
IMO, there are lots of jobs which can be done by ublk, such as some generic things:
- re-implement nbd-client with ublk and io_uring, especially applying recent io_uring/net zero copy feature
Some of them have been in-progress. Userspace development is easier and more efficient, but still takes
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
Wol
Crash recovery for user-space block drivers
Linux microkernel?
Linux microkernel?
[2] https://github.com/oxidecomputer/twitter-spaces/blob/mast...
Linux microkernel?
Linux microkernel?
there's enough power to let people do things the easy, flexible way or the maximally efficient way if they need that.
Agreed -- I think the general trend is clearly toward increasing flexibility and having the option (for a growing set of things) of either an in-kernel implementation or a userspace one. Though for whatever reason, it seems like whenever a new userspace option pops up the "will Linux become a microkernel?" comments inevitably appear, whereas I can't recall ever seeing the inverse when Linux grows things like (to pick some recent examples) in-kernel TLS support or KSMBD.
Linux microkernel?
Wol
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
Instead, for container image scenarios, we will try to write a new paper based on the filesystem approach, hopefully later published.
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
nbdublk uses this to reimplement nbd.ko in userspace. There's a very long thread on our experience: here and continued here.
NBD driver
Had another demo driver for S3
Had another demo driver for S3
Had another demo driver for S3
Had another demo driver for S3
Harder drives
Had another demo driver for S3
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
Crash recovery for user-space block drivers
- ublk export for qemu-storage-daemon
- SPDK backend for UBD mod
- ceph's rbd
- ublk based zoned target
- ublk base compression target
...
a while, especially just 4 months past since the first version of ublk driver(RFC) is posted.