Crash recovery for user-space block drivers

By Jonathan Corbet
August 29, 2022

A new user-space block driver mechanism entered the kernel during the 6.0 merge window. This subsystem, called "ublk", uses io_uring to communicate with user-space drivers, resulting in some impressive performance numbers. Ublk has a lot of interesting potential, but the current use cases for it are not entirely clear. The recently posted crash-recovery mechanism for ublk makes it clear, though, that those use cases do exist.

If an in-kernel block driver crashes, it is likely to bring down the entire kernel with it. Putting those drivers into user space can, theoretically, result in a more robust system, since the kernel can now survive a driver crash. With ublk as found in the 6.0 kernel, though, a driver crash will result in the associated devices disappearing and all outstanding I/O requests failing. From a user's point of view, this result may be nearly indistinguishable from a complete crash of the system. As patch author Ziyang Zhang notes in the cover letter, some users might be disappointed by this outcome:

This is not a good choice in practice because users do not expect aborted requests, I/O errors and a released device. They may want a recovery mechanism so that no requests are aborted and no I/O error occurs. Anyway, users just want everything works as usual.

The goal of this patch set is to grant this wish.

A user-space block driver that implements crash recovery should set up its ublk devices with the new UBLK_F_USER_RECOVERY flag. There is also an optional flag, UBLK_F_USER_RECOVERY_REISSUE, that controls how recovery is done; more on that below. After setup, no other changes are needed for normal driver operation.

Should a recovery-capable ublk driver crash, the kernel will stop the associated I/O request queues to prevent the addition of future requests, then wait patiently for a new driver process to come along. That wait can be indefinite; if a driver claims to be able to do recovery, then the kernel will expect it to live up to that claim. There is no notification mechanism for a driver crash; user space is required to notice on its own that the driver has come to an untimely end and start a new one.

That new driver process will connect to the ublk subsystem and issue the START_USER_RECOVERY command. That causes ublk to verify that the old driver is really gone and clean up after it, including dealing with all of the outstanding I/O requests. Any requests that showed up after the crash and were not accepted by the old driver can simply be requeued to the new one. Requests that were accepted may have to be handled a bit more carefully, though, since the kernel does not know if they were actually executed or not.

There are, evidently, some ublk back-ends that cannot properly deal with duplicated writes; such writes must be avoided in that case. That is what the UBLK_F_USER_RECOVERY_REISSUE flag is for; if it is present, all outstanding requests will be reissued. Otherwise, requests that had been picked up by the driver, but for which no completion status had been posted, will fail with an error status. This will happen even with read requests, which one would normally expect to be harmless if repeated.

After starting the recovery process, the new driver should reconnect to each device and issue a new FETCH_REQ command on each to enable the flow of I/O requests. Once all of the devices have been set up, an END_USER_RECOVERY command will restart the request queue and get everything moving again. With luck, users may not even notice that the block driver crashed and was replaced.

The ublk subsystem came out of Red Hat and only includes a simple file-backed driver, essentially replicating the loop driver, as an example. At the time, various use cases for this subsystem were mentioned in a vague way, but it was not clear how (or if) it is being used outside of a demonstration mode. It looks a bit like an interesting solution waiting for a problem.

The appearance of this recovery mechanism from a different company (Alibaba), just a few weeks after ublk was merged, suggests that more advanced use cases exist, and that ublk is, indeed, already in active use. This sort of recovery mechanism tends not to be developed in the absence of some hard experience indicating that it is necessary. Hopefully some of these real-world use cases will come to light — with code — so that the rest of the world can benefit from this work.

Just as usefully, this information might give some clues about where Linux is headed in the coming years. The effort to blur the boundaries between kernel tasks and those handled in user space shows no signs of slowing down; it would not be surprising to see more ublk-like mechanisms in the future. It would be interesting indeed to have an idea of where these changes are taking us — and to be shown that it isn't a world where development moves to proprietary, user-space drivers.

Index entries for this article
Kernel	Block layer/Block drivers
Kernel	io_uring

Crash recovery for user-space block drivers

Posted Aug 29, 2022 16:27 UTC (Mon) by developer122 (guest, #152928) [Link] (4 responses)

If someone wants to post a driver where a significant backend lives in userspace, then there should be a requirement that an open source userspace backend be provided that exercises all of the interfaces. This has been the standard for DRM drivers, but it should be equally true for a userspace filesystem driver or anything else.

GPU drivers are a bit weird, since mesa is a library and programs call into it directly instead of going down through the kernel and up into the userspace portion of the driver, and this systems seems like a place where an even stronger argument could be made. The userspace driver is needed to make the driver interface of the kernel itself work, and therefor definitely an open source reference implementation should be provided.

Crash recovery for user-space block drivers

Posted Aug 29, 2022 17:02 UTC (Mon) by iabervon (subscriber, #722) [Link]

I think the main direction this is likely to go is block devices where the actual storage is accessed over a network (using standard network-related syscalls) or USB (using libusb). So there wouldn't be anything in the kernel specific to a device for someone to post and have requirements placed on them.

Crash recovery for user-space block drivers

Posted Aug 30, 2022 4:20 UTC (Tue) by riking (subscriber, #95706) [Link] (2 responses)

I agree, we need more experimentation on higher-redundancy storage (better than RAID6) as well, and that's likely to involve placing disks across multiple kernels.

Crash recovery for user-space block drivers

Posted Aug 30, 2022 4:21 UTC (Tue) by riking (subscriber, #95706) [Link]

Whoops - this was supposed to be a reply to the person saying "where the actual storage is accessed over a network".

Crash recovery for user-space block drivers

Posted Aug 30, 2022 10:05 UTC (Tue) by Wol (subscriber, #4433) [Link]

I've been thinking about raid-16, raid-61, splattered across a network, but that brings in a whole 'nother can of worms ...

Cheers,
Wol

Crash recovery for user-space block drivers

Posted Aug 29, 2022 16:43 UTC (Mon) by ejr (subscriber, #51652) [Link]

I would expect this to show up in SPDK quite quickly. And be used for complex network block device setups, possibly including ones over "future transports" like CXL and proprietary variants.

Actually, could you expose a PostgreSQL table as a "block device" over this? That'd be cute although likely silly.

Linux microkernel?

Posted Aug 29, 2022 20:08 UTC (Mon) by storner (subscriber, #119) [Link] (4 responses)

The first Linux development was announced 31 years ago. And one reason for starting it was that Linus didn't like MINIX - among other things due to its microkernel architecture performance issues.

I don't want to reignite that discussion, but it feels like Linux is becoming more and more supportive of a hybrid monolithic/micro kernel design where features are implemented in userspace, with only a minimum of hooks into the "core" kernel to get a reasonable performance.

Linux microkernel?

Posted Aug 30, 2022 0:32 UTC (Tue) by developer122 (guest, #152928) [Link]

There's a lot more to the performance issues of minix and mach than just being microkernels. This came up during oxide's twitter space over their own microkernel OS [1] and the references are in the shownotes [2], but I think it was the creator of L4 who came out and said "yeah, IPC is slow...if you do it wrong!" Certainly as iouring and various shared memory approaches demonstrate, the IPC can indeed be quite fast.

[1] https://www.youtube.com/watch?v=cypmufnPfLw around 37:20 according to the shownotes
[2] https://github.com/oxidecomputer/twitter-spaces/blob/mast...

Linux microkernel?

Posted Aug 30, 2022 0:45 UTC (Tue) by dvdeug (guest, #10998) [Link] (2 responses)

The core is still clearly monolithic. It's very convenient to run things in userspace, and many things can be run perfectly well in userspace; I suspect the difference between reading a filesystem on CD in userspace versus kernel space is nonexistent timewise and usually unnoticable added pressure to the CPU and memory system. Like many modern systems, there's enough power to let people do things the easy, flexible way or the maximally efficient way if they need that.

Linux microkernel?

Posted Aug 30, 2022 5:50 UTC (Tue) by zev (subscriber, #88455) [Link] (1 responses)

there's enough power to let people do things the easy, flexible way or the maximally efficient way if they need that.

Agreed -- I think the general trend is clearly toward increasing flexibility and having the option (for a growing set of things) of either an in-kernel implementation or a userspace one. Though for whatever reason, it seems like whenever a new userspace option pops up the "will Linux become a microkernel?" comments inevitably appear, whereas I can't recall ever seeing the inverse when Linux grows things like (to pick some recent examples) in-kernel TLS support or KSMBD.

Linux microkernel?

Posted Aug 30, 2022 10:09 UTC (Tue) by Wol (subscriber, #4433) [Link]

Anyways, micro-kernel DESIGN brings a lot of advantages. It reduces coupling making maintenance easier for example.

But monolithic IMPLEMENTATION brings a lot of advantages too, not least speed.

That's why Linux has always been a modular system.

Whatever floats your boat, and if pushing stuff into user space brings advantages (which it clearly does in many cases, not least security), go for it!

Cheers,
Wol

Crash recovery for user-space block drivers

Posted Aug 30, 2022 11:08 UTC (Tue) by bbockelm (subscriber, #71069) [Link] (3 responses)

As for the application, this work looks fairly similar to this USENIX paper from an Alibaba team:

https://www.usenix.org/system/files/atc20-li-huiba.pdf

The above paper describes a virtual block device, implemented in user space, and including crash recovery. The overall application is to deliver block-based container images to their platform.

Crash recovery for user-space block drivers

Posted Aug 30, 2022 11:55 UTC (Tue) by hsiangkao (guest, #123981) [Link] (1 responses)

As the same Alibaba team member, I can tell it's not used for this scenario but for distributed block device.
Instead, for container image scenarios, we will try to write a new paper based on the filesystem approach, hopefully later published.

Crash recovery for user-space block drivers

Posted Aug 30, 2022 12:07 UTC (Tue) by bbockelm (subscriber, #71069) [Link]

Thanks for the update! For what it's worth, the existing paper was quite fun to read...

Crash recovery for user-space block drivers

Posted Aug 30, 2022 12:29 UTC (Tue) by old-memories (guest, #160155) [Link]

Hi. I am Ziyang Zhang. This patchset has nothing to do with DADI. I developed it by myself.

By the way, ublk is not for container images. It is a generic userspace block driver. Please read related articles published before more patiently, really.

Crash recovery for user-space block drivers

Posted Aug 30, 2022 18:51 UTC (Tue) by lperkov (guest, #136950) [Link]

The Sartura team is working on the ublk user-space components in rust. As soon as the prototype is ready, the code will be open source. We would be happy to collaborate if there is an interest.

NBD driver

Posted Sep 1, 2022 11:17 UTC (Thu) by rwmj (subscriber, #5474) [Link]

nbdublk uses this to reimplement nbd.ko in userspace. There's a very long thread on our experience: here and continued here.

Had another demo driver for S3

Posted Sep 3, 2022 2:12 UTC (Sat) by dacut (subscriber, #131937) [Link] (5 responses)

I had a similar driver (written for the 3.x kernels; I doubt it compiles any longer): https://github.com/dacut/ubd. The demo backend talked to S3.

This was before AWS had their NFS solution (Elastic File System/EFS) and EBS volumes were limited in size to 1 TB, so I demoed an absurd idea of a tiny host (t2.micro) with a 1 PB XFS-formatted UBD volume attached. Blocks are sparsely allocated, so only the superblocks were written to S3. Still, this took overnight just to format the filesystem.

It was a fun demo to write over a few weekends, but so slow as to be completely unpractical.

Had another demo driver for S3

Posted Sep 3, 2022 15:54 UTC (Sat) by rwmj (subscriber, #5474) [Link] (1 responses)

I don't think it's possible to write an efficient S3 driver to store an ordinary (eg. ext4) filesystem without a very large amount of local caching. The natural block size for S3 is very large (128K?) but if you try to create a naive filesystem with that block size it most likely won't work at all and if it did will break assumptions about small blocks that filesystem developers make.

Anyway, here's our attempt: https://libguestfs.org/nbdkit-S3-plugin.1.html It's written in Python and apparently used somewhat widely (judging by bug reports etc).

Had another demo driver for S3

Posted Sep 5, 2022 2:08 UTC (Mon) by dacut (subscriber, #131937) [Link]

Yep, that sounds about right! I think I told xfs to use 4096 byte blocks. Still slow as molasses.

Glad that your engineered solution (vs. my toy weekend project) is getting usage!

Had another demo driver for S3

Posted Sep 5, 2022 8:23 UTC (Mon) by cortana (subscriber, #24596) [Link] (1 responses)

You may be interested in <a href="https://www.youtube.com/watch?v=JcJSW7Rprio">harder drives</a>, in which the author implements some unusual filesystems: one which stores data in the padding of in-flight IMCP pings; one which encodes the data within the game state of parallel emulated Tetris games; and one which is comprised of used COVID-19 test cartridges...

Harder drives

Posted Sep 7, 2022 3:17 UTC (Wed) by dacut (subscriber, #131937) [Link]

Ah, that was a right excellent demo of completely useless technology. I loved every minute of it!

Thank you!

Had another demo driver for S3

Posted Sep 10, 2022 14:57 UTC (Sat) by ming.lei (guest, #74703) [Link]

BUSE[1] implemented S3 block device too, and it is one master thesis in 2021.

https://dspace.cuni.cz/bitstream/handle/20.500.11956/1487...

Crash recovery for user-space block drivers

Posted Sep 5, 2022 15:27 UTC (Mon) by pturmel (guest, #95781) [Link] (3 responses)

I'm looking forward to more of this, largely because I recently had my antipathy to userspace drivers busted rather convincingly by OpenVSwitch.

I often set up custom bridge-vlan-VM network environments for my laboratory of industrial hardware (mostly Allen-Bradley and Omron products). Until recently, it was done with native Linux bridges and native Linux vlans.

Then I discovered that KVM can tap into an OpenVSwitch bridge with a specified vlan tag. Greatly simplified the rats' nest of configurations I've been switching around for various simulations.

But some sibling VMs on the same infrastructure had their network performance noticeably improve. I didn't think to benchmark it all, but I'm very happy with the results.

Bring on more userspace drivers, thank you.

Crash recovery for user-space block drivers

Posted Sep 6, 2022 12:34 UTC (Tue) by cortana (subscriber, #24596) [Link] (2 responses)

How do you actually go about configuring OpenVSwitch? I've used it via a number of other products, which configure it for me. It's worked absolutely fine but I have no idea where to begin if I wanted to configure it from scratch myself...

Crash recovery for user-space block drivers

Posted Sep 6, 2022 12:53 UTC (Tue) by pturmel (guest, #95781) [Link] (1 responses)

I've been using netplan.io in Ubuntu Server. Unified config with my non-OpenVSwitch items.

Crash recovery for user-space block drivers

Posted Sep 8, 2022 10:40 UTC (Thu) by cortana (subscriber, #24596) [Link]

Thanks. I see NM can also configure it. I'll check these out!

Crash recovery for user-space block drivers

Posted Sep 10, 2022 14:51 UTC (Sat) by ming.lei (guest, #74703) [Link]

> It looks a bit like an interesting solution waiting for a problem.

AFAIK, the above seems not true.

IMO, there are lots of jobs which can be done by ublk, such as some generic things:

- re-implement nbd-client with ublk and io_uring, especially applying recent io_uring/net zero copy feature
- ublk export for qemu-storage-daemon
- SPDK backend for UBD mod
- ceph's rbd
- ublk based zoned target
- ublk base compression target
...

Some of them have been in-progress. Userspace development is easier and more efficient, but still takes
a while, especially just 4 months past since the first version of ublk driver(RFC) is posted.