| From: |
| Ming Lei <ming.lei-AT-redhat.com> |
| To: |
| Jens Axboe <axboe-AT-kernel.dk> |
| Subject: |
| [PATCH] Docs: ublk: add ublk document |
| Date: |
| Sun, 28 Aug 2022 12:50:03 +0800 |
| Message-ID: |
| <20220828045003.537131-1-ming.lei@redhat.com> |
| Cc: |
| linux-doc-AT-vger.kernel.org, linux-block-AT-vger.kernel.org, Christoph Hellwig <hch-AT-lst.de>, Ming Lei <ming.lei-AT-redhat.com>, Jonathan Corbet <corbet-AT-lwn.net>, "Richard W . M . Jones" <rjones-AT-redhat.com>, ZiyangZhang <ZiyangZhang-AT-linux.alibaba.com>, Stefan Hajnoczi <stefanha-AT-redhat.com>, Xiaoguang Wang <xiaoguang.wang-AT-linux.alibaba.com> |
| Archive-link: |
| Article |
ublk document is missed when merging ublk driver, so add it now.
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Richard W.M. Jones <rjones@redhat.com>
Cc: ZiyangZhang <ZiyangZhang@linux.alibaba.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
Documentation/block/ublk.rst | 203 +++++++++++++++++++++++++++++++++++
1 file changed, 203 insertions(+)
create mode 100644 Documentation/block/ublk.rst
diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
new file mode 100644
index 000000000000..9e8f7ba518a3
--- /dev/null
+++ b/Documentation/block/ublk.rst
@@ -0,0 +1,203 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================================
+Userspace block device driver(ublk driver)
+==========================================
+
+Overview
+========
+
+ublk is one generic framework for implementing block device logic from
+userspace. It is very helpful to move virtual block drivers into userspace,
+such as loop, nbd and similar block drivers. It can help to implement new
+virtual block device, such as ublk-qcow2, and there was several attempts
+of implementing qcow2 driver in kernel.
+
+ublk block device(``/dev/ublkb*``) is added by ublk driver. Any IO request
+submitted to ublk device will be forwarded to ublk's userspace part(
+ublksrv [1]), and after the IO is handled by ublksrv, the result is
+committed back to ublk driver, then ublk IO request can be completed. With
+this way, any specific IO handling logic is totally done inside ublksrv,
+and ublk driver doe _not_ handle any device specific IO logic, such as
+loop's IO handling, NBD's IO communication, or qcow2's IO mapping, ...
+
+/dev/ublkbN is driven by blk-mq request based driver, each request is
+assigned by one queue wide unique tag. ublksrv assigns unique tag to each
+IO too, which is 1:1 mapped with IO of /dev/ublkb*.
+
+Both the IO request forward and IO handling result committing are done via
+io_uring passthrough command, that is why ublk is also one io_uring based
+block driver. It has been observed that io_uring passthrough command can get
+better IOPS than block IO. So ublk is one high performance implementation
+of userspace block device. Not only IO request communication is done by
+io_uring, but also the preferred IO handling in ublksrv is io_uring based
+approach too.
+
+ublk provides control interface to set/get ublk block device parameters, and
+the interface is extendable and kabi compatible, so basically any ublk request
+queue's parameter or ublk generic feature parameters can be set/get via this
+extendable interface. So ublk is generic userspace block device framework, such
+as, it is easy to setup one ublk device with specified block parameters from
+userspace.
+
+How to use ublk
+===============
+
+After building ublksrv[1], ublk block device(``/dev/ublkb*``) can be added
+and deleted by the utility, then existed block IO applications can talk with
+it.
+
+See usage details in README[2] of ublksrv, for example of ublk-loop:
+
+- add ublk device:
+ ublk add -t loop -f ublk-loop.img
+
+- use it:
+ mkfs.xfs /dev/ublkb0
+ mount /dev/ublkb0 /mnt
+ .... # all IOs are handled by io_uring!!!
+ umount /mnt
+
+- get ublk dev info:
+ ublk list
+
+- delete ublk device
+ ublk del -a
+ ublk del -n $ublk_dev_id
+
+Design
+======
+
+Control plane
+-------------
+
+ublk driver provides global misc device node(``/dev/ublk-control``) for
+managing and controlling ublk devices with help of several control commands:
+
+- UBLK_CMD_ADD_DEV
+ Add one ublk char device(``/dev/ublkc*``) which is talked with ublksrv wrt.
+ IO command communication. Basic device info is sent together with this
+ command, see UAPI structure of ublksrv_ctrl_dev_info, such as nr_hw_queues,
+ queue_depth, and max IO request buffer size, which info is negotiated with
+ ublk driver and sent back to ublksrv. After this command is completed, the
+ basic device info can't be changed any more.
+
+- UBLK_CMD_SET_PARAMS / UBLK_CMD_GET_PARAMS
+ Set or get ublk device's parameters, which can be generic feature related,
+ or request queue limit related, but can't be IO logic specific, cause ublk
+ driver does not handle any IO logic. This command has to be sent before
+ sending UBLK_CMD_START_DEV.
+
+- UBLK_CMD_START_DEV
+ After ublksrv prepares userspace resource such as, creating per-queue
+ pthread & io_ruing for handling ublk IO, this command is set for ublk
+ driver to allocate & expose /dev/ublkb*. Parameters set via
+ UBLK_CMD_SET_PARAMS are applied for creating /dev/ublkb*.
+
+- UBLK_CMD_STOP_DEV
+ Quiesce IO on /dev/ublkb* and delete the disk. After this command returns,
+ ublksrv can release resource, such as destroy per-queue pthread & io_uring
+ for handling io command.
+
+- UBLK_CMD_DEL_DEV
+ Delete /dev/ublkc*. After this command returns, the allocated ublk device
+ number can be reused.
+
+- UBLK_CMD_GET_QUEUE_AFFINITY
+ After /dev/ublkc is added, ublk driver creates block layer tagset, so each
+ queue's affinity info is available, ublksrv sends UBLK_CMD_GET_QUEUE_AFFINITY
+ to retrieve queue affinity info, so ublksrv can setup the per-queue context
+ efficiently, such as bind affine CPUs with IO pthread, and try to allocate
+ buffers in IO thread context.
+
+- UBLK_CMD_GET_DEV_INFO
+ For retrieve device info of ublksrv_ctrl_dev_info. And it is ublksrv's
+ responsibility to save IO target specific info in userspace.
+
+Data plane
+----------
+
+ublksrv needs to create per-queue IO pthread & io_uring for handling IO
+command (io_uring passthrough command), and the per-queue IO pthread
+focuses on IO handling and shouldn't handle any control & management
+task.
+
+ublksrv's IO is assigned by one unique tag, which is 1:1 mapping with IO
+request of /dev/ublkb*.
+
+UAPI structure of ublksrv_io_desc is defined for describing each IO from
+ublk driver. One fixed mmaped area(array) on /dev/ublkc* is provided for
+exporting IO info to ublksrv, such as IO offset, length, OP/flags and
+buffer address. Each ublksrv_io_desc instance can be indexed via queue id
+and IO tag directly.
+
+Following IO commands are communicated via io_uring passthrough command,
+and each command is only for forwarding ublk IO and committing IO result
+with specified IO tag in the command data:
+
+- UBLK_IO_FETCH_REQ
+ Sent from ublksrv IO pthread for fetching future coming IO request
+ issued to /dev/ublkb*. This command is just sent once from ublksrv IO
+ pthread for ublk driver to setup IO forward environment.
+
+- UBLK_IO_COMMIT_AND_FETCH_REQ
+ After one IO request is issued to /dev/ublkb*, ublk driver stores this
+ IO's ublksrv_io_desc to the specified mapped area, then the previous
+ received IO command of this IO tag, either UBLK_IO_FETCH_REQ or
+ UBLK_IO_COMMIT_AND_FETCH_REQ, is completed, so ulksrv gets the IO
+ notification via io_uring.
+
+ After ublksrv handles this IO, this IO's result is committed back to ublk
+ driver by sending UBLK_IO_COMMIT_AND_FETCH_REQ back. Once ublkdrv received
+ this command, it parses the IO result and complete the IO request to
+ /dev/ublkb*. Meantime setup environment for fetching future IO request
+ with this IO tag. So UBLK_IO_COMMIT_AND_FETCH_REQ is reused for both
+ fetching request and committing back IO result.
+
+- UBLK_IO_NEED_GET_DATA
+ ublksrv pre-allocates IO buffer for each IO at default, any new project
+ should use this IO buffer to communicate with ublk driver. But existed
+ project may not work or be changed to in this way, so add this command
+ to provide chance for userspace to use its existed buffer for handling
+ IO.
+
+- data copy between ublkserv IO buffer and ublk block IO request
+ ublk driver needs to copy ublk block IO request pages into ublksrv buffer
+ (pages) first for WRITE before notifying ublksrv of the coming IO, so
+ ublksrv can hanldle WRITE request.
+
+ After ublksrv handles READ request and sends UBLK_IO_COMMIT_AND_FETCH_REQ
+ to ublksrv, ublkdrv needs to copy read ublksrv buffer(pages) to the ublk
+ IO request pages.
+
+Future development
+==================
+
+Container-ware ublk deivice
+---------------------------
+
+ublk driver doesn't handle any IO logic, and its function is well defined
+so far, and very limited userspace interfaces are needed, and each one is
+well defined too, then it is very likely to make ublk device one
+container-ware block device in future, as Stefan Hajnoczi suggested[3], by
+removing ADMIN privilege.
+
+Zero copy
+---------
+
+Wrt. zero copy support, it is one generic requirement for nbd, fuse or
+similar drivers, one problem Xiaoguang mentioned is that pages mapped to
+userspace can't be remapped any more in kernel with existed mm interfaces,
+and it can be involved when submitting direct IO to /dev/ublkb*. Also
+Xiaoguang reported that big request may benefit from zero copy a lot,
+such as >= 256KB IO.
+
+
+References
+==========
+
+[1] https://github.com/ming1/ubdsrv
+
+[2] https://github.com/ming1/ubdsrv/blob/master/README
+
+[3] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stef...
--
2.31.1