First of all: Neil, thanks for that excellent article.
As the section on DRBD is considerably shorten than the sections
on DM/MD, I can add a bit for DRBD here.
DRBD used to have a ioctl() based interface (drbd-0.7 releases and
The main issue we stumbled across, was that this ioctl interface was not
designed with later extendability in mind. I.e. we required our users
to update the user land programs with the kernel module. -- Our code
enforced this, the module refused to talk to alder or newer user space
programs. Since at that time we where an out of mainline module, this
was never an issue for our users. The users where used to this, and this
was also expressed in the packages' dependencies.
The other issue we had appeared with kernel and user land running with
different word sizes, i.e. 64 bit kernel and 32 bit user land. While this
is not a principal issue with IOCTLs we got that wrong in the beginning.
As we realised at that point that ioctls are frowned upon in the kernel
community, and we had the plan to go mainline, I decided that we needed
a new interface.
As genetlink was not yet in the kernel, bare netlink is not usable for
external modules we ended up with using the connector. (Connector is
just a thin layer on top of netlink)
So, connector, seemed to be a good choice, since it is not IOCTL, and
it avoids the catch-22 issue mentioned by Neil. An a nice byproduct
is that netlink can also be used to inform userspace about random
events in the kernel.
Unfortunately it turned out that connector has its own issues, and
that Linus is not a friend of the whole netlink idea either.
DRBD's connector interface: The good stuff.
Not let me point our how DRBD's netlink interface is extensible.
The netlink packet is not based on a fixed layout (i.e. a C struct),
but it is a "tag list". Imagine it as a list labels and values.
Each typed (available types are: bit, int32, int54, sting/blob)
and each attributed being mandatory of optional. In the implementation
the labels are numbers, with the convention that such a number
will never be re-used.
On the kernel side, such a packet gets processed only if all the
label numbers of the mandatory tags are known, otherwise the
operation is refused, and the user is informed that the kernel
component is too old for the desired operation.
With that scheme it is possible to use older user space tools to
configure newer DRBD drivers. Using newer user space tools on older
kernels works as well, as long as you do not request any feature
not supported by the older DRBD driver.
That proofed to be fairly usable.
How we got it wrong again: call_usermodehelper()
At various events we call out to user space.
To give you an example: When we are up to start a resynchronisation
we call such an user helper, because the user may want to create a
snapshot of the block device on the resync target node just before
the resync begins. Of course there is also an user helper that
gets called after the resync finished, because the user might want
to drop that snapshot automatically.
The mistake was, that our user space tools return to the kernel an
error if the particular user space helper is not known. When we
introduced that new before-resync-target user space hook, we broke
installations updating to the new DRBD driver that kept the old
user space tools. Because when the before-resync-target handler
returns an error to the DRBD driver, it does not do the resync
but it abandons the connection to the peer.
We can avoid that in the future by more carefully define the return
code conventions of further, user space helpers.
New kid on the block: Configfs (or other vfs based approaches)
The idea seems intriguing at first. A new virtual file system, each
subsystem has its sub directory there. Let the user create new
sub directories in there, below the subsystem level. The kernel
side populates them with virtual files (=attributes), which in turn
can be tuned from user space.
When I try to envision how to use this interface for DRBD..
* Missing here is transactional semantics. In the ioctl()/netlink world,
you send a request towards your kernel driver, and your user space
tool gets a response back.
In case two instances of that user space tool gets invoked, and they
have to modify lots of attributes, they would step on each others toes.
* Quite frequently it is necessary to change multiple attributes in one
* Along the same lines is the issue of error reporting. Writing to
an attribute should fail with some errno, if an invalid value
gets written. In DRBD's netlink protocol we have about 150
error codes defined, mapping those to errno codes is not possible
in a sane way. It would just deliver an other number-space over
the errno channel.
In Configfs' documentation envision committable items, which are
currently unimplemented, handle only the issue of setting multiple
attributes as at creation time of the object and the error reporting
has to be done through errno of the rename system call.
For me the root of the issue is that the interface to the filesystem
was never intended to be an transactional interface.
I see the patchgroups interface presented by featherstitch project as
a clean way to add transactional semantics to a filesystem, thus that
could also bring sane transactional semantics to the configfs interface.
So, I am suggesting to get the configuration interface right, bring
transactions to the filesystem in first. Sounds crazy? Maybe it is?
Maybe having that transaction semantics for the filesystem being an
important thing we currently miss in Linux!
Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds