User: Password:
Subscribe / Log in / New account

A few words on DRBD and user space - kernel interfaces

A few words on DRBD and user space - kernel interfaces

Posted Oct 8, 2009 13:41 UTC (Thu) by philipp (guest, #8960)
Parent article: Infrastructure unification in the block layer

A few words on DRBD and user space - kernel interfaces

First of all: Neil, thanks for that excellent article.

As the section on DRBD is considerably shorten than the sections
on DM/MD, I can add a bit for DRBD here.

DRBD used to have a ioctl() based interface (drbd-0.7 releases and
The main issue we stumbled across, was that this ioctl interface was not
designed with later extendability in mind. I.e. we required our users
to update the user land programs with the kernel module. -- Our code
enforced this, the module refused to talk to alder or newer user space
programs. Since at that time we where an out of mainline module, this
was never an issue for our users. The users where used to this, and this
was also expressed in the packages' dependencies.

The other issue we had appeared with kernel and user land running with
different word sizes, i.e. 64 bit kernel and 32 bit user land. While this
is not a principal issue with IOCTLs we got that wrong in the beginning.

As we realised at that point that ioctls are frowned upon in the kernel
community, and we had the plan to go mainline, I decided that we needed
a new interface.

As genetlink was not yet in the kernel, bare netlink is not usable for
external modules we ended up with using the connector. (Connector is
just a thin layer on top of netlink)

So, connector, seemed to be a good choice, since it is not IOCTL, and
it avoids the catch-22 issue mentioned by Neil. An a nice byproduct
is that netlink can also be used to inform userspace about random
events in the kernel.

Unfortunately it turned out that connector has its own issues, and
that Linus is not a friend of the whole netlink idea either.

DRBD's connector interface: The good stuff.

Not let me point our how DRBD's netlink interface is extensible.
The netlink packet is not based on a fixed layout (i.e. a C struct),
but it is a "tag list". Imagine it as a list labels and values.
Each typed (available types are: bit, int32, int54, sting/blob)
and each attributed being mandatory of optional. In the implementation
the labels are numbers, with the convention that such a number
will never be re-used.

On the kernel side, such a packet gets processed only if all the
label numbers of the mandatory tags are known, otherwise the
operation is refused, and the user is informed that the kernel
component is too old for the desired operation.

With that scheme it is possible to use older user space tools to
configure newer DRBD drivers. Using newer user space tools on older
kernels works as well, as long as you do not request any feature
not supported by the older DRBD driver.
That proofed to be fairly usable.

How we got it wrong again: call_usermodehelper()

At various events we call out to user space.

To give you an example: When we are up to start a resynchronisation
we call such an user helper, because the user may want to create a
snapshot of the block device on the resync target node just before
the resync begins. Of course there is also an user helper that
gets called after the resync finished, because the user might want
to drop that snapshot automatically.

The mistake was, that our user space tools return to the kernel an
error if the particular user space helper is not known. When we
introduced that new before-resync-target user space hook, we broke
installations updating to the new DRBD driver that kept the old
user space tools. Because when the before-resync-target handler
returns an error to the DRBD driver, it does not do the resync
but it abandons the connection to the peer.

We can avoid that in the future by more carefully define the return
code conventions of further, user space helpers.

New kid on the block: Configfs (or other vfs based approaches)

The idea seems intriguing at first. A new virtual file system, each
subsystem has its sub directory there. Let the user create new
sub directories in there, below the subsystem level. The kernel
side populates them with virtual files (=attributes), which in turn
can be tuned from user space.

When I try to envision how to use this interface for DRBD..

* Missing here is transactional semantics. In the ioctl()/netlink world,
you send a request towards your kernel driver, and your user space
tool gets a response back.

In case two instances of that user space tool gets invoked, and they
have to modify lots of attributes, they would step on each others toes.

* Quite frequently it is necessary to change multiple attributes in one

* Along the same lines is the issue of error reporting. Writing to
an attribute should fail with some errno, if an invalid value
gets written. In DRBD's netlink protocol we have about 150
error codes defined, mapping those to errno codes is not possible
in a sane way. It would just deliver an other number-space over
the errno channel.

In Configfs' documentation envision committable items, which are
currently unimplemented, handle only the issue of setting multiple
attributes as at creation time of the object and the error reporting
has to be done through errno of the rename system call.

For me the root of the issue is that the interface to the filesystem
was never intended to be an transactional interface.

I see the patchgroups interface presented by featherstitch project as
a clean way to add transactional semantics to a filesystem, thus that
could also bring sane transactional semantics to the configfs interface.

So, I am suggesting to get the configuration interface right, bring
transactions to the filesystem in first. Sounds crazy? Maybe it is?
Maybe having that transaction semantics for the filesystem being an
important thing we currently miss in Linux!

-Philipp Reisner

(Log in to post comments)

A few words on DRBD and user space - kernel interfaces

Posted Oct 8, 2009 23:36 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Hi Phillipp,
thanks for the extra historical background on DRBD. It does serve to highlight that getting interfaces "right" really is hard.

I don't believe configfs is a useful answer for anything. I believe the supposed distinction between it and sysfs is an imaginary distinction. It is a bit like the distinction between DM and MD - superficially different but fundamentally the same.

I think 'transaction semantics' a quite achievable in sysfs - I have them for some aspects of MD. The basic idea is that updating some attributes is not immediately effective, but requires a write to a 'commit' attribute.

E.g. I can change the layout, chunksize, and/or the number of devices in a RAID5 by updating any of those attributes, and then writing "reshape" to the 'sync_action' attribute.

This does raise the question of what should be seen when you read from one of these non-immediate attributes. One option is to read both values (old and new). Another is to have two attributes "X" and "X_new" - writing the 'commit' command copies all the X_new to X. I currently prefer the former.

Your concern about multiple userspace tools being invoked can, I believe, be answered by a userspace solution, probably involving a lockfile in /var/locks or /var/run or similar.

Getting notifications to userspace through sysfs is quite easy using 'poll'. Userspace then re-reads the attribute and decides what to do based on the current state.

I had noted that DRBD uses a lot of different error code and wondered about that - is it really necessary?
- some of them translate directly to 'standard' error codes
- some of them seem to be reporting that a request was illegal, in which case correctly working code should never have made the request, and a simple EINVAL will do.
- some seem to differentiate which part of a request was illegal (CSUMS_ALG vs VERIFY_ALG?? - I'm guessing here). With the proposed sysfs interface, you wouldn't need that differentiation because you could tell which part of the request was in error by the attribute that was being written to at the time.

So I'm not convinced there is really a need for a lot of error codes, particularly when the interface allows (and requires) a separate status report for each attribute changed.


A few words on DRBD and user space - kernel interfaces

Posted Oct 17, 2009 18:39 UTC (Sat) by jageorge (guest, #61413) [Link]

Please not the horror of more non-atomic interfaces to needfully atomic operations just to avoid ioctl(). As someone who has written sysfs scanners which sometimes result in bizarre side effects resulting from abuse of sysfs (including non-atomic setup activities) let's be clear about the problem with ioctl(). #1 It creates an unenforceable binary interface which tends to not work well with enhancements or architecture variations. #2 see #1. (BTW I tend to agree about configfs being another name for the same animal - sysfs). My proposition is using a sysfs handle to accept multiple elements in a single atomic operation. That data can be ascii'fied (or not) and involve name-value pairs or simply field (lf) separated data elements in a known order. Sure it violates the one element per handle rule, but for processing atomic operations all elements _must_ be presented atomically.

I do have one alternative in mind which actually can be considered a fork of your proposal, but the linux infrastructure to implement it is not yet in place. Having a private (per process) sysfs (or procfs) directory where any sysfs hierarchy can be created and later pushed into place (mv?) under a "magic" subdirectory entry in sysfs under your device. This solution would be atomic, non-public, and follow the recommended sysfs element setting process as far as possible... Ultimately a pretty cool solution to the crappy non-atomic or pubic aggregation problem and perhaps a good long term solution, but one way or the other your solution should have the atomic interface benefits of ioctl() without the binary limitations on portability.

sysfs is dumb

Posted Oct 17, 2009 19:54 UTC (Sat) by quotemstr (subscriber, #45331) [Link]


Another reason people don't like ioctl is that it's not generically scriptable: to use an interface exposed by an ioctl, a C program must be written that can understands the appropriate structure definitions. Scripts can then only run these wrapper programs, and I suppose people didn't want to undertake the chore of wrapper writing. At first, sysfs seems to solve that problem, but the necessary filesystem structure is so hairy, and the ordering and atomicity requirements are so arcane, that people end up writing wrappers anyway! (Consider lspci and lsof.)

Serious question: how is sysfs better than sysctl? Both give you hierarchically-organized human readable ASCII-based cross-architecture key-value pairs that can be manipulated by scripts, but because sysctl is a single system call, there's at least a possibility of making atomic changes without disgusting hacks or having to implement a full filesystem transaction layer.

I don't see sysfs's filesystem interfaces as much of an advantage. You can grep sysctl output even more easily than you can grep /sys; and speaking of the name /sys: it's a de-facto standard. Mounting it elsewhere isn't particular useful except in the chroot case, and with a sysctl interface, you wouldn't have to mount anything at all!

Sure, you might be able to eventually do something Plan9-like and mount /sys and /proc over NFS, but the last mention I can find of anyone actually attempting that is from 1998. It doesn't seem terribly useful, and besides, and the security implications scare bejeesus out of me.

Besides: using sysctl is simpler! You don't have to worry about opening files, closing them, and so on. And the BSD people seem to get along fine without a sysfs, after all.

Having a private (per process) sysfs (or procfs) directory where any sysfs hierarchy can be created and later pushed into place (mv?) under a "magic" subdirectory entry in sysfs under your device.
This approach won't be particularly popular with people who like to manipulate sysfs with shell scripts.

sysfs is dumb - that depends

Posted Oct 18, 2009 1:37 UTC (Sun) by jageorge (guest, #61413) [Link]

Sysctl under Linux is just a wrapper around /proc... and I'm not saying that the BSD guys got it wrong, but sysfs IS the strategic direction already taken by Linux. However, there are clearly problems with the status quo especially when it comes to atomic operations. Both of my proposals (multi-element sysfs nodes, and process private staging sysfs directory) are compatible with the evolving direction of Linux system resource management from userspace.

The scriptability issue around my private staging tree proposal is easily addressable by using some sort of token (futex/mutex/semaphone) based approach to opening the staging directory instead of a purely PID based approach. Perhaps I'll try a kernel patch to illustrate what I mean ... if I can drum up some interest.

One way or the other private staging of atomic operations (whether ioctl() or some variation on my proposals) is essential for certain operations, and trying to avoid it _will_ result in race conditions many of which have security implications as well... now that I think about it token based private directories would be cool from a temporary directory perspective as well especially if the OS automatically reaped the result after the last token holder exited... so many cool implications... :-)

sysfs is dumb - that depends

Posted Oct 18, 2009 20:03 UTC (Sun) by quotemstr (subscriber, #45331) [Link]

sysfs IS the strategic direction already taken by Linux
It does seem that we're stuck with it for now, though it could be deprecated as many other interfaces have been.

So I agree, there's a need for atomic operations on sysfs. Your ideas seem over-engineered to me though. What's wrong with the following scheme? An application would create a temporary directory anywhere it liked. Under this temporary directory, an application would create a sysfs tree corresponding to the nodes to change, and after that, would write the name of the temporarily directory to a new special file, /sys/commit. If the commit is successful, the kernel would remove the temporary directory; if there's an error, it would leave the directory in place and return an error from write, or leave an error file in the temporary directory describing what went wrong.

This scheme doesn't require any new system calls or VFS infrastructure, and it's shell-script compatible.

sysfs is dumb - that depends

Posted Oct 19, 2009 14:40 UTC (Mon) by jageorge (guest, #61413) [Link]

Your suggest is essentially where I started, but there appear to be a couple of potential issues. 1. The commit from physical file system to sysfs seemed as if it could be expensive and/or racy. 2. Anything that exists in the normal file system environment is potentially vulnerable from a security/race (even multiple instances of the same monitoring/management software) standpoint.

Nevertheless, I don't want to over-complicate the implementation, and it is possible that there are already security facilities in the kernel which could serve to isolate something as process private. Furthermore, I agree that shell scripting should be relatively simple with any solution to this problem... to some extent that's one of the key ideas behind sysfs. An obvious first step would be to stage something without resolving the private view security question... perhaps even something like staging from a normal physical file system and using mv to flatten the directory structure into a text file which would be fed into a writable sysfs inode.

Basically the problem space is pretty clear (non-trivial atomic operations on IO devices) as is the high level of how to address it (sysfs nodes in the correct context which manage security and race problems). Once someone (possibly me) creates an implementation I expect many of the details to fall into place pretty quickly... and then it's just a matter of getting it past Greg and Al (shudder). The sad thing is that after 6 years of sysfs/udev as a "production" solution no one has done anything other than ducking the problem.

A few words on DRBD and user space - kernel interfaces

Posted Oct 20, 2009 19:18 UTC (Tue) by valyala (guest, #41196) [Link]

Why not using protocol buffers ( ) for all complex extensible APIs between kernel and userspace? .proto files could be shipped together with kernel headers, so userspace programs could uniformly use them for talking to kernel. strace-like programs could dynamically decode proto messages using the corresponding .proto files.

Here is a list of protobuf features.
- protobuf messages are designed to be extensible and backwards-compatible;
- protobuf encoding is architecture-independent;
- protobuf encoding is space-efficient;
- protobuf encoding is quite simple ( ), so it is easy to write cpu-efficient codecs with small footprint in arbitrary language;
- protobuf messages can contain another protobuf messages;
- .proto files can include another .proto files;
- encoded protobuf messages can be easily stored to files (space-efficient binary logs), which then can be easily decoded into human-readable text by universal decoders using corresponding .proto definitions;
- according to the , "Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats" ;)

A few words on DRBD and user space - kernel interfaces

Posted Oct 24, 2009 16:59 UTC (Sat) by jengelh (subscriber, #33263) [Link]

I looked at protobufs about a year ago, and it seems like libnl is doing almost the same (minus (un)serialization).

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds