User: Password:
Subscribe / Log in / New account

A few words on DRBD and user space - kernel interfaces

A few words on DRBD and user space - kernel interfaces

Posted Oct 8, 2009 23:36 UTC (Thu) by neilbrown (subscriber, #359)
In reply to: A few words on DRBD and user space - kernel interfaces by philipp
Parent article: Infrastructure unification in the block layer

Hi Phillipp,
thanks for the extra historical background on DRBD. It does serve to highlight that getting interfaces "right" really is hard.

I don't believe configfs is a useful answer for anything. I believe the supposed distinction between it and sysfs is an imaginary distinction. It is a bit like the distinction between DM and MD - superficially different but fundamentally the same.

I think 'transaction semantics' a quite achievable in sysfs - I have them for some aspects of MD. The basic idea is that updating some attributes is not immediately effective, but requires a write to a 'commit' attribute.

E.g. I can change the layout, chunksize, and/or the number of devices in a RAID5 by updating any of those attributes, and then writing "reshape" to the 'sync_action' attribute.

This does raise the question of what should be seen when you read from one of these non-immediate attributes. One option is to read both values (old and new). Another is to have two attributes "X" and "X_new" - writing the 'commit' command copies all the X_new to X. I currently prefer the former.

Your concern about multiple userspace tools being invoked can, I believe, be answered by a userspace solution, probably involving a lockfile in /var/locks or /var/run or similar.

Getting notifications to userspace through sysfs is quite easy using 'poll'. Userspace then re-reads the attribute and decides what to do based on the current state.

I had noted that DRBD uses a lot of different error code and wondered about that - is it really necessary?
- some of them translate directly to 'standard' error codes
- some of them seem to be reporting that a request was illegal, in which case correctly working code should never have made the request, and a simple EINVAL will do.
- some seem to differentiate which part of a request was illegal (CSUMS_ALG vs VERIFY_ALG?? - I'm guessing here). With the proposed sysfs interface, you wouldn't need that differentiation because you could tell which part of the request was in error by the attribute that was being written to at the time.

So I'm not convinced there is really a need for a lot of error codes, particularly when the interface allows (and requires) a separate status report for each attribute changed.


(Log in to post comments)

A few words on DRBD and user space - kernel interfaces

Posted Oct 17, 2009 18:39 UTC (Sat) by jageorge (guest, #61413) [Link]

Please not the horror of more non-atomic interfaces to needfully atomic operations just to avoid ioctl(). As someone who has written sysfs scanners which sometimes result in bizarre side effects resulting from abuse of sysfs (including non-atomic setup activities) let's be clear about the problem with ioctl(). #1 It creates an unenforceable binary interface which tends to not work well with enhancements or architecture variations. #2 see #1. (BTW I tend to agree about configfs being another name for the same animal - sysfs). My proposition is using a sysfs handle to accept multiple elements in a single atomic operation. That data can be ascii'fied (or not) and involve name-value pairs or simply field (lf) separated data elements in a known order. Sure it violates the one element per handle rule, but for processing atomic operations all elements _must_ be presented atomically.

I do have one alternative in mind which actually can be considered a fork of your proposal, but the linux infrastructure to implement it is not yet in place. Having a private (per process) sysfs (or procfs) directory where any sysfs hierarchy can be created and later pushed into place (mv?) under a "magic" subdirectory entry in sysfs under your device. This solution would be atomic, non-public, and follow the recommended sysfs element setting process as far as possible... Ultimately a pretty cool solution to the crappy non-atomic or pubic aggregation problem and perhaps a good long term solution, but one way or the other your solution should have the atomic interface benefits of ioctl() without the binary limitations on portability.

sysfs is dumb

Posted Oct 17, 2009 19:54 UTC (Sat) by quotemstr (subscriber, #45331) [Link]


Another reason people don't like ioctl is that it's not generically scriptable: to use an interface exposed by an ioctl, a C program must be written that can understands the appropriate structure definitions. Scripts can then only run these wrapper programs, and I suppose people didn't want to undertake the chore of wrapper writing. At first, sysfs seems to solve that problem, but the necessary filesystem structure is so hairy, and the ordering and atomicity requirements are so arcane, that people end up writing wrappers anyway! (Consider lspci and lsof.)

Serious question: how is sysfs better than sysctl? Both give you hierarchically-organized human readable ASCII-based cross-architecture key-value pairs that can be manipulated by scripts, but because sysctl is a single system call, there's at least a possibility of making atomic changes without disgusting hacks or having to implement a full filesystem transaction layer.

I don't see sysfs's filesystem interfaces as much of an advantage. You can grep sysctl output even more easily than you can grep /sys; and speaking of the name /sys: it's a de-facto standard. Mounting it elsewhere isn't particular useful except in the chroot case, and with a sysctl interface, you wouldn't have to mount anything at all!

Sure, you might be able to eventually do something Plan9-like and mount /sys and /proc over NFS, but the last mention I can find of anyone actually attempting that is from 1998. It doesn't seem terribly useful, and besides, and the security implications scare bejeesus out of me.

Besides: using sysctl is simpler! You don't have to worry about opening files, closing them, and so on. And the BSD people seem to get along fine without a sysfs, after all.

Having a private (per process) sysfs (or procfs) directory where any sysfs hierarchy can be created and later pushed into place (mv?) under a "magic" subdirectory entry in sysfs under your device.
This approach won't be particularly popular with people who like to manipulate sysfs with shell scripts.

sysfs is dumb - that depends

Posted Oct 18, 2009 1:37 UTC (Sun) by jageorge (guest, #61413) [Link]

Sysctl under Linux is just a wrapper around /proc... and I'm not saying that the BSD guys got it wrong, but sysfs IS the strategic direction already taken by Linux. However, there are clearly problems with the status quo especially when it comes to atomic operations. Both of my proposals (multi-element sysfs nodes, and process private staging sysfs directory) are compatible with the evolving direction of Linux system resource management from userspace.

The scriptability issue around my private staging tree proposal is easily addressable by using some sort of token (futex/mutex/semaphone) based approach to opening the staging directory instead of a purely PID based approach. Perhaps I'll try a kernel patch to illustrate what I mean ... if I can drum up some interest.

One way or the other private staging of atomic operations (whether ioctl() or some variation on my proposals) is essential for certain operations, and trying to avoid it _will_ result in race conditions many of which have security implications as well... now that I think about it token based private directories would be cool from a temporary directory perspective as well especially if the OS automatically reaped the result after the last token holder exited... so many cool implications... :-)

sysfs is dumb - that depends

Posted Oct 18, 2009 20:03 UTC (Sun) by quotemstr (subscriber, #45331) [Link]

sysfs IS the strategic direction already taken by Linux
It does seem that we're stuck with it for now, though it could be deprecated as many other interfaces have been.

So I agree, there's a need for atomic operations on sysfs. Your ideas seem over-engineered to me though. What's wrong with the following scheme? An application would create a temporary directory anywhere it liked. Under this temporary directory, an application would create a sysfs tree corresponding to the nodes to change, and after that, would write the name of the temporarily directory to a new special file, /sys/commit. If the commit is successful, the kernel would remove the temporary directory; if there's an error, it would leave the directory in place and return an error from write, or leave an error file in the temporary directory describing what went wrong.

This scheme doesn't require any new system calls or VFS infrastructure, and it's shell-script compatible.

sysfs is dumb - that depends

Posted Oct 19, 2009 14:40 UTC (Mon) by jageorge (guest, #61413) [Link]

Your suggest is essentially where I started, but there appear to be a couple of potential issues. 1. The commit from physical file system to sysfs seemed as if it could be expensive and/or racy. 2. Anything that exists in the normal file system environment is potentially vulnerable from a security/race (even multiple instances of the same monitoring/management software) standpoint.

Nevertheless, I don't want to over-complicate the implementation, and it is possible that there are already security facilities in the kernel which could serve to isolate something as process private. Furthermore, I agree that shell scripting should be relatively simple with any solution to this problem... to some extent that's one of the key ideas behind sysfs. An obvious first step would be to stage something without resolving the private view security question... perhaps even something like staging from a normal physical file system and using mv to flatten the directory structure into a text file which would be fed into a writable sysfs inode.

Basically the problem space is pretty clear (non-trivial atomic operations on IO devices) as is the high level of how to address it (sysfs nodes in the correct context which manage security and race problems). Once someone (possibly me) creates an implementation I expect many of the details to fall into place pretty quickly... and then it's just a matter of getting it past Greg and Al (shudder). The sad thing is that after 6 years of sysfs/udev as a "production" solution no one has done anything other than ducking the problem.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds