A new API for mounting filesystems

By Jake Edge
April 2, 2017

LSFMM 2017

The mount() system call tries to do too many things, Miklos Szeredi said at the start of a filesystem-only discussion at LSFMM 2017. He has been interested in cleaning that up for a long time. So he wanted to discuss some ideas he had for a new interface to mount filesystems.

mount() is lots of operations all rolled up into one call; there are various fields that are used in different ways depending on what needs to be done and it is almost out of flags to be used. It supports regular mounts, bind mounts, remounting, moving a mount, and changing the propagation type for a mount, but they are mutually exclusive and some operations require a remount. For example, you cannot create a read-only bind mount; you must first do the bind mount, then remount it read-only. Similarly, you cannot change the propagation parameters while doing a bind mount or changing other mount flags.

Szeredi has come up with a proposed solution with several new system calls, starting with:

    int fsopen(const char *fstype, unsigned int flags);

It would be used to get a file descriptor to communicate with a filesystem driver and might be called as follows:

    fsfd = fsopen("ext4", 0);

That would provide a connection to the ext4 filesystem driver so that parameters could be set via a protocol.

The talk of a protocol prompted Jeff Layton to ask about using a netlink socket instead. But Al Viro said that a netlink protocol would need to be fully specified right from the start, which would not fit well. Josef Bacik said that he thought netlink would allow adding new attributes and values after the fact. There could a different protocol specification for each filesystem type, perhaps based on a common set for all filesystems with extensions for specific types. Layton agreed but said the mechanism for the protocol could be determined at a later point.

The protocol Szeredi is envisioning would have a set of configuration commands, each with a NUL-delimited set of parameters. It might look something like:

    SETDEV\0/dev/sda1
    SETOPTS\0seclabel\0data=ordered
    ...

That data would be written to the filesystem file descriptor returned from fsopen().

Jeff Mahoney asked if there was a need for a system call at all. Perhaps sysfs or the /proc filesystem could be used instead. One attendee pointed out that would mean that some other mechanism would need to be used to mount /proc or /sys. There might also be implications for booting, since those filesystems may not be available early enough to mount the boot partition.

Additional system calls would be needed, Szeredi said, moving back to his proposed interface. Attaching a filesystem to a mount point would be done with mountat(), changes to a mount would done using mountupdate(), while mountmove() to move a mount and mountclone() to clone one round out the set. There were suggestions that some of those could be combined into one call, mountmove() and mountclone() in particular.

Szeredi said that he would look into using a netlink socket rather than fsopen(). One attendee said that netlink would make the simple case of a straightforward mount more complicated, but Szeredi said that the existing mount() would not be going away.

David Howells wondered if netlink was an optional kernel component; if so, mounting using this new mechanism would be impossible in some cases, another attendee said. But, again, Szeredi said that the existing mount() system call could be used. There was some concern that filesystems will come to depend on these new interfaces, so that using mount() won't work well.

Layton noted that there have been requests for better error messages from mounting operations; often there is not enough detail in the error code returned. Szeredi said that more detailed information could potentially be read from the descriptor returned by fsopen().

Overall, the attendees seemed interested in having a better API for mounting filesystems, but it would seem there is a ways to go before there is something concrete to ponder.

Index entries for this article
Kernel	Filesystems/Mounting
Conference	Storage, Filesystem, and Memory-Management Summit/2017

A new API for mounting filesystems

Posted Apr 2, 2017 21:52 UTC (Sun) by neilbrown (subscriber, #359) [Link]

> It might look something like:
>
> SETDEV\0/dev/sda1
> SETOPTS\0seclabel\0data=ordered
> ...

I"m not the world's biggest fan of configfs, but this appears to be duplicating exactly that functionality.

> One attendee pointed out that would mean that some other mechanism would need to be used to mount /proc or /sys.

Clearly a non-issue given:

> Szeredi said that the existing mount() would not be going away.

Both md and dm already have mechanisms for "build some kernel object using these storage devices". While I don't claim either of them represent perfection, but it would seem sensible to use existing practice as the starting point for a discussion, rather than creating something totally different.

A new API for mounting filesystems

Posted Apr 3, 2017 3:20 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (16 responses)

If you are designing a protocol from scratch, PLEASE, don't use zeros as field separators.

A new API for mounting filesystems

Posted Apr 3, 2017 4:31 UTC (Mon) by simcop2387 (subscriber, #101710) [Link] (13 responses)

Doesn't look like they're using zeros, but are using NUL bytes, 0x00. Which will play weirdly with C strings.

A new API for mounting filesystems

Posted Apr 3, 2017 4:59 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (12 responses)

Yeah, I meant NUL bytes. They're a PITA in many languages, including good old C.

A new API for mounting filesystems

Posted Apr 3, 2017 6:30 UTC (Mon) by Homer512 (subscriber, #85295) [Link] (11 responses)

When you want to be able to handle all POSIX-compatible file names, e.g. for loop mounting, you pretty much have to use \0 terminators or come up with escape characters. Given this alternative, I think \0 terminators are preferable, especially for C.

A new API for mounting filesystems

Posted Apr 3, 2017 6:34 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

If you want to stay with pure text then you can do length-tagged pairs, something like:

12\n
INSTRUCTION1\n
23\n
/a/23/character/file/name\n

It's not terribly difficult to parse in pure C either.

A new API for mounting filesystems

Posted Apr 3, 2017 7:57 UTC (Mon) by dany (guest, #18902) [Link] (1 responses)

this works unless there is "\n" in filename, which is valid character

A new API for mounting filesystems

Posted Apr 3, 2017 7:59 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

The \n is not a separator - it's just a decoration to avoid long strings. The length of the key/value is specified explicitly.

A new API for mounting filesystems

Posted Apr 3, 2017 14:29 UTC (Mon) by sorokin (guest, #88478) [Link] (6 responses)

~~Shouldn't we stop reinventing the wheel and go directly for XML? (trolling)~~

You mentioned that having '\0' have some downsides, but you did not specify which ones exactly. On the other hand I see a (in my optinion) significant downside of your encoding. With known lengths of the values it is very easy to calculate the required buffer size with zero terminated strings. In your case calculating the buffer size in advance is complicated.

You can appeal that we don't care about the performance of the mount system call. I would say that later the same format can be used in other places too. Making a format difficult to generate and to parse should not be based on the fact it will be rarely used.

A new API for mounting filesystems

Posted Apr 3, 2017 17:00 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

XML is way too complex (namespaces, entities, ...) but JSON (or one of its binary counterparts like BSON) would indeed be a welcome addition.

> You mentioned that having '\0' have some downsides, but you did not specify which ones exactly.
Several languages make it difficult to work with \0-containing strings, like good old bash or C.

> On the other hand I see a (in my optinion) significant downside of your encoding. With known lengths of the values it is very easy to calculate the required buffer size with zero terminated strings. In your case calculating the buffer size in advance is complicated.
I don't follow. In a case of \0-terminated array of dynamically-sized strings calculating the total length would require traversal of the whole array with parsing along the way.

A new API for mounting filesystems

Posted Apr 3, 2017 17:33 UTC (Mon) by Homer512 (subscriber, #85295) [Link] (3 responses)

>> On the other hand I see a (in my optinion) significant downside of your encoding. With known lengths of the values it is very easy to calculate the required buffer size with zero terminated strings. In your case calculating the buffer size in advance is complicated.
> I don't follow. In a case of \0-terminated array of dynamically-sized strings calculating the total length would require traversal of the whole array with parsing along the way.

I think the point is that with \0-terminated strings, you can precalculate the buffer size by summing strlen() of the parts. With a prefixed string, you have to add ceil(log10(strlen())) + 1 or something like that to account for the length.
The simplest workaround in plain old C is to use snprintf(NULL, 0, format, …) and use the return value to allocate the actual buffer.

A new API for mounting filesystems

Posted Apr 3, 2017 17:34 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

With my scheme there are no \0 symbols in the mix. So just get strlen() of the whole thing.

A new API for mounting filesystems

Posted Apr 3, 2017 20:58 UTC (Mon) by khim (subscriber, #9252) [Link] (1 responses)

It's impossible to call strlen till you allocate butter and put all the data into it. And calculating size of buffer is not trivial

A new API for mounting filesystems

Posted Apr 3, 2017 21:02 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

And how is that different from a bag of \0-terminated key-value pairs?

A new API for mounting filesystems

Posted Apr 4, 2017 5:01 UTC (Tue) by jhoblitt (subscriber, #77733) [Link]

msgpack is probably the most consistent and performant binary encoded equivalent of JSON. It would certainly be a great thing to start using a standardized data interchange format rather that using a hand rolled format for anything that isn't performance critical (i.e., called 1000+ per second). However, I wouldn't hold my breath, the kernel community is still holding on to C90.

A new API for mounting filesystems

Posted May 1, 2017 3:11 UTC (Mon) by spigot (subscriber, #50709) [Link]

Sounds similar to djb's netstrings proposal.

A new API for mounting filesystems

Posted Apr 3, 2017 6:25 UTC (Mon) by wichert (guest, #7115) [Link] (1 responses)

Considering a path is allowed to contain newlines using a NUL byte does make sense. If I remember correctly the only illegal characters in a path are NUL and the path separator ("/" generally).

A new API for mounting filesystems

Posted Apr 3, 2017 16:02 UTC (Mon) by kh (guest, #19413) [Link]

I really wish filenames had more limitations, maybe this would be a good time to revisit that idea.

https://www.dwheeler.com/essays/fixing-unix-linux-filenam...

A new API for mounting filesystems

Posted Apr 3, 2017 13:21 UTC (Mon) by fishface60 (subscriber, #88700) [Link] (4 responses)

It might be interesting if the openat family of system calls could be used on an open file system without it being attached to an existing file system tree.

A new API for mounting filesystems

Posted Apr 3, 2017 18:43 UTC (Mon) by droundy (subscriber, #4559) [Link] (2 responses)

Indeed, that sounds lovely!

However, it seems like you'd effectively be implementing a second version of mount namespaces, which while possibly elegant doesn't actually sound like a great idea. :(

If I were to redesign linux (which I won't be doing for many good reasons) I'd be highly tempted to remove all the non-"at" system calls, and move the whole concept of the root directory and working directory into user space (as two file descriptors). Things would be simpler and more elegant, and we wouldn't need special system calls like `chdir` or `pivot_root`. But that's only a daydream...

A new API for mounting filesystems

Posted Apr 3, 2017 19:27 UTC (Mon) by josh (subscriber, #17465) [Link] (1 responses)

> If I were to redesign linux (which I won't be doing for many good reasons) I'd be highly tempted to remove all the non-"at" system calls, and move the whole concept of the root directory and working directory into user space (as two file descriptors). Things would be simpler and more elegant, and we wouldn't need special system calls like `chdir` or `pivot_root`. But that's only a daydream...

Agreed completely for working directory. File-descriptor inheritance seems quite appropriate for that.

For the root directory, that would raise some security implications. Any setuid/setgid/privilege-raising application would need a secure way of running/loading out of its own root, so it can't get fooled by being run within a custom root. Otherwise, you could trivially use that to become root by running a setuid program with a root that contained your own dynamic libraries.

As long as we're talking about fundamental redesigns, I'd also eliminate all "raise privilege when run" mechanisms, and instead require handling that in userspace. Make a user/group privilege something you can have 0 or more of, with the privilege token for root giving you the ability to mint more such tokens. And then handle sudo/su/etc by running a server in userspace that can spawn more privileged processes on behalf of less privileged processes given appropriate controls.

A new API for mounting filesystems

Posted Apr 4, 2017 14:00 UTC (Tue) by jugglerchris (guest, #114208) [Link]

That sounds a lot like userv:

https://www.gnu.org/software/userv/

A new API for mounting filesystems

Posted Apr 3, 2017 19:18 UTC (Mon) by josh (subscriber, #17465) [Link]

I'd like to see that as well. Anything that took a "dirfd" could use this as the directory.