A new API for mounting filesystems
The mount() system call tries to do too many things, Miklos Szeredi said at the start of a filesystem-only discussion at LSFMM 2017. He has been interested in cleaning that up for a long time. So he wanted to discuss some ideas he had for a new interface to mount filesystems.
mount() is lots of operations all rolled up into one call; there are various fields that are used in different ways depending on what needs to be done and it is almost out of flags to be used. It supports regular mounts, bind mounts, remounting, moving a mount, and changing the propagation type for a mount, but they are mutually exclusive and some operations require a remount. For example, you cannot create a read-only bind mount; you must first do the bind mount, then remount it read-only. Similarly, you cannot change the propagation parameters while doing a bind mount or changing other mount flags.
Szeredi has come up with a proposed solution with several new system calls, starting with:
int fsopen(const char *fstype, unsigned int flags);It would be used to get a file descriptor to communicate with a filesystem driver and might be called as follows:
fsfd = fsopen("ext4", 0);That would provide a connection to the ext4 filesystem driver so that parameters could be set via a protocol.
![Miklos Szeredi [Miklos Szeredi]](https://static.lwn.net/images/2017/lsfmm-szeredi-sm.jpg)
The talk of a protocol prompted Jeff Layton to ask about using a netlink socket instead. But Al Viro said that a netlink protocol would need to be fully specified right from the start, which would not fit well. Josef Bacik said that he thought netlink would allow adding new attributes and values after the fact. There could a different protocol specification for each filesystem type, perhaps based on a common set for all filesystems with extensions for specific types. Layton agreed but said the mechanism for the protocol could be determined at a later point.
The protocol Szeredi is envisioning would have a set of configuration commands, each with a NUL-delimited set of parameters. It might look something like:
SETDEV\0/dev/sda1 SETOPTS\0seclabel\0data=ordered ...That data would be written to the filesystem file descriptor returned from fsopen().
Jeff Mahoney asked if there was a need for a system call at all. Perhaps sysfs or the /proc filesystem could be used instead. One attendee pointed out that would mean that some other mechanism would need to be used to mount /proc or /sys. There might also be implications for booting, since those filesystems may not be available early enough to mount the boot partition.
Additional system calls would be needed, Szeredi said, moving back to his proposed interface. Attaching a filesystem to a mount point would be done with mountat(), changes to a mount would done using mountupdate(), while mountmove() to move a mount and mountclone() to clone one round out the set. There were suggestions that some of those could be combined into one call, mountmove() and mountclone() in particular.
Szeredi said that he would look into using a netlink socket rather than fsopen(). One attendee said that netlink would make the simple case of a straightforward mount more complicated, but Szeredi said that the existing mount() would not be going away.
David Howells wondered if netlink was an optional kernel component; if so, mounting using this new mechanism would be impossible in some cases, another attendee said. But, again, Szeredi said that the existing mount() system call could be used. There was some concern that filesystems will come to depend on these new interfaces, so that using mount() won't work well.
Layton noted that there have been requests for better error messages from mounting operations; often there is not enough detail in the error code returned. Szeredi said that more detailed information could potentially be read from the descriptor returned by fsopen().
Overall, the attendees seemed interested in having a better API for mounting filesystems, but it would seem there is a ways to go before there is something concrete to ponder.
Index entries for this article | |
---|---|
Kernel | Filesystems/Mounting |
Conference | Storage, Filesystem, and Memory-Management Summit/2017 |
Posted Apr 2, 2017 21:52 UTC (Sun)
by neilbrown (subscriber, #359)
[Link]
I"m not the world's biggest fan of configfs, but this appears to be duplicating exactly that functionality.
> One attendee pointed out that would mean that some other mechanism would need to be used to mount /proc or /sys.
Clearly a non-issue given:
> Szeredi said that the existing mount() would not be going away.
Both md and dm already have mechanisms for "build some kernel object using these storage devices". While I don't claim either of them represent perfection, but it would seem sensible to use existing practice as the starting point for a discussion, rather than creating something totally different.
Posted Apr 3, 2017 3:20 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (16 responses)
Posted Apr 3, 2017 4:31 UTC (Mon)
by simcop2387 (subscriber, #101710)
[Link] (13 responses)
Posted Apr 3, 2017 4:59 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (12 responses)
Posted Apr 3, 2017 6:30 UTC (Mon)
by Homer512 (subscriber, #85295)
[Link] (11 responses)
Posted Apr 3, 2017 6:34 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (10 responses)
12\n
It's not terribly difficult to parse in pure C either.
Posted Apr 3, 2017 7:57 UTC (Mon)
by dany (guest, #18902)
[Link] (1 responses)
Posted Apr 3, 2017 7:59 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Apr 3, 2017 14:29 UTC (Mon)
by sorokin (guest, #88478)
[Link] (6 responses)
Posted Apr 3, 2017 17:00 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
> You mentioned that having '\0' have some downsides, but you did not specify which ones exactly.
> On the other hand I see a (in my optinion) significant downside of your encoding. With known lengths of the values it is very easy to calculate the required buffer size with zero terminated strings. In your case calculating the buffer size in advance is complicated.
Posted Apr 3, 2017 17:33 UTC (Mon)
by Homer512 (subscriber, #85295)
[Link] (3 responses)
I think the point is that with \0-terminated strings, you can precalculate the buffer size by summing strlen() of the parts. With a prefixed string, you have to add ceil(log10(strlen())) + 1 or something like that to account for the length.
Posted Apr 3, 2017 17:34 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Apr 3, 2017 20:58 UTC (Mon)
by khim (subscriber, #9252)
[Link] (1 responses)
Posted Apr 3, 2017 21:02 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Apr 4, 2017 5:01 UTC (Tue)
by jhoblitt (subscriber, #77733)
[Link]
Posted May 1, 2017 3:11 UTC (Mon)
by spigot (subscriber, #50709)
[Link]
Posted Apr 3, 2017 6:25 UTC (Mon)
by wichert (guest, #7115)
[Link] (1 responses)
Posted Apr 3, 2017 16:02 UTC (Mon)
by kh (guest, #19413)
[Link]
https://www.dwheeler.com/essays/fixing-unix-linux-filenam...
Posted Apr 3, 2017 13:21 UTC (Mon)
by fishface60 (subscriber, #88700)
[Link] (4 responses)
Posted Apr 3, 2017 18:43 UTC (Mon)
by droundy (subscriber, #4559)
[Link] (2 responses)
However, it seems like you'd effectively be implementing a second version of mount namespaces, which while possibly elegant doesn't actually sound like a great idea. :(
If I were to redesign linux (which I won't be doing for many good reasons) I'd be highly tempted to remove all the non-"at" system calls, and move the whole concept of the root directory and working directory into user space (as two file descriptors). Things would be simpler and more elegant, and we wouldn't need special system calls like `chdir` or `pivot_root`. But that's only a daydream...
Posted Apr 3, 2017 19:27 UTC (Mon)
by josh (subscriber, #17465)
[Link] (1 responses)
Agreed completely for working directory. File-descriptor inheritance seems quite appropriate for that.
For the root directory, that would raise some security implications. Any setuid/setgid/privilege-raising application would need a secure way of running/loading out of its own root, so it can't get fooled by being run within a custom root. Otherwise, you could trivially use that to become root by running a setuid program with a root that contained your own dynamic libraries.
As long as we're talking about fundamental redesigns, I'd also eliminate all "raise privilege when run" mechanisms, and instead require handling that in userspace. Make a user/group privilege something you can have 0 or more of, with the privilege token for root giving you the ability to mint more such tokens. And then handle sudo/su/etc by running a server in userspace that can spawn more privileged processes on behalf of less privileged processes given appropriate controls.
Posted Apr 4, 2017 14:00 UTC (Tue)
by jugglerchris (guest, #114208)
[Link]
Posted Apr 3, 2017 19:18 UTC (Mon)
by josh (subscriber, #17465)
[Link]
A new API for mounting filesystems
>
> SETDEV\0/dev/sda1
> SETOPTS\0seclabel\0data=ordered
> ...
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
INSTRUCTION1\n
23\n
/a/23/character/file/name\n
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
Shouldn't we stop reinventing the wheel and go directly for XML? (trolling)
You mentioned that having '\0' have some downsides, but you did not specify which ones exactly. On the other hand I see a (in my optinion) significant downside of your encoding. With known lengths of the values it is very easy to calculate the required buffer size with zero terminated strings. In your case calculating the buffer size in advance is complicated.
You can appeal that we don't care about the performance of the mount system call. I would say that later the same format can be used in other places too. Making a format difficult to generate and to parse should not be based on the fact it will be rarely used.
A new API for mounting filesystems
Several languages make it difficult to work with \0-containing strings, like good old bash or C.
I don't follow. In a case of \0-terminated array of dynamically-sized strings calculating the total length would require traversal of the whole array with parsing along the way.
A new API for mounting filesystems
> I don't follow. In a case of \0-terminated array of dynamically-sized strings calculating the total length would require traversal of the whole array with parsing along the way.
The simplest workaround in plain old C is to use snprintf(NULL, 0, format, …) and use the return value to allocate the actual buffer.
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
Sounds similar to djb's netstrings proposal.
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems
A new API for mounting filesystems