|
|
Log in / Subscribe / Register

Six (or seven) new system calls for filesystem mounting

By Jonathan Corbet
July 12, 2018
Mounting filesystems is a complicated business. The kernel supports a wide variety of filesystem types, and each has its own, often extensive set of options. As a result, the mount() system call is complex, and the list of mount options is a rather long read. But even with all of that complexity, mount() does not do everything that users would like. For example, the options for a mount operation must all fit within a single 4096-byte page — the fact that this is a problem for some users is illustrative in its own right. The problems with mount() have come up at various meetings, including at the 2018 Linux Storage, Filesystem, and Memory-Management Summit. A set of patches implementing a new approach is getting closer to being ready, but it features some complexity of its own and there are some remaining concerns about the proposed system-call API.

This patch set, from David Howells, is in its ninth revision. It makes extensive changes within the virtual filesystem layer to create the concept of a "filesystem context" that describes a specific mount operation. The questions about the internal changes have mostly been resolved at this point; things seem about ready to go in at that level. But the patch set also replaces the mount() system call with a rather more complex set of operations. (To be precise, mount() would not go away as long as it is needed, but it is unlikely to gain new functionality after the new system calls go in.)

The new way of mounting

In current kernels, a single mount() call does everything required to mount a filesystem at a specific location in the system hierarchy. With these patches applied, instead, the process would begin with a call to the new fsopen() system call:

    int fsopen(const char *fsname, unsigned int flags);

The fsname parameter identifies the type of the filesystem to be mounted — ext4 or nfs, for example — while flags is either zero or FSOPEN_CLOEXEC. This call doesn't mount any filesystems, it just creates the context in which the mount operation can be described and carried out. The return value is a file descriptor representing that context.

The next step is to provide the details for the mount to be performed; this is done by writing a series of strings to that file descriptor. The first character of the string is either "s" (to specify the source filesystem), "o" (to provide a mount option), or "x" (to execute a command). So a reasonable series of writes could be:

    s /dev/sda1
    o noatime
    x create

Note that these strings are not terminated by newlines; each write() call is supposed to convey exactly one of these strings. In this case, the strings written say that the filesystem found on /dev/sda1 should be mounted with the noatime option. The final line (with the create command) brings the filesystem context into fully formed existence, but does not actually mount it anywhere. There is also a reconfigure command that can be used to change the settings in an existing context.

Things can go wrong at any step, in which case the write() call will return an error. More detailed information about the problem can be had by reading from the file descriptor. This feature addresses one of the other problems with mount(): the inability to communicate the details of a problem to user space.

Assuming all goes well, the next step is to mount the filesystem with a call to:

    int fsmount(int fd, unsigned int flags, unsigned int ms_flags);

The filesystem-context file descriptor created by fsopen() is passed as fd to fsmount(). Once again, the only flag for flags is FSMOUNT_CLOEXEC, while ms_flags describe how the mount is to be performed. They can be used to create an unbindable or slave mount, for example (see this article for details on mount types). Some of those flags, though, duplicate options like noatime or read-only.

fsmount() returns another file descriptor corresponding to the newly mounted filesystem. Do note, though, that while the filesystem is "mounted", it has not been mounted at any specific location in the filesystem tree, so it will not be visible to users. Actually placing the filesystem into a mount namespace requires yet another system call:

    int move_mount(int from_dfd, const char *from_path,
                   int to_dfd, const char *to_path, unsigned int flags);

To put a mounted filesystem into a spot in the hierarchy, move_mount() would be called with the file descriptor from fsmount() passed as from_dfd (from_path would be NULL). The location where the filesystem should be placed is described by to_dfd and to_path in the usual manner for *at() system calls. Among other things, the to_dfd file descriptor will identify the mount namespace in which the mount appears — something that can be tricky to do currently. The flags argument is used to control behavior like following symbolic links or whether to automount filesystems when determining the source and destination locations.

As might be expected, move_mount() can also be used to relocate a fully mounted filesystem within the tree.

Other operations

That is the basic sequence of operations to mount a filesystem in the new order. But, of course, the real world is more complex than that. Users want to query filesystems, remount them into different namespaces, remount them with different options, and more. Three more system calls have been provided to make these actions possible; the first of those is fsinfo():

    int fsinfo(int dfd, const char *filename,
	       const struct fsinfo_params *params,
	       void *buffer, size_t buf_size);

This call can be used to query just about any attribute of a mounted filesystem. It is somewhat complex; interested readers can see the patch changelog for details, or the man page patch for a lot of details.

If the goal is to create a new mount of an existing filesystem, a more straightforward path is to use open_tree():

    int open_tree(unsigned int dfd, const char *pathname, unsigned int flags);

Without special flags, this call is similar to calling open() on a directory with the O_PATH flag set. It returns a file descriptor corresponding to that directory that can only be used for a small set of operations — move_mount(), for example. But with the OPEN_TREE_CLONE flag, it will make a copy of the filesystem mount that can then be mounted elsewhere; it can thus be used to create a bind mount. Add the AT_RECURSIVE flag, and a whole hierarchy can be cloned and made ready for mounting in a different context.

Finally, there is fspick():

    int fspick(unsigned int dirfd, const char *path, unsigned int flags);

This system call can be thought of as the equivalent of fsopen() for an existing mount point. It returns a file descriptor that can be written to in the same way to change the mount parameters; the "x reconfigure" string at the end creates the equivalent of a remount operation.

Playing with fire

There is relatively little controversy around most of this work, perhaps because few people have the stamina to plow through a 32-part patch set deep in the virtual filesystem layer. The concerns that have been raised have to do with the configuration API for file descriptors returned by fsopen() and fspick(). Andy Lutomirski was clear about his concerns, saying: "I think you’re seriously playing with fire with the API". His worry, echoed by Linus Torvalds, is that the API based on write() calls could be dangerous.

In particular, Lutomirski worried that an attacker might succeed in getting a setuid program to write to one of these file descriptors, giving that attacker access to files or devices that would otherwise be protected. This problem could be avoided by using the credentials of the process that created the file descriptor for all subsequent operations — something that is supposed to happen already — but that is not seen as a practical possibility; as Torvalds noted, even code that tries to get that right often makes mistakes and ends up using the credentials of the process calling write() instead.

Solving this problem requires changing the API so that a call to write() does not have arbitrary side effects in the kernel. One possibility is to create yet another system call and use it to communicate the mount parameters to the kernel; that would prevent problems resulting from a redirected write. The alternative, which seems likely to be the way things will go in the end, is to add a different system call to replace the "x" operation at the end of that series of writes. It would look something like:

    int fscommit(unsigned int fd, unsigned int cmd);

Where fd is the file descriptor for the under-construction mount point, and cmd is either FSCOMMIT_CREATE or FSCOMMIT_RECONFIGURE. The CAP_SYS_ADMIN capability would be required to perform this operation. The end result would be that, while an attacker might be able to convince a setuid program to write to the file descriptor, that attacker would not be able to actually make the changes effective without having already gained a high level of privilege.

Regardless of the final conclusion, this patch set will need to go through at least one more round before it can be merged. Torvalds has also complained that the motivation behind this work is not well described: "I sure want to see an explanation for *WHY* it adds 5000+ lines of core code" (a lot of interesting information can be found in Howells's response to that request). There is clearly some work to be done still, so this work will probably not be ready for the next merge window. In the not-too-distant future, though, the mount() system call seems likely to become obsolete.

Index entries for this article
KernelFilesystems/Mounting


to post comments

Six (or seven) new system calls for filesystem mounting

Posted Jul 12, 2018 18:14 UTC (Thu) by josh (subscriber, #17465) [Link] (6 responses)

Leaving aside the various issues with the API...

I'm really hopeful that it'll be possible to take the file descriptor returned by fsmount and use it as a directory fd with openat and similar, without putting it anywhere in the filesystem hierarchy.

Six (or seven) new system calls for filesystem mounting

Posted Jul 12, 2018 18:39 UTC (Thu) by ssl (guest, #98177) [Link] (2 responses)

Every day we stray closer and closer to plan9

Six (or seven) new system calls for filesystem mounting

Posted Jul 12, 2018 19:11 UTC (Thu) by giantcheeseburger (guest, #125574) [Link] (1 responses)

You say that like it's a bad thing.

Six (or seven) new system calls for filesystem mounting

Posted Jul 13, 2018 12:40 UTC (Fri) by k3ninho (subscriber, #50375) [Link]

As soon as we were networked with programs that could accept remote procedure calls (RPC), we were in the world of 'everything is a file' / 'every action is a server' that's core to Plan9 and HURD. The uptake of chmod/cgroups/netns/containerisation is just slow reinvention on the path to uptake of these ideas.

K3n.

Six (or seven) new system calls for filesystem mounting

Posted Jul 26, 2018 12:18 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

We seem already to do that with USB (I guess it's using FUSE or something). I regularly want to cd into a directory and use the command line, only to discover that there seems to be no path to the folder I'm looking at in Dolphin.

Cheers,
Wol

Six (or seven) new system calls for filesystem mounting

Posted Jul 8, 2019 14:05 UTC (Mon) by natkr (subscriber, #123377) [Link]

KDE and GNOME both have their own (mutually incompatible) virtual filesystem systems: KIO and GVFS, respectively.

Six (or seven) new system calls for filesystem mounting

Posted Aug 7, 2018 12:04 UTC (Tue) by sourcejedi (guest, #45153) [Link]

That's how I read the code. It should work very similarly to an FD pointing into a detached mount (see "umount -l"). Use fchdir() on it or whatever you like :).

I note you can't do mount manipulation inside a detached tree. A detached tree is considered to have no mount ns. In principle, mount requires CAP_SYS_ADMIN in the user namespace which owns the mount namespace. In practice it currently tests that the mount namespace is the same by calling check_mnt(); this is a strictly tighter check.

I also notice you can't nest this technique, at least in the current patch set. OPEN_TREE_CLONE calls check_mnt() on the source directory, so you can't apply it inside a detached tree.

Six (or seven) new system calls for filesystem mounting

Posted Jul 13, 2018 10:41 UTC (Fri) by dgm (subscriber, #49227) [Link] (1 responses)

I wonder why they propose fspick() as a new syscall, for what is in essence linkat() with flags. Why not just add the flags to linkat?

Six (or seven) new system calls for filesystem mounting

Posted Jul 14, 2018 18:34 UTC (Sat) by simcop2387 (subscriber, #101710) [Link]

linkat's semantics wouldn't allow that to work properly. Linkat is supposed to fail when given a path to an existing file. Of course this could be dealt with by adding a flag, saying not to do that. And then there's the fact that linkat is currently only dealing with files, not directories. You can hardlink a directory with it, and you can't mount a filesystem onto a file. It doesn't quite match up properly.

Six (or seven) new system calls for filesystem mounting

Posted Jul 13, 2018 10:59 UTC (Fri) by meyert (subscriber, #32097) [Link] (2 responses)

Isn't the name fscommit misleading; people starting to use Linux may mix it up with some commit operation on the file system itself, but the actually want fsync.
Also he prefix fs is very generic. Shouldn't it be fsm (for file system mount) to make the intention and the context of the syscalls more obvious?!
<Bike shedding end>

Six (or seven) new system calls for filesystem mounting

Posted Jul 16, 2018 14:22 UTC (Mon) by nilsmeyer (guest, #122604) [Link] (1 responses)

It doesn't really matter, once you start coding you'll quickly realize that you're looking at the wrong syscall. With regards to consistency and logical naming for system calls, that ship has sailed long ago.
Ken Thompson was once asked what he would do differently if he were redesigning the UNIX system. His reply: "I'd spell creat with an e."

unix speling

Posted Jul 18, 2018 7:49 UTC (Wed) by sdalley (subscriber, #18550) [Link]

And maybe umount with 2 n's.

The Plan 9 guys are going to love this

Posted Jul 16, 2018 18:03 UTC (Mon) by mirabilos (subscriber, #84359) [Link]

Six (or seven) new system calls for filesystem mounting

Posted Jul 18, 2018 15:55 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

I really don't like the use of structured strings in this context. What is it adding? A dedicated system call to set mount attributes seems much more robust.

Six (or seven) new system calls for filesystem mounting

Posted Jul 29, 2018 17:29 UTC (Sun) by mcortese (guest, #52099) [Link]

Isn't a string more error prone than traditional flags? A typo in MS_NOATIME is detected by the compiler, a typo in "o noatime" can only be detected at run time.


Copyright © 2018, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds