configfd() and shifting bind mounts
The mount API work replaces the existing, complex mount() system call with a half-dozen or so new system calls. An application would call fsopen() to open a filesystem stored somewhere or fspick() to open an already mounted filesystem. Calls to fsconfig() set various parameters related to the mount; fsmount() is then called to mount a filesystem within the kernel and move_mount() to attach the result to the filesystem hierarchy somewhere. There are a couple more calls to fill in other parts of the interface as well. The intent is for this set of system calls to be able to replace mount() entirely with something that is more flexible, capable, and maintainable.
Back in November, Bottomley discovered one significant gap with the new API: it is not possible to use it to set up a read-only bind mount. The problem is that bind mounts are special; they do not represent a filesystem directly. Instead, they can be thought of as a view of a filesystem that is mounted elsewhere. There is no superblock associated with a bind mount, which turns out to be a problem where the new API is concerned, since fsconfig() is designed to operate on superblocks. An attempt to call fsconfig() on a bind mount will end up modifying the original mount, which is almost certainly not what the caller had in mind. So there is no way to set the read-only flag for a bind mount.
David Howells, the creator of the new mount API, responded that what is needed is yet another system call, mount_setattr(), which would change attributes of mounts. That would work for the read-only case, Bottomley said, but it falls down when it comes to more complex situations, such as his proposed UID-shifting bind mount. Instead, he said, the file-descriptor-based configuration mechanism provided by fsconfig() is well suited to this job, but it needs to be made more widely applicable. He suggested that this interface be made more generic so that it could be used in both situations (and beyond).
He posted an initial version of this proposed interface in November, and has recently come back with an updated version. It adds two new system calls:
int configfd_open(const char *name, unsigned int flags, unsigned int op); int configfd_action(int fd, unsigned int cmd, const char *key, void *value, int aux);
A call to configfd_open() would open a new file descriptor intended for the configuration of the subsystem identified by name; the usual open() flags would appear in flags, and op defines whether a new configuration instance is to be created or an existing one modified. configfd_action() would then be used to make changes to the returned file descriptor. The fsconfig() system call (along with related parts like fsopen() and fspick()) is reimplemented using the new calls. Bottomley provides an example for mounting a tmpfs filesystem:
fd = configfd_open("tmpfs", O_CLOEXEC, CONFIGFD_CMD_CREATE); configfd_action(fd, CONFIGFD_SET_INT, "mount_attrs", NULL, MOUNT_ATTR_NODEV|MOUNT_ATTR_NOEXEC); configfd_action(fd, CONFIGFD_CMD_CREATE, NULL, NULL, 0); configfd_action(fd, CONFIGFD_GET_FD, "mountfd", &mfd, O_CLOEXEC); move_mount("", mfd, AT_FDCWD, "/mountpoint", MOVE_MOUNT_F_EMPTY_PATH);
The configfd_open() call creates a new tmpfs instance; the first configfd_action() call is then used to set the nodev and noexec mount flags on that instance. The filesystem mount is actually created with another configfd_action() call, and the third such call is used to obtain a file descriptor for the mount that can be used with move_mount() to make the filesystem visible.
With that infrastructure in place, Bottomley is able to reimplement his shiftfs filesystem as a type of bind mount. A shifting bind mount will apply a constant offset to user and group IDs before forwarding operations to the underlying mount; this is useful to safely allow true-root access to an on-disk filesystem from within a user namespace.
Only one developer, Christian Brauner, has responded to this patch series so far; he doesn't like it. It is an excessive collection of abstraction layers, he said, and it creates another set of multiplexing system calls, a design approach that is out of favor these days:
Unsurprisingly, Bottomley disagreed. He argued that there is a common pattern that arises in kernel development: a subsystem that is complicated to configure, but then relatively simple to use. Filesystem mounts are an example of this pattern; the setup is hard, but then they can all be accessed through the same virtual filesystem interfaces. Cryptographic keys and storage devices were also mentioned. It would be better, he said, to figure out a common way of interfacing with these subsystems rather than inventing slightly different interfaces every time. The configuration file descriptor approach may be a good solution for that common way, he said:
The conversation appears to have stalled out at this point. It is hard to
guess how this disagreement will be resolved, but one thing is fairly
straightforward to point out: if the configfd approach is deemed
unacceptable for the kernel, then somebody needs to come up with a better
idea for how the problems addressed by configfd will be solved. Thus far,
that better idea has not yet shown up on the mailing lists.
Index entries for this article | |
---|---|
Kernel | Filesystems/Mounting |
Kernel | System calls |
Posted Jan 10, 2020 21:30 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (20 responses)
Instead of
configfd_open("tmpfs", O_CLOEXEC, CONFIGFD_CMD_CREATE)
write
open("/dev/config/fs/tmpfs/create", O_CLOEXEC | O_RDWR)
Posted Jan 10, 2020 21:53 UTC (Fri)
by josh (subscriber, #17465)
[Link] (19 responses)
If you're binding a filesystem, though, I wonder why there isn't a way to change the fd you get from fsopen of the existing filesystem into a separate filesystem with separate options for "bind"?
Posted Jan 10, 2020 22:24 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (15 responses)
Not a problem for /proc
And if it *is* a problem, the right approach isn't some random new twice on open(2), but a system call that retrieves a directory file descriptor for /dev/config or whatever, one that you could then use with openat ---
open(get_configfs_fd(), "fs/tmpfs/create", O_CLOEXEC | O_RDWR)
Posted Jan 10, 2020 22:41 UTC (Fri)
by cyphar (subscriber, #110703)
[Link] (1 responses)
Posted Jan 10, 2020 23:35 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link]
Posted Jan 10, 2020 23:26 UTC (Fri)
by roc (subscriber, #30627)
[Link] (11 responses)
As a solution to the "what if procfs isn't mounted?" problem, that seems far more elegant than the alternative of creating new syscall APIs for every single feature in procfs that someone might need to use without procfs mounted. (Same goes for other magic filesystems.)
Posted Jan 10, 2020 23:36 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Why would it cause problems? The actual FD would refer to a magical internal non-rooted mount, e.g., like the one the kernel sets up for pipefs on boot.
Posted Jan 11, 2020 1:12 UTC (Sat)
by roc (subscriber, #30627)
[Link]
Posted Jan 10, 2020 23:37 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Agreed. We don't need duplicate APIs. We just need some way to get a directory FD for /proc, /sys, whatever without going through the mount table.
Posted Jan 11, 2020 17:58 UTC (Sat)
by smurf (subscriber, #17840)
[Link]
Posted Jan 12, 2020 14:57 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link] (6 responses)
Posted Jan 13, 2020 3:34 UTC (Mon)
by roc (subscriber, #30627)
[Link] (5 responses)
Posted Jan 13, 2020 21:40 UTC (Mon)
by mirabilos (subscriber, #84359)
[Link] (4 responses)
Posted Jan 13, 2020 21:45 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Perhaps it would be better to add a new syscall like 'open_special(fs_type)' to open '/proc', '/sys', '/sys/fs/...' directories without them being mounted.
Posted Jan 14, 2020 6:44 UTC (Tue)
by smurf (subscriber, #17840)
[Link]
Posted Jan 14, 2020 21:06 UTC (Tue)
by cyphar (subscriber, #110703)
[Link] (1 responses)
The problem is that there is a security issue with giving a program access to a /proc without any over-mounts if the /proc they already have access to has locked mounts on top of it (container runtimes use this technique to mask certain dangerous procfs files from containers). If we want to have a simple API that gives us a /proc handle, we'll need to make some kind of procfs2 (which has been suggested several times in the past) which removes all of the patently unsafe files so that untrusted programs can get access to all of it.
Posted Jan 14, 2020 21:09 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Probably at this point creating something like procfs2 and then mandating it would be the best approach. But then there's a question of what exactly is an "unsafe file"...
Posted Jan 12, 2020 14:56 UTC (Sun)
by mirabilos (subscriber, #84359)
[Link]
Posted Jan 10, 2020 22:38 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link] (2 responses)
Bind mounts can point to any file, even one that is not a mount point--or even one that isn't a directory.
However, it does seem to me that passing an O_PATH file descriptor to fspick, plus a new flag for fspick that says "create a bind mount", would be a good API. The article hints that "fsconfig() is designed to work with superblocks" but it's not clear why.
Posted Jan 11, 2020 16:51 UTC (Sat)
by jejb (subscriber, #6654)
[Link] (1 responses)
I did explain that problem in the original email: all the hooks for fsconfig actions are in sb->fs_type->init_fs_context() which the fs_context allocation uses. Now it is possible to special case this for bind mounts, but you also have to special case fsmount and fsconfig/reconfigure. By the time you've done all that, you've effectively got two separate paths through the same code, which isn't really such a good idea, which is why I asked the question "what would the generalisation of fsconfig look like".
Posted Jan 11, 2020 20:09 UTC (Sat)
by pbonzini (subscriber, #60935)
[Link]
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
Alternately, the new "mount" syscalls can give you a handle to /proc or /sys without actually mounting them.
Alternately, just acknowledge that not mounting /dev, /proc and /sys is not supported and going to cause problems, and leave it at that.
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts
configfd() and shifting bind mounts