The mismatched mount mess

By Jonathan Corbet
August 10, 2018

"Mounting" a filesystem is the act of making it available somewhere in the system's directory hierarchy. But a mount operation doesn't just glue a device full of files into a specific spot in the tree; there is a whole set of parameters controlling how that filesystem is accessed that can be specified at mount time. The handling of these mount parameters is the latest obstacle to getting the proposed new mounting API into the mainline; should the new API reproduce what is arguably one of the biggest misfeatures of the current mount() system call?

The list of possible mount options is quite long. Some of them, like relatime, control details of how the filesystem metadata is managed internally. The dos1xfloppy option can be used with the FAT filesystem for that all-important compatibility with DOS 1.x systems. The ext4 bsddf option tweaks how free space is reported in the statfs() system call. But some options can have significant security implications. For example, the acl and noacl options control whether access control lists (ACLs) are used on the filesystem; turning off ACLs by accident on the wrong filesystem risks exposing files that should not be accessible.

It turns out that turning off ACLs by accident is indeed something that can happen on Linux systems. Eric Biederman, who has been on a bit of a crusade to force changes to the new proposed mount API, has described how that can happen. In a simplified form, consider this set of actions:

Create a large scratch file and set it up as a loopback device with losetup.
Create an ext4 filesystem on the device.
Mount that device with the noacl option somewhere in the filesystem hierarchy.
In another spot, mount that same filesystem with the acl option.

The user who performed the second mount would naturally expect to get a filesystem with ACLs enabled — that behavior was explicitly requested, after all. But the kernel will, instead, silently apply the options used in the first mount to the second, resulting in an apparently successful mount with parameters other than those that were requested. Biederman's chief complaint is that the new API will behave in the same way; he has stated his intent to block the merging of that code until this issue is fixed.

The source of this problem is that, in the kernel, it's really only possible to mount a filesystem once. The kernel is able to create new mount points that look like independent mounts, but it's all a single mounted filesystem underneath the cover. That means that only a single set of mount options can apply. So, as Ted Ts'o explained, there aren't a whole lot of options for changing this behavior:

So if the file system has been mounted with one set of mount options, and you want to try to mount it with a conflicting set of mount options and you don't want it to silently ignore the mount options, the *only* thing we can today is to refuse the mount and return an error.

Some developers, including Biederman, are arguing that refusing the mount would indeed be better than ignoring the requested mount parameters. Andy Lutomirski said that this sort of multiple mount can go wrong in a number of ways and probably should not be allowed at all: "It seems to me that the current approach mostly involves crossing our fingers." There is, however, little prospect of changing how mount() works now, given the risk of breaking no end of administrative scripts.

That does leave open the question of whether the new API should allow this type of mount. Biederman feels strongly that incompatible shared mounts should be disallowed before the new API makes it into a kernel release, since it will become much harder to change afterward:

The fact that these things happen silently and you have to be on your toes to catch them is fundamentally a bug in the API. If the mount request had simply failed people would have noticed the issues much sooner and silently b0rkend configuration would not have propagated. As such I do not believe we should propagate this misfeature from the old API into the new API.

David Howells, the developer behind the new mount API, has stated that, since the current code does not break any existing user behavior, there is no urgent need to add restrictions. But he is looking into doing so anyway, adding options so that user space can specify whether no sharing should be allowed at all, whether it should only be allowed with the same mount parameters, or whether the current behavior should apply. Others have suggested a variant on the middle case, where the mount options would just have to be "compatible" with each other, not identical.

It turns out, though, that this limited sharing is not easy to implement either way. The core filesystem layer has no idea which mount options are compatible with each other, so there would have to be a new callback added to each filesystem implementation to answer that question. That answer doesn't just depend on the actual options; things like the security-module policy in force also have to be taken into account. It is a thorny problem, and any solution seems likely to be prone to errors. It is thus unsurprising that developers like Ts'o are asking whether it is worth the effort at all.

That is a question that has not been answered as of this writing. Assuming that Biederman doesn't back down, there will probably need to be some way of preventing shared mounts when the options are not compatible; that could come down to preventing (or at least giving an option to prevent) shared mounts entirely. Such an outcome will do little good, though, if there are enough users out there who depend on this type of shared mount. If the new API prevents them from getting their work done, they will simply stick with the old one, which will then become difficult to ever remove from the kernel.

Biederman is right in saying that, had this particular behavior never been allowed, there would not be users who are dependent on it now. But that ship sailed a long time ago. What's left now is a mess where developers are trying to figure out what the correct behavior while avoiding causing pain to system administrators. It is a bit of a mess lacking an obvious solution.

Index entries for this article
Kernel	Filesystems/Mounting

The mismatched mount mess

Posted Aug 11, 2018 1:10 UTC (Sat) by zblaxell (subscriber, #26385) [Link] (9 responses)

> The kernel is able to create new mount points that look like independent mounts, but it's all a single mounted filesystem underneath the cover.

...except for all the strange, undocumented, and probably unintentional places where it's not, and the mount points behave like separate filesystems (or at least separate VFS layers with different parameters running as an overlay top of a single filesystem). If it works accidentally some of the time, it could be made to work all of the time, given sufficient debugging.

It seems to me that the way to resolve this issue properly is to "simply" make the example with simultaneous noacl and acl mounts *work*. Add a context pointer in VFS somewhere so the filesystem can tell which mount point the request is coming from, and behave accordingly.

The mismatched mount mess

Posted Aug 11, 2018 6:37 UTC (Sat) by smurf (subscriber, #17840) [Link]

… and then gradually move all the other potentially-context-sensitive options over.

The mismatched mount mess

Posted Aug 11, 2018 7:17 UTC (Sat) by TheJH (subscriber, #101155) [Link] (7 responses)

Maaaaybe you could make it work for "acl". But then what about "sb="? "journal_path="? "data="?

The mismatched mount mess

Posted Aug 11, 2018 16:22 UTC (Sat) by nybble41 (subscriber, #55106) [Link] (6 responses)

To me it seems like the idea state would be to have two distinct sets of options: filesystem options which can only be set once per filesystem and are common across all shared mounts, like "sb" and "journal_path", and mount point options which may be set differently for each mount point, like "acl". Opening a filesystem and mounting it at a particular path would be two separate actions, so you could pass the appropriate sets of options to each of them. If a filesystem is already mounted then you wouldn't open it again, and attempting to do so should fail; instead, you would use a different call to obtain a reference to the already-open filesystem, which you can then mount. This call would not take any filesystem options.

The mismatched mount mess

Posted Aug 11, 2018 17:55 UTC (Sat) by josh (subscriber, #17465) [Link]

That seems plausible, and with some slight tweaks to the new API, userspace could provide those two sets of options to the fsopen fd and the fsmount fd, respectively.

The mismatched mount mess

Posted Aug 11, 2018 18:42 UTC (Sat) by zblaxell (subscriber, #26385) [Link] (4 responses)

That ends up propagating down to individual system calls and also has impact on caching. What happens when someone does a stat() through a mount point with the vfat 'uid=1000' option, then someone else does a stat() of the same file on a different mount point with the vfat 'uid=9999' option?

This opens up philosophical questions like "is the ownership of a file exclusively a property of its inode, or a property of something else? e.g. directory entry, parent directory, root of the filesystem, mount point" and "are inodes sufficient as file identifiers" and "who caches all this anyway?" VFS isn't good at dealing with filesystems where the filesystem's answer to this kind of question is different from ext4's answer.

So maybe it's only possible for cases where the VFS layer can implement the option without asking the filesystem (e.g. 'ro' or 'nosuid') or where the result doesn't have any effect VFS cares about (e.g. 'compress' or 'nodatasum' which affect only default parameters for new objects and data but don't create multiple views of the same object from different mount points).

But then what happens when someone mounts a filesystem once with nosuid, once with suid, and once with neither option? "nosuid" and "suid" clearly conflict, but does neither option mean "keep the value from the previous mount (but _which_ previous mount)" or does it mean "suid"?

The mismatched mount mess

Posted Aug 12, 2018 4:44 UTC (Sun) by viro (subscriber, #7872) [Link] (2 responses)

You know, what annoys me about that fsdevel thread is that presumably clued people do not bother to RTFM. Or RTFS. Or directly experiment. FYI: suid/nosuid is mount property. There is nothing to "conflict" - you either pass MS_NOSUID to mount(2), or you do not. Either way, it affects that mount and nothing else. Filesystem itself doesn't know and doesn't care.

All of the above could be found in a couple of minutes by reading through mount(2) or mount(8) and experimenting.
As in,
root@kvm1:~# dd if=/dev/zero of=/tmp/foo bs=1M count=2
2+0 records in
2+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.00372987 s, 562 MB/s
root@kvm1:~# mkfs /tmp/foo
mke2fs 1.44.3 (10-July-2018)
Discarding device blocks: done
Creating filesystem with 2048 1k blocks and 256 inodes

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done

root@kvm1:~# losetup /dev/loop0 /tmp/foo
root@kvm1:~# mkdir /tmp/a /tmp/b
root@kvm1:~# mount /dev/loop0 /tmp/a
root@kvm1:~# mount /dev/loop0 /tmp/b -o nosuid
root@kvm1:~# cp /usr/bin/whoami /tmp/a/
root@kvm1:~# chown lp /tmp/a/whoami
root@kvm1:~# chmod +s /tmp/a/whoami
root@kvm1:~# /tmp/a/whoami
lp
root@kvm1:~# /tmp/b/whoami
root
root@kvm1:~# umount /tmp/a /tmp/b
root@kvm1:~# losetup -d /dev/loop0
root@kvm1:~# rm -rf /tmp/foo /tmp/a /tmp/b

Two minutes, beginning to end. Sigh...

The mismatched mount mess

Posted Aug 12, 2018 5:01 UTC (Sun) by viro (subscriber, #7872) [Link] (1 responses)

... and if you are asking if previous mounts would for some reason affect the suid/nosuid state on subsequent ones, the answer is (a) no; (b) check and see.

Incidentally, the worst part of cross-namespace sharing of mounts is not just the fs-level mount options being ignored rather than having mount(2) fail on their mismatch. The real nastiness comes from mount -o remount done in one place and affecting every mount of that sucker. The mount-level options (nosuid among them) are not an issue - they won't do anything to other mounts. Per-fs ones bloody well will. And no, we really can't make each option per-mount - I dare you to try and handle the things like -o errors=panic or -o errors=remount-ro on per-mountpoint basis; good luck propagating the "originating" mount towards each ext4_error() in there. Especially fun when it comes to errors on e.g. writeback. Or -o data=... for the same ext4...

The mismatched mount mess

Posted Aug 12, 2018 5:28 UTC (Sun) by bof (subscriber, #110741) [Link]

wrt. the fs-level shared mount options - does the shared data for that "know" how often the thing is mounted? If that is the case, maybe a general approach of only permitting change (on remount or additional mount) if it's not mounted more than once already, would work?

The mismatched mount mess

Posted Aug 13, 2018 13:39 UTC (Mon) by bandrami (guest, #94229) [Link]

At that point I feel the need to imagine a lecture by Jeff Goldblum about the difference in what we can do and what we should do. Wouldn't it make more sense to design a new userspace layer that simulates all this instead?

The mismatched mount mess

Posted Aug 11, 2018 13:45 UTC (Sat) by smoogen (subscriber, #97) [Link] (1 responses)

So it sounds like one would need to have a BPF 'engine' which would view what was enabled in the kernel, what the options were and then return an intelligent answer whether the mount options being requested conflict. A more generic sort of engine would be useful in other areas where networking options for containers could cause conflicts and probably other components also. However it would be a big undertaking to make everything report an api which could be parsed by the engine.

[I went with BPF because it could possibly be programmed and updated to match specific needs of a set of systems.]

The mismatched mount mess

Posted Aug 11, 2018 15:17 UTC (Sat) by grawity (subscriber, #80596) [Link]

But it's not knowledge that needs to be imported from the userspace; it's something that already directly involves kernel code (filesystems). There's no point in having an isolated engine, it could just as well be regular kernel code.

The mismatched mount mess

Posted Aug 12, 2018 8:28 UTC (Sun) by ncm (guest, #165) [Link] (2 responses)

Any correct solution may be recognized by whether it reduces complexity and likelihood of mistakes. Figuring out what are "compatible" options obviously fails that test.

The correct solution will be one of : (1) by default, refuse to mount again, or (2) by default, refuse to mount again with different options. In either case, backward compatibility is achieved with a "foolish" flag, which permits the operation anyway. Nobody turns on "foolish" mode by accident, although some will certainly leave it on by accident. In that instance they are exactly where everyone is now.

Probably remounting one of the mounts with different options should fail, too, subject again to the foolish flag. Mounts that are allowed to users add an interesting wrinkle: if root mounts with one set of options, and fstab permits user mounting with different options, does it go? I expect so. That seems worth fixing too.

The mismatched mount mess

Posted Aug 13, 2018 23:42 UTC (Mon) by jreiser (subscriber, #11027) [Link] (1 responses)

After a filesystem error, then it would be nice to allow re-mounting a filesystem read-only. This currently is common to allow for diagnosis and semi-orderly shutdown of applications that are aware of the convention.

The mismatched mount mess

Posted Aug 15, 2018 0:33 UTC (Wed) by ncm (guest, #165) [Link]

A read-only re-mount is great place to use the "foolish" flag.

The mismatched mount mess

Posted Aug 13, 2018 4:56 UTC (Mon) by jjm@pk28.nl (guest, #111434) [Link] (1 responses)

[...] developers are trying to figure out what the correct behavior [is] while avoiding causing pain to system administrators.

Seems like the "Don't break userspace" rule is being interpreted here as "Userspace must remain broken forever".

The mismatched mount mess

Posted Aug 16, 2018 17:03 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]

"Once abuse has occurred, the only choice is to perpetuate it" seems an impediment to real improvement.