The mismatched mount mess
The list of possible mount options is quite long. Some of them, like relatime, control details of how the filesystem metadata is managed internally. The dos1xfloppy option can be used with the FAT filesystem for that all-important compatibility with DOS 1.x systems. The ext4 bsddf option tweaks how free space is reported in the statfs() system call. But some options can have significant security implications. For example, the acl and noacl options control whether access control lists (ACLs) are used on the filesystem; turning off ACLs by accident on the wrong filesystem risks exposing files that should not be accessible.
It turns out that turning off ACLs by accident is indeed something that can happen on Linux systems. Eric Biederman, who has been on a bit of a crusade to force changes to the new proposed mount API, has described how that can happen. In a simplified form, consider this set of actions:
- Create a large scratch file and set it up as a loopback device with losetup.
- Create an ext4 filesystem on the device.
- Mount that device with the noacl option somewhere in the filesystem hierarchy.
- In another spot, mount that same filesystem with the acl option.
The user who performed the second mount would naturally expect to get a filesystem with ACLs enabled — that behavior was explicitly requested, after all. But the kernel will, instead, silently apply the options used in the first mount to the second, resulting in an apparently successful mount with parameters other than those that were requested. Biederman's chief complaint is that the new API will behave in the same way; he has stated his intent to block the merging of that code until this issue is fixed.
The source of this problem is that, in the kernel, it's really only possible to mount a filesystem once. The kernel is able to create new mount points that look like independent mounts, but it's all a single mounted filesystem underneath the cover. That means that only a single set of mount options can apply. So, as Ted Ts'o explained, there aren't a whole lot of options for changing this behavior:
Some developers, including Biederman, are arguing that refusing the mount
would indeed be better than ignoring the requested mount parameters. Andy
Lutomirski said
that this sort of multiple mount can go wrong in a number of ways and
probably should not be allowed at all: "It seems to me that the
current approach mostly involves crossing our fingers.
" There is,
however, little prospect of changing how mount() works now, given
the risk of breaking no end of administrative scripts.
That does leave open the question of whether the new API should allow this type of mount. Biederman feels strongly that incompatible shared mounts should be disallowed before the new API makes it into a kernel release, since it will become much harder to change afterward:
David Howells, the developer behind the new mount API, has stated that, since the current code does not break any existing user behavior, there is no urgent need to add restrictions. But he is looking into doing so anyway, adding options so that user space can specify whether no sharing should be allowed at all, whether it should only be allowed with the same mount parameters, or whether the current behavior should apply. Others have suggested a variant on the middle case, where the mount options would just have to be "compatible" with each other, not identical.
It turns out, though, that this limited sharing is not easy to implement either way. The core filesystem layer has no idea which mount options are compatible with each other, so there would have to be a new callback added to each filesystem implementation to answer that question. That answer doesn't just depend on the actual options; things like the security-module policy in force also have to be taken into account. It is a thorny problem, and any solution seems likely to be prone to errors. It is thus unsurprising that developers like Ts'o are asking whether it is worth the effort at all.
That is a question that has not been answered as of this writing. Assuming that Biederman doesn't back down, there will probably need to be some way of preventing shared mounts when the options are not compatible; that could come down to preventing (or at least giving an option to prevent) shared mounts entirely. Such an outcome will do little good, though, if there are enough users out there who depend on this type of shared mount. If the new API prevents them from getting their work done, they will simply stick with the old one, which will then become difficult to ever remove from the kernel.
Biederman is right in saying that, had this particular behavior never been
allowed, there would not be users who are dependent on it now. But that
ship sailed a long time ago. What's left now is a mess where developers
are trying to figure out what the correct behavior while avoiding causing
pain to system administrators. It is a bit of a mess lacking an obvious
solution.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/Mounting |
Posted Aug 11, 2018 1:10 UTC (Sat)
by zblaxell (subscriber, #26385)
[Link] (9 responses)
...except for all the strange, undocumented, and probably unintentional places where it's not, and the mount points behave like separate filesystems (or at least separate VFS layers with different parameters running as an overlay top of a single filesystem). If it works accidentally some of the time, it could be made to work all of the time, given sufficient debugging.
It seems to me that the way to resolve this issue properly is to "simply" make the example with simultaneous noacl and acl mounts *work*. Add a context pointer in VFS somewhere so the filesystem can tell which mount point the request is coming from, and behave accordingly.
Posted Aug 11, 2018 6:37 UTC (Sat)
by smurf (subscriber, #17840)
[Link]
Posted Aug 11, 2018 7:17 UTC (Sat)
by TheJH (subscriber, #101155)
[Link] (7 responses)
Posted Aug 11, 2018 16:22 UTC (Sat)
by nybble41 (subscriber, #55106)
[Link] (6 responses)
Posted Aug 11, 2018 17:55 UTC (Sat)
by josh (subscriber, #17465)
[Link]
Posted Aug 11, 2018 18:42 UTC (Sat)
by zblaxell (subscriber, #26385)
[Link] (4 responses)
This opens up philosophical questions like "is the ownership of a file exclusively a property of its inode, or a property of something else? e.g. directory entry, parent directory, root of the filesystem, mount point" and "are inodes sufficient as file identifiers" and "who caches all this anyway?" VFS isn't good at dealing with filesystems where the filesystem's answer to this kind of question is different from ext4's answer.
So maybe it's only possible for cases where the VFS layer can implement the option without asking the filesystem (e.g. 'ro' or 'nosuid') or where the result doesn't have any effect VFS cares about (e.g. 'compress' or 'nodatasum' which affect only default parameters for new objects and data but don't create multiple views of the same object from different mount points).
But then what happens when someone mounts a filesystem once with nosuid, once with suid, and once with neither option? "nosuid" and "suid" clearly conflict, but does neither option mean "keep the value from the previous mount (but _which_ previous mount)" or does it mean "suid"?
Posted Aug 12, 2018 4:44 UTC (Sun)
by viro (subscriber, #7872)
[Link] (2 responses)
All of the above could be found in a couple of minutes by reading through mount(2) or mount(8) and experimenting.
Allocating group tables: done
root@kvm1:~# losetup /dev/loop0 /tmp/foo
Two minutes, beginning to end. Sigh...
Posted Aug 12, 2018 5:01 UTC (Sun)
by viro (subscriber, #7872)
[Link] (1 responses)
Incidentally, the worst part of cross-namespace sharing of mounts is not just the fs-level mount options being ignored rather than having mount(2) fail on their mismatch. The real nastiness comes from mount -o remount done in one place and affecting every mount of that sucker. The mount-level options (nosuid among them) are not an issue - they won't do anything to other mounts. Per-fs ones bloody well will. And no, we really can't make each option per-mount - I dare you to try and handle the things like -o errors=panic or -o errors=remount-ro on per-mountpoint basis; good luck propagating the "originating" mount towards each ext4_error() in there. Especially fun when it comes to errors on e.g. writeback. Or -o data=... for the same ext4...
Posted Aug 12, 2018 5:28 UTC (Sun)
by bof (subscriber, #110741)
[Link]
Posted Aug 13, 2018 13:39 UTC (Mon)
by bandrami (guest, #94229)
[Link]
Posted Aug 11, 2018 13:45 UTC (Sat)
by smoogen (subscriber, #97)
[Link] (1 responses)
[I went with BPF because it could possibly be programmed and updated to match specific needs of a set of systems.]
Posted Aug 11, 2018 15:17 UTC (Sat)
by grawity (subscriber, #80596)
[Link]
Posted Aug 12, 2018 8:28 UTC (Sun)
by ncm (guest, #165)
[Link] (2 responses)
The correct solution will be one of : (1) by default, refuse to mount again, or (2) by default, refuse to mount again with different options. In either case, backward compatibility is achieved with a "foolish" flag, which permits the operation anyway. Nobody turns on "foolish" mode by accident, although some will certainly leave it on by accident. In that instance they are exactly where everyone is now.
Probably remounting one of the mounts with different options should fail, too, subject again to the foolish flag. Mounts that are allowed to users add an interesting wrinkle: if root mounts with one set of options, and fstab permits user mounting with different options, does it go? I expect so. That seems worth fixing too.
Posted Aug 13, 2018 23:42 UTC (Mon)
by jreiser (subscriber, #11027)
[Link] (1 responses)
Posted Aug 15, 2018 0:33 UTC (Wed)
by ncm (guest, #165)
[Link]
Posted Aug 13, 2018 4:56 UTC (Mon)
by jjm@pk28.nl (guest, #111434)
[Link] (1 responses)
Seems like the "Don't break userspace" rule is being interpreted here as "Userspace must remain broken forever".
Posted Aug 16, 2018 17:03 UTC (Thu)
by smitty_one_each (subscriber, #28989)
[Link]
The mismatched mount mess
The mismatched mount mess
The mismatched mount mess
The mismatched mount mess
The mismatched mount mess
The mismatched mount mess
The mismatched mount mess
As in,
root@kvm1:~# dd if=/dev/zero of=/tmp/foo bs=1M count=2
2+0 records in
2+0 records out
2097152 bytes (2.1 MB, 2.0 MiB) copied, 0.00372987 s, 562 MB/s
root@kvm1:~# mkfs /tmp/foo
mke2fs 1.44.3 (10-July-2018)
Discarding device blocks: done
Creating filesystem with 2048 1k blocks and 256 inodes
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
root@kvm1:~# mkdir /tmp/a /tmp/b
root@kvm1:~# mount /dev/loop0 /tmp/a
root@kvm1:~# mount /dev/loop0 /tmp/b -o nosuid
root@kvm1:~# cp /usr/bin/whoami /tmp/a/
root@kvm1:~# chown lp /tmp/a/whoami
root@kvm1:~# chmod +s /tmp/a/whoami
root@kvm1:~# /tmp/a/whoami
lp
root@kvm1:~# /tmp/b/whoami
root
root@kvm1:~# umount /tmp/a /tmp/b
root@kvm1:~# losetup -d /dev/loop0
root@kvm1:~# rm -rf /tmp/foo /tmp/a /tmp/b
The mismatched mount mess
The mismatched mount mess
The mismatched mount mess
The mismatched mount mess
But it's not knowledge that needs to be imported from the userspace; it's something that already directly involves kernel code (filesystems). There's no point in having an isolated engine, it could just as well be regular kernel code.
The mismatched mount mess
The mismatched mount mess
The mismatched mount mess
The mismatched mount mess
The mismatched mount mess
[...] developers are trying to figure out what the correct behavior [is] while avoiding causing pain to system administrators.
The mismatched mount mess
