Unprivileged filesystem mounts, 2018 edition

By Jonathan Corbet
May 30, 2018

The advent of user namespaces and container technology has made it possible to extend more root-like powers to unprivileged users in a (we hope) safe way. One remaining sticking point is the mounting of filesystems, which has long been fraught with security problems. Work has been proceeding to allow such mounts for years, and it has gotten a little closer with the posting of a patch series intended for the 4.18 kernel. But, as an unrelated discussion has made clear, truly safe unprivileged filesystem mounting is still a rather distant prospect — at least, if one wants to do it in the kernel.

Attempts to make the mount operation safe for ordinary users are nothing new; LWN covered one patch set back in 2008. That work was never merged, but the effort to allow unprivileged mounts picked up in 2015, when Eric Biederman (along with others, Seth Forshee in particular) got serious about allowing user namespaces to perform filesystem mounts. The initial work was merged in 2016 for the 4.8 kernel, but it was known to not be a complete solution to the problem, so most filesystems can still only be mounted by users who are privileged in the initial namespace.

Biederman has recently posted a new patch set "wrapping up" support for unprivileged mounts. It takes care of a number of details, such as allowing the creation of device nodes on filesystems mounted in user namespaces — an action that is deemed to be safe because the kernel will not recognize device nodes on such filesystems. He clearly thinks that this feature is getting closer to being ready for more general use.

The plan is not to allow the unprivileged mounting of any filesystem, though. Only filesystem types that have been explicitly marked as being safe for mounting in this mode will be allowed. The intended use case is evidently to allow mounting of filesystems via the FUSE mechanism, meaning that the actual implementation will be running in user space. That should shield the kernel from vulnerabilities in the filesystem code itself, which turns out to be a good thing.

In a separate discussion, the "syzbot" fuzzing project recently reported a problem with the XFS filesystem; syzbot has been doing some fuzzing of on-disk data and a number of bugs have turned up as a result. In this case, though, XFS developer Dave Chinner explained that the problem would not be fixed. It is a known problem that only affects an older ("version 4") on-disk format and which can only be defended against at the cost of breaking an unknown (but large) number of otherwise working filesystems. Beyond that, XFS development is focused on the version 5 format, which has checksumming and other mechanisms that catch most metadata corruption problems.

There was an extensive discussion over whether the XFS developers are taking the right approach, but it took a bit of a diversion after Eric Sandeen complained about bugs that involve "merely mounting a crafted filesystem that in reality would never (until the heat death of the universe) corrupt itself into that state on its own". Ted Ts'o pointed out that such filesystems (and the associated crashes) can indeed come about in real life if an attacker creates one and somehow convinces the system to mount it. He named Fedora and Chrome OS as two systems that facilitate this kind of attack by automatically mounting filesystems found on removable media — USB devices, for example.

There is a certain class of user that enjoys the convenience of automatically mounted filesystems, of course. There is also the container use case, where there are good reasons for allowing unprivileged users to mount filesystems on their own. So, one might think, it is important to fix all of the bugs associated with on-disk format corruption to make this safe. Chinner has bad news for anybody who is waiting for that to happen, though:

There's little we can do to prevent people from exploiting flaws in the filesystem's on-disk format. No filesystem has robust, exhaustive verification of all it's metadata, nor is that something we can really check at runtime due to the complexity and overhead of runtime checking.

Many types of corruption can be caught with checksums and such. Other types are more subtle, though; Chinner mentioned linking important metadata blocks into an ordinary file as an example. Defending the system fully against such attacks would be difficult to do, to say the least, and would likely slow the filesystem to a crawl. That said, Chinner doesn't expect distributors like Fedora to stop mounting filesystems automatically: "They'll do that when we provide them with a safe, easy to use solution to the problem. This is our problem to solve, not blame-shift it away." That, obviously, leaves open the question of how to solve a problem that has just been described as unsolvable.

To Chinner, the answer is clear, at least in general terms: "We've learnt this lesson the hard way over and over again: don't parse untrusted input in privileged contexts". The meaning is that, if the contents of a particular filesystem image are not trusted (they come from an unprivileged user, for example), that filesystem should not be managed in kernel space. In other words, FUSE should be the mechanism of choice for any sort of unprivileged mount operation.

Ts'o protested that FUSE is "a pretty terrible security boundary" and that it lacks support for many important filesystem types. But FUSE is what we have for now, and it does move the handling of untrusted filesystems out of the kernel. The fusefs-lkl module (which seems to lack a web site of its own, but is built using the Linux kernel library project) makes any kernel-supported filesystem accessible via FUSE.

When asked (by Ts'o) about making unprivileged filesystem mounts safe, Biederman made it clear that he, too, doesn't expect most kernel filesystems to be safe to use in this mode anytime soon:

Right now my practical goal is to be able to say: "Go run your filesystem in userspace with fuse if you want stronger security guarantees." I think that will be enough to make removable media reasonably safe from privilege escalation attacks.

It would thus seem that there is a reasonably well understood path toward finally allowing unprivileged users to mount filesystems without threatening the integrity of the system as a whole. There is clearly some work yet to be done to fit all of the pieces together. Once that is done, we may finally have a solution to a problem that developers have been working on for at least a decade.

Index entries for this article
Kernel	Namespaces/User namespaces

Unprivileged filesystem mounts, 2018 edition

Posted May 30, 2018 16:07 UTC (Wed) by ms-tg (subscriber, #89231) [Link] (2 responses)

Out of curiosity, does anyone know to what extent the same vector to privilege escalation attacks (allowing local non-admin users to mount a CD or USB) exist today on Mac OS or Windows?

On Mac OS in particular, is it possible to construct a malicious .dmg file using these principles, since Mac users typically mount those disk images to install software?

Unprivileged filesystem mounts, 2018 edition

Posted Jun 1, 2018 8:23 UTC (Fri) by ehiggs (subscriber, #90713) [Link]

Sure. Here are some CVEs that demonstrate how maliciously crafted dmg files are indeed a potential attack vector:

https://www.cvedetails.com/cve/CVE-2018-4176/
https://www.cvedetails.com/cve/CVE-2015-7110/

Unprivileged filesystem mounts, 2018 edition

Posted Aug 24, 2018 19:21 UTC (Fri) by ssmith32 (subscriber, #72404) [Link]

I kind of feel like, once you've convinced the user to install your software, convincing them to type in their admin password is not far behind, and, at that stage, why rely on a vulnerability to do something?

Unprivileged filesystem mounts, 2018 edition

Posted May 30, 2018 16:25 UTC (Wed) by rahvin (guest, #16953) [Link] (6 responses)

Not related to the article so I apologize in advance.

Given the advancement in older filesystems over the last few years how would those developers rate these older filesystems like JFS, XFS, ext4 and others for being the most advanced and development mind-share. It appears that XFS has the most mind share and appears to be advancing the fastest but this could be because of Redhats other efforts, I'm curious what other think.

My concern is probably that there is a LOT of older information out there on what filesystem is best in what circumstances, etc that may no longer be relevant as a particular filesystem has seen more work than others.

Unprivileged filesystem mounts, 2018 edition

Posted May 30, 2018 22:30 UTC (Wed) by Paf (subscriber, #91811) [Link] (5 responses)

That’s really a pretty big question, but as someone who works on a file system and follows related news, here’s my sense of the zeitgeist.

I can think of three broad types worth addressing.

For traditional extent based file systems on Linux, EXT4 and XFS are clearly best of breed. There is an emerging consensus among enterprise distributions in favor of XFS as the default, if that helps, but neither is dramatically superior in general.

I can’t speak to log structured except to say that those are mostly built in to flash devices rather than used directly.

For copy-on-write, there are three real choices. ZFS, almost certainly best of breed but with complex legal issues, BTRFS, which you can get various answers on the readiness of, and bcachefs which is compelling but pretty clearly still too new.

Unprivileged filesystem mounts, 2018 edition

Posted May 30, 2018 23:39 UTC (Wed) by rahvin (guest, #16953) [Link] (4 responses)

How would you rate JFS in comparison?

Unprivileged filesystem mounts, 2018 edition

Posted May 31, 2018 11:51 UTC (Thu) by bendystraw (guest, #124653) [Link]

JFS is great; it's without a doubt my favorite file system for my OS/2 cluster.

Unprivileged filesystem mounts, 2018 edition

Posted Jun 2, 2018 10:27 UTC (Sat) by stevan (guest, #4342) [Link] (2 responses)

Personal experience - JFS remains the only filesystem from which I have never lost data, though I have not used BTRFS or ZFS in anger. It's old and relatively unloved, but seems to work at human-manageable-scale systems. Data loss leaves lasting scars.

Unprivileged filesystem mounts, 2018 edition

Posted Jun 20, 2018 18:56 UTC (Wed) by mstone_ (subscriber, #66309) [Link] (1 responses)

I've lost data to JFS. I'm also not really sure why it exists other than as a me-too from IBM as it offered nothing that wasn't available in other linux filesystems.

Unprivileged filesystem mounts, 2018 edition

Posted Jun 21, 2018 21:33 UTC (Thu) by philipstorry (subscriber, #45926) [Link]

I'd imagine that IBM offered it so that their customers could migrate to Linux.

It's curious to hear you call JFS a "me-too", as it predates the Linux kernel by over a year. (It originated with IBM's AIX systems in 1990, was later ported to OS/2, and finally to Linux.)

It's actually quite a nice filesystem for general use. It's got metadata journalling, uses extents and allocation groups, and has a reputation for being fast even under heavy loads whilst not consuming much CPU or memory itself.

XFS is probably the filesystem it's most natural to compare JFS to, as they have similar core features and were both ported to Linux at around the same time in 2001. It's also an OS that came from an old UNIX (IRIX) and is only three years younger than JFS, so understandably has a number of similar design decisions. It seems both were pretty cutting edge for the early 1990's!

I wasn't terribly involved with Linux back in 2001 when they were both ported, but it seems that XFS rapidly won the mindshare battle - it accrued more developers around it. Perhaps that's because SGI were more open to contributions from other developers than IBM were? Or maybe it's because its 64-bit on-disk structure gave it higher headline stats in terms of maximum sizes?

Certainly one of the things I've recently admired about JFS is that it's very much in "maintenance mode" these days. That may not be exciting or sexy, but it does make it attractive it you're looking for reliability. I suspect that the unchanging nature of JFS is why it tends to get discounted - it's not adding new features, but the ones it has are well implemented and reliable. But the tech industry and community likes the shiny new things, and JFS lost its shiny new feel over a decade ago.

Now it's simply a reliable workhorse.

The main reasons to avoid it are either feature requirements (and they're more likely to be COW based) or simply the concern that at some point it may be deprecated due to its inactivity. That sort of concern is kind of a self-reinforcing feedback look really, and I suspect it's started to happen already.

However, it's served three different operating systems well, and is still a viable choice for many purposes. It's a pity JFS doesn't get a little more respect...

lklfuse

Posted May 30, 2018 16:30 UTC (Wed) by phh (guest, #112196) [Link] (7 responses)

> that it lacks support for many important filesystem types.

On the list of supported FS by FUSE, technically there is lklfuse which makes it possible to mount any FS supported by Linux
I like the FUSE-only approach, because it makes the surface of attack fairly small. Then Ts'o suggestion is basically to replace FUSE with 9P. Yeah sure, whatever works I guess.

lklfuse

Posted May 31, 2018 7:55 UTC (Thu) by dgm (subscriber, #49227) [Link] (5 responses)

Funny how, the more Linux advances, the more it resembles Plan 9.

lklfuse

Posted May 31, 2018 10:03 UTC (Thu) by k3ninho (subscriber, #50375) [Link] (3 responses)

It's heartening to know that 'those that don't learn from history end up reinventing UNIX' has moved on to reinventing what the UNIX team invented next: everything-is-a-file plus programs-are-servers is 9P and has similarities in any message-passing distributed system, notably microservices in containers.

K3n.

lklfuse

Posted Jun 7, 2018 7:08 UTC (Thu) by Wol (subscriber, #4433) [Link] (2 responses)

That's always felt pretty arrogant to me.

Pr1mos was a Multics derivative, and I've always felt that in MANY ways it was better than Unix. Unix (in the form of BSD) just happened to be free, and gained traction, and well we all know that "the good enough is the enemy of the best".

Cheers,
Wol

lklfuse

Posted Jun 8, 2018 15:03 UTC (Fri) by nix (subscriber, #2304) [Link]

Also being free, and more-or-less portable, and not having PR1ME's, uh, reputation for eccentricity and fairly terrible marketing probably helped. Network effects kicked in from that point on: we already have Unix software, we don't want to massively rewrite it... the only option is another Unix.

lklfuse

Posted Aug 13, 2018 3:52 UTC (Mon) by fest3er (guest, #60379) [Link]

... and well we all know that "the good enough is the enemy of the best".

Brings to mind Brian Wilson's quip: "Beware the lollipop of mediocrity; lick it once and you'll suck forever."

lklfuse

Posted May 31, 2018 11:52 UTC (Thu) by bendystraw (guest, #124653) [Link]

We'd be so lucky.

lklfuse

Posted Jun 1, 2018 12:51 UTC (Fri) by smurf (subscriber, #17840) [Link]

The point is, 9P does work. So does FUSE … has anybody done an in-depth comparison of these protocols?

Unprivileged filesystem mounts, 2018 edition

Posted May 30, 2018 23:16 UTC (Wed) by roc (subscriber, #30627) [Link] (1 responses)

Would be good to know why Ted T'so said "FUSE is a pretty terrible security boundary." So far I think he hasn't explained it in the thread.

Unprivileged filesystem mounts, 2018 edition

Posted May 31, 2018 1:50 UTC (Thu) by ncm (guest, #165) [Link]

The set Z for which "Zi is a pretty terrible security boundary" does not hold is small enough as to be statistically negligible. The burden of proof is on anyone asserting that some Zj is in that set.

Sshd (with password and challenge-response authentication turned off) might be in the set. Anything not specifically designed to be in Z can safely be assumed not to be.

Safely mounting random images: Use a VM?

Posted Jun 4, 2018 14:55 UTC (Mon) by david.a.wheeler (subscriber, #72896) [Link] (1 responses)

There seems to be two fundamentally different ways of using a "disk" image. One is using a trusted image as fast as possible, and the other is one you don't trust. In many circumstances the image is trusted, so it makes sense that the "one you don't trust" has never gotten an adequate amount of effort.

FUSE, as far as I know, doesn't support all the options and features you'd want, and it's always playing catch-up. You can't run a different kernel in a container, either.

The only "easy safe way" I see to access a disk image you don't trust is to run a VM to access the drive and has no other access (in particular, no external network). Then "share" it over a simulated network that only has internal access. This does trust that the VMM is adequately protected, but that has a chance. Then you can run the *native* kernel code to read it. If the system gets broken into, you only get the VM & what it can see.

That's a pretty heavyweight approach. Is there a better one?

Safely mounting random images: Use a VM?

Posted Jun 5, 2018 12:49 UTC (Tue) by robbe (guest, #16131) [Link]

You’re basically echoing Ted’s suggestion from
https://lwn.net/Articles/755669/

As to FUSE always playing catch-up, why not flip that around for filesystems like FAT, which are mounted untrusted in the *majority* of uses (e.g. the mentioned automount-my-usb-stick)? The in-kernel FAT implementation would be relegated to legacy status, while distributors made sure that the automount would set up a userspace equivalent (FUSE, or gvfs, or whatever).

That wouldn’t work out for the container case, though.