Defending mounted filesystems from the root user

Posted Aug 22, 2023 2:04 UTC (Tue) by geofft (subscriber, #59789)
In reply to: Defending mounted filesystems from the root user by kmeyer
Parent article: Defending mounted filesystems from the root user

I think sufficiently-Byzantine filesystems just need to get deprivileged, honestly.

https://github.com/lkl/linux is a fork of the Linux kernel that turns all the interesting routines into a library, with a couple of neat tech demos of what you can do with it - including a FUSE wrapper for the filesystems in the kernel. So any filesystem that's already been implemented once, in the kernel, now has a userspace version.

There's also other ways to do it, such as UML or hardware-assisted virtualization.

Yes, you will lose some performance. I think the triangle of security, performance, and nicheness is a "pick two" situation - if you want both security and performance, you will need to attract enough interest and enthusiasm to pick up the work, possibly defining newer and easier-to-handle on-disk formats as the work happens. Otherwise, you can use an old implementation that made sense in the '90s at full performance with the security of the '90s, or you can use it at the performance of the '90s (which should be enough, honestly!).

(I'd also be very curious to see what the actual performance loss is even for day-to-day filesystems, and whether there are things that can be done to address performance like reviving the zero-copy FUSE patchset. I think I actually do very few things that are ridiculously sensitive to filesystem performance per se: most of the time I'm either working with large single files like giant CSVs or git pack files or game textures, for which the filesystem is essentially a constant factor and it's the raw I/O performance that matters, or reading and writing lots of small files like source code, which can mostly stay in the VFS cache, in theory. Applications that care very much about disk performance, like databases, tend to make a large contiguously-allocated single file anyway - and they subdivide it in userspace.)

Defending mounted filesystems from the root user

Posted Aug 22, 2023 9:48 UTC (Tue) by khim (subscriber, #9252) [Link]

In a sane world we would have both. FUSE-filesystem to deal with USB or other untrusted sources and in-kernel implementation for root fs.

Defending mounted filesystems from the root user

Posted Aug 24, 2023 10:29 UTC (Thu) by farnz (subscriber, #17727) [Link]

The only significant gotcha is that which implementation to use (userspace or kernelspace) is not about the filesystem in use, but rather about the degree to which the backing storage and the user are trustworthy.

In one system, I might want to use both the kernelspace implementation of xfs for my root FS, using something like fs-verity to protect against a malicious root user, and the userspace implementation for home directories. For added fun, I might want the userspace implementation to run multiple instances, so that an exploit is less likely to affect other instances (only affects other instances if it can be used to write to the backing store); this comes in handy with something along the lines of Android's user-per-application model, where I won't be able to mutate in-memory state that affects another application.

Defending mounted filesystems from the root user

Posted Aug 27, 2023 23:32 UTC (Sun) by kmeyer (subscriber, #50720) [Link] (1 responses)

I mean, we're talking about ext4, xfs, and btrfs. Punting them to FUSE + LKL isn't really an option, IMO.

Defending mounted filesystems from the root user

Posted Aug 28, 2023 0:31 UTC (Mon) by geofft (subscriber, #59789) [Link]

I was mostly replying to the comment talking about historical filesystems - if we need to write an ext5 that makes this sort of robustness easier, that's certainly an option. We are, in any case, mostly just talking about lockdown/secure boot configs - it seems pretty reasonable to say, the upstream kernel will only let you use a very small set of filesystems when lockdown is on, and if distros want to take the risk of filesystems that haven't been audited, they can patch this check out. A lot of people run without kernel lockdown enabled (perhaps enforcing secure boot in some other way, like verifying a signature on an entire read-only rootfs) and won't be impacted by this. The people who do run with kernel lockdown should be able to expect that the kernel isn't just writing off ring-0 escalation risks.

But also I don't think punting major filesystems to FUSE is really out of the question. It was the vision of the microkernels of the '90s, which failed not because there was anything fundamentally wrong with microkernels but because overhead was high. We've learned a lot about writing efficient software that spans multiple address spaces since then (it's in many senses similar to HPC work or GPU programming), and also the physical computers are way faster. As I mentioned, without an actual benchmark, I think saying that this just has to be done in kernelspace is premature optimization.

(We also know a lot more about software fault isolation now than we did in the '90s - we could use something like eBPF or wasm or Native Client to keep these filesystems in the kernel but limit the impact of bugs.)

We Linux folks rightly make fun of Windows for having done font rendering in the kernel for so long and having had a bunch of ring-0 privilege escalation bugs as a result. It made sense in the '90s when they cared a lot about font rendering performance and basically not at all about malicious fonts; it doesn't make sense today. I don't think filesystems are a fundamentally different story.