|
|
Log in / Subscribe / Register

Mounting images inside a user namespace

By Jake Edge
June 13, 2023

LSFMM+BPF

There has long been a desire to enable users to mount filesystem images without requiring privileges, but the security implications of allowing it are seriously concerning. Few, if any, kernel filesystems are hardened against maliciously crafted images, after all. Lennart Poettering led a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit where he presented a possible path forward.

He started with an overview of the problem, noting that "everybody wants to be able to mount disk images that contain arbitrary filesystems" in user space, without needing to be root. Since malicious images could crash the kernel—or worse—the only way to do that is to establish some trust in the image before it gets mounted. He talked about some components that the systemd developers want to add that would allow container managers and other unprivileged user-space programs to accomplish this.

[Lennart Poettering]

More specifically, code that is running in a user namespace can ask the host operating system to mount a filesystem stored in the contents of a particular file. It will require that containers have some limited access to an interprocess communication (IPC) mechanism to talk to the host OS. That is different than today's containers, which generally can only use the kernel API and, perhaps, communicate in a limited way with their container manager, he said.

There are multiple use cases for this feature, including unprivileged container managers that want to run containers from disk images, but also for tools that build container images. There are desktop application runtimes that want to be able to run apps from images as well. Essentially, any tool that wants to be able to work with disk images, but not have special privileges, could benefit.

There are a number of complexities for any solution. Some kind of trust needs to be established in the images before they get mounted; immutable images using dm-verity are easier in that regard, but there is a desire to also have limited support for writable images. Minimizing or eliminating the need for the host to enter the caller's namespace in order to perform the mount is also desirable. Recursion in the form of nested containers should be supported without needing to resort to special cases, as well, he said.

Poettering described how this all might work. An unprivileged process P, which might be a container manager, creates a user namespace U, but does not give U any user/group mappings. It then passes a file descriptor for U through an IPC mechanism to a service on the host, X, which could be a privileged process provided by systemd; X assigns a transient UID/GID range (64K of each, for example) to U. These transient ranges are a "key idea" of the feature; the transient ranges only last as long as the user namespace does and they are recycled when it goes away, unlike persistent UID/GID ranges. It is "dramatically different" to the way these ranges are handled today, he said.

X enforces a security policy on U that only allows a small subset of filesystem operations (open() for create, chmod(), and "a couple of other things") and only on mounts that appear in an allowlist, which is initially empty. So, initially, P cannot create any files. P can talk to Y, which is a different service, via IPC, passing it a file descriptor to U and another descriptor of an image file it would like to mount. It gets back a file descriptor, like one returned from fsmount() (in the new mount API), that corresponds to the mounted image with the ID-mapping from U already applied (using ID-mapped mounts). Y talks to X to get this new mount added to the allowlist and P can attach the mount file descriptor wherever it wants and join U if it has not already done so.

It looks like a lot of steps, he said, but for a client application it is fairly easy. The client simply makes an IPC call to get the user namespace set up and then a second one to get the mount. It can pass multiple images to Y to get multiple mounted filesystems and then it can attach them wherever makes sense in its directory hierarchy.

Instead of X and Y, he got more specific; he used the placeholders because the concept is entirely generic, so it could be implemented in other ways. For systemd, X would be systemd-userdbd and Y would be a new systemd-mntfsd service. The security policy he described for systemd-userdbd would be implemented using the BPF Linux security module (BPF-LSM). The images to be mounted by systemd-mntfsd would be in the discoverable disk image (DDI) format. More information about DDI (and other surrounding efforts) can be found in the report from last year's Image-Based Linux Summit.

These images have a GPT partition table and are separated into several partitions. One partition is for the filesystem, while another has the dm-verity information. There is a third partition with a signature for the root-level hash of the filesystem, which gets verified by the kernel using its internal keyring. If it passes, systemd-mntfsd will set up the filesystem and dm-verity, apply the user mapping, and return it to the requesting process. DDI makes it convenient to wrap each of those three parts together into a single image.

Another mechanism for trusting images would be to have a trusted directory on the host. Since only privileged processes should be able to write into that directory, systemd-mntfsd could be configured to allow requests to mount images from there. That provides a weaker level of trust but may be fine for some systems, he said.

Those two options (signed DDI and trusted directories) are already implemented and should appear in the next release of systemd. Another mechanism, which would allow mounting writable filesystems, is still being worked on. The idea would be that the requester (perhaps a tool building images) asks for a filesystem of a certain type and size that would be stored in a provided image file, which systemd-mntfsd would create (using mkfs) in the file; it would then add a dm-integrity sidecar file that tracks the changes to the filesystem image. Dm-integrity would use a hash with a key that is not accessible to the caller, so the sidecar file can only be (correctly) updated by the kernel. The caller can provide the image and the sidecar file at a later point and the mount service will be willing to mount it again. If the sidecar file is not present (or is corrupted), the image will not be mounted.

He was asked about using signed fs-verity files as well. He said that it is all being done in user space, so other mechanisms could be added if they make sense. His goal is generally to let the kernel make these trust decisions based on keys on its keyring, rather than "doing trust enforcement in user space", but others may want to do things differently.

Ted Ts'o suggested that systemd-mntfsd could copy an image file to a block device that is inaccessible to the requester, then run fsck on the filesystem image. If it passes that check, it could be mounted in a suitable fashion (e.g. nosuid, nodev) and handed off to the container without needing to use dm-verity. Poettering said that fsck is already being used in the writable case, "but it was news to me that this is the philosophy that filesystem engineers subscribe to". He noted that other filesystem developers were "shaking their heads", so he did not think that there was universal agreement that fsck was sufficient to detect malicious images.

Ts'o said that it would depend on the filesystem, so Poettering tried to get a commitment about ext4, but Ts'o hedged things a bit. He is "reasonably confident" that it is not possible to cause a "buffer overrun or privilege-escalation attack" with an ext4 filesystem that passes fsck. Denial-of-service due to an overly fragmented filesystem would be a possibility, though, so it "depends on what your threat model is", he said. Josef Bacik said that he just comes from a standpoint of being paranoid. He trusts that the Btrfs fsck does a good job to ensure that there is a valid filesystem, but it, like him, is imperfect. It sounds like a good solution, but he would be leery of trusting it in a high-security situation.

Jeff Layton asked about network filesystems. Poettering thought that might be less worrisome, but Layton assured him that it would not be. There is interest in being able to pass a directory file descriptor to systemd-mntfsd, which will bind-mount to that directory, apply the UID mapping, and return that to the requester, Poettering said. That is not particularly risky because the filesystem is already mounted in the system, which is perhaps analogous to the network-filesystem case. But it turns out that none of the network filesystems implement ID mapping, though Christian Brauner said that he had gotten it working for CephFS (with some caveats).

Layton said that a malicious server was just as bad or worse than a malicious image, but that NFS had recently added TLS support. One way to establish trust in that environment would be to only allow servers that can present a properly signed TLS certificate. David Howells raised the automounter as another thing to consider, while Steve French mentioned SMB. Poettering said that if there is a need to mount these kinds of things in containers, they can be added, "as long as there's some kind of sensible security story in place".

There is an unresolved problem that has cropped up, he said. LSMs cannot restrict manipulations of access-control lists (ACLs), so it is a way that the transient IDs in the user namespace (U above) could leak out into the rest of the system in a persistent fashion. Perhaps it is not a big problem, he said, but all of the other ways that these IDs can be persistently associated with filesystem objects (e.g. chown()) are being blocked. He is not too concerned, but it is a low-severity vulnerability.

He gave a demo at around 19:10 in the video of the session. He started systemd-userdbd in one window, systemd-mntfsd in another, and then handed a disk image to systemd-dissect, which mounted it using the new mechanism and then pulled it all apart. He ran it as an unprivileged user "and it just works". The user IDs are handled correctly and it is all "extremely simple". Furthermore, it is something of a showcase of recent kernel features, such as the new mount API (across namespaces) and BPF-LSM; they and a few others can be combined to provide this long-sought feature.

He is pleased with the result, because "it is tiny", is socket-activated so it is not running all of the time, and there is just a single socket for IPC that needs to be bind-mounted into container to make it all work. Brauner pointed out that the superblock is not owned by the user namespace where the mount is being done, "which means that all of the destructive ioctl()s" that exist for Btrfs or XFS are not available to the container. But the container does own the mount, which means it can unmount it. The ownership of the mount is separate from the ownership of the superblock, he said, which is a nice side effect.

An attendee asked whether the containers would have access to the image files after the mount had been done. If so, a container could modify the image, thus potentially crash or compromise the kernel that way. Poettering said that the containers may have access to those files, since they might own them, but that dm-verity is meant to prevent any changes; if the image file is changed, any read of that region will return an error. Other mechanisms, such as fs-verity and dm-integrity, would also provide that kind of protection. He noted that in the fsck scenario, Ts'o had said that the image would need to be copied to a location inaccessible to the container.

The session ended with a quick discussion of how a network filesystem might be mounted in a separate network namespace for the container. Poettering said that it was something to work out with the network-filesystem developers, since it would need to be a mount option of some sort. Howells said that it would straightforward to do that using the new mount API if it were deemed desirable.


Index entries for this article
KernelContainers
KernelFilesystems/Mounting
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2023


to post comments

Mounting images inside a user namespace

Posted Jun 13, 2023 14:46 UTC (Tue) by bluca (subscriber, #118303) [Link] (11 responses)

systemd-backdoord go brrrr!

Mounting images inside a user namespace

Posted Jun 13, 2023 17:47 UTC (Tue) by vadim (subscriber, #35271) [Link] (7 responses)

What's that supposed to mean?

It's good, welcome functionality. Think for instance of things like AppImage and Flatpak -- applications in a container that an unprivileged user would want to be able to mount easily and safely.

I think ideally we'd have a secure, privilege separated system for doing such things. A simple very, readonly, easy to validate for correctness filesystem. A sandboxed userspace driver to interpret it for good measure. And a good sandbox for the actual application running within.

Mounting images inside a user namespace

Posted Jun 13, 2023 22:28 UTC (Tue) by bluca (subscriber, #118303) [Link]

Mounting images inside a user namespace

Posted Jun 14, 2023 10:58 UTC (Wed) by tao (subscriber, #17563) [Link] (4 responses)

It's probably just yet another immature person who refuses to see any value in anything systemd-related.

Mounting images inside a user namespace

Posted Jun 14, 2023 11:40 UTC (Wed) by mezcalero (subscriber, #45103) [Link] (3 responses)

Luca is a systemd maintainer. A very funny one, apparently. ;-)

Lennart

Mounting images inside a user namespace

Posted Jun 14, 2023 12:09 UTC (Wed) by bluca (subscriber, #118303) [Link] (2 responses)

My top-notch humor is wasted it seems.

Mounting images inside a user namespace

Posted Jun 14, 2023 16:08 UTC (Wed) by Karellen (subscriber, #67644) [Link]

Without a winking smiley or other blatant display of humor, it is utterly impossible to parody a Creationistanti-systemd troll in such a way that someone won't mistake for the genuine article.

-- Poe's Law

It's not that your humour is bad, it's that it's so on point that it's hard to distinguish it from the thing it's making fun of.

Mounting images inside a user namespace

Posted Aug 1, 2023 11:24 UTC (Tue) by tao (subscriber, #17563) [Link]

Ah, sorry, my apologies for not being able to recognise that you were joking. Keep up the good work on systemd!

Mounting images inside a user namespace

Posted Jun 15, 2023 9:14 UTC (Thu) by gray_-_wolf (subscriber, #131074) [Link]

Ignoring the previous comment (I think it was somewhat funny, but that is a matter of taste, and I have a poor one), I have to admit I am not sure why this is required. You mention AppImage and Flatpak, but why do they need disk images? Rootless podman works on my machine today, so I am not sure why they cannot use the same mechanism for distributing the files they need. Why does it need to be a disk image?

Mounting images inside a user namespace

Posted Jun 14, 2023 23:30 UTC (Wed) by motk (guest, #51120) [Link] (2 responses)

Yeah, this isn't helpful

Mounting images inside a user namespace

Posted Jun 17, 2023 22:00 UTC (Sat) by jschrod (subscriber, #1646) [Link] (1 responses)

Are you aware that "bluca" is an important systemd developer?

The beauty of LWN.net is that the developers are directly involved in the discussions about their work. For that, we bystanders have to do our own work to decipher their in-jokes. The former is more important than the latter.

Please deal with it.

Mounting images inside a user namespace

Posted Jun 18, 2023 6:58 UTC (Sun) by smurf (subscriber, #17840) [Link]

> Are you aware that "bluca" is an important systemd developer?

Apparently not.

I wasn't either until Lennart told us, but so what, snarky comments are par for the course around here; we've heard far worse during the systemd wars of yesteryear.

Mounting images inside a user namespace

Posted Jun 13, 2023 15:33 UTC (Tue) by elaforma (subscriber, #165356) [Link] (7 responses)

> Few, if any, kernel filesystems are hardened against maliciously crafted images, after all.

As someone who doesn't know much about file systems, that sounds very concerning. How does Linux handle removable drives?

Mounting images inside a user namespace

Posted Jun 13, 2023 15:48 UTC (Tue) by pj (subscriber, #4506) [Link]

Linux handles removable drives by putting the onus on root (aka whoever has physical access) to make sure the removable drives are trustable.

Mounting images inside a user namespace

Posted Jun 13, 2023 15:49 UTC (Tue) by farnz (subscriber, #17727) [Link] (1 responses)

Same way as Windows and macOS - hope that the file system on the drive is not maliciously crafted, and use the in-kernel driver to mount it.

There's a special case here, where the in-kernel driver is FUSE, which improves safety considerably (since the filesystem image is interpreted by userspace code, not kernel code), but otherwise, yeah, it's all about hope.

Mounting images inside a user namespace

Posted Jun 14, 2023 16:37 UTC (Wed) by zeha (subscriber, #61580) [Link]

macOS recently started to use userspace filesystem drivers for many formats; maybe not for apfs though.

Mounting images inside a user namespace

Posted Jun 13, 2023 15:55 UTC (Tue) by artizirk (subscriber, #113443) [Link]

> How does Linux handle removable drives?
As well as you might think (aka badly). Tho targeting the USB stack itself or some other less used USB device driver with USB Rubber Ducky type tool will probably be easier.

Anything that connects to your USB data lines can potentially take over your system. (BadUSB, Juice jacking)

Anything with a Radio is also fun.
* https://8051enthusiast.github.io/2021/07/05/002-wifi_fun....
* https://thehackernews.com/2018/11/bluetooth-chip-hacking....

Mounting images inside a user namespace

Posted Jun 13, 2023 19:24 UTC (Tue) by MattBBaker (guest, #28651) [Link] (2 responses)

Same way that all other software stacks deal with the problem. Assume that physical access == admin access and ignore it.
There is a reason why secure environment will physically remove USB chips where possible, rewrite the BIOS/EFI/firmware to not even recognize USB and other external ports in general, and destroy and epoxy over the physical connector.

Mounting images inside a user namespace

Posted Jun 14, 2023 11:38 UTC (Wed) by mezcalero (subscriber, #45103) [Link]

That's not true btw. AFAIK at least ChromeOS mounts removable USB media via an unprivileged/sandboxed FUSE VFAT driver, not via a kernel driver.

And frankly, we should do the same on regular Linux desktops too.

Lennart

Mounting images inside a user namespace

Posted Jun 14, 2023 12:09 UTC (Wed) by Wol (subscriber, #4433) [Link]

> rewrite the BIOS/EFI/firmware to not even recognize USB and other external ports in general, and destroy and epoxy over the physical connector.

And then you get laptops where the ONLY port is USB-(C), and on my work laptop in particular usually has power, 2nd screen, external mouse and keyboard connected (plus cat-5 at home). Plugging in cards was admin-only, but I think that got lifted because I was having loads of grief for some reason ...

Cheers,
Wol

Surely you need to handle corrupt data anyway

Posted Jun 13, 2023 16:30 UTC (Tue) by epa (subscriber, #39769) [Link] (36 responses)

What the article calls a “malicious image” I call one that has been corrupted in some way, or just exercises a case the filesystem programmer hadn’t thought of. And instead of “attack” I would call it something that will go wrong sooner or later. Surely with all the testing and fuzzing that is done against Linux kernels, somebody is trying to provoke bugs by mutating filesystem images?

Of course a lot of the time there is no right answer. If the file’s length is recorded as one value in one field but another field implies a different length, there is no way to decide what the file “should” be. The best you can do will often be to detect the error and remount the filesystem read-only. But certainly I would not expect the kernel itself to be exploitable and start trampling memory or panicking the system.

If I were to play devil’s advocate I would say that providing restricted access to filesystem mounting is the wrong direction to go in. As long as suid binaries and device files are disabled, any user ought to be able to mount any filesystem (at least by loopback) and this should be enabled by default. Otherwise there isn’t much hope the bugs in filesystem code will get fixed. They are liable to be met with the classic useless response: don’t do that.

Surely you need to handle corrupt data anyway

Posted Jun 13, 2023 17:34 UTC (Tue) by smurf (subscriber, #17840) [Link] (32 responses)

Unfortunately there's no way to ensure, in the general case, that a file system doesn't contain malicious data. File systems are just too big and complicated to allow fuzzers to discover their holes.

Worse, there's no clear separation of metadata and files' content because on a corrupt file system any data block can be anywhere. This means that you can't even rely on a check you did a millisecond ago. Protecting against such an attack is somewhat unlikely to speed up a file system, so you can assume that the kernel won't protect against it any time soon.

If you need that kind of guarantee, use a userspace file system driver and FUSE. If I remember correctly, there's already code out there to do this with a user-mode linux kernel, so you don't even have to write new drivers.

Surely you need to handle corrupt data anyway

Posted Jun 13, 2023 18:36 UTC (Tue) by mb (subscriber, #50428) [Link] (30 responses)

> This means that you can't even rely on a check you did a millisecond ago.

Yes, you can. Unless you read the same block(s) of data that you based your checks on from the disk again, of course.

I don't see why it should not be possible to write a completely memory-safe fs in the kernel.

Memory safety is about not overrunning your own allocated buffers. Therefore the *kernel* has to track its allocation sizes and check them every time an index-like element is read from the disk.

Memory safety is about not dereferencing invalid and NULL pointers. That one is trivial. Just check em all.

Yes, this is very complex, because filesystems are complex data structures and the filesystem drivers are complex parsers for these structures. But there is no excuse for using data from the disk as a pointer(-offset) before validating if that the pointer points into a valid range.

But remember that this is not about trying to interpret or even fix corrupt filesystems. It's only about avoiding memory safety issues in case of corrupt filesystems.

And I don't buy the performance argument either. That would mean that memory-safe by-design languages like Rust would be slow. They are not.

Surely you need to handle corrupt data anyway

Posted Jun 14, 2023 10:06 UTC (Wed) by SLi (subscriber, #53131) [Link] (29 responses)

Or you could run the sensitive code with the least set of privileges it needs. That's what FUSE does. Then you just don't need to care about buffer overflows.

Surely you need to handle corrupt data anyway

Posted Jun 14, 2023 11:17 UTC (Wed) by bluca (subscriber, #118303) [Link] (28 responses)

That's way too simplistic. Just because it runs in a user process doesn't magically make it ok to be subject to malicious attacks. The number one attack vector on any modern computer is the web browser, and that's an unprivileged userspace process. That's why establishing trust _before_ use is fundamental.

Surely you need to handle corrupt data anyway

Posted Jun 14, 2023 19:50 UTC (Wed) by SLi (subscriber, #53131) [Link] (1 responses)

I believe the threat model here was that the unprivileged user attacker controls the disk image and wants to compromise the kernel to elevate privileges. I have slightly hard time imagining a non-contrived scenario where an attacker would control a disk image yet need to trigger a vulnerability in a process that could run with effectively the same or fewer privileges as the container.

Surely you need to handle corrupt data anyway

Posted Jun 14, 2023 21:05 UTC (Wed) by mb (subscriber, #50428) [Link]

>I have slightly hard time imagining a non-contrived scenario where an attacker would control a disk image yet need to trigger a vulnerability in a process that could run with effectively the same or fewer privileges as the container.

Automount of a USB stick on a screen-locked machine or similar.

Surely you need to handle corrupt data anyway

Posted Jun 14, 2023 22:51 UTC (Wed) by p2mate (guest, #51563) [Link] (25 responses)

That's also way too simplistic. Private keys have leaked in the past and will in the future. Key management isn't prefect either. Hence you need *and* user space filesystems *and* a filesystem implementation in a memory safe language *and* trust.

Surely you need to handle corrupt data anyway

Posted Jun 15, 2023 5:59 UTC (Thu) by mb (subscriber, #50428) [Link] (23 responses)

In the first step we would only need a simple commitment that we want to have some memory safe kernel fs drivers in the future.
That we don't have, but it is the base for so many things.

Then we can go from there and think about the steps to actually get there.
There certainly is no one-fits-all solution and we will probably never fix all fs implementations.
But what matters is the commitment to start the process.

Surely you need to handle corrupt data anyway

Posted Jun 15, 2023 11:43 UTC (Thu) by bluca (subscriber, #118303) [Link] (22 responses)

Well, what pretty much all the actual filesystem developers seem to be saying is that there is no such "simple commitment", so...

Surely you need to handle corrupt data anyway

Posted Jun 15, 2023 17:36 UTC (Thu) by mb (subscriber, #50428) [Link] (21 responses)

We can't have a commitment, because there is no commitment?

Surely you need to handle corrupt data anyway

Posted Jun 15, 2023 18:32 UTC (Thu) by bluca (subscriber, #118303) [Link] (20 responses)

Well, no, it's because those that would have to make such commitment don't want to... if you watch the recording of the session it is quite clear where the wind blows

Surely you need to handle corrupt data anyway

Posted Jun 15, 2023 19:38 UTC (Thu) by mb (subscriber, #50428) [Link] (19 responses)

So we agree, that this is not a technical problem. That was my whole point.

We should refrain from reproducing the same "we can't have safe fs in the kernel" and instead say "we don't want to have safe fs in the kernel, because it's too much work, etc...".
Because that's actually what currently happens.

Developing a technical workaround for that is the *wrong* decision.

Surely you need to handle corrupt data anyway

Posted Jun 15, 2023 20:20 UTC (Thu) by bluca (subscriber, #118303) [Link] (18 responses)

We don't agree at all - it is a technical problem. The technical problem is that, according to the experts, it's not safe to do so. I mean it's open source, so if you believe you can build a perfectly safe filesystem that can resist to attacks via malicious images, by all means do so and publish the patches.

Surely you need to handle corrupt data anyway

Posted Jun 15, 2023 21:19 UTC (Thu) by mb (subscriber, #50428) [Link] (17 responses)

>according to the experts, it's not safe to do so.

Sure. And I doubt it. What's your point?

>if you believe you can build a perfectly safe filesystem

Where did I say that?

Surely you need to handle corrupt data anyway

Posted Jun 17, 2023 22:11 UTC (Sat) by jschrod (subscriber, #1646) [Link] (16 responses)

> >according to the experts, it's not safe to do so.

> Sure. And I doubt it. What's your point?

I think his point is clear:
If the subject matter experts tell you that it's not safe and you disagree - then *you* have to prove your point.

Because the rest of the Linux developers will trust their file system developers more than you, without decades of track record in that area.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 10:34 UTC (Sun) by mb (subscriber, #50428) [Link] (15 responses)

Ok, so I prove my point: Rewrite the fs in Rust. Then it immediately becomes memory safe. Therefore, the claim that an fs can't be safe in the kernel is wrong.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 11:13 UTC (Sun) by smurf (subscriber, #17840) [Link] (6 responses)

So which FS are you going to rewrite in Rust?

And how do you propose we should handle all the others?

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 12:33 UTC (Sun) by mb (subscriber, #50428) [Link] (5 responses)

>So which FS are you going to rewrite in Rust?

I did never say that I was rewriting any fs in Rust.

I just said that if one would rewrite an fs in Rust, it would automatically be memory safe. Therefore, the claim that it was not possible to have safe in-kernel fs implementations must be wrong.
Of course, you can also have a reasonably safe implementation in C. The developer just has to commit to that goal. But for whatever reason "the experts" don't want to.

Saying that we can't have safe fs in the kernel "because the expert say so" has just been proven wrong.
We should get over that false claim.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 12:40 UTC (Sun) by SLi (subscriber, #53131) [Link] (3 responses)

You do realize that it's actually almost impossible (or at least used to be) to write even a linked list, let alone some more complicated structure, in safe rust? Some things may just not be possible in it.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 13:51 UTC (Sun) by mb (subscriber, #50428) [Link] (2 responses)

>You do realize that it's actually almost impossible (or at least used to be)
>to write even a linked list, [...] in safe rust?

You do realize that saying it's *almost* impossible means that it actually is possible?
But besides that: It's not true. A linked list can easily be implemented in safe Rust. It just has a small runtime cost. If you don't want to pay that cost, you can use unsafe Rust and audit it for memory safety problems.
Both versions would be fine.
Remember what I said: You can also use C to implement safe fs drivers in the kernel. It's just much harder.

>Some things may just not be possible in it.

So if it might be hard, then better give up?
Not my way of thinking.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 14:56 UTC (Sun) by pizza (subscriber, #46) [Link] (1 responses)

> So if it might be hard, then better give up?
> Not my way of thinking.

I suggest you look up the definition of "cost-benefit analysis".

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 17:06 UTC (Sun) by mb (subscriber, #50428) [Link]

> I suggest you look up the definition of "cost-benefit analysis".

Ah. So now we get to a common understanding of the situation.
So it is indeed possible to have safe fs code in the kernel.

It's just that the "experts" say that a kernel memory error induced by plugging a disk is not worth fixing.

As I said, I beg to differ. But I'm fine, if you have a different opinion.
Just please be honest and say "it costs too much", or "we don't care about users" or anything but "it's not possible". Because that's false.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 19:28 UTC (Sun) by smurf (subscriber, #17840) [Link]

> Saying that we can't have safe fs in the kernel "because the expert say so" has just been proven wrong.

The experts said no such thing. They said that the *current* crop of FSes are inherently unsafe and unlikely to be "fixed".

Thus if somebody wants one they get to write one themselves.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 17:35 UTC (Sun) by jschrod (subscriber, #1646) [Link] (7 responses)

So you think that writing a file system in Rust would result in a safe filesystem. That is so absurd, it's at the level of being comedy. Rust doesn't have any magic pixie dust that is automatically sprinkled over a program written in it and makes it safe at a level that is the point of discussion in this article. Everybody who thinks so, is, IMNSHO, not a professional programmer.

Providing memory safety doesn't result in an implementation that can be used to "mount an arbitrary image inside a user namespace" (what this article is about). It just makes it harder to get *one* sub-class of errors, caused by memory allocation and access. All other security-relevant error causes are still there. It doesn't even make memory errors impossible -- since a filesystem has to handle arbitrary binary data on devices, some parts of that fs implementation would be in unsafe Rust.

> Ok, so I prove my point

Well, actually you prove my point: You seem to have no clue at all what it means to write any safe program, even less so a program that realizes a file system. And, as I wrote, that's why your opinion on that topic is not trusted by the participants in that thread.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 18:00 UTC (Sun) by mgb (guest, #3226) [Link]

There are shades of gray between black and white.

I haven't used Rust yet - today I use mostly C++ and Perl - but I am trying to get an idea as to the tradeoffs Rust offers between programmer effort, runtime performance, and bug prevention.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 18:38 UTC (Sun) by mb (subscriber, #50428) [Link]

>Providing memory safety doesn't result in an implementation that can be used to
>"mount an arbitrary image inside a user namespace"

If you would have read what I had written before, then you would have noticed that I never ever made that claim.

The rest of what you wrote can be summarized in your own words:

>absurd

There is nothing but personal attacks.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 19:45 UTC (Sun) by smurf (subscriber, #17840) [Link] (4 responses)

> Rust doesn't have any magic pixie dust

Well … if you read Asahi Lina's (and her friends') accounts of how absurdly easy(*) it is for them to write an entire class of graphics drivers in Rust, you might change your opinion.

(*) compared to doing the same thing in C and then spending months searching for the inevitable memory reference count errors, fencepost bugs, and related issues

Rust has plenty of magic pixie dust. The point is, to apply it you need a competent magician (or three). And somebody needs to pre-mix the glue that holds the magic without disrupting it. And so on.

> All other security-relevant error causes are still there

Possibly true, but at least they don't crash the kernel any more; also, when you have code whose data structures can no longer get corrupted you can spend more time on higher-level issues. Like, ensuring that no data block points to metadata.

> It doesn't even make memory errors impossible -- since a filesystem has to handle arbitrary binary data on devices, some parts of that fs implementation would be in unsafe Rust.

Part of the point of Rust is that the unsafe part can be contained and scrutinized for that kind of errors, while the compiler handles safeguarding the rest of the code. In C there is no "rest of the code", it's *all* unsafe.

Surely you need to handle corrupt data anyway

Posted Jun 18, 2023 20:08 UTC (Sun) by mb (subscriber, #50428) [Link]

>Part of the point of Rust is that the unsafe part can be contained and scrutinized for that kind of errors

Exactly right.

And - what lots of people get wrong all the time - using the unsafe keyword in Rust does not mean that the program loses memory safety.
Quite contrary. The unsafe block have to commit to memory safety. In unsafe blocks it's also not allowed to write memory-unsafe code.

unsafe blocks are all around in Rust. All low level primitives are unsafe, basically.
But safe wrappers are built around them.

Therefore, of course a file system can be implemented in safe rust. Of course there is a low level data access part where raw bits have to be transmuted to safe structures. But that's a small part that can be audited for memory safety much more easily than a C-only program.

That said, it's also possible to write a safe C program, of course. It's just very much harder.
And that is the real reason why we have unsafe fs implementations. Saying that some "experts" evaluate it as being impossible is false. No real expert says that.

Surely you need to handle corrupt data anyway

Posted Jun 19, 2023 1:27 UTC (Mon) by jschrod (subscriber, #1646) [Link] (2 responses)

I'm aware of Asahi Lina's work and think it's great showpiece that writing kernel code in Rust is a big step forward to more secure code.

OK, Rust gives you memory-safe code. Well, since decades most of my programming is done in programming languages with managed memory. There the kind of problems that you cite "compared to doing the same thing in C" don't happen.

But: Do you really think that these systems are free of security bugs? I have CVEs for code, written by me in Lisp, that proves this wrong. In this thread the focus is on the obvious problems with memory management and access. But there is a huge domain of bugs *in the logic of our programs*. This logic errors are all memory safe and won't prevented by Rust, but they bite you dearly. If the logic of an interpretation of a binary structure on disk is wrong, you're out. No safe filesystem any more.

And that is, what this article and the discussion is about: are we're willing to allow random people mounting filesystems with kernel code in user space? The proposal of "md" about a `safe file-system' that someone else shall write is contra-productive, IMNSHO. "md" argues that it's possible by principle, but no one wants to do that. I have raised my voice in an opposing argument: it's not even possible to do by principle. Murphy bites, and he does it all the time.

Lennart and others made, in this article discussion, a compelling argument for using FUSE. I don't even understand why an automount of a USB stick happens with udisk if it's not in VFAT format. (Those eccentric people who want to mount ext4 USB sticks -- and I belong to them! -- are not worth to make that the default behaviour.) IMHO, these are the directions one has to follow to constrain that attack vector. Gambling on a "safe filesystem" is not an option, neither technically nor socially.

FWIW, my background: I have never been involved in filesystem development. But for the last 40(!) years, I wrote printer and graphics drivers. From that background of hardware-near software development, I know that Rust will be a *VERY BIG* help in better security, but it will not be a panacea that guarentees us to be safe of any security problem.

Surely you need to handle corrupt data anyway

Posted Jun 19, 2023 11:32 UTC (Mon) by farnz (subscriber, #17727) [Link]

One thing that's not inherent to C versus Rust, but is a cultural difference that tends to go in Rust's favour, is Rust developers' preference for parsing over validating, and for abstractions over raw integers that prevent you making the security errors to begin with. This isn't a guarantee (unlike memory safety in safe Rust), but it does tend to result in fewer security issues.

This, in turn, leads to bounds on the danger level of accessing a filesystem; if the worst case possible by supplying a corrupt filesystem to a driver is that it'll read parts of the block device it wouldn't have read without the corruption, then you have a very different issue to one where a corrupt filesystem will cause the driver to read from an entirely different partition.

Surely you need to handle corrupt data anyway

Posted Jun 19, 2023 12:26 UTC (Mon) by pizza (subscriber, #46) [Link]

> I don't even understand why an automount of a USB stick happens with udisk if it's not in VFAT format. (Those eccentric people who want to mount ext4 USB sticks -- and I belong to them! -- are not worth to make that the default behaviour.)

For Fedora/GNOME, EXT4 and XFS filesystems are only sometimes automounted. SD cards seem to be, but "real" external drives required me to explicitly mount the drive, either manually or by clicking on it in GNOME's file browser. Though now that I think about it, it's possible there are LVM interactions involved here too. (yay, another layer of things to go very wrong..)

(I don't recall if NTFS volumes behave the same way -- I have a 1TB SSD set up that way, but it's not stashed somewhere convenient)

Surely you need to handle corrupt data anyway

Posted Jun 15, 2023 11:42 UTC (Thu) by bluca (subscriber, #118303) [Link]

Private keys are in the same family as those used for secure boot, in fact the primary means to use it is via MOK and the machine keyring. There are revocations and blocklisting and rotation to deal with that, it is already covered in the existing threat model.

Surely you need to handle corrupt data anyway

Posted Jun 13, 2023 20:01 UTC (Tue) by epa (subscriber, #39769) [Link]

I wasn’t saying to guarantee it does not contain malicious data, or even to still function if it does. If the filesystem is corrupt it would be fine for all operations on that filesystem to fail. Just don’t do something wildly buggy like overrunning an array, converting a filesystem error into a whole-machine crash or a local root exploit.

Naively if you “trust the filesystem” you might convert offsets read from disk into pointers or array indexes without further checking. And it’s quite easy for bugs like that to sneak in, if you don’t do fuzz testing. It’s a similar situation to a lot of userspace software twenty years ago, when a corrupt or specially crafted input file might let an attacker exploit your mail client or music player. The kernel needs to hold itself to a higher standard — and if it does, there is not much need for elaborate permission mechanisms around mounting a filesystem, because (apart from suid etc) it should be a safe operation.

Surely you need to handle corrupt data anyway

Posted Jun 14, 2023 8:16 UTC (Wed) by dgc (subscriber, #6611) [Link] (2 responses)

> What the article calls a “malicious image” I call one that has been corrupted in some way,
> or just exercises a case the filesystem programmer hadn’t thought of. And instead of
> “attack” I would call it something that will go wrong sooner or later. Surely with all
> the testing and fuzzing that is done against Linux kernels, somebody is trying to
> provoke bugs by mutating filesystem images?

Ok, I'll bite. There's a difference between real world corruption vectors and what fuzzers like syzbot do.

Real world corruption results from storage errors, firmware bugs, kernel bugs, filesystem bugs, memory corruption, etc. Not to mention less obvious stuff like writing something to the wrong place (yes, we had an XFS bug that did that recently).

In the most cases, we can detect and defend against this real world corruption vectors - XFS recently detected that metadata was being corrupted by data being written to the wrong location. We've known about these vectors for decades and we've defended against them for almost as long, too.

e.g. the XFS v5 format is entirely based around being to identify and validate that the metadata block is owned by the filesystem, is located in the right place, is not stale, has not had random bit corruptions and does not contain bad pointers to other metadata or outside the scope of the structure. This defends against 99.99% of real world corruptions that occur while data is in flight to/from storage or at rest in storage.

Anyone who thinks that filesystems are not actually trying to validate their structure is good before it gets parsed really does not understand the state of the art. On XFS, the v5 format largely stopped random bit corrupting fuzzer development dead for several years - it's only relatively recently that fuzzer authors have realised they could -recalculate CRCs- after corrupting blocks to stop the filesystems rejecting blocks with injected random bit errors. They learnt that trick from the fuzzers we built ourselves that use xfs_db to recalculate CRCs after rewriting values....

IOWs, fuzzer projects learnt that they can go back to injecting random errors into filesystems by having -detailed knowledge of the filesystem structure-. By knowing exactly where the CRC fields in the metadata structures are and how they are calculated. they can corrupt random bits in the metadata and then recalculate the CRC to hide the random bit corruption.

The monkeys worked out how to make bashing millions of keyboards work again...

However, there is a difference this time: rather than the old fuzzers being an analog of random bit corruption like can happen in storage media, what the fuzzers are now doing *will not happen* in the real world *by accident*. The fuzzers are performing intentional, premeditated corruption of the filesystem structure by violating the trust model they operate in with the intent of making things break. That's pretty much the definition of "malicious damage", and the only real-world analog we have for this behaviour is an attacker trying to break into a system....

Hence, from a filesystem engineering point of view, there is no practical difference between the actions of a fuzzer like syzbot and a malicious attacker trying to attack the kernel using a filesystem image trust model violation technique.

Hopefully this might explain why filesystem engineers see a significant difference between real world production system corruption vectors versus fuzzer corruption vectors, and why the words "malicious actor" appear in discussions of these fuzzing techniques.

-Dave.

Surely you need to handle corrupt data anyway

Posted Jun 15, 2023 5:34 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

Thanks for the explanation. You make a good point that if the data is protected by a CRC, then it cannot be corrupted by hardware failure without that being obvious. However, it's still possible for the metadata to be corrupted by a software bug. The "malicious actor" could simply be somebody running an older, buggy version of the filesystem that writes bad metadata in some cases (but still calculates the CRC). Or even a filesystem developer running a newer, buggy version...

So I would still tend to err on the side of "anything that can go wrong, will go wrong" and assume that any kind of bad data will, sooner or later, turn up in a production system, even without it being deliberately created by an attacker. Just as when writing userspace code I have to be careful to defend against buffer overruns, even if I know that the only data it's reading is something previously saved from the same program.

Surely you need to handle^W -correct- corrupt data anyway

Posted Jun 15, 2023 11:15 UTC (Thu) by dgc (subscriber, #6611) [Link]

> You make a good point that if the data is protected by a CRC, then it cannot
> be corrupted by hardware failure without that being obvious. However, it's
> still possible for the metadata to be corrupted by a software bug.

Yes. We've thought about that, too. After all, I'm writing some of the code and I don't trust myself to write bug free code. :)

To mitigate such problems, XFS also runs metadata structure verification functions on the structures it is about to write to disk. It does this before it recalculates the CRCs to ensure what we are sealing with the CRC and writing to disk has not been corrupted by some in-memory issue. We've caught all sorts of kernel memory corruption bugs, hardware memory corruption bugs (i.e. buggy CPUs - we've caught more than one) and other software (XFS) bugs with this pre-write IO verification architecture.

> So I would still tend to err on the side of "anything that can go wrong,
> will go wrong"

It should be clear by now that we already make this assumption, and that we assume really bad things *will* happen.

If you want to learn more about the metadata verification architecture in XFS, read this design document that was added to the 3.10 kernel (yes, a decade ago):

https://docs.kernel.org/filesystems/xfs-self-describing-m...

If that doesn't scare you off, then try this one on for size, the online repair design document goes into much more detail about how this metadata structure is leveraged for total structure verification and repair:

https://docs.kernel.org/filesystems/xfs-online-fsck-desig...

The complexity of the verification and repair problem might also give you some insight into why most filesystem engineers don't think robust runtime defense against malicious attack is possible for anything other than the simplest of filesystems. It should also go some way to explain why we also think that fsck doesn't provide any guarantee that a filesystem image is "safe"...

-Dave.

Mounting images inside a user namespace

Posted Jun 13, 2023 17:17 UTC (Tue) by mcon147 (subscriber, #56569) [Link] (2 responses)

Can we have a 'known-good' file-format that is easy to validate/deny?

Mounting images inside a user namespace

Posted Jun 13, 2023 17:36 UTC (Tue) by smurf (subscriber, #17840) [Link] (1 responses)

It should be possible to harden one of the read-only formats.

Feel free to do the work …

Mounting images inside a user namespace

Posted Jun 14, 2023 2:54 UTC (Wed) by hsiangkao (subscriber, #123981) [Link]

> It should be possible to harden one of the read-only formats.

There is already a fuzzing syzbot to exercise filesystems with crafted images continuously and EROFS known issues are already handled upstream:
https://syzkaller.appspot.com/upstream/s/erofs

If some new reasonable attack vector (not only crash but also infinite loop, deadlock, etc.) is reported upstream, we'd like to address it soon. In the long term (someday after rust infrastructure really becomes mature), a pure-rust EROFS kernel implementation could be landed as well (maybe just start from a small on-disk subset), at least I don't see any real barrier of this. Yet currently it lacks too many important api parts and I don't see many kernel rust-binding issues will be resolved easily.

As for FS_USERNS_MOUNT, I think that is much of a policy issue. I don't see some real barrier to enable a compression-disabled EROFS to enable this honestly (compression can be explicitly disabled to avoid attack vectors out of compression since EROFS metadata format doesn't rely on compression). Yet before such policy is written, I tend to be neutral (even conserved) of this rather than too aggressive to enable this first.

Mounting images inside a user namespace

Posted Jun 13, 2023 20:36 UTC (Tue) by geofft (subscriber, #59789) [Link] (8 responses)

This is all very exciting, but userspace can't always expect to be able to talk to systemd, even on a machine running systemd. The most straightforward case (and the one I care the most about, personally) is Docker / Kubernetes / etc. If I "docker run" or "kubectl run" something, I'm in an environment where I don't have access to the host init, so even if the host happens to be running systemd (as is common these days), I can't use this. But I do have access to the kernel API, and I can and do create nested user namespaces inside the container to access various kernel features.

So I think work should continue on making filesystems that really are safe for FS_USERNS_MOUNT, even if it's just read-only filesystems or restricted functionality (e.g., ext2 instead of ext3/ext4 seems totally fine).

To be clear, I'd probably still use this in uncontainerized userspace, and the idea of having systemd dynamically assign any user on the system some subuids/subgids without having to maintain a static mapping in /etc (and without the setuid binaries!!!) is also really exciting.

Separately, I am really skeptical about Ts'o's claim about fsck - it seems to rely on fsck itself not being compromisable. I would think it's probably as easy, if not easier, to write a malicious filesystem that overflows something in e2fsck and causes it to exit(0) as to write a malicious filesystem that compromises the ext4 driver.

Mounting images inside a user namespace

Posted Jun 13, 2023 21:00 UTC (Tue) by bluca (subscriber, #118303) [Link] (4 responses)

> This is all very exciting, but userspace can't always expect to be able to talk to systemd, even on a machine running systemd. The most straightforward case (and the one I care the most about, personally) is Docker / Kubernetes / etc. If I "docker run" or "kubectl run" something, I'm in an environment where I don't have access to the host init, so even if the host happens to be running systemd (as is common these days), I can't use this. But I do have access to the kernel API, and I can and do create nested user namespaces inside the container to access various kernel features.

It's not the host's init, it's some services that are accessible via IPC. It's up to the container manager to manage those accesses. There is nothing stopping them from doing so.

Mounting images inside a user namespace

Posted Jun 13, 2023 21:39 UTC (Tue) by geofft (subscriber, #59789) [Link] (3 responses)

Sure, but there's nothing causing them to do so, either, and the whole point of a container is that it gives you access to the host kernel but not to the host userspace.

(Do you know any container runtimes/mamnagers that currently expose host IPC services like D-Bus into the guest? I think even systemd-nspawn does not.)

My point is not that it's technically impossible; of course it's technically possible. My point is it won't happen in practice for a very large and common use case, for good engineering reasons.

Mounting images inside a user namespace

Posted Jun 13, 2023 22:27 UTC (Tue) by bluca (subscriber, #118303) [Link]

It's not D-Bus, it's varlink, ie: allowing access to the socket only allows access to _that_ socket, it's not a bus. Of course it has to be mediated. I don't see why it wouldn't be used by container managers, and nesting is an explicitly supported use case. Certainly nspawn will use it, and I'd guess podman and lxc will too once it's available.

Mounting images inside a user namespace

Posted Jun 14, 2023 5:34 UTC (Wed) by zdzichu (subscriber, #17118) [Link] (1 responses)

Isn't container runtime's job to setup environment for "guests", including mounts? The container just runs, it doesn't have to mount anything, thus no need for IPC with the host.

Mounting images inside a user namespace

Posted Jun 14, 2023 21:30 UTC (Wed) by geofft (subscriber, #59789) [Link]

Yeah, it depends what you're using it for. I think the envisioned setup in the article is indeed where you request something like "I would like to run this disk image as a container please" and there's a way to do that. But one of my use cases is running integration tests for software on fairly generic container images (e.g., GitHub runners). My software might have its own disk image with some test files, and I might want to mount that image and use it as part of my application code. This disk image is separate from the container image (i.e., the environment where the container runs), and it might be calculated dynamically in some way - e.g. the application might download it from a URL that's in the test suite. We wouldn't want to teach the container infrastructure about this image or build a custom container image for this application.

In fact, a feature request over the years for Kubernetes has been a way to mount a second container image in a container (at some path other than /). This doesn't require any kernel support, it just requires the container runtime to unpack two images where it was previously unpacking one. Despite being one of the oldest Kubernetes feature requests, from one of the developers (https://github.com/kubernetes/kubernetes/issues/831) it hasn't really been implemented yet (there's a pointer there to https://github.com/kubernetes-retired/csi-driver-image-po..., but note the URL...).

(At my day job we worked around this exact limitation by setting up additional containers with the data we need and having them do a cp -a / into a shared volume, and having our actual application block until the copy is done.)

In other words I think there's a conceptual separation between "application space" and "infrastructure space", analogous to the much more obvious separation between kernelspace and userspace. New unprivilged kernel features get to be used by applications, even if the infrastructure doesn't explicitly set it up. (We make extensive use of unprivileged user namespaces inside containers for all sorts of things!)

Mounting images inside a user namespace

Posted Jun 14, 2023 11:32 UTC (Wed) by mezcalero (subscriber, #45103) [Link] (2 responses)

Note that container environments don't actually have to talk to systemd in any way for this all to work. It's entirely sufficient to bind mount one AF_UNIX socket inode from the host into the container, that's all. It's not too different to how access to syscalls or device nodes is controlled: the container policy allows or denies syscalls, device nodes and whether the socket is bind mounted or not.

I mean, sure a container that wants this functionality will only work on a host OS that provides this functionality, but that's not too different from some arbitrary kernel feature which typically are only available on some kernels and not others.

Lennart

Mounting images inside a user namespace

Posted Jun 14, 2023 21:56 UTC (Wed) by geofft (subscriber, #59789) [Link] (1 responses)

Yeah, but in practice if a common filesystem is marked FS_USERNS_MOUNT then I can use it on both GKE (containerd on Google COS) and GitHub Actions (Docker on Ubuntu) within basically a year.

In other words, there is a subset of fun kernel features available to users - including unprivileged user namespaces, but also including stuff like seccomp - that is in practice made available to container users simply by virtue of Linux in normal configurations making them available. It helps a lot that the hosting userspace doesn't have to do anything to control unprivileged kernel features; they're just there. Very rarely I'd have to file a feature request with the COS team or the GitHub Actions teams to add a certain kernel module or maybe enable a configuration option, but it would just be build configuration - they wouldn't be changing any code or writing anything to plumb things through from the host - so they'd be likely to say yes.

You could imagine a world where the kernel didn't implement, say, timerfd for unprivileged users, and it said "Only root can create kernel timers, but root is free to run a daemon to configure timers requested by other processes and send them stuff over pipes when the timers expire." (Obviously there's no technical reason for this, but bear with me.) In practice that would mean that writing unprivileged programs that use timers would be difficult. You could do it on certain OSes, and you could maybe reconfigure certain environments to make it worse, but it would be a much less reliable experience than having timerfd in its current form.

(And it also doesn't require making any assumptions about what the guest container looks like and whether it follows a normal filesystem layout, so you don't have to figure out where that UNIX socket gets bind-mounted to. You can have one-file containers with a statically linked binary and no directories - and I think a lot of Go folks do exactly this - that use random kernel features, because a container keeps the kernel the same but changes out the userspace.)

The whole container ecosystem more or less relies on the idea (wildly technically invalid, but remarkably true in practice) that there is indeed a common Linux ABI available to userspace, and you can go from distro to distro or provider to provider and expect the same things to be available. And I think there is a sense on the kernel side that this should indeed be true - see e.g. the pushback to the "optional patches" in https://lwn.net/ml/linux-kernel/CAG_fn=WR3s3UMh76+bibN0nU... or the objection to mutually-exclusive major features in https://lwn.net/Articles/858023/ , both of which would have risked breaking the commonality of "Linux" as seen from userspace.

(Again I want to be clear that I'm not arguing against the work you're describing here. I'm excited for it! But I think we should _also_ have FS_USERNS_MOUNT or something like it.)

Mounting images inside a user namespace

Posted Jun 15, 2023 10:14 UTC (Thu) by SLi (subscriber, #53131) [Link]

Thanks, this was insightful! First I was baffled about why you want what you want, but I think you make an important point. It's kind of a mixed social (or standardization) and technical problem, and it makes sense to say that there is a de facto base standard (the kernel ABI), whether it's good or not.

Mounting images inside a user namespace

Posted Jun 13, 2023 20:51 UTC (Tue) by flussence (guest, #85566) [Link] (10 responses)

If I could use FUSE for every non-boot filesystem, I would. There's no reason to be shovelling gigabytes of arbitrary data into a ring 0 driver and architecture-astronaut stunts like this are actively dragging us further from a secure world.

Mounting images inside a user namespace

Posted Jun 13, 2023 21:00 UTC (Tue) by dezgeg (guest, #92243) [Link] (1 responses)

Give lklfuse from https://github.com/lkl/linux a try

LKL

Posted Jun 14, 2023 17:49 UTC (Wed) by DemiMarie (subscriber, #164188) [Link]

Hopefully LKL evolves to the point that non-Linux based OSs have a chance at running on the hardware people actually have.

Mounting images inside a user namespace

Posted Jun 13, 2023 22:01 UTC (Tue) by pizza (subscriber, #46) [Link] (2 responses)

> There's no reason to be shovelling gigabytes of arbitrary data into a ring 0 driver

Uh, sure there is, and it's a _very_ good one: Performance.

Sure, FUSE is "good enough" for many use cases, but it's not so great for I/O intensive workloads. (and of course that's expected, given that it triples the number of kernel/userspace transitions vs an in-kernel driver)

Mounting images inside a user namespace

Posted Jun 14, 2023 5:50 UTC (Wed) by smurf (subscriber, #17840) [Link]

You can work around (most of) the performance issue by keeping the data blocks in the kernel, and use io_uring in the FUSE server. There's a "play" server that exports a single file as a block device via FUSE, at https://github.com/uroni/fuseuring and there's no reason (in principle anyway) why this shouldn't work for a "real" FUSE-exported file system.

Of course you still have more latency, but as long as the actual data stays within the kernel that should be OK for many/most? workloads.

Mounting images inside a user namespace

Posted Jun 14, 2023 7:43 UTC (Wed) by leromarinvit (subscriber, #56850) [Link]

> Uh, sure there is, and it's a _very_ good one: Performance.

At least for untrusted removable media, trading some performance for better security is probably a worthwile tradeoff. The difference will probably not even be noticeable with typical thumb drives. I wouldn't mind a solution where the "default" way to mount removable media (e.g. clicking a USB drive in the file manager) is via FUSE and you have to jump through extra hoops to mount it natively (extra points if that's easily available as well, for those cases where you're using e.g. an NVMe SSD via Thunderbolt and want it to be fast).

Mounting images inside a user namespace

Posted Jun 14, 2023 11:28 UTC (Wed) by mezcalero (subscriber, #45103) [Link] (4 responses)

I fully agree. The fact that desktop infrastructure such as udisks just willy-nilly mount random USB sticks you plug in natively is a big security problem. We should do what ChromeOS does: restrict which file systems can be mounted like that (i.e. vfat only pretty much), and implement the fs driver for that as unpriv fuse. And then not use a kernel fs for such an untrusted fs.

Lennart

Mounting images inside a user namespace

Posted Jun 14, 2023 17:46 UTC (Wed) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Verified boot (on both Android and ChromiumOS) requires that the kernel’s implementation of the FS chosen for the writable partition is secure against maliciously crafted images. The threat model includes someone who has obtained kernel privileges and keep those privileges across OS upgrades. This means that the upgrade OS must be secure no matter what the contents of the writable partition are. Ideally, that writable partition would be mounted via FUSE, but then ideally Linux would be replaced by a microkernel.

Mounting images inside a user namespace

Posted Jun 15, 2023 8:07 UTC (Thu) by smurf (subscriber, #17840) [Link]

The writable partition is usually encrypted. This means that any offline change either results in random data (which are typically not useful) OR requires the encryption key.

However, your ability to access that key implies that the device is already rooted, thus you don't need to fudge with the file system in the first place.

Mounting images inside a user namespace

Posted Jun 24, 2023 0:08 UTC (Sat) by Kamilion (subscriber, #42576) [Link] (1 responses)

I should point out, as a heavy user of udisks2 and Lubuntu -- **udisks** is not the one triggering the automount. In my LXQT desktop; the file manager has a checkbox for enabling automounting (which defaults to on).

One of my long-time modifications while rolling my own livecd has been disabling it, as one of my primary dayjob tasks in ewaste happens to be erasing media, and it's incredibly annoying to have something trying to automount a disk you're just about to scramble.

"oooh, I see a lvm signature! please may i?"
"No, I took the LVM tools away from you for a reason, don't touch it."
"awwww."

(I do, in fact support the idea of mounting VFAT through fuse for additional safety though. I *would* however, appreciate someone finally getting around to a sane way to inform graphic console users something has gone wrong with storage processes. There's been many times I've briefly plugged a device in and unplugged it, and this has angered the kernel or userspace on many occasions, with no recourse but to stare at dmesg for a while. "oh, kernel was still retry looping on link rate for 18 seconds before giving up on the SATA endpoint or noticing a serial change")

Mounting images inside a user namespace

Posted Jul 2, 2023 11:48 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

> I should point out, as a heavy user of udisks2 and Lubuntu -- **udisks** is not the one triggering the automount. In my LXQT desktop; the file manager has a checkbox for enabling automounting (which defaults to on).

Yes, `udiskie` needs to be running for `udisks2` to do anything on my machines. One can also set up automounting behaviors via udev properties (I use automount for many things, so a lot of udiskie's feature set is asking for LUKS passwords and then mediating locking again).

Mounting images inside a user namespace

Posted Jun 14, 2023 11:59 UTC (Wed) by karim (subscriber, #114) [Link] (25 responses)

Why not just run user-mode Linux to mount the image and then make its contents available over a "remote network connection"? The UML instance can be constrained as-is with existing mechanisms and all the fs utilities can be used as-is within UML.

Or maybe I'm missing something.

Mounting images inside a user namespace

Posted Jun 14, 2023 12:11 UTC (Wed) by bluca (subscriber, #118303) [Link] (24 responses)

Why maintain and deploy an _entire new_ kernel, that needs to be set up and configured and updated, that will get out of sync with the actual filesystems capabilities provided by the real one (filesystems do change), when the real one is there to be used?

Mounting images inside a user namespace

Posted Jun 14, 2023 12:24 UTC (Wed) by karim (subscriber, #114) [Link] (23 responses)

Hmm. I've been using Linux since ~1994/5. I can't recall having been bitten by such filesystem changes. I could have selective memory or maybe I'm too conservative in my filesystem choices. Still, my recollection is that most of the time Linux does a better job at handling corrupted/malformed filesystems than most other workstation-grade OSes.

I mean, ultimately, I don't want to have to trust the images in any way, shape or form. I just want a mechanism that enables me to mount/manipulate an FS image without having to care where it came from or who might've signed it. What's being proposed above wouldn't solve that problem for me. UML, or a set of tools built around it, possibly would.

Mounting images inside a user namespace

Posted Jun 14, 2023 12:29 UTC (Wed) by bluca (subscriber, #118303) [Link] (22 responses)

New filesystems features get added all the time, there are feature flags and so on. Having to maintain an entirely separate kernel, with its own lifecycle, security fixes, deployments, etc, sounds like an absolute nightmare to me. It's already costly enough to maintain one.

Mounting images inside a user namespace

Posted Jun 14, 2023 12:34 UTC (Wed) by karim (subscriber, #114) [Link]

I guess we're just solving for different use-cases.

Personally if that UML instance is wrapped around tools for just filesystem manipulation then I honestly don't care so much to just grab the latest stable release from kernel.org, building it and deploying it for that purpose. Again, with just that use-case in mind: accessing random FS images.

For me, having to actually think about the trustability of the FS image in question is a step too many. For all I care, the image could be maliciously crafted and I want to not think about it.

Mounting images inside a user namespace

Posted Jun 14, 2023 13:10 UTC (Wed) by smurf (subscriber, #17840) [Link] (20 responses)

The maintenance cost for "make ARCH=uml", on top of "make ARCH=$(uname -m)", is pretty much zero.

Hardening the file systems in question appears to be more expensive than that, esp. given that not even the file systems' fsck maintainers are willing to guarantee that an fsck-clean file system wont' crash the kernel.

Mounting images inside a user namespace

Posted Jun 14, 2023 13:43 UTC (Wed) by bluca (subscriber, #118303) [Link] (19 responses)

> The maintenance cost for "make ARCH=uml", on top of "make ARCH=$(uname -m)", is pretty much zero.

Except of course nobody really does that, apart from a handful of hackers on their 'pet' systems.

> Hardening the file systems in question appears to be more expensive than that, esp. given that not even the file systems' fsck maintainers are willing to guarantee that an fsck-clean file system wont' crash the kernel.

Hence why the rest of the article

Mounting images inside a user namespace

Posted Jun 14, 2023 13:53 UTC (Wed) by karim (subscriber, #114) [Link] (14 responses)

> Except of course nobody really does that, apart from a handful of hackers on their 'pet' systems.

Oh, now that's very constructive. Pigeonholing an approach by marginalizing an audience. Do you feel better now? Do you think you've actually achieved anything?

Now, please explain why Ubuntu, Debian, Fedora, etc. couldn't just do "make" twice on the same kernel as was precisely suggested to you before and ship that UML version with possibly some scripts around it to facilitate looking at any image that the Linux kernel already supports without any changes to any fsck or requirement for any trust on any images. How harder would that be? In fact, how hard would that be for Joe User if they had to just grab the sources of the distro kernel they're already using and rebuild it for UML even if the distro didn't do this?

Mounting images inside a user namespace

Posted Jun 14, 2023 14:03 UTC (Wed) by zdzichu (subscriber, #17118) [Link] (7 responses)

For starters, it would double the number of combinations to be tested before release. Distributions do not have abundance of QA resources :(

Mounting images inside a user namespace

Posted Jun 14, 2023 15:05 UTC (Wed) by geert (subscriber, #98403) [Link]

Not double, as ARCH=um is still limited to x86...

Mounting images inside a user namespace

Posted Jun 15, 2023 7:59 UTC (Thu) by smurf (subscriber, #17840) [Link] (5 responses)

Why would it double anything? There's no reason to build multiple UML kernels. It'd add one.

Mounting images inside a user namespace

Posted Jun 15, 2023 8:51 UTC (Thu) by farnz (subscriber, #17727) [Link] (4 responses)

In most distros today, you build one kernel per CPU architecture you support - so there's one AMD64 kernel, one AArch64 kernel, one RISC-V 64GC kernel etc.

You'd need to repeat that with UML for each CPU architecture you support, thus doubling the number of kernels you build - instead of building one kernel per supported architecture, you'd build two.

Mounting images inside a user namespace

Posted Jun 15, 2023 12:19 UTC (Thu) by karim (subscriber, #114) [Link] (2 responses)

So what's the best tradeoff? Adding to the release testing burden or adding to code maintenance and security burden? I get that this isn't "free" for release management. But, personally, I'd rather trust the code that's already there and has a lot of road mileage than devise new code that will only exceptionally be used just to avoid burdening the release loop.

Mounting images inside a user namespace

Posted Jun 15, 2023 13:40 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

For starters, UML is currently x86-only; so using UML as your filesystem handler means porting UML to the other architectures, and taking on the security and code maintenance burden that represents.

Then you need a trustworthy IPC mechanism with the host kernel; something that allows the UML kernel to read/write the block devices it's managing, and a trustworthy filesystem protocol to share the FS from the UML kernel to the host kernel. Again, both of these add to the security and maintenance burdens.

Alternatively, you could go the ChromeOS route - the only filesystem used on external devices is FUSE from the kernel point of view, and you run userspace filesystem implementations that have been deliberately and intentionally hardened, and can be sandboxed heavily - you can design the userspace component to accept the FS as one file descriptor, and the FUSE interface as the other, and sandbox it such that all you have is anonymous memory plus accesses to those two file descriptors. This has the advantage over UML that the process that mounts the filesystem is designed to only read that one FS, and no others; so there's a much smaller attack surface to begin with.

Mounting images inside a user namespace

Posted Jun 15, 2023 13:43 UTC (Thu) by karim (subscriber, #114) [Link]

Thanks for sharing this. This is useful.

Mounting images inside a user namespace

Posted Jun 15, 2023 12:53 UTC (Thu) by geert (subscriber, #98403) [Link]

So there will be one UML kernel for x86, one for x86_64, and all other architectures are blocked on gaining UML support first...

Mounting images inside a user namespace

Posted Jun 14, 2023 14:49 UTC (Wed) by bluca (subscriber, #118303) [Link] (5 responses)

> Oh, now that's very constructive. Pigeonholing an approach by marginalizing an audience. Do you feel better now? Do you think you've actually achieved anything?

Pet vs cattle is a well-known and widely used phrasing to describe the dichotomy between power users' custom systems vs large scale deployments. If you take offence at that categorization go complain to the tech press industry, as they're the ones who came up with it. Or not, but in either case, please stop bothering me.

> Now, please explain why Ubuntu, Debian, Fedora, etc. couldn't just do "make" twice on the same kernel as was precisely suggested to you before and ship that UML version with possibly some scripts around it to facilitate looking at any image that the Linux kernel already supports without any changes to any fsck or requirement for any trust on any images. How harder would that be? In fact, how hard would that be for Joe User if they had to just grab the sources of the distro kernel they're already using and rebuild it for UML even if the distro didn't do this?

Given the number of times that has happened is precisely 0, I'd wager the answer lies somewhere between "harder than you'd think" and "it's just such a stupid idea that nobody could ever be bothered". Your pick.

Mounting images inside a user namespace

Posted Jun 14, 2023 14:52 UTC (Wed) by karim (subscriber, #114) [Link] (4 responses)

> Pet vs cattle is a well-known and widely used phrasing to describe the dichotomy between power users' custom systems vs large scale deployments. If you take offence at that categorization go complain to the tech press industry, as they're the ones who came up with it. Or not, but in either case, please stop bothering me.
...
> Given the number of times that has happened is precisely 0, I'd wager the answer lies somewhere between "harder than you'd think" and "it's just such a stupid idea that nobody could ever be bothered". Your pick.

I see. Thank you for clarifying the type of debate you like engaging in. I, for one, have better things to do in life.

FWIW, you've just help me convinced myself that the UML approach is likely the better one. Thank you.

Mounting images inside a user namespace

Posted Jun 14, 2023 18:30 UTC (Wed) by bluca (subscriber, #118303) [Link] (3 responses)

Given you are the one who completely misunderstood a reference and immediately dialed it up to 10000, you can save the victim act for another occasion

Mounting images inside a user namespace

Posted Jun 14, 2023 18:55 UTC (Wed) by karim (subscriber, #114) [Link] (1 responses)

> Given you are the one who completely misunderstood a reference and immediately dialed it up to 10000, you can save the victim act for another occasion

<spits-coffee-laughing/>

Buddy, I don't think you understand. I have no compunction butting heads with anyone. Let's just say that I've had my fingers caught in a few lkml flame wars circa 20 years ago and it was great fun. But, if you will, blame it on getting old ... I don't have the time to waste on the sort of debate you seem to want to have.

My position still stands. UML is the cleanest way to do this because: a) it works today, b) it doesn't need any changes to any tools, c) it doesn't require me to make any assumptions regarding the safety of the images I want to access.

Your semantic side-show changes nothing to this.

Mounting images inside a user namespace

Posted Jun 15, 2023 10:36 UTC (Thu) by bluca (subscriber, #118303) [Link]

Sure thing, have fun!

Mounting images inside a user namespace

Posted Jul 19, 2023 9:36 UTC (Wed) by nye (guest, #51576) [Link]

I know this is a month old but your behaviour across these comments is so shocking that I just can't stay quiet. You should be ashamed.

Mounting images inside a user namespace

Posted Jun 14, 2023 14:20 UTC (Wed) by leromarinvit (subscriber, #56850) [Link] (3 responses)

> Except of course nobody really does that, apart from a handful of hackers on their 'pet' systems.

I assume most people use distro kernels. So the obvious way to get something like that into the hands of average users is for distros to package it. For the user, the cost of having a UML kernel that's in sync with their regular kernel would just be a bit of disk space and network traffic for updates.

People who build their own kernels will probably be technical enough to understand how this works, and to make an informed decision whether they want to also build a UML image when they upgrade (the alternatives would be either continuing to use an older one, with all the potential pitfalls this entails, or disabling this mechanism entirely).

Mounting images inside a user namespace

Posted Jun 14, 2023 15:06 UTC (Wed) by bluca (subscriber, #118303) [Link] (2 responses)

It's not that obvious, as it's orders of magnitude more complicated, and doesn't even fully take care of the security problems. What would that buy over the approach described in the article?

Mounting images inside a user namespace

Posted Jun 14, 2023 15:58 UTC (Wed) by leromarinvit (subscriber, #56850) [Link] (1 responses)

With "obvious", I just meant compared to every user building a UML image themselves.

As for the merits of the UML approach (or something like lklfuse) vs what's described in the article, for one, it could be used with arbitrary file systems. Special trusted images seem like a non-starter for random removable media. (But of course that's an entirely different use case than for containers - I see no reason both approaches can't live side by side.)

What are your security concerns with the UML approach? The worst thing a malicious fs image could do is compromise the UML kernel - so a user-space process. What that process can do can be suitably constrained using the same methods as for any other user space application - of course it shouldn't be allowed to access any other device than the image in question, or access the network or any such things. The security boundary would then be the interface used to connect to the "host" kernel (be it 9p, NFS, FUSE, or something else), and whatever is used to sandbox the UML process.

It seems the alternative most everyone is using today (for removable media) is to just look the other way and mount it directly in-kernel. This seems to be strictly worse to me, security-wise.

Mounting images inside a user namespace

Posted Jun 14, 2023 17:58 UTC (Wed) by bluca (subscriber, #118303) [Link]

Those are different use cases though, as you noted.

In the "user plugs USB drive" use case, the status quo is terrible, as the kernel developers go 'lalala can't hear you', and desktop developers simply stick udisks in and allow mounting anything that is plugged in. This is of course bad, so any movement toward improving, like the already suggested FUSE approach, seems good, and the proposal in the article is not aimed at this.

The proposal in the article is aimed at the use cases where status quo is simply: you cannot do that, full stop. So to enable that use case, we need to take security seriously. And that's where establishing trust _before_ use is fundamental - if you 'only' compromise your local container manager instead of the kernel, sure it's less bad, but it's simply still unacceptable where this matters. UML uses the exact same drivers as the normal kernel, which are just as, let's say, not guaranteed to be robust against malicious images, so just shouting SANDBOXING and turning the other way, while a step up, it's nowhere near good enough.

Not to mention the fact that someone mentioned UML is x86 only, which again is a non-starter - arm64 is not only a thing, it's an important thing. And the whole maintenance angle, of course.


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds