|
|
Log in / Subscribe / Register

A decision on composefs

By Jake Edge
June 7, 2023

LSFMM+BPF

At the end of our February article about the debate around the composefs read-only, integrity-protected filesystem, it was predicted that the topic would come up at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. That happened on the second day of the summit when Alexander Larsson led a session on composefs. While the mailing-list discussion was somewhat contentious, the session was less so, since overlayfs can be made to fit the needs of the composefs use cases. It turns out that an entirely new filesystem is not really needed.

Larsson began by looking at the use case that spurred the creation of composefs. At Red Hat, image-based Linux systems are created using OSTree/libostree; they are not the typical physical block-device images, however, as they are more like "virtual images". There is a content-addressed store (CAS) that contains all of the file content for all of the images. In order to build a directory hierarchy for the virtual image, a branch gets checked out from the OSTree repository, which contains the metadata and directory information for the image; OSTree then builds the directory structure using hard links to the CAS entities.

[Alexander Larsson]

A system created this way "is very lean", because it is flexible and easy to update, so the Red Hat developers want to use OSTree-based images for containers and other types of systems. But there is a missing feature that they would like to have: some kind of tampering prevention as with dm-verity. There are two main reasons that a tamper-proof filesystem is desired: security, to provide a trusted boot, for example, and safety, such as protecting the data used by a self-driving car. Fs-verity provides much of what they are looking for, but it does not go far enough; it is concerned only with protecting the file contents, not the file metadata or the directory structure.

So, back in November, he and Giuseppe Scrivano posted composefs, which is like a combination of a read-only filesystem, such as SquashFS or EROFS, with overlayfs. Composefs just contains the metadata and the directory structure; it gets mounted as the equivalent of the overlayfs upper layer, with the CAS as the lower layer. So referencing a file by its name actually resolves to the entry in the CAS.

If all of the files in the CAS have fs-verity enabled for them, those digest values can be used in the creation of the image for the composefs metadata, which itself is protected with fs-verity. When composefs is mounted, the expected digest for the metadata image is passed to the mount command, so that it can be verified; the Merkle tree of digests for the CAS is part of that image, so everything is protected against any kind of change (be it malicious or cosmic-ray induced).

In the "sometimes heated discussions about this", it turns out that there are already features in overlayfs "that sort of make this possible". Files can have extended attributes (xattrs) that specify that the metadata is separate from the file data (overlay.metacopy) and that the names are different in the layers (overlay.redirect). The idea would be to create an overlayfs with two filesystems: the CAS is the lowest layer and a read-only EROFS loopback image would be above that with its xattrs pointing into the CAS.

There are some missing features, though. In order to support fs-verity on the file contents, there needs to be a overlay.verity xattr to tell overlayfs to verify the file contents based on the digest in the xattr. There also needs to be a mount option to specify that every file must have a digest. There are pending patches to add those features to overlayfs.

There were some performance measurements ("of dubious quality") that were done using ls -lR which showed some lookup amplification in overlayfs. For no real reason, overlayfs was looking up the underlying CAS file; Amir Goldstein called overlayfs "too eager" and has posted patches to support lazy lookups, so that the lower-layer file is not looked up until it is actually needed. An overlay filesystem is a union filesystem, which combines the entries of all of the layers, but that is not needed for this use case, Larsson said. You could use overlay whiteouts to hide the underlying CAS files, but Goldstein's lazy-lookup patches also add data-only layers, which do not have the file names from their filesystems visible in the combined overlayfs.

The basic question for the room, Larsson said, was what the approach should be to getting something upstream to solve his (and others') use cases. "This where the talk gets kind of short because I think most people are leaning toward using the existing code in overlayfs", rather than add a new filesystem. It is less code to maintain, which is always beneficial, but also the features that would be added to overlayfs are useful in their own right.

There are a few cons, but "I think they're pretty minor"; he does not like loopback mounts because they are global and are not namespaced. In addition, the performance of the overlayfs version is roughly the same as composefs (after the lazy lookup is added), but having two filesystems does double the number of inodes and directory entries (dentries) that are in use. He wondered if anyone thought that a custom filesystem was the right approach.

If you ask a room full of kernel developers if a new filesystem is needed, "the answer is almost always going to be 'no'", Josef Bacik said to laughter; "you made our argument for us" since an existing filesystem can cover the needed features. Larsson agreed that he would rather not have to maintain a filesystem; "I have enough code to maintain".

Larsson was asked about the problems with loopback mounts. He said that there are people working on solutions, but a container must have a loopback device available. The loopback device is global, however, so it can see all of the loopback mounts in the entire system. Christian Brauner said that he has working patches for doing proper namespacing for global system devices like loopback; there are iSCSI people who are interested in it as well. He hopes that it is just a matter of time before that problem is solved.

There are two facets to the problem of global devices; they appear as device nodes in the /dev tree but are also sysfs entries, Brauner said. He did not want to do a "half-assed namespacing" where he only dealt with the device nodes and did not handle the sysfs entries, but he had to step carefully in order to avoid breaking backward compatibility.

Other mechanisms for handling the integrity and/or trust level of container images were discussed, some of which overlapped the FUSE passthrough discussion the previous day or the session on mounting filesystems in user namespaces coming later in the day. Allowing unprivileged users to mount random images from, say, DockerHub, is not something that will ever be supported, Brauner said. Goldstein agreed, noting that something like a BPF verifier for filesystem images would be needed to ensure that they would not crash the kernel. James Bottomley thought there was a class of simple filesystem images that the kernel could verify before mounting, even for unprivileged users. But Bottomley's idea was not entirely well-received in the room.


Index entries for this article
KernelFilesystems/composefs
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2023


to post comments

A decision on composefs

Posted Jun 7, 2023 15:50 UTC (Wed) by iabervon (subscriber, #722) [Link]

I think it's sometimes worthwhile to write a new filesystem (or whatever) in order to get a working prototype that has all the features needed for your use case to see that it's possible to meet your needs in a consistent way. Then, when the features you want are demonstrated to have tangible benefits, people will be able to judge whether it's useful to integrate them into filesystems that also support other use cases. 90% of the work is really getting fully-specified semantics that solve your problem, and that part of the process is reused in a different implementation. (And the other 90% is making your implementation robust, but if you put that off until after the demo, it can be skipped entirely if the functionality migrates to a pre-existing filesystem.)

A decision on composefs

Posted Jun 7, 2023 17:22 UTC (Wed) by developer122 (guest, #152928) [Link] (5 responses)

>James Bottomley thought there was a class of simple filesystem images that the kernel could verify before mounting, even for unprivileged users.

If this doesn't exist I'm surprised it hasn't been created. It is very sorely needed and has been for decades.

The only candidate I can think of where I don't know if it's possible to craft a malicious image is tar. There you're essentially relying on a userspace tool to unpack the files onto another existing filesystem for you. I guess there's also FUSE which just skips over verifying the image at all, but I don't know that it can guarantee protection for the system.

A decision on composefs

Posted Jun 7, 2023 17:45 UTC (Wed) by hsiangkao (subscriber, #123981) [Link] (4 responses)

> If this doesn't exist I'm surprised it hasn't been created. It is very sorely needed and has been for decades. The only candidate I can think of where I don't know if it's possible to craft a malicious image is tar.

IMHO, the problem is not about on-disk format but the kernel implementation changes (even some images are simple enough so that we could verify each field to ensure all fields with the sane data, but that doesn't mean it could then be bug-free on malicious crafted images so they still can crash the kernel when running with the implementation. The same to userspace fsck tools, some fsck bugs could do stack-overflow then used for executing malicious code).

Of course you could dump a simple filesystem with a very simple format (maybe even simpler than tar) and fuzzed for long time. But after you update with some new feature or just code update, such guarantee will be no longer true in principle.
I think we could do our best to avoid unexpected behaviors (most impacted stuff is crash of course) due to malicious crafted images. but it's not an absolute word in principle.

A decision on composefs

Posted Jun 7, 2023 21:08 UTC (Wed) by atnot (guest, #124910) [Link] (2 responses)

> I think we could do our best to avoid unexpected behaviors (most impacted stuff is crash of course) due to malicious crafted images. but it's not an absolute word in principle.

This is no different from the rest of the Kernel really, plenty of subsystems deal with data structures that are at least as complicated as filesystems from very risky sources like the network.

The big difference is that in filesystem code, issues resulting from invalid or malicious data have never been considered bugs or vulnerabilities historically. This is understandable since the assumption was that it is pointless to defend against the very disk you are running from, even as that has become less and less true these days. Making all of that code safe against untrusted inputs retroactively would be an absolutely mammoth undertaking.

Hence the suggestion that specific filesystems can opt into promising they are hardened against malicious images, which would allow them to be mounted by non-root users.

A decision on composefs

Posted Jun 7, 2023 23:08 UTC (Wed) by hsiangkao (subscriber, #123981) [Link]

> This is no different from the rest of the Kernel really, plenty of subsystems deal with data structures that are at least as complicated as filesystems from very risky sources like the network.

Yes, yet I think it will be a useful feature for each filesystem (already have dozens now, I do think many filesystems with on-disk format will benefit from mounting as unprivileged users directly if that kicks off from one) but objectively speaking it's quite hard to set some standards in what degree it should be safe.

> Hence the suggestion that specific filesystems can opt into promising they are hardened against malicious images, which would allow them to be mounted by non-root users.

Note that currently syzbot is already heaily fuzzing several filesystems uninterruptedly,
https://syzkaller.appspot.com/upstream/s/fs

A decision on composefs

Posted Jun 8, 2023 15:11 UTC (Thu) by gscrivano (subscriber, #74830) [Link]

and that was the whole idea with composefs. The format is so simple and was designed this way, that it would be very difficult to abuse it, and that at some point could be even used from a user namespace. I think root containers could take advantage of that simple format too, just because root can mount a file system then it doesn't make it any safer and we have the burden to validate the image in one way or another.

Filesystem security against malicious images

Posted Jun 10, 2023 20:53 UTC (Sat) by DemiMarie (subscriber, #164188) [Link]

bcachefs is (per discussions in IRC) intended to be secure against malicious filesystems, even if the block device gets modified behind the kernel’s back. That said, I think FUSE (or something like it) is the long-term answer here. We need to be moving more stuff out of the kernel, not exposing more kernel attack surface.

A decision on composefs

Posted Jun 22, 2023 0:34 UTC (Thu) by Kamilion (subscriber, #42576) [Link] (1 responses)

I realize this may be a bit of a curveball here, but has anyone for a moment considered that the FHS is the bug here? Why is everyone stuck in thinking that we're somehow limited to 1970s three letter identifiers here?

Why are we still stuck with the notion that the root itself must be a physical filesystem?

We rely on the initramfs infrastructure, and in many cases, simply throw it away after pivot_root.

"we have legacy software"
okay, so, run it in a chroot or namespace that dedicates itself to strict FHS adherence.

Everything else has changed.

We don't typically use xorg now, and even when we did, it was effectively just a wrapper to get at the GLX extensions for the last decade. Can't remember the last time I've launched a proper X application short of using xEvil to validate all the aspects of the stack are still functional.

We don't use init.d now; systemd has effectively solved the 'script interpreter required for boot' issues of the past.

We have the android binder in mainline.

We have io_uring, BPF, and a large quantity of components that no longer attempt to apply "everything is a file".

I'm not an OSX user or an apple owner, but I still think /System/ is a far better arrangement for embedded systems and appliance images that still rely on autoconf and makefiles to decide paths.

I'm not a NIX user, but I recognize the factors it's utility brings.

I'm not a GoboLinux user, but the community has supported it's compile-time requirements for twenty years already.

*Personally* I think it's time we got rid of everything in / and moved the concept of where "local system" lives two nodes down in the filesystem tree, to start on the process of supporting new systems hierarchies that CXL enables, and started making a concerted push towards memory mapped I/O, cross-domain DMA rings, and hypervisor-mediated message passing via "well known" PCIe endpoints.

What I'm stuck on is what to paint that particular bikeshed -- /System/this/usr/ ?
We've entered the unicode age and that's one of the very few reasons I can think of for retaining /usr/ in it's short form at all, outside of 'legacy' containment, is it's obstinance against localized translation for compatibility's sake. Most of those localizations have already accepted they'll have to interact with elements outside of their localization environment due to all the accrued history of the companies developing software using latin-based charactersets.

Home systems are looking more and more like we think of 'classic' supercomputer environments. We should be collectively thinking about getting ahead of the wave, with the hardware we have, and what's on the way to the enterprise and the consumer. Beowulf clusters and OpenMPI was obviously not the correct answer to every question.

Fuchsia's 'start with nothing and ask for enablements from a parent' approach seems far more sane in the modern age than clinging to explicit fork/exec core duplication semantics.

Kernel Samepage Merging statistics also tend to back me up as well; having been on a downward trend for a number of years now. Less components are shared now.

Sorry if this has become a bit of an essay, but these opinions have been screeching to escape my brain louder and louder in the last five years. Given the age of this article; I don't expect much in the way of replies, but I just felt I needed to scream into the void at least once or the concepts might continue to be overlooked.

A decision on composefs

Posted Jun 22, 2023 1:41 UTC (Thu) by Kamilion (subscriber, #42576) [Link]

You know, it's only when you read back what you write yourself that you realize that people might not be on the same page as you are.

So, minor clarifications.
I'm basing this on the assumption that CXL "will be" a revolution in how hardware components interact, and that we will be encountering more systems where it makes more sense to treat large regions of memory as filesystems than having to jump through a throughput hoop using a "legacy" hardware bridge through the windowed BAR approach. (Think AHCI, EHCI, UHCI, OHCI(1394), SAS, and even NVME in it's current window-on-an-independent-address-space incarnation)

If we wish to continue with the notion of nodes in the filesystem representing devices; something like qemu's doorbell PCIe devices will eventually make it into hardware.
Probably some hardware-enforced ring queues as well which may map favorably with io_uring semantics.

We will require the ability for one node to peer into the other in a relatively safe manner.
The VFS still has not been dethroned as the administrator's window into the operation of the system, but outside of "niche" clustering filesystems, having a single system image is *rare*.

Even "today", if I pick up something like the MI300A that fits in an existing epyc socket, tomorrow I might be picking up a scaled down consumer version of it, with 4GB of HBM acting as the legacy 32bit address space, and CXL traffic comprising access to the rest of the 64bit address space. It may not even require DDR to be directly connected at all anymore, as it's already on the "far side" of infinity fabric in the ryzen-derived designs.

It won't be long now that we start hitting limits of NVME and wishing we could just treat this 2TB m.2 as a direct map, and zone it to apply runtime variable hardware permissions. (We already enforce executable memory being non-writable, for example using those hardware permissions...)

Most of the changes I've outlined would definitely make massive changes to the way everyday people looked at a linux system; but as far as keeping CDE packages from 1995 still runnable under the 'default personality' in 2025, I think we could eventually agree on a reasonable hard-switch if we retained the legacy strict-posix personality. It's not like we have any competition from commercial UNIX suppliers anymore. And much of that is better suited towards being thrown in a virtualization box and only permitted host-only IP access to a sanitization proxy/gateway, or completely restricted from modern IPv6 access entirely.

If this was orchestrated correctly, "things would change" and "new interfaces would be added" but "nothing would break", as per linux's longtime vows.

A decision on composefs

Posted Sep 26, 2023 14:03 UTC (Tue) by alexl (guest, #19068) [Link]

All the remaining necessary kernel changes are now upstream as of 6.6-rc1, and composefs 1.0 is out:

https://blogs.gnome.org/alexl/2023/09/26/announcing-compo...


Copyright © 2023, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds