Debating composefs

By Jonathan Corbet
February 16, 2023

When LWN looked at the composefs filesystem in December, we reported that there had been "little response" to the patches. That is no longer the case. Whether composefs (or something like it) should be merged has become the subject of an extended debate; at its core, the discussion is over just how Linux should support certain types of container workloads.

Composefs is an interesting sort of filesystem, in that a mounted instance is an assembly of two independent parts. One of those, an "object store", is a directory tree filled with files of interest, perhaps with names that reflect the hash of their contents; the object store normally lives on a read-only filesystem of its own. The other is a "manifest" that maps human-readable names to the names of files in the object store. Composefs uses the manifest to create the filesystem that is visible to users while keeping the object store hidden from view. The resulting filesystem is read-only.

This mechanism is intended to create the system image for containers. When designing a container, one might start with a base operating-system image, then add a number of layers containing the packages needed for that specific container's job. With composefs, the object store contains all of the files that might be of interest to any container the system might run, and the composition of the image used for any specific container is done in the manifest file. The result is a flexible mechanism that can mount a system image more quickly than the alternatives while allowing the object store to be verified with fs-verity and shared across all containers in the system.

The v3 discussion

Version 3 of the composefs patches was posted in late January; it included a number of improvements requested by reviewers of the previous versions. Amir Goldstein was not entirely happy with the work that had been done, though; he suggested that, rather than proposing composefs, the developers (Alexander Larsson and Giuseppe Scrivano) should put their efforts into improving the existing kernel subsystems (specifically overlayfs and the EROFS filesystem) instead.

A long discussion of the value (or lack thereof) of composefs followed. There are currently two alternatives to composefs that are used for container use cases:

At container boot time, the system image is assembled by creating a large set of symbolic links to the places where the target files actually live. This approach suffers from a lengthy startup time; at least one system call is required to create each of thousands of symbolic links, and that takes time.
The container is given a base image on a read-only filesystem; overlayfs is then used to overlay one or more layers on top to create the container-specific image.

The consensus seems to be that the symbolic-link approach, due to its startup cost, is not a viable alternative to composefs. Goldstein and others do think that the overlayfs approach could be viable, though, perhaps with a few changes to that filesystem. The composefs developers are not so sure.

Considering overlayfs

One current shortcoming with overlayfs, Scrivano said, is that, unlike composefs, it is unable to provide fs-verity protection for the filesystem as a whole. Any changes that affect the overlay layer would bypass that protection. Larsson described an overlayfs configuration that could work (with some improvements to overlayfs), but was unenthusiastic about that option:

However, the main issue I have with the overlayfs approach is that it is sort of clumsy and over-complex. Basically, the composefs approach is laser focused on read-only images, whereas the overlayfs approach just chains together technologies that happen to work, but also do a lot of other stuff.

Among other things, he said, the extra complexity in the overlayfs solution leads to worse performance. The benchmark he used to show this was to create a filesystem using both approaches, then measure the time required to execute an "ls -lR" of the whole thing. In the cold-cache case, the overlayfs solution took about four times as long to run; the performance in the warm-cache case was more comparable, but composefs was still faster.

Goldstein strongly contested the characterization of the overlayfs solution; he also started an extended sub-thread on whether the "ls -lR" benchmark made sense. He added: "I see. composefs is really very optimized for ls -lR. Now only need to figure out if real users start a container and do ls -lR without reading many files is a real life use case." Dave Chinner jumped in to defend this test:

Cold cache performance dominates the runtime of short lived containers as well as high density container hosts being run to their container level memory limits. `ls -lR` is just a microbenchmark that demonstrates how much better composefs cold cache behaviour is than the alternatives being proposed.

He added that he has used similar tests to benchmark filesystems for many years and has never had to justify it to anybody. Larsson, meanwhile, explained the emphasis that is being placed on this performance metric (or "key performance indicator" — KPI) this way:

My current work is in automotive, which wants to move to a containerized workload in the car. The primary KPI is cold boot performance, because there are legal requirements for the entire system to boot in 2 seconds. It is also quite typical to have shortlived containers in cloud workloads, and startup time there is very important. In fact, the last few months I've been primarily spending on optimizing container startup performance (as can be seen in the massive improvements to this in the upcoming podman 4.4).

Goldstein finally accepted the importance of this metric and suggested that overlayfs could be changed to provide better cold-cache performance as well. Larsson answered that, if overlayfs could be modified to address the performance gap, it might be the better solution; he also raised doubts as to whether the performance gap could really be closed and whether it made sense to add more complexity to overlayfs.

The conclusion from this part of the discussion was that some experimentation with overlayfs made sense to see whether it is a viable option or not. Overlayfs maintainer Miklos Szeredi has been mostly absent from the discussion, but did briefly indicate that some of the proposed changes might make sense.

The EROFS option

There was another option that came up a number of times in the discussion, though: enhance the EROFS filesystem to include the functionality provided by composefs. This filesystem, which is designed for embedded, read-only operation, already has fs-verity support. EROFS developer Gao Xiang has repeatedly said that the filesystem could be enhanced to implement the features provided by composefs; indeed, much of that functionality is already there as part of the Nydus mechanism. Scrivano has questioned this idea, though:

Sure composefs is quite simple and you could embed the composefs features in EROFS and let EROFS behave as composefs when provided a similar manifest file. But how is that any better than having a separate implementation that does just one thing well instead of merging different paradigms together?

Gao has suggested that the composefs developers will, sooner or later, want to add support for storing file data (rather than just the manifest metadata), at which point composefs will start to look more like an ordinary filesystem. At such a point, the argument for separating it from other filesystems would not be so strong.

A few performance issues in EROFS (for this use case) were identified in the course of the discussion, and various fixes have been implemented. Jingbo Xu has run a number of benchmarks to measure the results of patches to both EROFS and overlayfs, but none yet have shown that either of the other two options can outperform composefs. That work is still in an early state, though.

As might be imagined, this sprawling conversation did not come to any sort of consensus with regard to whether it makes sense to merge composefs or to, instead, put development effort into one of the alternatives. Chances are that no such conclusion will be reached for the next few months. This is just the sort of decision that the upcoming LSFMM/BPF summit was created to resolve; chances are that there will be an interesting discussion at that venue.

Index entries for this article
Kernel	Filesystems/composefs

Debating composefs

Posted Feb 16, 2023 16:07 UTC (Thu) by bluca (subscriber, #118303) [Link] (9 responses)

> One current shortcoming with overlayfs, Scrivano said, is that, unlike composefs, it is unable to provide fs-verity protection for the filesystem as a whole. Any changes that affect the overlay layer would bypass that protection.

I don't follow this comment - you can very trivially setup a read-only overlayfs (by using multiple lowerdir=, and no upperdir=/workdir=) from multiple read-only volumes, each protected by dm-verity. What changes are there that could bypass this protection? I'm not aware of any.

Debating composefs

Posted Feb 16, 2023 17:20 UTC (Thu) by gscrivano (subscriber, #74830) [Link] (8 responses)

I was commenting on using overlayfs alone, even if it gains support to fs verify the file payload referred by metacopy. As you've pointed out, there is still a need for another read-only file system to protect the image.

Debating composefs

Posted Feb 16, 2023 20:28 UTC (Thu) by bluca (subscriber, #118303) [Link] (7 responses)

It depends on what you define as 'protection'. Even if the filesystems are writable, a read-only overlayfs will be, well, read-only (ie: no workdir/upperdir, only a series of lowerdir), so a tenant that only sees the overlay won't be able to modify it.
If you want integrity and authentication on top of that, yes, you need the lower layers to provide that. But even there, with overlayfs + dm-verity the protection is so much stronger: you get cryptographic authentication enforced by the kernel (via roothash signature), which on top of everything else also allows to verify the volume _before_ using it, and without being vulnerable to TOCTOU/symlink chasing shenanigans. As the kernel devs love to remind us every time they can, fs drivers are not hardened against rogue filesystems, so for use cases where security is important, this is kinda fundamental to have these layers of defence in place.

Debating composefs

Posted Feb 16, 2023 20:47 UTC (Thu) by walters (subscriber, #7396) [Link] (4 responses)

> fs drivers are not hardened against rogue filesystems,

Yes, but you have that issue if you have *any* persistent locally mutable mounted Linux filesystems, whether or not they are used as the backing storage for container images. I think in cases that would be deploying composefs, they'd already be using those anyways for local system state.

I suspect going forward though what may be viable for these sorts of things is to do a fast "metadata only fsck" on boot - perhaps even from an isolated userspace process written in a memory-safe programming language.

> for use cases where security is important,

Security is not a binary thing. It depends on your threat model, and there's always costs/benefits to different approaches. Let's please not imply that security isn't important to the people working on this. It is. You could instead say e.g. "Is vulnerable to local filesystem corruption attacks", not that it's "not secure".

Debating composefs

Posted Feb 16, 2023 22:17 UTC (Thu) by bluca (subscriber, #118303) [Link] (3 responses)

> Yes, but you have that issue if you have *any* persistent locally mutable mounted Linux filesystems, whether or not they are used as the backing storage for container images. I think in cases that would be deploying composefs, they'd already be using those anyways for local system state.
I suspect going forward though what may be viable for these sorts of things is to do a fast "metadata only fsck" on boot - perhaps even from an isolated userspace process written in a memory-safe programming language.

Yes indeed, verification needs to extend to all volumes. On an immutable OS it's important that the root/usr partition is also dm-verity verified for the same reason. And writable storage should get the luks + attestation treatment. The only real gap we have is the ESP, because it needs to be readable by firmware - the hope there is that fat32 is so simple as a filesystem, that it's difficult to use such an attack vector.

> Security is not a binary thing. It depends on your threat model, and there's always costs/benefits to different approaches. Let's please not imply that security isn't important to the people working on this. It is. You could instead say e.g. "Is vulnerable to local filesystem corruption attacks", not that it's "not secure".

Where did I say that it's not secure? As you correctly wrote, security is not binary, it's a spectrum. What I said is that the overlay + dm-verity model has stronger guarantees, ie, it sits further on that spectrum than what is being described in the article, and I've explained why I think so - which does not mean what the article describes is 'not secure' tout court.

Debating composefs

Posted Feb 16, 2023 22:52 UTC (Thu) by walters (subscriber, #7396) [Link] (2 responses)

> And writable storage should get the luks + attestation treatment.

The threat model I think many are concerned with (and definitely is in the composefs threat model AFAIK) is specifically "anti-persistence". Assuming an attacker gets root on the node with e.g. a chain from web browser flaw to kernel flaw, can they persist across a reboot by mutating persistent state that will be read on the next boot?

The https://chromium.googlesource.com/chromiumos/docs/+/maste... gives one example.

LUKS doesn't prevent code that has gained kernel mode execution from directly mutating the block device in such a way that the ext4/xfs/btrfs/whatever filesystem would get confused on the next boot and perform a double free or whatever and get chained into regaining kernel mode memory execution. I'm not sure which type of attestation you're referring to but I don't think it will either.

The other main threat is "evil maid" attacks (I really dislike that term but like "pets versus cattle" it's gained too much industry prominence to ignore) - i.e. the local attacker gaining access to the disk or physical console cases? I believe this is covered too.

What I'm saying in short is that I think composefs on a single persistent Linux filesystem provides the same protection that a system using a hybrid of dm-verity for OS state and a distinct persistent volume would. (Applying the same chosen LUKS/dm-integrity/etc stack to the persistent volume(s))

> Where did I say that it's not secure?

You didn't, but I think an objective observer would certainly feel it was reasonable to conclude you were strongly implying it by saying "so for use cases where security is important".

Overall, I do enjoy talking with you in these LWN comment threads periodically, but it seems to me we have been getting a bit repetitive and I'm sure some readers feel the same way. There's definitely been some email threads on e.g. fedora-devel@ where I could see the Subject: line and just from the names of the people replying I knew *exactly* what they were going to say (and was right 70% of the time). Let's aim to not be those people ;)

> What I said is that the overlay + dm-verity model has stronger guarantees

Yes, but I'm again saying that's moot if you have other persistent volumes that aren't dm-verity. That's the specific claim, so let's try to really narrow in on this specifically - I admit I could be wrong, there is a lot of nuance and subtlety in all of this.

Debating composefs

Posted Feb 17, 2023 19:46 UTC (Fri) by bluca (subscriber, #118303) [Link] (1 responses)

> What I'm saying in short is that I think composefs on a single persistent Linux filesystem provides the same protection that a system using a hybrid of dm-verity for OS state and a distinct persistent volume would. (Applying the same chosen LUKS/dm-integrity/etc stack to the persistent volume(s))

It is absolutely true that if you can modify a local disk's filesystem superblock then you can in theory backdoor your way in. But that assumes you already have root privileges and arbitrary code execution on that system - ie, you need a separate, pre-existing attack vector. What I am pointing out is that images downloaded from the network and opened (like for containers) can be said attack vectors, when they are not accompanied with a signature that is checked by the kernel before opening them.

The local persistent volume for local state is created from a known-good state (ie, by the trusted kernel/userspace on firstboot or provision), so the threat model is slightly different. That is not to say it's not a risk - it absolutely is, and any mitigations for that scenario are great. But it's a separate scenario.

Debating composefs

Posted Feb 17, 2023 22:06 UTC (Fri) by walters (subscriber, #7396) [Link]

> What I am pointing out is that images downloaded from the network and opened (like for containers) can be said attack vectors, when they are not accompanied with a signature that is checked by the kernel before opening them.

Hmm. Re-reading the threads, sorry - I think it may have been me that took the conversation off in the wrong direction. The overall article was about composefs, then you replied to a comment from giuseppe which was actually about the "overlayfs+erofs instead of composefs", and I sort of misread the comments as being about composefs.

I honestly haven't dug completely into the overlayfs+erofs suggestion; it strikes me as a bit hacky but I also do see Amir's point about maintenance long term. For composefs the semantics around verifying the image seem to me to be much clearer, and that's what I thought we were debating - sorry!

Debating composefs

Posted Feb 16, 2023 22:04 UTC (Thu) by walters (subscriber, #7396) [Link] (1 responses)

> without being vulnerable to TOCTOU/symlink chasing shenanigans.

Can you elaborate on this? Where would be the check versus use here?

If you're talking about offline code swapping or creating a symlink for the underlying content files, composefs is going to detect that because it compares the expected fsverity digest with the one it found when opening the file, before returning to userspace.

Looking at code and thinking about symlinks, perhaps https://github.com/containers/composefs/blob/18a6301a40aa... should be passing O_NOFOLLOW, but per above I don't think that matters much.

Debating composefs

Posted Feb 16, 2023 22:24 UTC (Thu) by bluca (subscriber, #118303) [Link]

I am specifically referring to signature checks. There is a school of thought (not related to the composefs work, recently I had to object to a proposal to remove/deprecate kernel signature support for fs-verity) that says it's enough to verify a verity roothash signature in userspace, and then later pass the verity object to the kernel for opening and using. To me this seems like a textbook case of toctou waiting to happen...

Debating composefs

Posted Feb 16, 2023 19:10 UTC (Thu) by jhoblitt (subscriber, #77733) [Link] (3 responses)

Regardless of the outcome, I'm pleased to see attention to container start up times. I have a significant workload that uses 20GIB OCI images.

Debating composefs

Posted Feb 17, 2023 12:14 UTC (Fri) by benlongo (guest, #132381) [Link] (1 responses)

I’m curious what comprises that 20G - is it data or all binaries?

Debating composefs

Posted Feb 18, 2023 21:58 UTC (Sat) by jhoblitt (subscriber, #77733) [Link]

There is some test data but it is mostly internally built conda packages.

Debating composefs

Posted Feb 21, 2023 13:29 UTC (Tue) by amarao (guest, #87073) [Link]

If someone makes containers starting 10 times faster, someone will definitively find idea of running whole zoo of images at once attractive, and we will have 200Gb of images with the same start time constrains as now.

Debating composefs

Posted Feb 17, 2023 2:37 UTC (Fri) by hsiangkao (guest, #123981) [Link] (2 responses)

> This filesystem, which is designed for embedded, read-only operation, already has fs-verity support.

Some little correction:
"Currently, EROFS doesn't have a self-contained fs-verity like other local filesystems, but we'd like to support it later with some further discussion (like we'd like to provide a choise to verify bdev inodes or likewise rather than a regular inode). since it could make merkle-tree built-in (Alexander Larsson have pointed out they also may dislike DM distribution) and shipped together with block/non-block-device (like pmem or file-based, or later mtd-based) cases.

EROFS is not only designed for embedded use cases in the beginning and we've always landed EROFS+fsdax for secure containers (kata-containers) and EROFS+fscache for runC containers. Nydus is just one of a current user-space example for such container use cases, but EROFS filesystem can also be used without Nydus as well."

I really would like that (one day) in-kernel iomap infrastructure finally would support block-based, file-based distribution, and finally it could have a friendly connection with fscache + cachefiles to support in-kernel caching in addition to netfs. That makes the whole things less fragmented (many users don't like loopback devices for whatever reasons) and simplify the current EROFS fscache implementation a lot.

Debating composefs

Posted Feb 17, 2023 19:22 UTC (Fri) by bluca (subscriber, #118303) [Link] (1 responses)

Is in-memory deduplication on your road map? E.g., imagine two processes running from two different EROFS images, that happen to both contain the exact same libc6. It would be really nice if only one copy of said libc6 was held in memory.

Debating composefs

Posted Feb 18, 2023 0:58 UTC (Sat) by hsiangkao (guest, #123981) [Link]

Yes, we’ve already sent several versions of this feature,
https://lore.kernel.org/linux-fsdevel/20230203030143.7310...

Due to lunar new year vacation, I don’t have enough slot to review it. We have to delay it for the next merge window.