LWN: Comments on "Debating composefs"

Debating composefs

amarao — Tue, 21 Feb 2023 13:29:06 +0000

If someone makes containers starting 10 times faster, someone will definitively find idea of running whole zoo of images at once attractive, and we will have 200Gb of images with the same start time constrains as now.

Debating composefs

jhoblitt — Sat, 18 Feb 2023 21:58:38 +0000

There is some test data but it is mostly internally built conda packages.

Debating composefs

hsiangkao — Sat, 18 Feb 2023 00:58:12 +0000

Yes, we’ve already sent several versions of this feature,
https://lore.kernel.org/linux-fsdevel/20230203030143.7310...

Due to lunar new year vacation, I don’t have enough slot to review it. We have to delay it for the next merge window.

Debating composefs

walters — Fri, 17 Feb 2023 22:06:52 +0000

> What I am pointing out is that images downloaded from the network and opened (like for containers) can be said attack vectors, when they are not accompanied with a signature that is checked by the kernel before opening them.

Hmm. Re-reading the threads, sorry - I think it may have been me that took the conversation off in the wrong direction. The overall article was about composefs, then you replied to a comment from giuseppe which was actually about the "overlayfs+erofs instead of composefs", and I sort of misread the comments as being about composefs.

I honestly haven't dug completely into the overlayfs+erofs suggestion; it strikes me as a bit hacky but I also do see Amir's point about maintenance long term. For composefs the semantics around verifying the image seem to me to be much clearer, and that's what I thought we were debating - sorry!

Debating composefs

bluca — Fri, 17 Feb 2023 19:46:57 +0000

> What I'm saying in short is that I think composefs on a single persistent Linux filesystem provides the same protection that a system using a hybrid of dm-verity for OS state and a distinct persistent volume would. (Applying the same chosen LUKS/dm-integrity/etc stack to the persistent volume(s))

It is absolutely true that if you can modify a local disk's filesystem superblock then you can in theory backdoor your way in. But that assumes you already have root privileges and arbitrary code execution on that system - ie, you need a separate, pre-existing attack vector. What I am pointing out is that images downloaded from the network and opened (like for containers) can be said attack vectors, when they are not accompanied with a signature that is checked by the kernel before opening them.

The local persistent volume for local state is created from a known-good state (ie, by the trusted kernel/userspace on firstboot or provision), so the threat model is slightly different. That is not to say it's not a risk - it absolutely is, and any mitigations for that scenario are great. But it's a separate scenario.

Debating composefs

bluca — Fri, 17 Feb 2023 19:22:53 +0000

Is in-memory deduplication on your road map? E.g., imagine two processes running from two different EROFS images, that happen to both contain the exact same libc6. It would be really nice if only one copy of said libc6 was held in memory.

Debating composefs

benlongo — Fri, 17 Feb 2023 12:14:38 +0000

I’m curious what comprises that 20G - is it data or all binaries?

Debating composefs

hsiangkao — Fri, 17 Feb 2023 02:37:28 +0000

> This filesystem, which is designed for embedded, read-only operation, already has fs-verity support.

Some little correction:
"Currently, EROFS doesn't have a self-contained fs-verity like other local filesystems, but we'd like to support it later with some further discussion (like we'd like to provide a choise to verify bdev inodes or likewise rather than a regular inode). since it could make merkle-tree built-in (Alexander Larsson have pointed out they also may dislike DM distribution) and shipped together with block/non-block-device (like pmem or file-based, or later mtd-based) cases.

EROFS is not only designed for embedded use cases in the beginning and we've always landed EROFS+fsdax for secure containers (kata-containers) and EROFS+fscache for runC containers. Nydus is just one of a current user-space example for such container use cases, but EROFS filesystem can also be used without Nydus as well."

I really would like that (one day) in-kernel iomap infrastructure finally would support block-based, file-based distribution, and finally it could have a friendly connection with fscache + cachefiles to support in-kernel caching in addition to netfs. That makes the whole things less fragmented (many users don't like loopback devices for whatever reasons) and simplify the current EROFS fscache implementation a lot.

Debating composefs

walters — Thu, 16 Feb 2023 22:52:58 +0000

> And writable storage should get the luks + attestation treatment.

The threat model I think many are concerned with (and definitely is in the composefs threat model AFAIK) is specifically "anti-persistence". Assuming an attacker gets root on the node with e.g. a chain from web browser flaw to kernel flaw, can they persist across a reboot by mutating persistent state that will be read on the next boot?

The https://chromium.googlesource.com/chromiumos/docs/+/maste... gives one example.

LUKS doesn't prevent code that has gained kernel mode execution from directly mutating the block device in such a way that the ext4/xfs/btrfs/whatever filesystem would get confused on the next boot and perform a double free or whatever and get chained into regaining kernel mode memory execution. I'm not sure which type of attestation you're referring to but I don't think it will either.

The other main threat is "evil maid" attacks (I really dislike that term but like "pets versus cattle" it's gained too much industry prominence to ignore) - i.e. the local attacker gaining access to the disk or physical console cases? I believe this is covered too.

What I'm saying in short is that I think composefs on a single persistent Linux filesystem provides the same protection that a system using a hybrid of dm-verity for OS state and a distinct persistent volume would. (Applying the same chosen LUKS/dm-integrity/etc stack to the persistent volume(s))

> Where did I say that it's not secure?

You didn't, but I think an objective observer would certainly feel it was reasonable to conclude you were strongly implying it by saying "so for use cases where security is important".

Overall, I do enjoy talking with you in these LWN comment threads periodically, but it seems to me we have been getting a bit repetitive and I'm sure some readers feel the same way. There's definitely been some email threads on e.g. fedora-devel@ where I could see the Subject: line and just from the names of the people replying I knew *exactly* what they were going to say (and was right 70% of the time). Let's aim to not be those people ;)

> What I said is that the overlay + dm-verity model has stronger guarantees

Yes, but I'm again saying that's moot if you have other persistent volumes that aren't dm-verity. That's the specific claim, so let's try to really narrow in on this specifically - I admit I could be wrong, there is a lot of nuance and subtlety in all of this.

Debating composefs

bluca — Thu, 16 Feb 2023 22:24:10 +0000

I am specifically referring to signature checks. There is a school of thought (not related to the composefs work, recently I had to object to a proposal to remove/deprecate kernel signature support for fs-verity) that says it's enough to verify a verity roothash signature in userspace, and then later pass the verity object to the kernel for opening and using. To me this seems like a textbook case of toctou waiting to happen...

Debating composefs

bluca — Thu, 16 Feb 2023 22:17:59 +0000

> Yes, but you have that issue if you have *any* persistent locally mutable mounted Linux filesystems, whether or not they are used as the backing storage for container images. I think in cases that would be deploying composefs, they'd already be using those anyways for local system state.
I suspect going forward though what may be viable for these sorts of things is to do a fast "metadata only fsck" on boot - perhaps even from an isolated userspace process written in a memory-safe programming language.

Yes indeed, verification needs to extend to all volumes. On an immutable OS it's important that the root/usr partition is also dm-verity verified for the same reason. And writable storage should get the luks + attestation treatment. The only real gap we have is the ESP, because it needs to be readable by firmware - the hope there is that fat32 is so simple as a filesystem, that it's difficult to use such an attack vector.

> Security is not a binary thing. It depends on your threat model, and there's always costs/benefits to different approaches. Let's please not imply that security isn't important to the people working on this. It is. You could instead say e.g. "Is vulnerable to local filesystem corruption attacks", not that it's "not secure".

Where did I say that it's not secure? As you correctly wrote, security is not binary, it's a spectrum. What I said is that the overlay + dm-verity model has stronger guarantees, ie, it sits further on that spectrum than what is being described in the article, and I've explained why I think so - which does not mean what the article describes is 'not secure' tout court.

Debating composefs

walters — Thu, 16 Feb 2023 22:04:55 +0000

> without being vulnerable to TOCTOU/symlink chasing shenanigans.

Can you elaborate on this? Where would be the check versus use here?

If you're talking about offline code swapping or creating a symlink for the underlying content files, composefs is going to detect that because it compares the expected fsverity digest with the one it found when opening the file, before returning to userspace.

Looking at code and thinking about symlinks, perhaps https://github.com/containers/composefs/blob/18a6301a40aa... should be passing O_NOFOLLOW, but per above I don't think that matters much.

Debating composefs

walters — Thu, 16 Feb 2023 20:47:22 +0000

> fs drivers are not hardened against rogue filesystems,

Yes, but you have that issue if you have *any* persistent locally mutable mounted Linux filesystems, whether or not they are used as the backing storage for container images. I think in cases that would be deploying composefs, they'd already be using those anyways for local system state.

I suspect going forward though what may be viable for these sorts of things is to do a fast "metadata only fsck" on boot - perhaps even from an isolated userspace process written in a memory-safe programming language.

> for use cases where security is important,

Security is not a binary thing. It depends on your threat model, and there's always costs/benefits to different approaches. Let's please not imply that security isn't important to the people working on this. It is. You could instead say e.g. "Is vulnerable to local filesystem corruption attacks", not that it's "not secure".

Debating composefs

bluca — Thu, 16 Feb 2023 20:28:06 +0000

It depends on what you define as 'protection'. Even if the filesystems are writable, a read-only overlayfs will be, well, read-only (ie: no workdir/upperdir, only a series of lowerdir), so a tenant that only sees the overlay won't be able to modify it.
If you want integrity and authentication on top of that, yes, you need the lower layers to provide that. But even there, with overlayfs + dm-verity the protection is so much stronger: you get cryptographic authentication enforced by the kernel (via roothash signature), which on top of everything else also allows to verify the volume _before_ using it, and without being vulnerable to TOCTOU/symlink chasing shenanigans. As the kernel devs love to remind us every time they can, fs drivers are not hardened against rogue filesystems, so for use cases where security is important, this is kinda fundamental to have these layers of defence in place.

Debating composefs

jhoblitt — Thu, 16 Feb 2023 19:10:10 +0000

Regardless of the outcome, I'm pleased to see attention to container start up times. I have a significant workload that uses 20GIB OCI images.

Debating composefs

gscrivano — Thu, 16 Feb 2023 17:20:31 +0000

I was commenting on using overlayfs alone, even if it gains support to fs verify the file payload referred by metacopy. As you've pointed out, there is still a need for another read-only file system to protect the image.

Debating composefs

bluca — Thu, 16 Feb 2023 16:07:01 +0000

> One current shortcoming with overlayfs, Scrivano said, is that, unlike composefs, it is unable to provide fs-verity protection for the filesystem as a whole. Any changes that affect the overlay layer would bypass that protection.

I don't follow this comment - you can very trivially setup a read-only overlayfs (by using multiple lowerdir=, and no upperdir=/workdir=) from multiple read-only volumes, each protected by dm-verity. What changes are there that could bypass this protection? I'm not aware of any.