The PuzzleFS container filesystem
PuzzleFS, Miculas began, is an immutable (and thus read-only) filesystem that shares design goals with the Open Container Initiative (OCI) v2 image specification. It uses content-defined chunking (discussed further shortly) and a content-addressed data store, with file data and metadata stored separately from each other. The project was started by Tycho Andersen in 2021 as an attempt to create a successor to atomfs.
The first version of the OCI image specification, he said, had a number of problems, many of which are described in this 2019 blog post by Aleksa Sarai. At the base of those problems is the dependence on tar archives to hold the layers in the filesystem. Tar, as it turns out, is not well suited to the container filesystem problem.
The format, he said, is defined poorly. It has no index; instead, there is just a header leading directly into the content. The compression mechanism used means that the filesystem image is not seekable; as a result, the entire filesystem must be decompressed even to extract one little file. There is no deduplication; even a small change means re-downloading the entire thing, though layers can be used to work around that problem to an extent. It is machine-dependent, in that directory entries can be shown in a different order on different systems. The lack of a canonical representation has led to a lot of extensions, many of which are solving the same problem.
PuzzleFS is intended to solve these problems. A filesystem image itself consists of a set of files placed on an underlying filesystem. As with the OCI image format, there is a top-level index.json file that contains a set of tags, each of which represents a versioned filesystem and points to a manifest file. The manifest file, in turn, points to an image configuration and the data stored in the actual image layers. Everything else is stored as a set of blobs in the blobs/sha256 directory.
Most data in the filesystem is broken into variable-size chunks, then stored as blobs using the SHA256 hash of the content as the file name. The chunking itself is done with the FastCDC algorithm, which finds "cut points" where a data stream can be split into blobs of varying sizes. Any given stream (the contents of a file, for example) might be split into five or 50 chunks, depending on how those cut points are determined; each chunk then lands as a separate blob under blobs/sha256, and its hash is added to the manifest.
The cut-point algorithm itself uses a sliding-window technique. The data is stepped through, byte by byte, and the hash of the last 48 bytes (for example) is calculated. If the N least-significant bytes of that hash are zero, then a cut point has been found; the data to that point is separated into a separate blob and the process starts anew.
This algorithm has some interesting characteristics, perhaps most notably its ability to perform deduplication and compression. Since each chunk is stored using its hash as the file name, chunks that are common to multiple files will automatically be shared. In traditional schemes, an update to a file will cause the entire new file to be stored; this is especially true if any bytes are inserted or deleted. Inserting a single byte into a traditionally compressed file will make the entire file after the insertion look different. With content-defined chunking, only the chunk containing the change will differ, while the rest of the file will contain the same chunks, perhaps at a different offset.
The results can be seen in an experiment that Miculas carried out. He downloaded ten different versions of Ubuntu 22.04 from the Docker Hub; they required 766MB to store in that form. Putting them into the OCI image format with compression reduced that size to 282MB. Placing them all into a PuzzleFS instance, instead, reduced the size to 130MB — without using compression. Adding compression cut the whole thing down to 53MB, a savings of 93% from the original size.
A goal of PuzzleFS was to always provide a canonical representation of the filesystem. Thus, for example, the traversal order of the source from which it is generated is defined, with both directories and extended attributes being sorted lexicographically. Another goal was direct mounting support. With tar-based formats, the files must first be extracted to an on-disk representation, creating a window where things could be changed before mounting the image. Thus, there is no guarantee that the files seen by the kernel are the same as those that were in the tar archive. PuzzleFS doesn't have that extraction step, so that problem does not exist.
Data integrity is an important objective in general. It is not possible to use dm-verity in this case to protect the whole volume; while the filesystem is immutable, the underlying data store is not, since adding a new version or layer requires the ability to add new data. So, instead, fs-verity is used to verify the integrity of the individual files in the data store. When mounting a specific image, the hash of the manifest of interest is given to the mount for verification.
An important objective behind this project was the avoidance of memory-safety bugs. For that reason, the filesystem implementation has been written in Rust. That choice has, he said, removed a lot of pain from the development process.
There is a working FUSE implementation, and a kernel implementation in a proof-of-concept state. The kernel side depends on a set of filesystem-interface abstractions that are being developed separately, and which should be headed toward the mainline at some point. Some other work is needed to get other dependencies, including the Cap'n Proto library used for metadata storage, into proper shape for the kernel. The work is progressing, though; interested folks can find the current code in this repository.
One topic Miculas did not address is the resemblance between PuzzleFS and composefs, which shares some similar goals. Composefs has run into difficulties getting into the mainline kernel — though other changes intended to support those goals are going in. PuzzleFS has some features that composefs lacks; whether those will be enough to make its upstream path easier is unclear.
See the slides from this talk for more information and details of the on-disk format.
[Thanks to the Linux Foundation for supporting our travel to this event.]
Index entries for this article | |
---|---|
Kernel | Filesystems |
Conference | Kangrejos/2023 |
Posted Sep 26, 2023 5:49 UTC (Tue)
by josh (subscriber, #17465)
[Link]
I'm wondering how easily PuzzleFS could build atop a remote object store rather than a local one?
Posted Sep 26, 2023 7:31 UTC (Tue)
by hsiangkao (guest, #123981)
[Link] (18 responses)
Side note is that EROFS already supports a varient CDC with compression since Linux v6.1. It would be nice to make a comparsion too in some extent so I could know what people care about more for our further enhancements. I know people enjoy new stuffs all the time but it would be nice to get more hints why state-of-art projects don't work well except for Rust adaption.
Similar to Christian said before [3], AFAIK, there were already four container image approaches raised for Linux kernel in addition to Composefs:
[1] https://0pointer.net/blog/casync-a-tool-for-distributing-...
Posted Sep 26, 2023 8:05 UTC (Tue)
by alexl (subscriber, #19068)
[Link] (6 responses)
PuzzleFS (as well as e.g. nydus, etc) does block-chunking de-duplication, whereas composefs does entire-file de-duplication. Yes, the block-chunking approach is more efficient during download (as you may reuse blocks when files are only partially the same), and will use less disk-space.
However the entire-file dedup approach is more efficient at runtime.
This is because the page cache, and things like vm address spaces, are tied to a single inode. You can't get a single inode out of the multiple chunk files. This means the chunked approach doubles the page cache use. Once to cache the chunk, and once to cache the merged file. You also can't share page cache use between different puzzlefs mounts even if they use shared base files. With composefs, the page caching *only* happens on the base file, and is shared between any composefs mounts using the same backing file. This allows better container density, as e.g. all the glibc mmaps shared between any running containers are the same in the page cache.
I guess an optimal system would use a block-chunked approach for downloads, but expand into full-file dedup in the local storage.
Posted Sep 26, 2023 8:18 UTC (Tue)
by hsiangkao (guest, #123981)
[Link] (1 responses)
Yes, yet in that way, there are more effective backup/restore technologies such as delta compression (frequently discussed in several academic conferences such as ATC) to reduce I/Os than just do content-defined chunking. And not necessary in a real filesystem form to keep such delta compression archival format.
Posted Sep 26, 2023 13:52 UTC (Tue)
by alexl (subscriber, #19068)
[Link]
Also, you can use filesystem level compression (in e.g. btrfs) to compress the files in the backing dir. Or even use reflinks copy create backing files that share some (but not all) blocks.
The sky is the limit, and all these approaches are compatible with using composefs to mount them.
Posted Sep 26, 2023 14:47 UTC (Tue)
by tych0 (subscriber, #105844)
[Link] (1 responses)
This is a great point. I guess it could be worked around with some core mm+fs fu, but it would definitely not be simple.
> I guess an optimal system would use a block-chunked approach for downloads, but expand into full-file dedup in the local storage.
It was an explicit design goal of PuzzleFS not to have any translation step between the image that is pushed to the container registry and the one that's mounted+run on the host, so any translation step here would be a non-starter, which seems like you need to pick "one or the other" unfortunately.
Posted Sep 27, 2023 19:01 UTC (Wed)
by calumapplepie (guest, #143655)
[Link]
Could KSM be extended to work on file-backed pages?
Posted Sep 26, 2023 16:11 UTC (Tue)
by walters (subscriber, #7396)
[Link] (1 responses)
Posted Sep 26, 2023 20:28 UTC (Tue)
by rcampos (subscriber, #59737)
[Link]
There were some patches to FUSE to allow a passthrough in some cases, that is what I wish is used to do this in a reasonable way.
Posted Sep 26, 2023 12:45 UTC (Tue)
by bluca (subscriber, #118303)
[Link] (10 responses)
Posted Sep 26, 2023 13:40 UTC (Tue)
by hsiangkao (guest, #123981)
[Link] (9 responses)
Hi! Using EROFS + dm-verity + blockdevice + overlayfs (composefs-like model), you could already implement a clean file-based runtime dedup by using overlayfs.
I understand that you mean EROFS self-contained file-based runtime dedupe across images as you mentioned in some previous article. Actually we already have an internal version for internal products by Jingbo, but it's still unclean for now. We will try to clean up and post it to fsdevel mailing list later, but not quite sure it could land smoothly, anyway.
Posted Sep 26, 2023 14:17 UTC (Tue)
by bluca (subscriber, #118303)
[Link] (8 responses)
Posted Sep 26, 2023 14:30 UTC (Tue)
by alexl (subscriber, #19068)
[Link] (3 responses)
Posted Sep 26, 2023 14:52 UTC (Tue)
by hsiangkao (guest, #123981)
[Link] (2 responses)
I think if some blob storage layer is just an EROFS image, you could directly apply dm-verity on these layers.
We can also use fs-verity to verity the blob storage layers if these layers are on a RW fs (the current composefs does.)
Posted Sep 26, 2023 15:02 UTC (Tue)
by alexl (subscriber, #19068)
[Link] (1 responses)
Posted Sep 26, 2023 15:09 UTC (Tue)
by hsiangkao (guest, #123981)
[Link]
Posted Sep 26, 2023 14:35 UTC (Tue)
by hsiangkao (guest, #123981)
[Link] (3 responses)
Alternatively, as an EROFS self-containerd approach, EROFS could share page cache if files with same data across images without relying on overlayfs, anyway.
Posted Sep 26, 2023 15:57 UTC (Tue)
by bluca (subscriber, #118303)
[Link] (2 responses)
Posted Sep 26, 2023 16:08 UTC (Tue)
by hsiangkao (guest, #123981)
[Link] (1 responses)
In that case, memory of /usr/foo/a is deduplicated according to how overlayfs works since /usr/foo/a is on the same EROFS instance.
Posted Sep 26, 2023 16:57 UTC (Tue)
by bluca (subscriber, #118303)
[Link]
Posted Sep 26, 2023 8:39 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Sep 26, 2023 10:02 UTC (Tue)
by bluca (subscriber, #118303)
[Link]
Posted Sep 26, 2023 12:28 UTC (Tue)
by jezuch (subscriber, #52988)
[Link] (1 responses)
Posted Sep 26, 2023 12:45 UTC (Tue)
by smcv (subscriber, #53363)
[Link]
Posted Sep 26, 2023 13:04 UTC (Tue)
by walters (subscriber, #7396)
[Link] (2 responses)
Not at all - the decision was that the *ingredients* for composefs (mainly overlayfs extensions) would be in the kernel, and "composefs" would be a userspace concept wiring those ingredients together - and this has all happened!
Please see the README.md for https://github.com/containers/composefs
Posted Sep 26, 2023 13:38 UTC (Tue)
by alexl (subscriber, #19068)
[Link] (1 responses)
Posted Sep 26, 2023 14:17 UTC (Tue)
by bluca (subscriber, #118303)
[Link]
The PuzzleFS container filesystem
The PuzzleFS container filesystem
1. original Nydus [4], also called Dragonfly Image service in OCIv2 brainstorm [5] --- Actually they already had an in-kernel implementation of their RAFS v5 before I joined Alibaba Cloud, if needed, they could post the original kernel implementation at that time to the mailing list;
2. TarFS [6];
3. Overlaybd [7] --- Another QCOW2-like block driver approach from another team in Alibaba Cloud;
4. Puzzlefs.
I'm not sure if Linux could accept all of them for the same use case, anyway.
[2] https://www.usenix.org/system/files/conference/atc16/atc1...
[3] https://lore.kernel.org/r/20230609-nachrangig-handwagen-3...
[4] https://github.com/opencontainers/.github/blob/master/mee...
https://github.com/dragonflyoss/image-service/blob/master...
[5] https://hackmd.io/@cyphar/ociv2-brainstorm
[6] https://github.com/kata-containers/kata-containers/pull/7106
https://kangrejos.com/2023/Read-only%20FS%20abstractions%...
[7] https://lore.kernel.org/r/9505927dabc3b6695d62dfe1be371b1...
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
> This is a great point. I guess it could be worked around with some core mm+fs fu, but it would definitely not be simple.
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
And check dm-verity root digests of these layers before mounting. I think that would be in the same effect, anyway.
The PuzzleFS container filesystem
The PuzzleFS container filesystem
I'm not sure if some people really rely on layering concept (such as system using raw partitions/devices without real filesystem storage), anyway.
The PuzzleFS container filesystem
- EROFS + dm-verity block device blobs as data only layers;
- a small overlayfs meta layer (with EROFS + dm-verity) to merge these data storage blobs into a merged rootfs.
Thus all layers are under dm-verity protection, so the whole image won't be tampered.
The PuzzleFS container filesystem
The PuzzleFS container filesystem
That already works without any extra built-in feature.
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
The PuzzleFS container filesystem
https://blogs.gnome.org/alexl/2023/09/26/announcing-compo...
The PuzzleFS container filesystem