An introduction to EROFS

By Jake Edge
June 7, 2023

Gao Xiang gave an overview of the Extended Read-Only File System (EROFS) in a filesystem session at the 2023 Linux Storage, Filesystem, Memory-Management and BPF Summit. EROFS was added to Linux 5.4 in 2019 and has been increasingly used in places beyond its roots as a filesystem for Android and embedded devices. Container images based on EROFS are being used in many places these days, for example.

Unfortunately, this session was quite difficult for me to follow, so the report below is fragmentary and incomplete. There is a YouTube video of the session, but it suffers from nearly inaudible audio, though perhaps that will be addressed before long. The slides from the session are also available.

EROFS is a block-based, read-only filesystem with a "very simple" format, Xiang began. The earlier read-only filesystems had many limitations, such as not supporting compression, which is part of why EROFS was developed. EROFS stores its data in a block-aligned fashion, which is page-cache friendly; that alignment also allows direct I/O and DAX filesystem access.

SquashFS is another read-only filesystem, but it does not store its compressed data in a block-aligned fashion, which increases the I/O overhead. EROFS does its compression into fixed 4KB blocks in the filesystem, while SquashFS uses fixed-sized blocks of uncompressed data. In addition, SquashFS does not allow random-access in its directories, unlike EROFS; that means SquashFS requires linear searches for directory entries.

Replacing tar or cpio archives with a filesystem is a potential use case for EROFS. There has been a proposal from the confidential-computing community for a kernel tarfs filesystem, which would allow guest VMs to efficiently mount a tar file directly. But EROFS would be a better choice, he said. There is a proof-of-concept patch set that allows directly mounting a downloaded tar file using EROFS that performs better than unpacking the tarball to ext4, then mounting it in the guest using overlayfs.

There are still problems with this approach, including a lack of sharing in the page cache between guests that are using the same tar archive. Aleksa Sarai agreed that there was a problem with that, but thought that eliminating tar archives as the underlying format would go a long toward fixing it—along with a bunch of other problems. He also said that the EROFS approach is better than what's being done today, but believes that replacing the tar format in container images is needed.

There is currently a lot of effort that goes into optimizing image layout that is all needed solely due to the tar format; "in my mind, this is insanity", Sarai said. The community needs to stop expending so much energy working around the limitations of the tar format. There may be 500 instances of Bash in the guests on a system, but they cannot share the same inode in a tar-based format, so they are treated as distinct files. But the tar format is going to continue to need to be supported, Xiang said, so a compatible solution is needed.

He continued with features of EROFS, including the ability to do chunk-based deduplication of file data. The typical use case is for systems using EROFS with Nydus. EROFS optionally supports per-file compression with LZ4/LZMA, but uses smaller compression block sizes, which reduces the memory amplification that occurs with SquashFS. The data is decompressed in-place in order to reduce extra copies.

Recent use cases for EROFS take three basic forms. The first is an EROFS full image; those are used in compressed form for space saving at the cost of some performance, or uncompressed and shared among guests with DAX or FS-Cache. The second is to have an EROFS metadata-only image with an external source for the file data, such as a tar archive or other binary format. The third is to use EROFS with overlayfs as described in the previous session on composefs.

Using EROFS could potentially increase performance for machine-learning data sets, Gao said. These data sets often have millions of small files in a single directory; the training process will read the entire directory and choose files randomly from the list. Because of its compact layout, EROFS is potentially twice as fast as ext4 for those kinds of operations.

The session wound down with some discussion about using the clone-file-range ioctl() operation to do an overlayfs copy_up on files. A copy_up is performed when the lower-layer file is accessed for write; the file gets copied to the upper layer before it can be modified. If the layers are loopback-mounted files from the same filesystem, a copy-on-write operation could be done instead. Amir Goldstein seemed to think that something like that is possible and would be useful, but there is work needed to get there.

Index entries for this article
Kernel	Filesystems/EROFS
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2023

An introduction to EROFS

Posted Jun 7, 2023 15:51 UTC (Wed) by hsiangkao (guest, #123981) [Link] (5 responses)

Very sorry about my spoken English. I might need to add some words:

> The earlier read-only filesystems had many limitations, such as not supporting compression,

Here I meant they can be worked effectively without compression.

ROMFS might be something but as far as I understand it doesn't have block concept so we still need do extra memcpy for buffered I/O, see:
romfs_read_folio() -> romfs_dev_read() -> romfs_blk_read().
It makes direct I/O / FSDAX nonsense as well. Also ROMFS and CRAMFS on-disk format itself are quite limited as well.

> uses smaller compression block sizes, which reduces the memory amplification that occurs with SquashFS.

EROFS can use 1 MiB pcluster size as well as Squashfs, but EROFS original proposed scenarios were effectively with smaller pcluster sizes (4/8/16KiB for example, EROFS uses 4KiB pcluster by default), because we'd like to enable compression for users without extra memory footprints. Yet the previous approach (I mean indexes) are not quite good at these small compression units (you could benchmark with 4/8/16 KiB compression unit instead of typical 128 KiB for example.)

Finally, I'd like to mention EROFS now supports global compressed data deduplication with rolling hash as well, so if there are similiar data but not block-aligned (like text data like source code or similiar wikipedia versions), it might be useful to deduplicate + compression with this way...

An introduction to EROFS

Posted Jun 8, 2023 9:23 UTC (Thu) by gmgod (guest, #143864) [Link] (4 responses)

Hello, this looks like very exciting work that seems to better fit lots of use cases people currently have (from initramfs to specific-need archiving, to a base for "immutable" OS, VMs and containers).

Two questions as someone who has not followed the advent of EROFS:

1. Do you have strong tempering prevention guarantees built-in (beyond being immutable of course) or is that something people have to figure out outside of EROFS?

2. Is EROFS agnostic of compression methods? Or said otherwise is it modular enough to use different compression/filtering methods? (I am aware that you are covering the two main cases people would want with your current choice: I am not questioning that.)

An introduction to EROFS

Posted Jun 8, 2023 10:08 UTC (Thu) by hsiangkao (guest, #123981) [Link] (1 responses)

Two questions as someone who has not followed the advent of EROFS:
> 1. Do you have strong tempering prevention guarantees built-in (beyond being immutable of course) or is that something people have to figure out outside of EROFS?

You meant malicious image resistence? We're always trying my best to deal with fuzzing issues and fix them as quick as possible. And currently we don't have remaining fuzzing issue at hand. That is the only guarantee I could do for this.

> 2. Is EROFS agnostic of compression methods? Or said otherwise is it modular enough to use different compression/filtering methods? (I am aware that you are covering the two main cases people would want with your current choice: I am not questioning that.)

It depends. In principle, any compression method could be added to EROFS with no modification directly but since EROFS data including compressed data is block-aligned (IMHO, like btrfs and f2fs compression but unlike squashfs), if such compression method doesn't support the optimized fit-block approach (aka. fixed-sized output compression, currently only lz4 and lzma have, and I'm working on deflate now), the last block (usually 4k block size) of each pcluster (4k, 8k, ... to 1m) will not be completely full with compressed data. That will cause some compression ratio loss if pcluster is small (like 4k or 8k, but I think it can be ignored if pcluster size itself is large like 128k or more).

In practice, I tend to avoid adding new algorithm randomly before I design carefully to EROFS since it could cause compatibility problems and maintainence burden if I later change to the optimal approach. In short, this year I will land deflate algorithm to enable deflate hardware accelerators (and maybe more I'm still planning with compression algorithm guys).

An introduction to EROFS

Posted Jun 8, 2023 10:39 UTC (Thu) by hsiangkao (guest, #123981) [Link]

> like btrfs and f2fs compression

Add some words: I just meant compressed data is block-aligned like those as far as I understand, but actually EROFS can handle arbitary decompressed offset/length instead of block-aligned decompressed offset/length compared with f2fs/btrfs. So that EROFS can do block-unaligned rolling hash compressed data deduplication since Linux v6.1 (also called CDC).

In principle, we could record byte-granularity decompressed offset/length pair and byte-granularity arbitary compressed offset/length pair for each compression unit but that makes on-disk indexes ineffective (metadata I/O) even makes on-disk index random access impossible. In addition, unaligned compressed data makes caching/in-place I/O strategy unfriendly.

For more details of detailed design, you could also refer to EROFS ATC19 paper and kernel documentation if needed.

An introduction to EROFS

Posted Jun 9, 2023 15:48 UTC (Fri) by bobolopolis (subscriber, #119051) [Link] (1 responses)

> 1. Do you have strong tempering prevention guarantees built-in (beyond being immutable of course) or is that something people have to figure out outside of EROFS?

dm-verity is probably your best bet for this, which would let you use erofs, squashfs, or whatever other read-only filesystem you want. I've been pretty happy with dm-verity + squashfs in past projects, I'm sure erofs would work great too.

An introduction to EROFS

Posted Jun 9, 2023 16:35 UTC (Fri) by hsiangkao (guest, #123981) [Link]

> dm-verity is probably your best bet for this, which would let you use erofs, squashfs, or whatever other read-only filesystem you want. I've been pretty happy with dm-verity + squashfs in past projects, I'm sure erofs would work great too.

Signed verified images are fine of this (if users just trust the signature), I think later LWN will post the following LSF/MM FS track topics. The related stuffs discussed several times in several seperate topics.