|
|
Log in / Subscribe / Register

The ABI status of filesystem formats

By Jonathan Corbet
October 8, 2020
One of the key rules of Linux kernel development is that the ABI between the kernel and user space cannot be broken; any change that breaks previously working programs will, outside of exceptional circumstances, be reverted. The rule seems clear, but there are ambiguities when it comes to determining just what constitutes the kernel ABI; tracepoints are a perennial example of this. A recent discussion has brought another one of those ambiguities to light: the on-disk format of Linux filesystems.

Users reporting kernel regressions will receive a varying amount of sympathy, depending on where the regression is. For normal user-space programs using the system-call API, that sympathy is nearly absolute, and changes that break things will need to be redone. This view of the ABI also extends to the virtual filesystems, such as /proc and sysfs, exported by the kernel. Changes that break things are deemed a little more tolerable when they apply to low-level administrative tools; if there is only one program that is known to use an interface, and that program has been updated, the change may be allowed. On the other hand, nobody will be concerned about changes that break out-of-tree kernel modules; the interface they used is considered to be internal to the kernel and not subject to any stability guarantee.

But those are not the only places where user space interfaces with the kernel. Consider, for example, this regression report from Josh Triplett. It seems that an ext4 filesystem bug fix merged for 5.9-rc2 breaks the mounting of some ext4 filesystems that he works with.

These filesystems are created for some unspecified internal purpose with an unreleased internal tool. They are meant to be read-only, so there will never be any need (or ability) to create new files on them. As a space-saving measure, this tool overlays all of the block and inode bitmaps onto a single block, set to all ones, indicating that all blocks and inodes are already allocated. To indicate that this has been done, this tool marks the filesystem with a special flag (EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS). This flag is defined by the ext4 tools, but is not used by the kernel in any way. It is placed in the set of read-only compatibility flags, though, meaning that a kernel that sees it in a filesystem will know that said filesystem can be safely mounted, but only in read-only mode.

Until 5.9-rc2, mainline kernels were happy to mount these filesystems. The commit highlighted by Triplett changed that situation by adding some checks to the mount-time ext4 verifier that ensures that the filesystem image is valid. As a result of these new checks, the overlapping bitmaps are detected, the kernel complains, and any attempt to mount the filesystem fails. Something that used to work no longer does — the definition of a regression. Triplett included in his report a small patch that disables the validity check when the EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS flag is present, rendering his filesystems mountable again.

That change is likely to be merged, but it has not brought great joy to the filesystem developers, who see it as a sign of things going wrong. XFS maintainer Darrick Wong argued that "unofficial" filesystem variants should not be supported:

I disagree; creating undocumented forks of an existing ondisk format (especially one that presents as inconsistent metadata to regular tools) is creating a ton of pain for future users and maintainers when the incompat forks collide with the canonical implementation(s).

Triplett responded that filesystem images should fall within the realm of the kernel ABI:

I was generally under the impression that mounting existing root filesystems fell under the scope of the kernel<->userspace or kernel<->existing-system boundary, as defined by what the kernel accepts and existing userspace has used successfully, and that upgrading the kernel should work with existing userspace and systems. If there's some other rule that applies for filesystems, I'm not aware of that.

Ext4 maintainer Ted Ts'o said that he was not opposed to Triplett's patch, but suggested that further patches to support this tool might receive a chillier reception. He then told the story of the make_ext4fs tool, an independently written ext4 filesystem creator used for years by Android. It created a long list of compatibility and corruption problems, he said, that took years to iron out. Third-party filesystem tools are prone to such problems, he said:

As far as I'm concerned, it's not just about on-disk file system format, it's also about the official user space tools. If you create a file system which the kernel is happy with, but which wasn't created using the official user space tools, file systems are so full of state and permutations of how things should be done that the opportunities for mischief are huge.

Filesystem developers have a certain natural aversion to "mischief", so it is unsurprising that Ts'o would prefer that these outside tools simply not exist. At a minimum, he suggested, future policy should say that filesystem-image regressions would only need to be addressed for images that were created and managed by the designated official tools. He requested that Triplett find a way to get rid of his custom tool.

Triplett disagreed with that policy suggestion, saying that a more reasonable approach would be that any filesystem images that pass the e2fsck checker should be supported. Not all tools that work with the ext4 format should have to live in the e2fsprogs repository; to say otherwise, he added, would be tantamount to saying that the FreeBSD kernel, which has an ext4 filesystem driver, should live in e2fsprogs too. The tool in question here, which Triplett finally described late in the conversation, appears equally unsuited to inclusion in the e2fsprogs repository.

Triplett concluded that message with a restatement of his request:

The *only* thing I'm asking, here, is "don't break things that worked". And after this particular item, I'd be happy to narrow that to "don't break things that e2fsck was previously happy with".

Ts'o's response made it clear that he is uninclined to grant that wish; that would, he said, make it impossible to fix security-related problems related to invalid filesystem images, of which there are many. Many of these invalid images — often generated by fuzzing tools or attackers — pass e2fsck until the problems are found and fixed. Grandfathering in any image that passes e2fsck would thus, he said, require invalid filesystems to be supported forever. He concluded with some suggestions for other ways to solve Triplett's problem and requested that Triplett work more closely with the ext4 developers in the future.

As of this writing, that is where the discussion has stopped. Ts'o's willingness to apply the fix for the immediate problem means that there is no pressing need to resolve the larger issue of regressions involving filesystems; that, in turn, means that the issue is likely to come up again at some point. The kernel ABI is a large and amorphous thing, and many of the boundaries are fuzzy at best. Filesystems are one area where that boundary has not yet been fully explored; somebody is likely to inadvertently end up on the wrong side of it sooner or later.

Index entries for this article
KernelDevelopment model/User-space ABI
KernelFilesystems/ext4


to post comments

The ABI status of filesystem formats

Posted Oct 8, 2020 17:46 UTC (Thu) by sbaugh (guest, #103291) [Link] (2 responses)

Triplett's library sounds interesting and somewhat-generally-useful, actually. Repeating the description here:

>The short version is that I needed a library to rapidly turn
>dynamically-obtained data into a set of disk blocks to be served
>on-the-fly as a software-defined disk, and then mounted on the other
>side of that interface by the Linux kernel. Turns out that's *many
>orders of magnitude* faster than any kind of network filesystem like
>NFS. It's slightly similar to a vvfat for ext4. The less blocks it can
>generate and account for and cache, the faster it can run, and
>microseconds matter.

It's an unusual use case, but it doesn't seem like one which is totally unacceptable. I could imagine using it myself. It would be nice to see it upstreamed to e2fsprogs.

The ABI status of filesystem formats

Posted Oct 8, 2020 19:51 UTC (Thu) by khim (subscriber, #9252) [Link]

Agree 100% there. I'm not sure I would want to push e2fsprogs to support it, but it sounds precisely as something much more useful then author who made this tool thinks.

I only hope that's not something which differentiates the product which he is developing from other such tools and it could be open-sourced and shared.

The ABI status of filesystem formats

Posted Oct 11, 2020 21:23 UTC (Sun) by WolfWings (subscriber, #56790) [Link]

...this seems like they're simply using the wrong filesystem entirely. ROMFS has been around since somewhere in the 2.6's and compiles down to literally a couple kilobytes of total binary size in most cases.

The ABI status of filesystem formats

Posted Oct 8, 2020 18:04 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (9 responses)

My 2¢ (which should not count for much, seeing as I am neither a Linux developer nor someone who regularly works with filesystem images):

This problem would be a lot easier if ext4 had a formal specification saying "This is what a valid filesystem should look like." I'm not really happy with the idea that "valid" just means "something you got from mke2fs(8)." Obviously, people are going to want to do experimental things that are not within the scope of a general-purpose tool like mke2fs. A formal spec would help to delimit the lines between "weird, but supported" and "don't do that, or else you're on your own."

I get the sense that some developers* don't like formal specs. A formal spec does not need to be huge or complicated like the C or POSIX standards; it could just be a short description of what e2fsprogs currently expects and supports, in plain and simple language. In fact, I imagine they could take https://www.kernel.org/doc/html/latest/filesystems/ext4/i..., sweep it for bugs, ambiguities, or outdated information, and then slap a "this is the formal spec" notice on it. This would not necessarily be easy (ambiguities in particular are hard to find and hard to fix), but it wouldn't require convening a huge standards committee or something ridiculous like that.

* This is not a reference to any individual, and it is especially not a reference to Mr. Ts'o. I do not know what Mr. Ts'o thinks of formal specs.

The ABI status of filesystem formats

Posted Oct 8, 2020 20:49 UTC (Thu) by khim (subscriber, #9252) [Link] (1 responses)

Format specification only works for simple data structures. Look on C++: it has formal spec, ISO-approved one… yet discussions about if something is a valid C++ or not is not uncommon.

The ABI status of filesystem formats

Posted Oct 8, 2020 22:10 UTC (Thu) by roc (subscriber, #30627) [Link]

C++ is an extreme example. Filesystem formats are vastly simpler than C++. We know how to write formal specifications for quite complicated things.

The ABI status of filesystem formats

Posted Oct 9, 2020 7:11 UTC (Fri) by bangert (subscriber, #28342) [Link] (3 responses)

A formal spec is not the panacea that one may be inclined to assume.

The way i see it, the only thing a formal spec does is to force upon (someone) to take ALL discussions about the subject of the article up front. That's practically impossible to get right and specifically when done well takes a huge amount of effort. If the developers were forced to do it, it would consume their time for a very long time - if someone else were to do it it would/could have a higher degree of defects.

You are highly likely to just debate the formal specification instead of the implementation.

While these objections apply to many situations the extra effort a formal spec introduces is worth it in (some|other) cases.

And although the conclusion of the article, that this subject will likely be discussed again, may be true, that is not in itself a bad thing (and the article does not say so).

The ABI status of filesystem formats

Posted Oct 9, 2020 10:33 UTC (Fri) by roc (subscriber, #30627) [Link] (1 responses)

One nice thing about really formal specs (as opposed to careful English specs, which aren't really formal) is that for many specification languages you can find (or write) tools that leverage the specs to do useful work. For example, for a language like Alloy, fuzzing and model-checking tools can automatically generate testcases that are valid according to the spec and exercise all or most of the interesting edge cases (including many that the spec authors didn't think of).

You may also be able to prove useful consistency properties from the spec, e.g. that a "valid" filesystem doesn't have unaccounted-for blocks, etc.

The ABI status of filesystem formats

Posted Oct 9, 2020 10:35 UTC (Fri) by roc (subscriber, #30627) [Link]

What I meant to add was: if you write formal specifications the right way, you can use these tools to get more value out of the spec, which tilts the cost/benefit tradeoff towards formal specifications.

The ABI status of filesystem formats

Posted Oct 13, 2020 3:29 UTC (Tue) by marcH (subscriber, #57642) [Link]

> The way i see it, the only thing a formal spec does is to force upon (someone) to take ALL discussions about the subject of the article up front.

No, you can write a spec that formalizes an implementation. This is how most RFCs are written.

The ABI status of filesystem formats

Posted Oct 10, 2020 19:43 UTC (Sat) by error27 (subscriber, #8346) [Link] (2 responses)

The problem is that we don't care about a spec, we care about "User space used to work and now it doesn't." So the fix is to program a lot of validation into the kernel so that Josh's filesystem never works from square one. If it never worked to begin with then it can never be broken. It's a lot of work to code this and it doesn't necessarily make Josh any happier.

So probably that approach is a waste of time.

The ABI status of filesystem formats

Posted Oct 11, 2020 19:55 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

> So probably that approach is a waste of time.

I find this remark deeply confusing, so much so that I think I must have misunderstood you.

I proposed creating a formal spec, and declaring that to be the defining line of "what the kernel is prepared to support." In response to this, you described what happens if you don't have a spec and/or don't care about said spec, and then ambiguously said "that approach is a waste of time."

It sounds as if you were trying to refute my suggestion, but your argument before this line appears to *agree* with me, because it describes problems that arise when you don't have a spec. So, if you could clarify your position, that would help me to understand your argument a lot better.

The ABI status of filesystem formats

Posted Oct 12, 2020 13:46 UTC (Mon) by error27 (subscriber, #8346) [Link]

I didn't mean writing a spec is a waste of time. Specs are good because they encourage rigorous thinking. I meant it would be very difficult and a waste of time to try detect filesystems like Josh's in advance and prevent them from mounting.

But although specs have their uses, in the kernel, specs cannot work in the way you have said. The rule is never "You can't break the spec." The rule is "You can't break userspace."

Here is an example. Ten years ago glibc changed memcpy(). https://lwn.net/Articles/414467/ The spec said that the new implementation of memcpy() was valid but kernel developers find this attitude rage inducing. We would never do that in the kernel. You could theoretically still change memcpy(), but first you would have to fix all the applications that rely on the old behavior.

In the glibc example, it's not the flash developers who suffered, it's the regular Linux users. For the users, they had a program which worked and now it doesn't work. They don't care about specs. As kernel developers we care about users first and the spec second.

Why ext4?

Posted Oct 8, 2020 19:45 UTC (Thu) by Yenya (subscriber, #52846) [Link] (12 responses)

There are filesystems which are designed to be read-only, such as ISO9660. Why abuse ext4?

Why ext4?

Posted Oct 8, 2020 20:19 UTC (Thu) by sfeam (subscriber, #2841) [Link] (5 responses)

"There are filesystems which are designed to be read-only, such as ISO9660. Why abuse ext4?"

Choosing a different filesystem might lessen Ts'o's personal stake in the push-back but the same general arguments and principles would apply to any filesystem, right? The most solid argument Ts'o raises is that a read-only file system may still be a malware vector if the kernel code handling it fails to consider some edge case or unforeseen content in a data structure. Any patch to fix that vulnerability could hypothetically break an out-of-tree application that had been relying on the previous behavior. Is that to be considered a regression or not? In fact the general argument applies even if the filesystem is not read-only.

Why ext4?

Posted Oct 8, 2020 20:23 UTC (Thu) by Yenya (subscriber, #52846) [Link]

Well, a read-only by-design filesystem would not need a bitmap of free blocks and free inodes. And being able to make the r/only filesystem smaller for the same amount of data makes sense, unlike that abuse of ext4.

Why ext4?

Posted Oct 9, 2020 1:14 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

But that argument even applies to e2fsprogs itself. How do we know there isn't some weird combination of flags you can pass to tune2fs or mke2fs which somehow makes an evil filesystem? At that point, you would have to break e2fsprogs backcompat anyway. This is probably a lot less *likely* than the "some third-party tool generates a weird filesystem image" case, but some of those third-party tools are actually rather popular (e.g. the article mentions FreeBSD), and breaking them would also be unfortunate. So I'm not sure how much sense it makes to draw a bright line around e2fsprogs as the single point of compatibility.

Why ext4?

Posted Oct 9, 2020 3:45 UTC (Fri) by epa (subscriber, #39769) [Link] (2 responses)

ISO 9660 has a standard formally specifying it. That means you can decide whether a given file system image is valid, and if the kernel doesn’t work with a valid image, that is clearly a kernel bug. So no, it doesn’t quite have the same general problem as ext4.

In principle the point still stands: there could be a bug in the kernel code when handling a malformed file system image, and that bug could cause security problems, yet somebody might be relying on the exiting behaviour because they use one of these malformed images. But that seems much less likely with ISO 9660, partly because it’s much simpler than ext4, and partly because it’s understood by many different OSes, so you’d soon find out if your image depended on a Linux-specific quirk.

I have a CD-ROM that’s meant to be in ISO 9660 format but many of the filenames contain the / character. Linux does not handle this well. But it’s clearly incorrect by the standard, so you can say this is no bug in Linux.

Why ext4?

Posted Oct 9, 2020 6:38 UTC (Fri) by Yenya (subscriber, #52846) [Link] (1 responses)

Well, my (rhetorical) question was, why Josh Triplett even decided to use this "optimized" ext4 image instead of going for some filesystem which is read-only by design. Not whether there can arise the same kind of problems with ISO 9660.

Why ext4?

Posted Oct 9, 2020 8:06 UTC (Fri) by epa (subscriber, #39769) [Link]

Yes, I wasn't replying to your comment, but to sfeam's one.

Why ext4?

Posted Oct 9, 2020 12:02 UTC (Fri) by eru (subscriber, #2753) [Link] (5 responses)

>There are filesystems which are designed to be read-only, such as ISO9660. Why abuse ext4?

Maybe he needed a file system that supports all features expected of a native Linux file system. ISO9660 does not qualify. (But squashfs would? It would also automatically save space).

Why ext4?

Posted Oct 11, 2020 21:27 UTC (Sun) by WolfWings (subscriber, #56790) [Link] (3 responses)

SquashFS _requires_ compression, so there'd be a compression -> decompression layer involved. CRAMFS has a ~256MB total size limit. ROMFS is 32-bit so it'd run into obstacles around the 4GB mark (or maybe 2GB, I don't think it's been tested for >2GB total size to my knowledge) but it would be the lightest weight option IMHO.

But this does boil down to they were abusing an invalid filesystem data structure configuration that was not previously checked. That they were declaring that a particular flag marked the filesystem as such a monstrosity doesn't change that they were laying multiple copies of a data structure on top of each other instead of pushing upstream for a proper "read only" flag that did away with those structures entirely is the core invalidity.

Why ext4?

Posted Oct 13, 2020 15:30 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

What about EROFS? It doesn't look very complicated, but it does appear to support most things you might reasonably want to do.

https://www.kernel.org/doc/html/latest/filesystems/erofs....

Why ext4?

Posted Oct 18, 2020 16:19 UTC (Sun) by WolfWings (subscriber, #56790) [Link]

Oh, yup, I'd missed that one going through the main filesystems, that one looks to nail it and he "ROMFS but a full Linux filesystem" so it's just about ideal.

Why ext4?

Posted Oct 20, 2020 5:38 UTC (Tue) by eru (subscriber, #2753) [Link]

Squashfs, unlike cramfs, does not have problematic size limitations: "Files up to 2^64 bytes are supported; file systems can be up to 2^64 bytes" (https://tldp.org/HOWTO/SquashFS-HOWTO/whatis.html).

Why ext4?

Posted Oct 26, 2020 20:57 UTC (Mon) by mina86 (guest, #68442) [Link]

FYI, there’s Rock Ridge extension which adds POSIX semantics to an ISO file-system.

make_ext4fs to generate filesystem

Posted Oct 8, 2020 22:39 UTC (Thu) by jhhaller (guest, #56103) [Link] (5 responses)

I wrote an alternate version of this tool at Alcatel-Lucent (twice because the first one got lost). Historically, Unix mkfs could make a filesystem from a list of files and contents, and write that filesystem to a file or device. Most any team generating filesystem images needs such a tool if they want to generate that ISO without needing root access. And, who really wants to give root access to one's CI system? We typically used this as a TFTP boot image for embedded Unix systems, and the filesystem was in RAM. That was another story, as the TFTP large block option was required, and no one had implemented a TFTP server with large block support at that time, and the filesystems were typically more than the 64K 512 byte block TFTP transfer size limit.

It was not hard to slightly modify mke2fs to take a manifest of files (and file of contents for regular files), and use the functions available to mke2fs to write the special files, permissions, directories, and file content to the ISO. I based the manifest loosely on the syntax of the contents of the proto argument of Unix 6th edition mkfs command - http://man.cat-v.org/unix-6th/8/mkfs - and added options for other file types including both symbolic and hard links. But, by using mke2fs, it at least used the functions built into the filesystem to populate the parts of the filesystem, making corruption significantly less likely. It would make two passes through the internal data model of the manifest, the first time writing inodes, and the second time writing the file contents. I don't remember if all-zero blocks were suppressed to support holes, I suspect not, as few files we installed had large all-zero blocks.

It was just more work than I wanted to do to get approval for releasing the changes externally, and get it accepted into ext4 source code, so it stayed internal. Since we never redistributed the tool externally, there was no need to share it, even though the change was under GPL. Sorry, Google and Ted Ts'o.

make_ext4fs to generate filesystem

Posted Oct 9, 2020 2:58 UTC (Fri) by djwong (subscriber, #23506) [Link] (2 responses)

Yeah... mke2fs -d takes care of slurping a directory tree into the new filesystem now.

Also, that's a very interesting man page link-- now I've learned where the mkfs.xfs "protofile" format comes from! The manpage for mkfs.xfs even features a very similar example file.

Er, thanks!

make_ext4fs to generate filesystem

Posted Oct 9, 2020 15:05 UTC (Fri) by jhhaller (guest, #56103) [Link] (1 responses)

The -d option would still require root access to allow creating device file entries which may not match the host's devices.
Setting the correct file ownership is another aspect requiring root access. The protofile addresses both of these issues,
as it overwrites the filesystem owner, permissions, and allows creating device files. I doubt mkfs was originally
concerned with CI systems, but creating filesystem images on a non-native host was likely an early concern.

make_ext4fs to generate filesystem

Posted Oct 12, 2020 16:36 UTC (Mon) by mebrown (subscriber, #7960) [Link]

The yocto build system has a utility called 'pseudo'.

It uses an LD_PRELOAD library such that anything doing a mknod() libc call is saved to a database instead of invoking the underlying mknod() call, and instead just creates a regular placeholder file.

Then when you run mke2fs, similarly the LD_PRELOAD then hooks stat() such that mke2fs reading the placeholder file actually reads the details from the database instead of the filesystem.

Seems a little hacky at first, but seems to work well enough in practice.

make_ext4fs to generate filesystem

Posted Oct 9, 2020 17:40 UTC (Fri) by tytso (✭ supporter ✭, #9993) [Link] (1 responses)

Interesting. I didn't know about this feature in the 6th edition Unix's mkfs. It's not something which the BSD Fast File System supported, and my involvement with Unix started with BSD 4.3.

Using a proto file so that the system can more easily create device files does make sense, and of course these days we need to have something which supports SELinux attributes as well. That's one of the reasons why contrib/e2fsdroid exists. (The other is that e2fsdroid uses the Android libsparse file to create an image which omits the all zero blocks and which is compatible with the existing Android image creation tools, and I didn't want to drag that into mke2fs as a shared library dependency, since that would make distro installer release engineers.... cranky.)

Patches to support some kind of proto file which supports SELinux attributes, and which uses the qemu image file creation functions in libext2fs for mke2fs would certainly be gratefully accepted. If we had that, plus the per-block dedup functionality to create shared blocks file system added, then yes, we could upstream all of that functionality into mke2fs, and allow other embedded systems developers access to e2fsdroid functionality without having to build AOSP....

make_ext4fs to generate filesystem

Posted Oct 9, 2020 17:42 UTC (Fri) by tytso (✭ supporter ✭, #9993) [Link]

Sorry: s/libsparse file/libsparse library/

The ABI status of filesystem formats

Posted Oct 12, 2020 16:42 UTC (Mon) by ncm (guest, #165) [Link]

I understand the desire to use ext4 for a read-only fs as a need for features maybe not-supported in certain other filesystems.

I don't understand a reluctance to certify e2fsck as the arbiter of what may be mounted as an ext2/3/4 filesystem. If it is discovered that some bit pattern is a security hole, surely e2fsck will be updated to patch up an image that tickles that security hole?

It seems not necessary to remain bitwise compatible with filesystems that *current* e2fsck would alter, but that older e2fsck would allow. e2fsck should continue to accept non-broken images, and to fix known-broken ones. So, we could accept a future kernel requiring that a file system image that used to mount as-is be fixed by a new e2fsck before it may be mounted. We could even have e2fsck refuse to fix certain old images it finds too confusing, as it might for any damaged image.

The ABI status of filesystem formats

Posted Oct 15, 2020 11:11 UTC (Thu) by rwmj (subscriber, #5474) [Link]

I'm wondering if what Josh really wants is some kind of deduplicating block layer. I have a (non-upstream) nbdkit allocator for this. The devil may be in the details however since it may be possible that the bitmaps being combined are not actually identical at the block size level.


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds