New APIs for filesystems
A discussion of extensions to the statx() system call comes up frequently at the Linux Storage, Filesystem, Memory Management, and BPF Summit; this year's edition was no exception. Kent Overstreet led the first filesystem-only session at the summit on querying information about filesystems that have subvolumes and snapshots. While it was billed as a discussion on statx() additions, it ranged more widely over new APIs needed for modern filesystems.
Brainstorming
Overstreet began the session with the idea that it would be something of a brainstorming exercise to come up with additions to the filesystem APIs. He had some thoughts on new features, but wanted to hear what other attendees were thinking so that a list of tasks could be gathered. He said that he did not plan to do all of the work on that list himself, but he would help coordinate it.
He has started thinking about per-subvolume disk-accounting for bcachefs, which led him to the need for a way to iterate over subvolumes. He mentioned some previous discussion where Al Viro had an idea for an iterator API that would return a file descriptor for each open subvolume. "That was crazy and cool", Overstreet said; it also fits well with various openat()-style interfaces. He thinks there is a simpler approach, however.
![Kent Overstreet [Kent Overstreet]](https://static.lwn.net/images/2024/lsfmb-overstreet-sm.png)
Adding a flags parameter to opendir() would allow creating a flag for iterating over subvolumes and submounts. Subvolumes and mounts have a lot in common, he has noticed recently; user-space developers would like to have ways to work with them, which this would provide.
Extended attributes (xattrs) on files are also in need of an iterator interface of some kind, he said. Those could smoothly fit into the scheme he is proposing. The existing getdents() interface is "nice and clean", he said, so it could be used for xattrs as well.
The stx_subvol field has recently been added to statx() for subvolume identifiers. Another flag for statx() will be needed to identify whether the file is in a snapshot. That way, coreutils would be able to, by default, filter out snapshots. That way, when someone is working through the filesystem, they do not see the same files over and over again.
Steve French asked a "beginner question" about how to list the snapshots for a given mount in a generic fashion on Linux. Overstreet said that a snapshot is a type of subvolume and that "a subvolume is a fancy directory". This new opendir() interface could be used to iterate over the subvolumes and the new statx() flag could be used to check which are snapshots.
All of the information that statfs() returns for a mounted filesystem should also be available for subvolumes, he said, "continuing with the theme that subvolumes and mount points actually have a lot in common". That includes things like disk-space usage and the number of inodes used.
Dave Chinner said that XFS already has a similar interface based on project IDs, where a directory entry that corresponds to a particular project can be passed to statfs() to retrieve the information specific to that project. He said that filesystems could examine the passed-in directory and decide what to report based on that, so no new system call would be needed. Overstreet was skeptical that users who type df in their home directory would expect to only get information for the subvolume it is in, rather than the whole disk, as they do now. He thought a new system call would be the right way to approach it.
French said that other operating systems have a way to simply open a version of a file from a snapshot without actually having to directly work with the entire snapshot subvolume itself. A user can simply open a file from a given snapshot identifier, which is convenient and not really possible on Linux. Overstreet acknowledged the problem, but said that he did not think a new system call was needed to support that use case. Using the new interfaces that are being discussed, user space should be able to handle that functionality, perhaps using read-only mounts of snapshots in such a way that the user does not directly have to work with them.
User-space concerns
But Lennart Poettering said: "as a user-space person, I find it a bit weird" that opendir() is seen as a good interface for this functionality. In many ways, he finds opendir() to be "a terrible API" because it gives you a filename, but then you have to open the file to get more information, which does not necessarily match up because there can be a race between the two operations. He would much prefer to get a file descriptor when enumerating things so that the state cannot change between the two.
There are some other mismatches between opendir() and subvolumes, he continued. Right now, user space expects to get filenames from readdir(), which means they do not contain the slash ("/") character, but subvolume path names do. In addition, the filename returned in the struct dirent can only be 255 characters long, which is too restrictive for subvolume names.
In the end, Poettering thinks that user-space programs do not want to get filenames, they want something that cannot change out from under them. Jeff Layton suggested using file handles instead, which Poettering agreed would be better still. Christian Brauner noted that the listmount() system call uses a new 64-bit mount ID, but there is no way to go from that mount ID to a file descriptor or handle. It would be easy to add, however.
Overstreet said that he plans to add firmlinks, which is an Apple filesystem (APFS) feature that fits in between hard links and symbolic links. It would use a file handle and filesystem universally unique ID (UUID) to identify a particular file. Amir Goldstein said that overlayfs also uses those two IDs to identify its files, so Overstreet thought that perhaps that scheme should become a standard for Linux filesystems.
There are some other missing pieces for file handles, though, he said. There is no system call to go from a file handle to a path. Goldstein said that the ability exists, but it is only reliable for directories. "That's because hard links suck", Overstreet said; Goldstein agreed that was part of it, but Jan Kara said that there are some filesystems that cannot provide that mapping.
It is getting increasingly difficult to guarantee inode-number uniqueness, Overstreet said. Most of the discussion about his proposal for a session at LSFMM+BPF revolved around the problem and its solutions; it has come up at earlier summits, as well. The basic problem is that more-recent filesystems (Btrfs, bcachefs) have lots of trouble ensuring that inode numbers are unique across all of the subvolumes/snapshots in a mounted filesystem, which confuses tools like rsync and tar.
The 64-bit inode space is simply too small to guarantee uniqueness, he said, but there are various schemes that have been used to make things work. He would rather not be "kicking cans down the road" and thinks filesystem developers need to nudge user-space developers to start using file handles for uniqueness detection "sooner, rather than later".
Inode zero
He noted a recent "kerfuffle" regarding filesystems that return all inode numbers as zero values, which broke lots of user-space tools. That will become more prevalent over time, so he wondered if it made sense to add a mount option that would intentionally report the same inode number in order to shake out those kinds of problems. Chinner suggested using a sysctl instead, which Overstreet agreed would be a better choice.
Ted Ts'o said that in order to get user space on board with a switch to using file handles, it is important to make it a cross-OS initiative. Lots of maintainers of user-space tools want to ensure that they work on macOS and the BSDs. If it can get to a point where using file handles is "supported by more forward-leaning, POSIX-like filesystems", the chances will be much better for getting enough of user space converted so that it is possible to return zeroes for inode numbers without breaking everything. It will still be a multi-year effort, which means that it is worth taking the time to try to ensure that it can be adopted more widely than just on Linux.
Overstreet asked about support for file handles in the other operating systems; Chinner said that anything that supports NFS must have some form of file-handle support. Ts'o agreed but said that the others may not export file-handle information to user space.
As part of the conversion process, a list of affected programs should be created, Ts'o said. To his "total shock", he found out that the shared-library loader needs unique inode numbers because that is how it distinguishes different libraries, which he found out the hard way. Overstreet wanted to hear that story, but it is a long one that might need a more casual setting to relate, Ts'o said.
This problem will only get worse in, say, 20 years when 64 bits is even less able to handle the number of inodes, Overstreet said. If the right tools are provided to user-space developers, they will help find and fix all of the problems. But Poettering cautioned that getting rid of the reliance on the uniqueness of inode numbers is going to be extremely difficult. It is used to ensure that the same resources are not loaded multiple times, for example, so it would be better to provide user-space APIs that directly address that problem.
There was some discussion of various ways to try to add information to the inode number to solve that problem, but there is nothing generalized for all filesystems; it is fair to say there is not any real agreement on how to do that within the filesystem community. Ts'o asked Poettering if file handles, which have more bits to work with, would solve his problems. Poettering said that "file handles are great", but it requires different privileges to query them and they are not implemented on all filesystems, so he still needs a fallback.
For example, he wondered about getting file handles for procfs files, though it was not entirely clear what the answer to that was. Beyond that, he asked if there was a limit on the size of a file handle; Overstreet said it was a string, so there was no limit. There was some mention of using a hash on the file handles to create a fixed-length quantity, but the end of the session was a bit chaotic, with multiple side discussions all going on at once.
Brauner got the last word in, pretty much, when he said that he originally had been scared of adding an option to return zero for all inode numbers. But he sees that it makes sense as a tool for educating user space that inode numbers are not unique. There is still a need to provide user space with some kind of API to determine whether two files are actually the same, but that will have to be worked out later—on the mailing list or perhaps at a future summit.
Index entries for this article | |
---|---|
Kernel | Filesystems/APIs |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2024 |
Posted May 30, 2024 13:49 UTC (Thu)
by pj (subscriber, #4506)
[Link] (9 responses)
Posted May 30, 2024 15:38 UTC (Thu)
by tux3 (subscriber, #101245)
[Link] (6 responses)
Posted May 30, 2024 17:07 UTC (Thu)
by jra (subscriber, #55261)
[Link] (4 responses)
There are two missing pieces with race conditions though. One is unlinkat(fd, name, flags). The problem is "name" may be rename-raced for the unlink. One way to fix that would be to add these semantics to unlinkat():
unlinkat(fd, NULL, flags)
where fd becomes a handle pointing to the object to unlink. Hard links are the problem here.
The second is renameat(), which suffers from the same rename-race problem for the source name. A similar solution,
renameat(fd, NULL, dirfd, newpath)
where fd becomes a handle pointing to the object to rename works. If we can work out the hardlink problem, this would fortify the API against races considerably.
Posted May 30, 2024 17:14 UTC (Thu)
by jra (subscriber, #55261)
[Link]
Posted Jun 1, 2024 14:26 UTC (Sat)
by jengelh (guest, #33263)
[Link] (1 responses)
If you unlinkat(open("foo.txt",O_RDONLY),NULL,0), sure, that's ambiguous, because the fd refers to the inode, which may have anywhere from *0* to UINT64_MAx hardlinks and you're not saying which of the hardlinks you want gone.
Posted Jun 1, 2024 21:04 UTC (Sat)
by dezgeg (subscriber, #92243)
[Link]
Posted Jun 1, 2024 19:04 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
Except surely the whole point of hardlinks is to provide multiple names to one object. The hardlink "problem" simply says to me that nobody has worked out the difference between a name (reference) and an object (inode). What exactly are you trying to do - do you want to delete the *reference* (in which case you have to use the filename), or do you want to delete the *thing* in which case you use the handle.
So we have two completely different possibilities - "delete by name" which deletes the object if all references to it have disappeared, and "delete the thing" which then has to clean up dangling references. Possibly by "null"ing the object, and fixing the filesystem driver to clean up null'd inode references when it scans a directory. After all, "file close" will autodelete a file with no references.
Cheers,
Posted May 31, 2024 0:36 UTC (Fri)
by Paf (guest, #91811)
[Link]
The point is unless I’ve missed something they’re not at all suitable for normal use without further refinement. As of today, open by handle is a privileged operation due to the permission business, but that is the only way to actually use a handle.
Posted May 30, 2024 20:24 UTC (Thu)
by shironeko (subscriber, #159952)
[Link] (1 responses)
Posted May 31, 2024 7:13 UTC (Fri)
by mb (subscriber, #50428)
[Link]
Posted May 30, 2024 15:01 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (1 responses)
One of the reasons linux doesn't (as far as I can make out) have decent ACLs is exactly this. Plus it might have copied Windows ...
But as part of the file identifier, is it possible to include the id of the canonical parent directory. This is categorically NOT defined as always being valid, but inasmuch as the system hasn't engaged in hard-link shenanigans, it should provide a valid canonical path. Creating, moving, and copying files would update this parent-id, but hardlinking wouldn't. So hardlinking then deleting the original parent directory would break the link.
But provided that parent-id can be queried regardless of snapshots, subvolumes, etc etc, it should be possible to declare two files as the same by following that path.
And it might result in a decent ACL system :-)
Cheers,
Posted May 30, 2024 15:15 UTC (Thu)
by koverstreet (✭ supporter ✭, #4296)
[Link]
Posted May 30, 2024 19:23 UTC (Thu)
by geofft (subscriber, #59789)
[Link] (2 responses)
For the inode problem, I think almost all uses of inode numbers in userspace is to tell if two files are the same. You can imagine userspace APIs to do this specific task that use handles within the kernel. For instance, imagine something kind of like kcmp{2), with *at-style arguments for filenames (and without the pid arguments):
This API is enough to implement the commonly-supported shell syntax [ file1 -ef file2 ], though in practice I see a lot of shell users writing something like [ "$(stat -c %i file1 )" -eq "$(stat -c %i file2)" ] (possibly because -ef is non-POSIX... but it's supported in ksh, bash, and dash, and stat -c isn't the right syntax on e.g. macOS, so -ef would be more portable). So there will need to be some publicity so people start migrating.
You could also imagine some API where the kernel maintained a set of file handles and told you if a handle was already in it, with some API kind of like epoll:
For opendir would it work to return O_PATH file descriptors? Now that we have close_range(2), it's less annoying to have to close a pile of file descriptors.
Posted May 31, 2024 14:01 UTC (Fri)
by koverstreet (✭ supporter ✭, #4296)
[Link] (1 responses)
Posted May 31, 2024 14:23 UTC (Fri)
by geofft (subscriber, #59789)
[Link]
Posted May 30, 2024 20:01 UTC (Thu)
by roc (subscriber, #30627)
[Link]
Posted May 30, 2024 20:06 UTC (Thu)
by abatters (✭ supporter ✭, #6932)
[Link]
https://git.yoctoproject.org/pseudo/
If you delete a file outside of pseudo and then create another file that happens to have the same inode number, pseudo becomes confused:
https://wiki.yoctoproject.org/wiki/Pseudo_Abort
Posted May 31, 2024 0:48 UTC (Fri)
by Paf (guest, #91811)
[Link] (2 responses)
Kent, today, open by handle is a privileged op because it intrinsically bypasses the directory hierarchy (and the associated permissions). What are the ideas for dealing with that? I had one up thread which would be an API for getting an open handle from an fd, so you keep *open by handle* privileged. Is that the thought or is there something else (or have I missed something and that already exists)?
Posted May 31, 2024 13:59 UTC (Fri)
by koverstreet (✭ supporter ✭, #4296)
[Link]
But mostly that's not what we want to expose filehandles for. It's more just for uniquely identifying files for detecting hardlinks and rigorously avoiding TOCTOU races.
Posted Jun 1, 2024 14:37 UTC (Sat)
by cesarb (subscriber, #6266)
[Link]
As a curiosity: according to what I have read, Microsoft Windows has a privilege called "Bypass Traverse Checking" which is enabled by default, and which when enabled bypasses the directory permission checks, and also allows the use of the Windows equivalent of "open by handle" (OpenFileById). So it's a "privileged operation", but most users there already have the necessary privilege (and disabling it breaks other things, since the same privilege is also used for directory change notification).
Posted May 31, 2024 7:42 UTC (Fri)
by LtWorf (subscriber, #124958)
[Link] (9 responses)
Is this one of these situations where "We do not break userspace, except when we do"?
In the cache that weborf (my thesis, from a long time ago) generates, inode is part of the key of the cache. Setting all inode numbers to 0, or having collisions for inodes on the same filesystem could certainly cause the server to send wrong content to a client.
I think a server web is "userspace"… So I'm baffled at how this is even being considered.
Posted May 31, 2024 10:45 UTC (Fri)
by taladar (subscriber, #68407)
[Link] (8 responses)
Posted May 31, 2024 16:58 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link] (7 responses)
Or is this about efficiency? I.e. you don't want to keep a central store of the available inode numbers because it's too slow, so you would rather do something decentralized like a UUID? In that case... I really don't think it is reasonable to break userspace on the basis of efficiency, unless it's in a new (or at least currently-unstable) filesystem, with appropriate "nothing is going to be compatible with this" warnings in the documentation.
(It is not "breaking userspace" to provide a filesystem that does weird things and isn't compatible with commonly-used software, or else NFS would not be allowed to exist. The sysadmin can decide whether they want to deal with the compatibility issues, and keep using ext4 if they prefer stability.)
Posted May 31, 2024 17:19 UTC (Fri)
by pizza (subscriber, #46)
[Link]
Just feel the need to point out that NFS predates Linux and that commonly-used software.
...This is another scenario where the "everything is a simple file" abstraction leaks all over one's shoes.
Posted May 31, 2024 17:45 UTC (Fri)
by koverstreet (✭ supporter ✭, #4296)
[Link] (1 responses)
And bcachefs shards inode numbers by CPU id, that takes more bits.
Posted Jun 3, 2024 8:42 UTC (Mon)
by DemiMarie (subscriber, #164188)
[Link]
Posted Jun 1, 2024 14:14 UTC (Sat)
by Paf (guest, #91811)
[Link] (3 responses)
That doesn’t make *me* super comfortable over the medium to long term. Especially if you want to offer file system lifetime uniqueness rather than allow reuse. Which is why those systems generally already have much larger internal handles (which they tend to expose as file handles if they support those APIs).
Posted Jun 2, 2024 17:29 UTC (Sun)
by willy (subscriber, #9762)
[Link] (2 responses)
There are a lot of tradeoffs in FS development which aren't immediately apparent to people who aren't FS developers. It's what keeps it interesting!
Posted Jun 4, 2024 10:00 UTC (Tue)
by paulj (subscriber, #341)
[Link]
Empirical evidence in telephony and networking suggests it is hard to get past a H-Density ratio of 87% (which may correspond to a "direct" utilisation of less than 20% useable-v-total-address-spacec), at least in those contexts. See RFC3194:
Posted Jun 4, 2024 18:02 UTC (Tue)
by Paf (guest, #91811)
[Link]
Posted May 31, 2024 19:43 UTC (Fri)
by gray_-_wolf (subscriber, #131074)
[Link] (2 responses)
Posted May 31, 2024 22:27 UTC (Fri)
by neilbrown (subscriber, #359)
[Link] (1 responses)
'tar' keep a list of inode number for every file that has 2 or more links. Usually that is a small number of files. If we asked find to keep a full file handle for every linked file, and used it on a directory tree where everything was a hard link (which certainly does happen in practice) then it would cause tar to use more RAM. File handles might be double the size of dev+inode so it wouldn't be enormously more RAM.
Posted Jun 3, 2024 19:40 UTC (Mon)
by Tobu (subscriber, #24111)
[Link]
Re alloc efficiency, is the size of a filehandle fixed for a given mount (mount_id or dev_id)? Going by the name_to_handle_at manpage (paragraph starting with “The caller can discover the required size”), it seems to be, but I don't know if it's an actual guarantee. If it's basically ino_t but slightly expanded so there's bits for snapshots, cpu sharding, layered mounts, and future proofing, it doesn't seem like a big impact, but a symbolic key of variable length feels heavier, for use cases like backups or indexing. Layered mounts do seem to imply you have to add to the largest of the lower mounts' filehandle size.
New APIs for filesystems
New APIs for filesystems
On a related note, is POSIX the place to discuss APIs with other platforms these days?
There's this grammatical phenomenon where people talk about POSIX in the past tense, not so much about what's up and coming from them. Is that just a misconception I have?
New APIs for filesystems
New APIs for filesystems
with the old familiar integer fd's. I'm not aware of any common userspace code that uses struct file_handle though.
New APIs for filesystems
>>unlinkat(fd, NULL, flags)
>Hard links are the problem here.
But surely that's what unlinkat(open("foo.txt",O_PATH),NULL,0) is supposed to address.
New APIs for filesystems
New APIs for filesystems
Wol
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
Hierarchical file systems ...
Wol
Hierarchical file systems ...
How will the proposal to use file handles get over the current restriction in open_by_handle_at(2) that only privileged users can call it? Right now you're able to have a world-readable file in a non-world-accessible directory and that's enough to make sure other users can't get to it. This is useful in practice to e.g. unpack/rsync/etc. files with arbitrary permissions somewhere in your home directory, when your home directory is mode 700. So there isn't an unprivileged API to work with file handles because that would bypass that protection.
New APIs for filesystems
int compare_files_at(int dirfd1, const char *path1, int flags1, int dirfd2, const char *path2. int flags2);
which would tell you if the files you'd get from openat(dirfd1, path1, flags1) and openat(dirfd2, path2, flags2) are the same file. Like kcmp it would give you some sort order if they weren't equal. This is easy to implement with actual file handles, but if that's hard for a filesystem, it can do something else.
int fhset_create(int flags);
int fhset_ctl(int fhset, int op, int dirfd, const char *path, int flags, u64 *userdata);
where, if you add a file that you've seen before, you get EEXIST and userdata gets set to the value that was previously passed when you added it originally. Then you can have some userspace list of the filename you previously used, or whatever other metadata you find useful. I think this is sufficient for tools like tar and find and the glibc loader (whose deduping on inodes I have also had the misfortune of running into).
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
Inodes, Yocto, and Pseudo
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
Subvolumes and inodes
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems
New APIs for filesystems