New APIs for filesystems

By Jake Edge
May 30, 2024

A discussion of extensions to the statx() system call comes up frequently at the Linux Storage, Filesystem, Memory Management, and BPF Summit; this year's edition was no exception. Kent Overstreet led the first filesystem-only session at the summit on querying information about filesystems that have subvolumes and snapshots. While it was billed as a discussion on statx() additions, it ranged more widely over new APIs needed for modern filesystems.

Brainstorming

Overstreet began the session with the idea that it would be something of a brainstorming exercise to come up with additions to the filesystem APIs. He had some thoughts on new features, but wanted to hear what other attendees were thinking so that a list of tasks could be gathered. He said that he did not plan to do all of the work on that list himself, but he would help coordinate it.

He has started thinking about per-subvolume disk-accounting for bcachefs, which led him to the need for a way to iterate over subvolumes. He mentioned some previous discussion where Al Viro had an idea for an iterator API that would return a file descriptor for each open subvolume. "That was crazy and cool", Overstreet said; it also fits well with various openat()-style interfaces. He thinks there is a simpler approach, however.

Adding a flags parameter to opendir() would allow creating a flag for iterating over subvolumes and submounts. Subvolumes and mounts have a lot in common, he has noticed recently; user-space developers would like to have ways to work with them, which this would provide.

Extended attributes (xattrs) on files are also in need of an iterator interface of some kind, he said. Those could smoothly fit into the scheme he is proposing. The existing getdents() interface is "nice and clean", he said, so it could be used for xattrs as well.

The stx_subvol field has recently been added to statx() for subvolume identifiers. Another flag for statx() will be needed to identify whether the file is in a snapshot. That way, coreutils would be able to, by default, filter out snapshots. That way, when someone is working through the filesystem, they do not see the same files over and over again.

Steve French asked a "beginner question" about how to list the snapshots for a given mount in a generic fashion on Linux. Overstreet said that a snapshot is a type of subvolume and that "a subvolume is a fancy directory". This new opendir() interface could be used to iterate over the subvolumes and the new statx() flag could be used to check which are snapshots.

All of the information that statfs() returns for a mounted filesystem should also be available for subvolumes, he said, "continuing with the theme that subvolumes and mount points actually have a lot in common". That includes things like disk-space usage and the number of inodes used.

Dave Chinner said that XFS already has a similar interface based on project IDs, where a directory entry that corresponds to a particular project can be passed to statfs() to retrieve the information specific to that project. He said that filesystems could examine the passed-in directory and decide what to report based on that, so no new system call would be needed. Overstreet was skeptical that users who type df in their home directory would expect to only get information for the subvolume it is in, rather than the whole disk, as they do now. He thought a new system call would be the right way to approach it.

French said that other operating systems have a way to simply open a version of a file from a snapshot without actually having to directly work with the entire snapshot subvolume itself. A user can simply open a file from a given snapshot identifier, which is convenient and not really possible on Linux. Overstreet acknowledged the problem, but said that he did not think a new system call was needed to support that use case. Using the new interfaces that are being discussed, user space should be able to handle that functionality, perhaps using read-only mounts of snapshots in such a way that the user does not directly have to work with them.

User-space concerns

But Lennart Poettering said: "as a user-space person, I find it a bit weird" that opendir() is seen as a good interface for this functionality. In many ways, he finds opendir() to be "a terrible API" because it gives you a filename, but then you have to open the file to get more information, which does not necessarily match up because there can be a race between the two operations. He would much prefer to get a file descriptor when enumerating things so that the state cannot change between the two.

There are some other mismatches between opendir() and subvolumes, he continued. Right now, user space expects to get filenames from readdir(), which means they do not contain the slash ("/") character, but subvolume path names do. In addition, the filename returned in the struct dirent can only be 255 characters long, which is too restrictive for subvolume names.

In the end, Poettering thinks that user-space programs do not want to get filenames, they want something that cannot change out from under them. Jeff Layton suggested using file handles instead, which Poettering agreed would be better still. Christian Brauner noted that the listmount() system call uses a new 64-bit mount ID, but there is no way to go from that mount ID to a file descriptor or handle. It would be easy to add, however.

Overstreet said that he plans to add firmlinks, which is an Apple filesystem (APFS) feature that fits in between hard links and symbolic links. It would use a file handle and filesystem universally unique ID (UUID) to identify a particular file. Amir Goldstein said that overlayfs also uses those two IDs to identify its files, so Overstreet thought that perhaps that scheme should become a standard for Linux filesystems.

There are some other missing pieces for file handles, though, he said. There is no system call to go from a file handle to a path. Goldstein said that the ability exists, but it is only reliable for directories. "That's because hard links suck", Overstreet said; Goldstein agreed that was part of it, but Jan Kara said that there are some filesystems that cannot provide that mapping.

It is getting increasingly difficult to guarantee inode-number uniqueness, Overstreet said. Most of the discussion about his proposal for a session at LSFMM+BPF revolved around the problem and its solutions; it has come up at earlier summits, as well. The basic problem is that more-recent filesystems (Btrfs, bcachefs) have lots of trouble ensuring that inode numbers are unique across all of the subvolumes/snapshots in a mounted filesystem, which confuses tools like rsync and tar.

The 64-bit inode space is simply too small to guarantee uniqueness, he said, but there are various schemes that have been used to make things work. He would rather not be "kicking cans down the road" and thinks filesystem developers need to nudge user-space developers to start using file handles for uniqueness detection "sooner, rather than later".

Inode zero

He noted a recent "kerfuffle" regarding filesystems that return all inode numbers as zero values, which broke lots of user-space tools. That will become more prevalent over time, so he wondered if it made sense to add a mount option that would intentionally report the same inode number in order to shake out those kinds of problems. Chinner suggested using a sysctl instead, which Overstreet agreed would be a better choice.

Ted Ts'o said that in order to get user space on board with a switch to using file handles, it is important to make it a cross-OS initiative. Lots of maintainers of user-space tools want to ensure that they work on macOS and the BSDs. If it can get to a point where using file handles is "supported by more forward-leaning, POSIX-like filesystems", the chances will be much better for getting enough of user space converted so that it is possible to return zeroes for inode numbers without breaking everything. It will still be a multi-year effort, which means that it is worth taking the time to try to ensure that it can be adopted more widely than just on Linux.

Overstreet asked about support for file handles in the other operating systems; Chinner said that anything that supports NFS must have some form of file-handle support. Ts'o agreed but said that the others may not export file-handle information to user space.

As part of the conversion process, a list of affected programs should be created, Ts'o said. To his "total shock", he found out that the shared-library loader needs unique inode numbers because that is how it distinguishes different libraries, which he found out the hard way. Overstreet wanted to hear that story, but it is a long one that might need a more casual setting to relate, Ts'o said.

This problem will only get worse in, say, 20 years when 64 bits is even less able to handle the number of inodes, Overstreet said. If the right tools are provided to user-space developers, they will help find and fix all of the problems. But Poettering cautioned that getting rid of the reliance on the uniqueness of inode numbers is going to be extremely difficult. It is used to ensure that the same resources are not loaded multiple times, for example, so it would be better to provide user-space APIs that directly address that problem.

There was some discussion of various ways to try to add information to the inode number to solve that problem, but there is nothing generalized for all filesystems; it is fair to say there is not any real agreement on how to do that within the filesystem community. Ts'o asked Poettering if file handles, which have more bits to work with, would solve his problems. Poettering said that "file handles are great", but it requires different privileges to query them and they are not implemented on all filesystems, so he still needs a fallback.

For example, he wondered about getting file handles for procfs files, though it was not entirely clear what the answer to that was. Beyond that, he asked if there was a limit on the size of a file handle; Overstreet said it was a string, so there was no limit. There was some mention of using a hash on the file handles to create a fixed-length quantity, but the end of the session was a bit chaotic, with multiple side discussions all going on at once.

Brauner got the last word in, pretty much, when he said that he originally had been scared of adding an option to return zero for all inode numbers. But he sees that it makes sense as a tool for educating user space that inode numbers are not unique. There is still a need to provide user space with some kind of API to determine whether two files are actually the same, but that will have to be worked out later—on the mailing list or perhaps at a future summit.

Index entries for this article
Kernel	Filesystems/APIs
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2024

New APIs for filesystems

Posted May 30, 2024 13:49 UTC (Thu) by pj (subscriber, #4506) [Link] (9 responses)

Is there any chance of getting POSIX itself updated to allow the use of filehandles? It's been learned that using filenames leads to security race conditions, which should be enough of a reason for the standard to change, and changing the standard would likely make it easier to get the BSDs on-board and agreeing to a common API for tools.

New APIs for filesystems

Posted May 30, 2024 15:38 UTC (Thu) by tux3 (subscriber, #101245) [Link] (6 responses)

It seems POSIX 2024 publication is just underway. Considering it takes up to a decade between revisions, we may have missed that train.
On a related note, is POSIX the place to discuss APIs with other platforms these days?
There's this grammatical phenomenon where people talk about POSIX in the past tense, not so much about what's up and coming from them. Is that just a misconception I have?

New APIs for filesystems

Posted May 30, 2024 17:07 UTC (Thu) by jra (subscriber, #55261) [Link] (4 responses)

We're mostly there w.r.t. fd handles with the introduction of the XXXat() calls.

There are two missing pieces with race conditions though. One is unlinkat(fd, name, flags). The problem is "name" may be rename-raced for the unlink. One way to fix that would be to add these semantics to unlinkat():

unlinkat(fd, NULL, flags)

where fd becomes a handle pointing to the object to unlink. Hard links are the problem here.

The second is renameat(), which suffers from the same rename-race problem for the source name. A similar solution,

renameat(fd, NULL, dirfd, newpath)

where fd becomes a handle pointing to the object to rename works. If we can work out the hardlink problem, this would fortify the API against races considerably.

New APIs for filesystems

Posted May 30, 2024 17:14 UTC (Thu) by jra (subscriber, #55261) [Link]

Never mind. I was mixing up file handles: struct file_handle *handle
with the old familiar integer fd's. I'm not aware of any common userspace code that uses struct file_handle though.

New APIs for filesystems

Posted Jun 1, 2024 14:26 UTC (Sat) by jengelh (guest, #33263) [Link] (1 responses)

>One way to fix that would be to add these semantics to unlinkat():
>>unlinkat(fd, NULL, flags)
>Hard links are the problem here.

If you unlinkat(open("foo.txt",O_RDONLY),NULL,0), sure, that's ambiguous, because the fd refers to the inode, which may have anywhere from *0* to UINT64_MAx hardlinks and you're not saying which of the hardlinks you want gone.
But surely that's what unlinkat(open("foo.txt",O_PATH),NULL,0) is supposed to address.

New APIs for filesystems

Posted Jun 1, 2024 21:04 UTC (Sat) by dezgeg (subscriber, #92243) [Link]

I think regardless of O_PATH Linux does track the pathname which was used to open the file - you can see it in /proc/<pid>/fd.

New APIs for filesystems

Posted Jun 1, 2024 19:04 UTC (Sat) by Wol (subscriber, #4433) [Link]

> where fd becomes a handle pointing to the object to rename works. If we can work out the hardlink problem, this would fortify the API against races

Except surely the whole point of hardlinks is to provide multiple names to one object. The hardlink "problem" simply says to me that nobody has worked out the difference between a name (reference) and an object (inode). What exactly are you trying to do - do you want to delete the *reference* (in which case you have to use the filename), or do you want to delete the *thing* in which case you use the handle.

So we have two completely different possibilities - "delete by name" which deletes the object if all references to it have disappeared, and "delete the thing" which then has to clean up dangling references. Possibly by "null"ing the object, and fixing the filesystem driver to clean up null'd inode references when it scans a directory. After all, "file close" will autodelete a file with no references.

Cheers,
Wol

New APIs for filesystems

Posted May 31, 2024 0:36 UTC (Fri) by Paf (guest, #91811) [Link]

As currently envisioned, file handles - an opaque value that allows direct reference to a specific file regardless of location, etc - totally bypass normal path based security conventions, at least when used for open. I guess if you barred using them for open and made it possible to acquire one off an fd or something?

The point is unless I’ve missed something they’re not at all suitable for normal use without further refinement. As of today, open by handle is a privileged operation due to the permission business, but that is the only way to actually use a handle.

New APIs for filesystems

Posted May 30, 2024 20:24 UTC (Thu) by shironeko (subscriber, #159952) [Link] (1 responses)

IMO updating the standard should be the last step, after everyone has agreed on something.

New APIs for filesystems

Posted May 31, 2024 7:13 UTC (Fri) by mb (subscriber, #50428) [Link]

The problem is that in practice everybody implements a slight variant of everybody else's idea and then posix standardizes the 15th variant: https://xkcd.com/927/

Hierarchical file systems ...

Posted May 30, 2024 15:01 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

Okay, I know Unix doesn't like hierarchies, but that lack can be a major problem (like maybe here :-)

One of the reasons linux doesn't (as far as I can make out) have decent ACLs is exactly this. Plus it might have copied Windows ...

But as part of the file identifier, is it possible to include the id of the canonical parent directory. This is categorically NOT defined as always being valid, but inasmuch as the system hasn't engaged in hard-link shenanigans, it should provide a valid canonical path. Creating, moving, and copying files would update this parent-id, but hardlinking wouldn't. So hardlinking then deleting the original parent directory would break the link.

But provided that parent-id can be queried regardless of snapshots, subvolumes, etc etc, it should be possible to declare two files as the same by following that path.

And it might result in a decent ACL system :-)

Cheers,
Wol

Hierarchical file systems ...

Posted May 30, 2024 15:15 UTC (Thu) by koverstreet (✭ supporter ✭, #4296) [Link]

That's basically what bcachefs does now internally, for backpointers from inodes to dirents. Unlike xfs we don't store backpointers for every dirent - just one, because the main thing I wanted it for is simplifying fsck and directories don't have hardlinks.

New APIs for filesystems

Posted May 30, 2024 19:23 UTC (Thu) by geofft (subscriber, #59789) [Link] (2 responses)

How will the proposal to use file handles get over the current restriction in open_by_handle_at(2) that only privileged users can call it? Right now you're able to have a world-readable file in a non-world-accessible directory and that's enough to make sure other users can't get to it. This is useful in practice to e.g. unpack/rsync/etc. files with arbitrary permissions somewhere in your home directory, when your home directory is mode 700. So there isn't an unprivileged API to work with file handles because that would bypass that protection.

For the inode problem, I think almost all uses of inode numbers in userspace is to tell if two files are the same. You can imagine userspace APIs to do this specific task that use handles within the kernel. For instance, imagine something kind of like kcmp{2), with *at-style arguments for filenames (and without the pid arguments):

    int compare_files_at(int dirfd1, const char *path1, int flags1, int dirfd2, const char *path2. int flags2);

which would tell you if the files you'd get from openat(dirfd1, path1, flags1) and openat(dirfd2, path2, flags2) are the same file. Like kcmp it would give you some sort order if they weren't equal. This is easy to implement with actual file handles, but if that's hard for a filesystem, it can do something else.

This API is enough to implement the commonly-supported shell syntax [ file1 -ef file2 ], though in practice I see a lot of shell users writing something like [ "$(stat -c %i file1 )" -eq "$(stat -c %i file2)" ] (possibly because -ef is non-POSIX... but it's supported in ksh, bash, and dash, and stat -c isn't the right syntax on e.g. macOS, so -ef would be more portable). So there will need to be some publicity so people start migrating.

You could also imagine some API where the kernel maintained a set of file handles and told you if a handle was already in it, with some API kind of like epoll:

    int fhset_create(int flags);
    int fhset_ctl(int fhset, int op, int dirfd, const char *path, int flags, u64 *userdata);

where, if you add a file that you've seen before, you get EEXIST and userdata gets set to the value that was previously passed when you added it originally. Then you can have some userspace list of the filename you previously used, or whatever other metadata you find useful. I think this is sufficient for tools like tar and find and the glibc loader (whose deduping on inodes I have also had the misfortune of running into).

For opendir would it work to return O_PATH file descriptors? Now that we have close_range(2), it's less annoying to have to close a pile of file descriptors.

New APIs for filesystems

Posted May 31, 2024 14:01 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link] (1 responses)

compare_files() is not the API you want when checking for hardlinks - you can't keep every file you've seen open

New APIs for filesystems

Posted May 31, 2024 14:23 UTC (Fri) by geofft (subscriber, #59789) [Link]

Right, hence the rest of my comment :) I think you want both APIs, but if the kernel will only take one, you can implement the former in userspace with the latter.

New APIs for filesystems

Posted May 30, 2024 20:01 UTC (Thu) by roc (subscriber, #30627) [Link]

rr is an example of an application that assumes device/inode number pairs uniquely identify a file. That's how we determine if two file descriptors refer to the same file, and we can also read /proc/.../maps to identify which files are backing VMAs.

Inodes, Yocto, and Pseudo

Posted May 30, 2024 20:06 UTC (Thu) by abatters (✭ supporter ✭, #6932) [Link]

The pseudo utility from Yocto is another example of an application that uses inodes:

https://git.yoctoproject.org/pseudo/

If you delete a file outside of pseudo and then create another file that happens to have the same inode number, pseudo becomes confused:

https://wiki.yoctoproject.org/wiki/Pseudo_Abort

New APIs for filesystems

Posted May 31, 2024 0:48 UTC (Fri) by Paf (guest, #91811) [Link] (2 responses)

I put this in a thread, but I’m going to pop this up to the top level.

Kent, today, open by handle is a privileged op because it intrinsically bypasses the directory hierarchy (and the associated permissions). What are the ideas for dealing with that? I had one up thread which would be an API for getting an open handle from an fd, so you keep *open by handle* privileged. Is that the thought or is there something else (or have I missed something and that already exists)?

New APIs for filesystems

Posted May 31, 2024 13:59 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link]

Filesystems with backpointers (XFS, bcachefs) can do the path walk and permissions checks when opening by file handle - we could lift that restriction for them.

But mostly that's not what we want to expose filehandles for. It's more just for uniquely identifying files for detecting hardlinks and rigorously avoiding TOCTOU races.

New APIs for filesystems

Posted Jun 1, 2024 14:37 UTC (Sat) by cesarb (subscriber, #6266) [Link]

> Kent, today, open by handle is a privileged op because it intrinsically bypasses the directory hierarchy (and the associated permissions).

As a curiosity: according to what I have read, Microsoft Windows has a privilege called "Bypass Traverse Checking" which is enabled by default, and which when enabled bypasses the directory permission checks, and also allows the use of the Windows equivalent of "open by handle" (OpenFileById). So it's a "privileged operation", but most users there already have the necessary privilege (and disabling it breaks other things, since the same privilege is also used for directory change notification).

New APIs for filesystems

Posted May 31, 2024 7:42 UTC (Fri) by LtWorf (subscriber, #124958) [Link] (9 responses)

> thinks filesystem developers need to nudge user-space developers to start using file handles for uniqueness detection "sooner, rather than later"

Is this one of these situations where "We do not break userspace, except when we do"?

In the cache that weborf (my thesis, from a long time ago) generates, inode is part of the key of the cache. Setting all inode numbers to 0, or having collisions for inodes on the same filesystem could certainly cause the server to send wrong content to a client.

I think a server web is "userspace"… So I'm baffled at how this is even being considered.

New APIs for filesystems

Posted May 31, 2024 10:45 UTC (Fri) by taladar (subscriber, #68407) [Link] (8 responses)

No, this is one of those situations where kernel developers think about the problems of tomorrow so the "we never want to change anything at all" people don't constantly paint themselves into a corner as the world inevitably changes around them (e.g. by numbers of files or sizes of filesystems increasing and approaching limits of fixed width integer values in the implementation).

New APIs for filesystems

Posted May 31, 2024 16:58 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (7 responses)

Is anyone seriously running a filesystem with 2^64 files on it in production? That seems a bit excessive even by Linux standards.

Or is this about efficiency? I.e. you don't want to keep a central store of the available inode numbers because it's too slow, so you would rather do something decentralized like a UUID? In that case... I really don't think it is reasonable to break userspace on the basis of efficiency, unless it's in a new (or at least currently-unstable) filesystem, with appropriate "nothing is going to be compatible with this" warnings in the documentation.

(It is not "breaking userspace" to provide a filesystem that does weird things and isn't compatible with commonly-used software, or else NFS would not be allowed to exist. The sysadmin can decide whether they want to deal with the compatibility issues, and keep using ext4 if they prefer stability.)

New APIs for filesystems

Posted May 31, 2024 17:19 UTC (Fri) by pizza (subscriber, #46) [Link]

> It is not "breaking userspace" to provide a filesystem that does weird things and isn't compatible with commonly-used software, or else NFS would not be allowed to exist.

Just feel the need to point out that NFS predates Linux and that commonly-used software.

...This is another scenario where the "everything is a simple file" abstraction leaks all over one's shoes.

New APIs for filesystems

Posted May 31, 2024 17:45 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link] (1 responses)

Files in different snapshots will have the same inode number - so we'd have to include the subvolume ID in the reported inode number, and that would take some bits (~20).

And bcachefs shards inode numbers by CPU id, that takes more bits.

Subvolumes and inodes

Posted Jun 3, 2024 8:42 UTC (Mon) by DemiMarie (subscriber, #164188) [Link]

Could those have different st_dev, though?

New APIs for filesystems

Posted Jun 1, 2024 14:14 UTC (Sat) by Paf (guest, #91811) [Link] (3 responses)

No, no one has that many - yet. But supercomputer systems and probably some cloud distributed systems can see it from here - the largest are approach an exabyte of data, which is 10^18 or 2^60. So you’ve 4 powers of two left on your bytes, and say no one would have an average file size under a kilobyte, and you’ve got another ten powers of 2. So 14 powers of two.

That doesn’t make *me* super comfortable over the medium to long term. Especially if you want to offer file system lifetime uniqueness rather than allow reuse. Which is why those systems generally already have much larger internal handles (which they tend to expose as file handles if they support those APIs).

New APIs for filesystems

Posted Jun 2, 2024 17:29 UTC (Sun) by willy (subscriber, #9762) [Link] (2 responses)

It's not about having 2^64 inodes, it's about the efficiency of allocating inside numbers. As Kent said above, btrfs shards by CPU, so that's, say, 2^54 IDs per 2^10 CPUs. Other filesystems assign inode # ranges to allocation groups ... so if you have one large file that occupies an entire AG, that makes the, say, 2^13 inode #s for that AG unassignable.

There are a lot of tradeoffs in FS development which aren't immediately apparent to people who aren't FS developers. It's what keeps it interesting!

New APIs for filesystems

Posted Jun 4, 2024 10:00 UTC (Tue) by paulj (subscriber, #341) [Link]

As soon as you start encoding structure into numbers (i.e., different portions of the number are directly or indirectly controlled by different portions of some external structure) it becomes nearly impossible to realise full efficiency in assigning from that number space.

Empirical evidence in telephony and networking suggests it is hard to get past a H-Density ratio of 87% (which may correspond to a "direct" utilisation of less than 20% useable-v-total-address-spacec), at least in those contexts. See RFC3194:

https://datatracker.ietf.org/doc/html/rfc3194

New APIs for filesystems

Posted Jun 4, 2024 18:02 UTC (Tue) by Paf (guest, #91811) [Link]

I definitely think it's both - it's true, yes, that the larger ranges allow distributed allocations by range-slicing. This might just be starting to appear as a concern for local file systems as the distributed-ness of individual computers swells, but the distributed FS folks have been worrying about this for a long time. It's standard practice for a head/master node to hand out big lumps of ID space to various child nodes.

New APIs for filesystems

Posted May 31, 2024 19:43 UTC (Fri) by gray_-_wolf (subscriber, #131074) [Link] (2 responses)

Will there be any measurable effect on either RAM or CPU usage in the user space? My understanding is that e.g. find basically keeps a set of seen inodes. Assuming that is correct, how will it be affected by switching to set of strings instead of set of int64_t? That should cost more both in RAM (more to store) and CPU (strcmp costs more than comparing numbers).

New APIs for filesystems

Posted May 31, 2024 22:27 UTC (Fri) by neilbrown (subscriber, #359) [Link] (1 responses)

I don't think 'find' keeps a list of seen inodes. It only tracks inode numbers on the path down to the current file, so that it can detect loops. And that isn't even 'find', that is 'ftw' or similar.

'tar' keep a list of inode number for every file that has 2 or more links. Usually that is a small number of files. If we asked find to keep a full file handle for every linked file, and used it on a directory tree where everything was a hard link (which certainly does happen in practice) then it would cause tar to use more RAM. File handles might be double the size of dev+inode so it wouldn't be enormously more RAM.

New APIs for filesystems

Posted Jun 3, 2024 19:40 UTC (Mon) by Tobu (subscriber, #24111) [Link]

Re alloc efficiency, is the size of a filehandle fixed for a given mount (mount_id or dev_id)?

Going by the name_to_handle_at manpage (paragraph starting with “The caller can discover the required size”), it seems to be, but I don't know if it's an actual guarantee. If it's basically ino_t but slightly expanded so there's bits for snapshots, cpu sharding, layered mounts, and future proofing, it doesn't seem like a big impact, but a symbolic key of variable length feels heavier, for use cases like backups or indexing. Layered mounts do seem to imply you have to add to the largest of the lower mounts' filehandle size.