Toward a better definition for i_version

By Jonathan Corbet
August 26, 2022

Filesystems maintain a lot of metadata about the files they hold; most of this metadata is for consumption by user space. Some metadata, though, stays buried within the filesystem and is not visible outside of the kernel. One such piece of metadata is the file version count, known as i_version. Current efforts to change how i_version is managed — and to make it visible to user space — have engendered a debate on what i_version actually means and what its behavior should be.

Early versions of `i_version`

Version 0.99.7 of the kernel was released on March 13, 1993. Those were exciting times; among other things, this release included a version of the mmap() system call that was, according to a young Linus Torvalds, "finally starting to really happen". This release also brought a new filesystem by Rémy Card called "ext2fs" — the distant ancestor of the ext4 filesystem currently used by many Linux systems.

As part of the ext2fs addition, the kernel's inode structure was augmented with a field called i_version, which was noted in a comment as being for the NFS filesystem. Nothing actually used that field until the 0.99.14 release in November of that year, when an ioctl() call was added to provide access to i_version. Those of us who were valiantly trying to use NFS on Linux in those days will remember that the server ran in user space then, so this ioctl() call was needed for i_version to be useful for NFS.

Initially, i_version was incremented whenever a given inode number was reused for a new file. This is an event that the NFS server needs to know about; otherwise a file handle created for one file could be used to access a completely different file that happened to end up with the same inode number, with aesthetically displeasing results. Version 2.2.3pre1 in 1999 added a new i_generation field to be used for this purpose instead, though it was not actually used until the 2.3.1pre1 development kernel in May of that year. When i_generation took over this role, i_version became a sort of counter for versions of the same file, incremented on changes in a filesystem-specific way (for filesystems that managed i_version at all).

While i_generation was all that the NFS server needed to carry out its task of creating the dreaded "stale file handle" errors when a file is replaced, there was still a role for i_version. NFS will perform far better if it can cache data locally, but doing so safely requires knowledge of when a file's contents change; i_version can be used for that purpose. Those who are interested in the details can read this article by Neil Brown on how cache consistency is maintained in current versions of NFS.

The trouble with `i_version` now

In the nearly 30 years since i_version was introduced, there has been little in the way of formal description of what the field is supposed to mean. In 2018, Jeff Layton added some comments describing how i_version was meant to be used, which clarified some details. As it turns out, though, some details remain to be nailed down, and they are creating trouble now.

Layton's text says: "The i_version must appear different to observers if there was a change to the inode's data or metadata since it was last queried". That has been the deal between the virtual filesystem (VFS) layer and the filesystems for years, but now there is a desire to alter it. In its current form, it seems that i_version is creating some performance difficulties.

As described above, NFS uses i_version to detect when a file has changed. If an NFS client has portions of a file cached, an i_version change will cause it to discard those caches, leading to more traffic with the server. The kernel's integrity measurement architecture (IMA), which ensures that files have not been tampered with by comparing them against trusted checksums, also uses i_version; if a file has changed, it must be re-checksummed before access can be allowed. In either case, spurious i_version increments will cause needless extra work to be done, hurting performance.

These unwanted increments are indeed happening, as it turns out, and the cause is an old villain: access-time (atime) tracking. By default, Unix filesystems will note every time that a file is read in that file's atime field. This record-keeping turns an otherwise read-only operation into a filesystem write and can be bad for performance on its own; for this reason, there are a number of options for disabling atime updates. If they are enabled, though, every atime update will, since it changes the metadata in a file's inode, increment i_version, with the bad results described above.

Rethinking `i_version`

Layton has decided to do something about that problem, resulting in a number of related patch sets. This patch, for example, makes i_version visible in the statx() system call, exposing it to user space for the first time (the old ext2 ioctl() command still exists, but it returns i_generation rather than i_version). The stated purpose is to make it easier to test its behavior and to facilitate the writing of user-space NFS servers. Another patch causes the XFS filesystem to not update i_version for atime updates; there is a similar patch for ext4. Finally, there is an update to the i_version comments making it explicit that atime updates should not increment that field.

Resistance to this work has come primarily from XFS developer Dave Chinner, who called the changed i_version rules "misguided". He had a number of complaints, starting with the fact that XFS sees i_version rather differently and updates it frequently:

In case you didn't realise, XFS can bump iversion 500+ times for a single 1MB write() on a 4kB block size filesystem, and only one of them is initial write() system call that copies the data into the page cache. The other 500+ are all the extent allocation and manipulation transactions that we might run when persisting the data to disk tens of seconds later.

This behavior, he said, is tied to how i_version is stored on-disk, meaning that changes to its semantics need to be treated like a disk-format change. He argued that what is being requested is essentially the lazytime mount option, which is implemented at the VFS level. If NFS needs lazytime-like semantics for i_version, he said, that should also be implemented at the VFS level so that all filesystems will behave in the same way.

Layton responded that lazytime semantics don't really help, since they simply defer the atime updates and will still result in unwanted i_version bumps. He also said that, since the only consumers for i_version are in the kernel, its semantics can be changed without creating further problems. Chinner disagreed with that claim, saying that his forensic-analysis tools make heavy use of that field in the on-disk images. It might not be possible to change the behavior of i_version in XFS without an on-disk format change.

Despite all of this, Chinner has let it be known that he is not really opposed to the change, except for one thing: he wants a tight specification of just how i_version is meant to behave, especially if it will be exposed to user space. Trond Myklebust suggested that i_version should only change in response to explicit operations — those in which user space has requested a change to the file. Changes to atime are, instead, implicit since user space has not asked for them, so they should not result in i_version updates. Layton said that it could simply be defined as any operation that updates an inode's mtime or ctime fields. Neil Brown had a more complex proposal that would use the ctime field directly while providing the higher resolution needed for NFS.

In the end, though, Layton argued that "the time to write a specification for i_version was when it was created" and that he's doing his best to fix the problems long after that time. But, he said, it is "probably best to define this as loosely as possible so that we can make it easier for a broad range of filesystems to implement it". An occasional spurious bump is not a huge problem, but the regular increments caused by atime updates are. Fixing that problem should be good enough.

For all the noise of the discussion, the disagreements are likely smaller than they seem. It is a good opportunity to get a better understanding of what this 30-year-old field really means, and to adjust its behavior to the benefit of Linux users. The next step would appear to be the posting another version of the patches by Layton, at which point we will get a sense for whether there is enough of a consensus around the proposed changes to get them merged.

Index entries for this article
Kernel	Filesystems

Toward a better definition for i_version

Posted Aug 27, 2022 2:45 UTC (Sat) by bartoc (guest, #124262) [Link] (5 responses)

Isn't updating the atime absolutely a change in the metadata, meaning NFS really _should_ discard at least the metadata cache in that case. Yeah it's horrible for performance but if you have atime turned on you presumably care about it! After all it's horrible for performance in many other, non i_version related ways.

Toward a better definition for i_version

Posted Aug 27, 2022 13:03 UTC (Sat) by jhoblitt (subscriber, #77733) [Link]

^This. Is high performance NFS exports with a time enabled an important use case?

Toward a better definition for i_version

Posted Aug 27, 2022 22:47 UTC (Sat) by jlayton (subscriber, #31672) [Link] (2 responses)

It's technically a metadata change, but it's not one that's "interesting" for NFSv4 or IMA (or really, anyone). We don't want to invalidate any caches or anything due to an atime change, since that is _all_ that has changed and there were no other observable effects.

Furthermore, the only way to _see_ the atime is to do a stat or a statx, and in that case you'll almost certainly need to issue a GETATTR on NFS anyway, in which case you'll get the new atime that you're interested in.

The best I've come up with so far is to define this such that when POSIX (or whatever) mandates that the ctime be changed, we would also want to change the i_version.

Toward a better definition for i_version

Posted Aug 28, 2022 18:15 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (1 responses)

Given that i_version is (should be) bumped iff mtime or ctime would be bumped, then why does i_version need to exist? Can't userspace just query mtime and ctime directly?

mtime and ctime

Posted Aug 28, 2022 22:44 UTC (Sun) by corbet (editor, #1) [Link]

The problem with mtime and ctime is that they don't have anything near the required resolution to catch every change. Neil's article, linked in this article, describes those issues in detail.

Toward a better definition for i_version

Posted Sep 12, 2022 14:45 UTC (Mon) by roblucid (guest, #48964) [Link]

It was a good practice to mount NFS ro for shared executable filesystems to avoid in service meta data changes, it worked well because they were not updated frequently. Originally in UNIX it was the update(8) process which synced the disks every 30s that saved inodes in batches, stored together in dense blocks. Smaller slower disks perhaps deterred careless use of commands like find over /, if you were compiling it wasn't that fast either.

From a sysadmin point of view, too much is made of atime updates, the lazy 24 hour cooldown works well enough in practice and you can turn atimes off. Secondly filesystem implementors could choose a way to save atime updates to disks more efficiently separate from priority disk updates (lost atime won't cause fs data corruption worse than what Linux allows already by ignoring file access), they don't care as developers don't need atimes, it's sysadmin's tracking scope of activity if some disaffected user say maliciously deleted/corrupted data.

If you're relying on NFS client file systems file times for some distributed network program, your system is effectively broken.

Toward a better definition for i_version

Posted Aug 28, 2022 11:34 UTC (Sun) by meyert (subscriber, #32097) [Link] (2 responses)

Isn't the obvious solution to introduce new attribute "i_version_content" which is only incremented when the actual file content does change and not trying to change the semantic of an existing attribute with unclear semantic?

Toward a better definition for i_version

Posted Aug 28, 2022 16:32 UTC (Sun) by jlayton (subscriber, #31672) [Link] (1 responses)

NFSv4 also needs it to be incremented on certain types of metadata changes. Size changes (which would imply a content change), but also permissions, ownership, link count, ACLs etc. are all supposed to result in a change to the change attribute.

In point of fact, AFS has a data version counter that is only bumped on content changes, but its semantics don't really match what NFS needs.

Toward a better definition for i_version

Posted Sep 2, 2022 13:45 UTC (Fri) by trondmy (subscriber, #28934) [Link]

To be strictly fair, in an ideal world, we shouldn't really need the i_version to tell us that the permissions, owner, link count, etc. have changed. All those values can be directly retrieved at low cost to the client+server.

The objects that really do need to be tracked by i_version, or something equivalent, are the ones that can otherwise only be revalidated at a very high cost to the client and/or server. So typically that list would mean objects with unbounded size, such as file data, acls, and to a certain extent xattrs. Also data or metadata that might be unavailable when offlined in a hierarchical storage. If you wanted to revalidate your cache by directly comparing the cache to the object on the server, then the time it would take to do so is effectively unbounded.

Now that said, the NFS spec was trying to make i_version easy to implement, so it went with the features that typically are tracked by ctime, so that the latter could be used in cases where the time resolution is good enough to track all changes.

Toward a better definition for i_version

Early versions of i_version

The trouble with i_version now

Rethinking i_version

Toward a better definition for i_version

Toward a better definition for i_version

Toward a better definition for i_version

Toward a better definition for i_version

mtime and ctime

Toward a better definition for i_version

Toward a better definition for i_version

Toward a better definition for i_version

Toward a better definition for i_version

Early versions of `i_version`

The trouble with `i_version` now

Rethinking `i_version`