Toward a better definition for i_version
Early versions of i_version
Version
0.99.7 of the kernel was released on March 13, 1993. Those were
exciting times; among other things, this release included a version of the
mmap() system call that was, according to a young Linus Torvalds,
"finally starting to really
happen
". This release also brought a new filesystem by Rémy Card
called "ext2fs" — the distant ancestor of the ext4 filesystem currently used by many
Linux systems.
As part of the ext2fs addition, the kernel's inode structure was augmented with a field called i_version, which was noted in a comment as being for the NFS filesystem. Nothing actually used that field until the 0.99.14 release in November of that year, when an ioctl() call was added to provide access to i_version. Those of us who were valiantly trying to use NFS on Linux in those days will remember that the server ran in user space then, so this ioctl() call was needed for i_version to be useful for NFS.
Initially, i_version was incremented whenever a given inode number was reused for a new file. This is an event that the NFS server needs to know about; otherwise a file handle created for one file could be used to access a completely different file that happened to end up with the same inode number, with aesthetically displeasing results. Version 2.2.3pre1 in 1999 added a new i_generation field to be used for this purpose instead, though it was not actually used until the 2.3.1pre1 development kernel in May of that year. When i_generation took over this role, i_version became a sort of counter for versions of the same file, incremented on changes in a filesystem-specific way (for filesystems that managed i_version at all).
While i_generation was all that the NFS server needed to carry out its task of creating the dreaded "stale file handle" errors when a file is replaced, there was still a role for i_version. NFS will perform far better if it can cache data locally, but doing so safely requires knowledge of when a file's contents change; i_version can be used for that purpose. Those who are interested in the details can read this article by Neil Brown on how cache consistency is maintained in current versions of NFS.
The trouble with i_version now
In the nearly 30 years since i_version was introduced, there has been little in the way of formal description of what the field is supposed to mean. In 2018, Jeff Layton added some comments describing how i_version was meant to be used, which clarified some details. As it turns out, though, some details remain to be nailed down, and they are creating trouble now.
Layton's text says: "The i_version must appear different to
observers if there was a change to the inode's data or metadata since it
was last queried
". That has been the deal between the virtual
filesystem (VFS) layer and the filesystems for years, but now there is a
desire to alter it. In its current form, it seems that i_version
is creating some performance difficulties.
As described above, NFS uses i_version to detect when a file has changed. If an NFS client has portions of a file cached, an i_version change will cause it to discard those caches, leading to more traffic with the server. The kernel's integrity measurement architecture (IMA), which ensures that files have not been tampered with by comparing them against trusted checksums, also uses i_version; if a file has changed, it must be re-checksummed before access can be allowed. In either case, spurious i_version increments will cause needless extra work to be done, hurting performance.
These unwanted increments are indeed happening, as it turns out, and the cause is an old villain: access-time (atime) tracking. By default, Unix filesystems will note every time that a file is read in that file's atime field. This record-keeping turns an otherwise read-only operation into a filesystem write and can be bad for performance on its own; for this reason, there are a number of options for disabling atime updates. If they are enabled, though, every atime update will, since it changes the metadata in a file's inode, increment i_version, with the bad results described above.
Rethinking i_version
Layton has decided to do something about that problem, resulting in a number of related patch sets. This patch, for example, makes i_version visible in the statx() system call, exposing it to user space for the first time (the old ext2 ioctl() command still exists, but it returns i_generation rather than i_version). The stated purpose is to make it easier to test its behavior and to facilitate the writing of user-space NFS servers. Another patch causes the XFS filesystem to not update i_version for atime updates; there is a similar patch for ext4. Finally, there is an update to the i_version comments making it explicit that atime updates should not increment that field.
Resistance to this work has come primarily from XFS developer Dave Chinner,
who called
the changed i_version rules "misguided
". He had a number
of complaints, starting with the fact that XFS sees i_version
rather differently and updates
it frequently:
In case you didn't realise, XFS can bump iversion 500+ times for a single 1MB write() on a 4kB block size filesystem, and only one of them is initial write() system call that copies the data into the page cache. The other 500+ are all the extent allocation and manipulation transactions that we might run when persisting the data to disk tens of seconds later.
This behavior, he said, is tied to how i_version is stored on-disk, meaning that changes to its semantics need to be treated like a disk-format change. He argued that what is being requested is essentially the lazytime mount option, which is implemented at the VFS level. If NFS needs lazytime-like semantics for i_version, he said, that should also be implemented at the VFS level so that all filesystems will behave in the same way.
Layton responded that lazytime semantics don't really help, since they simply defer the atime updates and will still result in unwanted i_version bumps. He also said that, since the only consumers for i_version are in the kernel, its semantics can be changed without creating further problems. Chinner disagreed with that claim, saying that his forensic-analysis tools make heavy use of that field in the on-disk images. It might not be possible to change the behavior of i_version in XFS without an on-disk format change.
Despite all of this, Chinner has let it be known that he is not really opposed to the change, except for one thing: he wants a tight specification of just how i_version is meant to behave, especially if it will be exposed to user space. Trond Myklebust suggested that i_version should only change in response to explicit operations — those in which user space has requested a change to the file. Changes to atime are, instead, implicit since user space has not asked for them, so they should not result in i_version updates. Layton said that it could simply be defined as any operation that updates an inode's mtime or ctime fields. Neil Brown had a more complex proposal that would use the ctime field directly while providing the higher resolution needed for NFS.
In the end, though, Layton argued
that "the time to write a specification for i_version was when it was
created
" and that he's doing his best to fix the problems long after
that time. But, he
said, it is "probably best to define this as loosely as possible so that
we can make it easier for a broad range of filesystems to implement
it
". An occasional spurious bump is not a huge problem, but the
regular increments caused by atime updates are. Fixing that problem should
be good enough.
For all the noise of the discussion, the disagreements are likely smaller
than they seem. It is a good opportunity to get a better understanding of
what this 30-year-old field really means, and to adjust its behavior to the
benefit of Linux users. The next step would appear to be the posting
another version of the patches by Layton, at which point we will get a
sense for whether there is enough of a consensus around the proposed
changes to get them merged.
Index entries for this article | |
---|---|
Kernel | Filesystems |
Posted Aug 27, 2022 2:45 UTC (Sat)
by bartoc (guest, #124262)
[Link] (5 responses)
Posted Aug 27, 2022 13:03 UTC (Sat)
by jhoblitt (subscriber, #77733)
[Link]
Posted Aug 27, 2022 22:47 UTC (Sat)
by jlayton (subscriber, #31672)
[Link] (2 responses)
Furthermore, the only way to _see_ the atime is to do a stat or a statx, and in that case you'll almost certainly need to issue a GETATTR on NFS anyway, in which case you'll get the new atime that you're interested in.
The best I've come up with so far is to define this such that when POSIX (or whatever) mandates that the ctime be changed, we would also want to change the i_version.
Posted Aug 28, 2022 18:15 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Aug 28, 2022 22:44 UTC (Sun)
by corbet (editor, #1)
[Link]
Posted Sep 12, 2022 14:45 UTC (Mon)
by roblucid (guest, #48964)
[Link]
From a sysadmin point of view, too much is made of atime updates, the lazy 24 hour cooldown works well enough in practice and you can turn atimes off. Secondly filesystem implementors could choose a way to save atime updates to disks more efficiently separate from priority disk updates (lost atime won't cause fs data corruption worse than what Linux allows already by ignoring file access), they don't care as developers don't need atimes, it's sysadmin's tracking scope of activity if some disaffected user say maliciously deleted/corrupted data.
If you're relying on NFS client file systems file times for some distributed network program, your system is effectively broken.
Posted Aug 28, 2022 11:34 UTC (Sun)
by meyert (subscriber, #32097)
[Link] (2 responses)
Posted Aug 28, 2022 16:32 UTC (Sun)
by jlayton (subscriber, #31672)
[Link] (1 responses)
In point of fact, AFS has a data version counter that is only bumped on content changes, but its semantics don't really match what NFS needs.
Posted Sep 2, 2022 13:45 UTC (Fri)
by trondmy (subscriber, #28934)
[Link]
The objects that really do need to be tracked by i_version, or something equivalent, are the ones that can otherwise only be revalidated at a very high cost to the client and/or server. So typically that list would mean objects with unbounded size, such as file data, acls, and to a certain extent xattrs. Also data or metadata that might be unavailable when offlined in a hierarchical storage. If you wanted to revalidate your cache by directly comparing the cache to the object on the server, then the time it would take to do so is effectively unbounded.
Now that said, the NFS spec was trying to make i_version easy to implement, so it went with the features that typically are tracked by ctime, so that the latter could be used in cases where the time resolution is good enough to track all changes.
Toward a better definition for i_version
Toward a better definition for i_version
Toward a better definition for i_version
Toward a better definition for i_version
The problem with mtime and ctime is that they don't have anything near the required resolution to catch every change. Neil's article, linked in this article, describes those issues in detail.
mtime and ctime
Toward a better definition for i_version
Toward a better definition for i_version
Toward a better definition for i_version
Toward a better definition for i_version