The hard life of a virtual-filesystem developer
The longstanding "tracefs" virtual filesystem provides access to the ftrace tracing system; among other things, it implements a directory with a set of control files for every tracepoint known to the system. A quick look on a 6.6 kernel shows over 2,800 directories containing over 16,000 files. Until the 6.6 release, the kernel had to maintain directory entry ("dentry") and inode structures for each of those directories and files; all of those structures consumed quite a bit of memory.
For added fun, multiple instances of the tracepoint hierarchy can be mounted, with each one causing the kernel to duplicate all of that memory overhead. Even with a single tracepoint hierarchy, the chances are that almost none of the files contained within it will be accessed over the life of the system, so the memory is simply wasted.
Eventfs was merged in 6.6 as a way of eliminating this waste. It is a reimplementation of the portion of tracefs that represents the actual tracepoints, but optimized so that dentries and inodes are only allocated when a file is actually accessed. Vast amounts of memory were returned to the system for better use, and there was widespread rejoicing.
That rejoicing would have been more enthusiastic, though, had not a series of bugs, some with security implications, turned up in eventfs. This filesystem has required a long series of fixes — a process that is ongoing as of this writing. As all this has unfolded, there has been an extensive series of long threads between tracing maintainer Steve Rostedt, Linus Torvalds, and others. Among other things, there have been discussions on the size reported for virtual files, whether those files should have unique inode numbers, and many conversations on the details of interfacing with the kernel's virtual filesystem (VFS) layer. In the end, Torvalds ended up creating a patch series addressing a number of problems in eventfs.
For those wanting rather more sensational coverage than is LWN's habit, now might be a good time to search out the articles published elsewhere. The focus here will be on two points that came out in the discussions.
One of those is that documentation for would-be filesystem developers is lacking. Some, including VFS maintainer Christian Brauner, would disagree with that claim. There is, indeed, a fair amount of VFS documentation, including detailed descriptions of the VFS locking rules, which are some of the most complex in the kernel. But the number of things that Rostedt, who is not an inexperienced kernel developer, stumbled over during the course of this work makes it clear that many things remain undocumented. That is perhaps especially true for a developer wanting to implement a virtual filesystem, which tends to be a one-time project entered into by a developer whose focus is on another part of the kernel.
Consider, for example, the subsystem known as "kernfs". It is a framework
designed to ease the implementation of virtual filesystems; it is currently
used to implement control groups and the resctrl
filesystem. It seems like exactly what a developer of a virtual filesystem
would need, except for one little problem: it is meticulously undocumented.
No attempt has been made to describe its use; as a result, when Rostedt
considered it, he concluded:
"kernfs doesn't look trivial and I can't find any documentation on how
to use it
" and passed it by.
Perhaps, had kernfs been more accessible when eventfs was developed, it would have been found suitable to the task and would have helped to prevent the long series of mistakes that plagued eventfs. Perhaps, if the missing documentation were to be provided, the next virtual filesystem project could have an easier time of it.
There is another problem, though, that was nicely spelled out by Torvalds: the VFS layer is oriented toward the needs of "real" filesystems, those that are charged with the task of persistently storing data in a hierarchical directory structure. As a result, it has a lot of performance-driven quirks that are not only unhelpful for virtual filesystems, they also complicate the task of implementing those filesystems. To take it even further, though, the whole filesystem concept is a bit of an awkward fit for virtual filesystems:
And realize that [they] aren't really designed to be filesystems per se - they are literally designed to be something entirely different, and the filesystem interface is then only a secondary thing - it's a window into a strange non-filesystem world where normal filesystem operations don't even exist, even if sometimes there can be some kind of convoluted transformation for them.
That results in pathologies like even simple filesystem operations (stat()
on a /proc file, for example) not working properly in
virtual filesystems. In a
normal filesystem, the lifetime of the files themselves is directly tied to
filesystem operations. The objects represented in a virtual filesystem,
instead, have unrelated lifetimes of their own. The combination of two
separate worlds, Torvalds said, is "why virtual filesystems are
generally a complete mess
".
So how does one improve on this situation? One approach would be to abandon the idea of a virtual filesystem entirely, saying that the filesystem abstraction is simply not suitable for this kind of kernel ABI. Arguably, that is what the networking subsystem (along with some others) has done by adopting netlink for complex interfaces. Netlink works well for many things, but it is not a universally popular interface. An older variant of this approach, of course, is to simply provide a set of ioctl() calls. Use of ioctl() is somewhat discouraged, though; it tends to produce widely varying interfaces that see little review before being merged into the kernel. Yet another approach is the addition of new system calls, as was done by the VFS layer itself with the listmount() and statmount() system calls that were merged for the 6.8 release.
In the end, though, there is value to a filesystem-oriented interface. It
is familiar to users, scriptable, and relatively well defined. If
everything is a file, then utilities written to work with files can be
brought to bear. That is why virtual filesystems have proliferated over
the years; it suggests that there would be value in making it easier for
developers to correctly implement virtual filesystems. That, in turn,
indicates that putting some effort into APIs like kernfs and, crucially,
documenting them could do a lot to make life less difficult for the next
developer who takes on a virtual filesystem project.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/Virtual filesystem layer |
