Leading items

Welcome to the LWN.net Weekly Edition for June 30, 2022

This edition contains the following feature content:

System call interception for unprivileged containers: a Linux Security Summit session on using seccomp() to make truly unprivileged containers possible.
Whatever happened to SHA-256 support in Git?: SHA-1 is known to be weak, so why is Git so slow to move on to a new hash algorithm?
Two memory-tiering patch sets: proposals to tackle parts of the tricky tiered-memory problem.
A "fireside" chat: The Dirk and Linus show comes to Austin.
NFS: the new millennium: a history of the NFS protocol and implementations since version 4.0.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

System-call interception for unprivileged containers

By Jake Edge
June 29, 2022

LSS NA

On the first day of the 2022 Linux Security Summit North America (LSSNA) in Austin, Texas, Stéphane Graber and Christian Brauner gave a presentation on using system-call interception for container security purposes. The idea is to allow unprivileged containers, those without elevated privileges on the host, to still accomplish their tasks, some of which require privileges. A fair amount of work has been done to make this viable, but there is still more to do.

Graber started things off by saying that he works for Canonical on the LXD container manager project, while Brauner works for Microsoft in various areas of Linux security. Graber said that there are two types of containers these days, privileged and unprivileged, "one is bad, one is OK". He noted that privileged containers are "unfortunately what everyone uses" for Docker containers, Kubernetes, and so on.

Unprivileged containers

LXD defaults to using unprivileged containers; user namespaces are "the primary barrier for security" in those containers. Privileged containers have had a constant whack-a-mole game using Linux Security Modules (LSMs), seccomp() filters, and other mechanisms to try to close holes that allow processes inside the containers to gain privileges on the host. He and others want to move to a world where everyone uses unprivileged containers; "privileged containers should not be a thing", he said.

But there are a number of things that do not work in unprivileged containers. They are effectively running as some random regular user on the host system; "we don't allow random users on our systems to do a lot of things". Using other types of namespaces and adding new ones has allowed unprivileged containers to work around some of these restrictions, but there is a limit to how far that can be pushed. There is not a lot of appetite for adding lots more namespace types to the kernel.

So the LXD project started looking at what could be done with seccomp() filters and, in particular, with system call interception in user space. It can provide a way to allow the container to do things that require privileges, but do so in a controlled way that is mediated by the container manager.

Brauner said that seccomp() conveniently sits on the system-call-entry path well before the system-call-specific code is invoked. There are some system calls where the container should be able to succeed in making the call even though it lacks the required privileges. For example, mknod() should be allowed for certain kinds of device nodes, such as /dev/zero, /dev/null, /dev/console, and so on. These are "pretty boring device nodes", but the kernel's permission model either allows creating any arbitrary device node or no device nodes.

Unprivileged processes (or containers) should not be able to create /dev/kmem or some random block device, for example, as that could lead to a compromise of the host. But there are a few simple device nodes that containers require, which are currently bind-mounted from the host. There is no good reason not to just allow them to be created in the containers directly.

One could imagine some kind of allowlist in the kernel that specified which device nodes do not require privileges to create, Brauner said. That is "kind of hacky", so other solutions were tried. Along the way, he discovered that there already is a limited version of an allowlist; the "whiteouts" used by the Overlay filesystem to mark files that have been deleted in an upper overlay are actually character device nodes with a special device number (0/0). Those can be created without extra privileges. That weakens the argument against an allowlist for mknod() in the kernel, he said, but that route was not pursued.

Something else that was tried was allowing unprivileged processes to create device nodes, but not to be able to open them. That broke pretty much all of the container runtimes, Brauner said. It is a deeply held assumption that if a process can create a device node, it can open it. So it turns out that allowing the creation of device nodes that cannot be opened "is not a great idea".

But all of that was focused on a single system call; there is a need to support other "safe" uses of system calls. So the idea of system-call interception was born at the 2017 Linux Plumbers Conference (LPC), Brauner thinks. A mechanism that can inspect the arguments to the system call could, for example, deny mknod() calls for block devices and for character device numbers that are not on the approved list. Rather than some static policy in the kernel about what to allow or deny, the decision could be delegated to a user-space process.

So seccomp() was extended to support exactly that, he said. A new type of filter was added to get a user-space notification when the system call is made; the container manager can then obtain a file descriptor that it can poll for system-call events. When the manager is notified of a system call, ioctl() commands can be used to retrieve the arguments to the call, which can be used to make a decision. That decision is returned to the kernel by writing to the file descriptor.

A seccomp() filter can only tell the kernel to continue the call, fail the call with a specific error code that gets returned to the caller, or return success. If the container manager thinks the system call should succeed for an unprivileged container, it cannot just tell the kernel to go ahead and perform the call since the calling task does not have the proper privileges. So the container manager has to emulate the system call by performing the action as if the task did have the proper permissions. Once it does so, and makes the result available to the container, it can tell the kernel to return success to the task.

He asked attendees if they could think of a security problem that might arise from this scheme; someone was quick to mention time-of-check-to-time-of-use (TOCTTOU) concerns. Brauner said that mknod() is a "pretty boring system call because it only has integer arguments". Other system calls, with pointer arguments allow the container manager to be tricked by a caller that changes the argument at the address after the manager checks that it is "safe". seccomp() filters are written in classic BPF, rather than extended BPF (eBPF), which means that they cannot dereference pointers. So, in order to inspect an argument passed by reference, the manager would need to read the data directly from the process's memory (using the address as an offset into /proc/PID/mem). That "works" but it suffers from TOCTTOU races.

Once the seccomp() notify mechanism was added, people immediately started thinking up ways to create a security framework that, for example, looked at the pathname argument for the open() system call to decide whether to allow or deny access to a particular file. It could then tell the kernel to continue the system call if the file name was not problematic. The process being filtered would presumably already have the privileges needed to open the file, but could be denied if the filtering process decided it should not be able to access the file. The process could simply rewrite the argument after the check was done, though, and the kernel will happily open the file.

That limits the usefulness of being able to continue system calls from filters. It can only be done if the ultimate security boundary, the kernel itself, will deny the action anyway, as it would for mknod() from an unprivileged container. That means that the seccomp() notification mechanism cannot be used to implement a security policy for, say, privileged containers. In order to warn people away from doing so, Brauner said that the put a comment in seccomp.h describing the problems.

Generally, seccomp() system-call interception requires a trusted, privileged process on the host to supervise the calls. For example, in the case of nested unprivileged containers, having the container manager in the outer container supervise the calls from the inner one is pretty uninteresting, he said. That is something to keep in mind as uses for this facility are designed.

Target system calls

Graber took over at that point to describe the system calls they have been working on intercepting, which is quite a different list than what they started with back in Los Angeles at LPC. That is not surprising, since even at that time they knew some on the list would be hard or impossible to handle. The current list is mknod(), as already mentioned, setxattr(), bpf(), sched_setscheduler(), mount(), and sysinfo(). Those are all implemented for LXD; other projects have been using what has been done in LXD, and may be working on intercepting other system calls.

Intercepting mknod()/mknodat() allows LXD to run tools like debootstrap in a unprivileged container. That means distribution images can be built in those containers. Another reason that those calls needed to be intercepted is to allow containers to create whiteouts for overlayfs. That allows Docker to unpack its layers into an unprivileged container, for example. Graber said the he considers the interception of mknod() with the restrictions LXD has in place to be "relatively safe". He is not aware of any problems, but it is not enabled in LXD containers by default. It is one that the project thinks most containers can enable, however.

setxattr() provides a way to mark a deleted directory in overlayfs, so it needed to be supported in LXD as well. There is an allowlist of extended attributes (xattrs) that can be set from unprivileged containers. Obviously, only some attributes can be allowed, since setting those in certain namespaces, such as the "security.*" xattrs, "would be extremely bad", Graber said.

Brauner then described the situation for the mount() call. In the mknod() case, he said, there was no need to "play any specific games with the privilege level or security level" in the supervisor/manager. It could simply access the mount namespace of the container and create the device node within it. Things are not so simple with mount().

When performing a mount() on behalf of the container, there are a number of security attributes that need to be handled, such as Linux capabilities, LSM profiles, user and group IDs, various namespaces (e.g. mount, PID, or user), and so on. The emulated call in the manager needs to assume the identity of the requesting process in the container so that no extra privileges come along for the ride when the mount is performed. "It becomes really tricky to get right", he said.

Given that, he asked, "why intercept the mount() system call?" There are cases where the host is providing a filesystem to the container that the container manager can vouch for. Under those limited circumstances, allowing the filesystem to be mounted is useful. You cannot allow arbitrary mounts inside the container, however, due to the possibility of malicious filesystem images.

The container manager can emulate the mount() call, so it can avoid the TOCTTOU races that could occur since most of the arguments are pointers. The mount() system call is also problematic because it is a "terrible multiplexer" that can perform a wide variety of actions beyond just mounting a block-based filesystem: bind-mount, mount a pseudo-filesystem, change mount or superblock attributes, and more. Intercepting the system call is useful, for now, though he some ideas on a "delegated mounting" feature for the virtual filesystem (VFS) layer that may be a better solution in the future.

Graber said that LXD allows the mount inside the container to automatically have user and group ID remapping applied. It also has a mode where it will intercept the mount and turn it into the equivalent mount using Filesystem in User Space (FUSE). That makes it "pretty safe" because the filesystem is not actually mounted directly through the kernel but is instead being handled by a user-space process inside the container.

Brauner said that he has implemented a proof-of-concept for bpf() interception, which uses the pidfd work that he has done over the last few years. There is a problem with emulating system calls that return file descriptors, such as open() and bpf(), because the file descriptor needs to be shared with the requesting process. The pidfd API allows descriptors to be safely injected into another task. LXD restricts the programs that the containers can run; one that it allows will enable the container to further restrict access to its devices.

Graber said that the sched_setscheduler() interception is not one that LXD considers to be safe; "I find it dodgy", Brauner said. But, Graber said, Android uses the call a lot, so when running Android in an unprivileged container it can be enabled. That could lead to various kinds of problems, however, so it should be used with care—if at all.

The sysinfo() interception was added recently to further support a feature from LXCFS, which can report things like available memory based on the control-group limits of the container, rather than the system-wide numbers. That works well, but multiple tools use sysinfo() to get values to report, so they still would show the host-wide values. By intercepting the call, the uptime, amount of memory, and so on can be reported correctly inside the container.

Graber then demonstrated various interceptions in an LXD container. As one example, he showed the sysinfo() interception. He started the container with a limit of 256MB of memory and, inside the container, the free command indeed showed that. That is because LXCFS was mounted on /proc/meminfo so that it could intercept reads of that file. But, running a binary that consulted sysinfo() reported the 16GB on his laptop instead. Restarting the container with the interception cleared that little problem right up.

All of the information used by the sysinfo() interception comes from what LXCFS has already gathered, but not reporting through the system call led to multiple bug reports, Brauner said. For example, Java looks at the available memory via sysinfo() and will pre-allocate its memory based on that. In addition, Graber said, the free in Alpine Linux uses (or used) sysinfo() leading to bug reports regarding the LXD control-group limits.

Future

They closed with some thoughts on future plans. Brauner said that he would like to explore adding at least some limited support for eBPF to seccomp() filters. For a long time, new system calls with pointer arguments were rejected because seccomp() cannot dereference pointers. That has changed, so that multiplexers, like io_uring, and the new extensible system call scheme were not blocked. But that leads to another problem.

The GNU C library (glibc) wanted to switch to using the clone3() system call, but ran afoul of the seccomp() filters installed for many containers. Those did not allow clone3() at all because all of the arguments are behind a pointer so they cannot be inspected. The older clone() system call has a flags argument that is passed directly, thus can be used to decide whether the system call should proceed. So Brauner would like to see some mechanism for inspecting arguments that are behind pointers, and some kind of limited eBPF support would fit the bill. In the past, seccomp() maintainer Kees Cook has generally been opposed to doing so, but Cook was not present at LSSNA this year.

Beyond that, Graber said that some kind of limited support for kernel-module loading might be something on the horizon. That idea scares many people, with good reason, but it would be strictly limited interception of init_module()/finit_module(). It would not allow the container to actually load a module; instead the container would pass in what it wants to load, and if the module passes some checks, the container manager would load the host's version of that module. One of the applications for that is for firewalls in a container that need various network modules. Right now, there is a list of modules that get loaded at container startup time, but it would be nice to have on-demand module loading, he said.

One interesting thing about seccomp() filters is that the interception is done even before the system-call table is consulted, which means that new system calls can be created entirely in user space. The new system call would simply be defined for an unused system-call number, which would get intercepted by the filter to call the new code. That could be used to prototype new system calls. He has not seen anyone actually do so, yet, but it is a possibility.

[I would like to thank LWN subscribers for supporting my travel to Austin for LSSNA.]

Comments (9 posted)

Whatever happened to SHA-256 support in Git?

By Jonathan Corbet
June 23, 2022

The news has been proclaimed loudly and often: the SHA-1 hash algorithm is terminally broken and should not be used in any situation where security matters. Among other things, this news gave some impetus to the longstanding effort to support a more robust hash algorithm in the Git source-code management system. As time has passed, though, that work seems to have slowed to a stop, leaving some users wondering when, if ever, Git will support a hash algorithm other than SHA-1.

Hash functions are, of course, at the core of how Git works. Every object in its data store — every version of every file, among other things — is hashed, with the resulting value serving as the key under which that object is stored. Commits, too, are represented by a hash of the current state of the tree, the commit message, and the hash(es) of the parent commit(s). The security of the hash function is a key part of the integrity of a repository as a whole. If an attacker could replace a commit with another having the same hash value, they could perhaps inject malicious code into a repository without risking detection. That prospect is worrisome to anybody who depends on the security of code stored in Git repositories — everybody, in other words.

The Git project has long since chosen SHA-256 as the replacement for SHA-1. Git was originally written with SHA-1 deeply wired into the code, but all of that code has since been refactored and can handle multiple hash types, with SHA-256 being the second supported type. It is now possible to create a Git repository using SHA-256 (just use the --object-format=sha256 flag) and most local operations will work just fine. The foundation for support of alternative hash algorithms in Git was part of the 2.29 release in 2020 and appears to be solid.

That 2.29 release, though, is the last one that features alternative-hash work in any significant way; there has been no mention of this work in the project's release notes since a fix showed up in 2.31, released in March 2021. The 2.29 work marked SHA-256 as experimental and warned that "that there is no interoperability between SHA-1 and SHA-256 repositories yet". There was some work toward interoperability posted in 2020, but those patches do not appear to have ever been merged into the Git mainline.

In other words, work on supporting the use of a hash algorithm other than SHA-1 in Git appears to have ground to a halt. That recently led Stephen Smith to post a query about its status to the development list. This response from Ævar Arnfjörð Bjarmason is illuminating and, for those looking forward to full SHA-256 support, potentially discouraging:

I wouldn't recommend that anyone use it for anything serious at the moment, as far as I can tell the only users (if any) are currently (some) people work on git itself.

Bjarmason pointed out that there is still no interoperability between SHA-1 and SHA-256 repositories, and that none of the Git hosting providers appear to be supporting SHA-256. That support (or the lack thereof) matters; a repository that cannot be pushed to a Git forge will be essentially useless to many people. There is also the risk (which cannot really be made to go away) that the longer hashes used with SHA-256 may break tools developed outside of the Git project. The overall picture is one of a feature that is not yet ready for real-world use.

That said, it is worth noting that brian m. carlson, who has done the bulk of the hash-transition work so far, disagrees with Bjarmason's assessment. In his view, the only "defensible" reason to use SHA-1 at this point is interoperability with the Git forge providers. Otherwise, he said, SHA-1 is obsolete, and performance with SHA-256 can be "substantially faster". But he agrees that the needed interoperability does not exist, and nobody has said that it is coming anytime soon.

What has happened here looks, to an extent at least, like a story that has played out numerous times over the course of free-software history. A problem has been identified, and a great deal of core foundational work has been done to solve it. That solution appears to be well considered and solidly implemented. In a sense, the job is 90% done. All that is left is the hard work of making the transition to a new hash easy for users — what could be thought of as "the other 90%" of the job.

This sort of interface and compatibility development is hard and developers often do not find it particularly rewarding, so it tends to be neglected by our community. The Git project, one might argue, is especially prone to user-interface challenges, but the problem is wider than that. There are certain sorts of tasks that volunteers are often uninclined to pick up, and that companies may not feel the need to fund.

Given the threat that the SHA-1 hash poses, one might think that there would be a stronger incentive for somebody to support this work. But, as Bjarmason continued, that incentive is not actually all that strong. The project adopted the SHA-1DC variant of SHA-1 for the 2.13 release in 2017, which makes the project more robust against the known SHA-1 collision attacks, so there does not appear to be any sort of imminent threat of this type of attack against Git. Even if creating a collision were feasible for an attacker, Bjarmason pointed out, that is only the first step in the development of a successful attack. Finding a collision of any type is hard; finding one that is still working code, that has the functionality the attacker is after, and that looks reasonable to both humans and compilers is quite a bit harder — if it is possible at all.

So few people are losing sleep over the possibility that a Git repository could be deliberately corrupted by way of an SHA-1 hash collision anytime soon. The combination of a lack of urgency and little apparent interest in doing the work has seemingly brought the SHA-256 transition to a halt. Perhaps that is how the situation will remain until another SHA-1 weakness turns up and brings attention back to the situation. But, as Randall Becker pointed out, there is a cost to this inaction:

Adding my own 0.02, what some of us are facing is resistance to adopting git in our or client organizations because of the presence of SHA-1. There are organizations where SHA-1 is blanket banned across the board - regardless of its use. [...] Getting around this blanket ban is a serious amount of work and I have very recently seen customers move to older much less functional (or useful) VCS platforms just because of SHA-1.

It is a bit of a stretch to imagine that remaining with SHA-1 will threaten Git's dominance in the near future. But it could, perhaps, give a toehold to a competitor that would lead to trouble in the longer term, especially if the security of SHA-1 crumbles further.

Given that, one might think that companies that are dependent on Git would see some value in solving this particular problem. Many companies use Git, but some have based their entire business model around it. The latter companies have benefited greatly from the community's investment in Git, and they have a lot to lose if Git loses its prominence. It would seem to make sense for one or more of these companies to make the relatively small investment needed to push this transition to completion; that would be good for the community — and for their own future as well.

Comments (68 posted)

Two memory-tiering patch sets

By Jonathan Corbet
June 27, 2022

Once upon a time, computers just had one type memory, so memory within a given system was interchangeable. The arrival of non-uniform memory access (NUMA) systems complicated the situation significantly; now some memory was faster to access than the rest, and memory-management algorithms had to adapt or performance would suffer. But NUMA was just the start; today's tiered-memory systems, which may include several tiers of memory with different performance characteristics, are adding new challenges. A couple of relevant patch sets currently under review help to illustrate the types of problems that will have to be solved.

The core challenge with NUMA systems is ensuring that memory is allocated on the nodes where it will be used. A process that is running mostly from memory on its local node will perform better than one that is working with a lot of remote memory. So finding the right place for a given page is a one-time task; once that page and its users have found their way to the same NUMA node, the problem is solved and the only remaining concern is to avoid separating them again.

Tiered memory is built on the NUMA concept, but there are some differences. A bank of memory can be represented as a NUMA node that lacks a CPU, so that memory will not be seen as local to any process in the system. As a general rule, memory on these CPU-less nodes is slower than normal system DRAM — it might be a large bank of persistent memory, for example — but that is not necessarily the case, as we will see below.

Since memory on a CPU-less node is not local to any process, there must be some other criterion that regulates the allocation of memory there. The approach that is being taken is to demote pages to such a node from faster DRAM using the kernel's normal reclaim mechanisms; in a situation where a page would otherwise have been evicted or pushed to swap, it can be moved to slower memory instead. That makes use of the slower memory while keeping that page available should it turn out to still be useful. Eventually, if that page sits unused in the slower tier, it can be pushed to an even slower tier or evicted entirely.

Demoting pages to slower tiers cannot be a one-way operation, though, or performance will suffer; some of those pages will end up being accessed frequently and keeping them in slow memory will slow things down. So there needs to be a mechanism for promoting pages back to faster memory. Simply moving a page back to fast memory on the first access after demotion would be one possible approach, but that would also promote infrequently used memory and would likely create a lot of movement of pages between tiers, which would have significant costs of its own; a better solution is called for.

Hot-page selection

The hot-page selection patch set from Huang Ying is an attempt to find that better solution. The approach taken is to try to estimate the frequency of accesses to each slow-tier page, and to promote those that are accessed most often. There is no access counter for pages, though, so some sort of heuristic is required. The specific approach is to occasionally scan through slow-tier memory, setting the individual page protections to PROT_NONE (no access). When this is done, the current time is shoehorned into the associated page structure. An attempt to access the page will generate a fault, at which point the previous permissions can be restored and the faulting process can continue. But the kernel can also compare the current time to the timestamp that was stored previously; if that time is short enough, the conclusion is that the page is accessed frequently and should be promoted.

What is "short enough"? The initial patch in the series sets the threshold to one second, a value "which works well in our performance test". That patch also points out some shortcomings with this approach; the right threshold will be highly workload-dependent, and the promotion mechanism will respond slowly to changes in access patterns. If a set of pages that have sat in slow memory for some time suddenly becomes hot, the first accesses will still appear slow, and those pages will have to go through another scan cycle before being promoted.

In an attempt to mitigate that last problem, the access-time threshold will not be applied if there is an abundance of free fast memory. If the resources are available, in other words, pages will simply be promoted even if it seems that they are not being used often. When that can't happen, though, then the access-time test still applies, and the one-second value might not work for all workloads. Ying notes that there does not seem to be a way for users to try to configure that value themselves in a reasonable way, so no knob to do that configuration is provided.

Instead, the series adds a knob to limit the number of page promotions performed per second, expressed as a MB/s bandwidth value. Once the rate limit has been hit for a given time period, promotions will stop for a while, preventing excessive page promotions from overwhelming the system and hurting performance in their own right. The last step is to adjust the access-time threshold dynamically so that the number of pages that are eligible for promotion approximately matches the configured promotion limit. Thus, if too many pages are being chosen for promotion, the threshold will be made tighter, focusing the algorithm on the most frequently accessed pages.

Benchmark results provided with the patch set show some significant performance improvements with the new algorithm. In response to a previous posting of the patch set, though, Johannes Weiner suggested a simpler approach, which is evidently in use at Meta. It uses the current least-recently-used (LRU) mechanism that regulates memory eviction in general, with the end result that pages will be promoted on the second access. Ying answered that the LRU is good for identifying cold pages, but is less effective at identifying hot pages. Weiner did not respond further. At this point, the future of this patch set is not clear, but it does appear to provide a solution that's needed, in one way or another, in the mainline kernel.

Rethinking tier assignment

The assignment of CPU-less NUMA nodes to slower memory tiers has been an effective heuristic in the early days of support for tiered memory. But, as Aneesh Kumar K.V points out in this patch series, the real world is inevitably more complicated than that. A CPU-less node might be populated with slower memory, but it could also hold memory that is as fast as — or faster than — normal system DRAM. Examples would include a virtual node backed by DRAM in a virtual machine, CXL memory behind a fast interconnect, or high-bandwidth memory provided by a specialized device. Treating such memory as being slower will deprive the system of its benefits. The cover letter also points out that a CPU hot-add event might cause a CPU-less node to contain a CPU and be moved to a different tier, even though the characteristics of the memory in that node have not changed.

The proposed solution is to replace the kernel's simplistic tiering setup with something a bit more sophisticated and explicit. Current kernels do not really identify "tiers" as such; instead, they order nodes into an internal "demotion order" that is a function of the reported node distance; this order is not readily visible from user space and cannot be changed. The patch set turns tiers into a proper kernel object and enables the creation of an arbitrary number of user-visible tiers, each identified (and ordered) by an integer ID value. Higher-numbered tiers are expected to contain faster memory.

The default tier has ID 200 (internally named MEMORY_TIER_DRAM), and all nodes with memory start out in that tier. Drivers for devices containing memory are able to request that their memory is placed into a different tier. Two other tiers are build into the patch set for this use: MEMORY_TIER_HBM_GPU (300) for high-bandwidth memory, and MEMORY_TIER_PMEM (100) for persistent memory. If a driver knows that its device has one type of memory or the other, it can place that memory into the appropriate tier.

The patch series also provides a set of files in sysfs (described in the cover letter and this documentation patch) that can be used to examine the current memory-tier configuration. New tiers can be created from user space, and nodes can be moved between tiers using the sysfs interface. The patch set also makes the demotion policy used to move pages to slower tiers more explicit and configurable.

Weiner responded to an earlier version of the patch set by questioning the assignment of all nodes to the same tier, regardless of whether they contain a CPU:

Making tiers explicit is a good idea, but can we keep the current default that CPU-less nodes are of a lower tier than ones with CPU? I'm having a hard time imagining where this wouldn't be true... Or why it shouldn't be those esoteric cases that need the manual tuning.

That behavior remains unchanged in the current version of the patch set, though.

In any case, the default tier-assignment policy is an easy thing to change at any future point. The overall structure of a mechanism to make tiers into explicit objects could be rather harder to change once it's merged and its sysfs files become part of the kernel API. That aspect of the patch set does not appear to be controversial, though; after seven revisions, most of the review comments have been addressed. So, while there may be space for a tweak or two, this work seems to be about ready to be merged.

Comments (3 posted)

A "fireside" chat

By Jake Edge
June 28, 2022

OSSNA

In something of an Open Source Summit tradition, Linus Torvalds and Dirk Hohndel sit down for a discussion on various topics related to open source and, of course, the Linux kernel. Open Source Summit North America (OSSNA) 2022 in Austin, Texas was no exception, as they reprised their keynote on the first day of the conference. The headline-grabbing part of the chat was Torvalds's declaration that Rust for Linux might get merged as soon as the next merge window, which opens in just a few weeks, but there was plenty more of interest there.

Hohndel introduced himself as the chief open source officer at the Cardano Foundation; he is working to help foster an open-source ecosystem around the foundation's blockchain technology. Torvalds said that these "fireside chats" are held because of his wishes; "I do software", not public speaking, he said, so the format makes it easier for him. He effectively has outsourced figuring out what people are interested in hearing about to Hohndel; with a grin, Torvalds said, "if he asks bad questions, it's not my fault".

Hohndel said that it was "super exciting" to see so many people in the room for the keynote; there were around 1200 in-person attendees for OSSNA. Things have been quite different over the last few years and he asked Torvalds how all of that had affected him and the kernel. "This whole COVID mess affected the kernel almost not at all", Torvalds said; certainly individuals were affected in various negative ways, but from a development standpoint, the kernel project continued apace—or even sped up during the lockdowns. In part, that is because kernel development has always been a distributed project where many participants were already working from home.

Torvalds noted that he has been working from home for 20 years now. Hohndel said that it was interesting in the early days of the pandemic when companies were trying to figure out how to work remotely and rely on email more, which is something that the kernel community figured out long ago. "To us, it didn't feel as new."

Hohndel asked whether Torvalds had noticed any changes in the kernel process from the events of the last few years. Torvalds spent a little bit of time describing the normal nine- to ten-week kernel development cycle that will be familiar to LWN readers; it has been the standard for kernel releases for 15 years at this point. It is a calm, staid process, "which is, I think, exactly what you want".

But that "boring and staid development process" does not lead to the same characteristics in the development itself. There are lots of changes going on in the core kernel, "and I'm really happy to see that". One might think that a 30-year-old project would have gotten boring, but he is actively trying to encourage people to "do the exciting stuff". There are new architectures being added, people are trying out new languages for kernel development, and core parts of the kernel are being improved in fundamental ways. "It is not just at the edge that the kernel is growing," which is one of the things that he personally enjoys the most. "We're not a dead project."

Don't break user space

Hohndel then referred to a recent discussion about the limits of the kernel's "don't break user space" promise, especially with regard to BPF programs. More and more things are being implemented in the kernel using BPF rather than, say, system calls, he said; what is the point where "your API ends and crazy space begins?" Torvalds said that "'crazy space' is hopefully all internal" to the kernel for one thing.

But he does not like talking about this in terms of APIs, because people will look at the documentation for an API and argue that if someone is not following what was written, it is their problem, not the kernel's. "I feel that's a complete cop-out and it's just bad policy. Documentation is worthless compared to reality." He did note that his opinion is biased because he "never writes documentation".

Torvalds's rule has never been that the API cannot be broken, but that the kernel "can't break people's loads, you can't break what people do". If someone takes advantage of a bug in the kernel, "that bug is not a bug, it's a feature". The kernel will maintain that feature forever, unless there are "really pressing concerns", almost all of which are security-related, that require it to change. "We will go to insane lengths to actually keep bug-for-bug compatibility."

Torvalds is not just a developer, he is also a user, and he thinks that the most annoying problem that users experience is "doing a software upgrade and things stop working". He cannot change all of the other software projects out there, which generally have different policies, but he has been "very hard-nosed" with kernel policy to try to ensure that programs continue to run after a upgrade. If that does not happen, "you're supposed to scream at us". The kernel has had great success with that policy, he said, and he wished that everyone in the room would push for this in their projects as well.

Hohndel probed a bit deeper, though. For example, BPF programs are loaded from user space, as are kernel modules, including those that are out of tree; would breaking those cross the line, he asked. Torvalds said that he does not consider breaking kernel modules to be breaking user space. Those who are writing their own modules need to update them if the kernel changes; out-of-tree modules are "heavily discouraged" though they sometimes make sense as a step in the development path. He is not a lawyer, so he does not want to even look at the question of when third-party modules may run afoul of the kernel's GPL license; "I don't know when that gray area turns into black."

But for user-space programs "that are part of user workflow", whether it uses BPF or any other kernel feature, it is generally on the kernel developers to ensure that they do not break, Torvalds said. Some people use BPF for low-level tracing, statistics gathering, and debugging, though; when it is used that way, kernel changes may break those programs "and we can't do anything about it". He is talking about users, he said, "not people who are digging around in the kernel".

Rust

Hohndel asked about the status of Rust for the kernel; "I haven't seen any actual patches being merged yet." Torvalds said that the patches are available, but they have not been merged. Part of the issue is that the kernel project has gotten more careful over the years; "we were more freewheeling 30 years ago." But some believe that the kernel has become too risk-averse at this point.

Rust for the kernel has been discussed for years at this point, so "real soon now we will actually have it merged in the kernel—maybe next release". That was met with a round of applause, but Torvalds was quick to caution that "to me it's a trial run". The memory safety offered by Rust means it has real technical benefits. But the kernel tried C++ more than 25 years ago; "we tried it for two weeks and then ... we stopped trying it."

To Torvalds, "Rust is a way to try something new" for the kernel. People have been working hard on it and "I really hope it works out because otherwise they will be bummed". But it will start "very small" and in "very specific parts of the kernel"; "we're not rewriting it all in Rust." Some kernel developers want to do something new and fun; "I think Rust makes a lot of technical sense."

But Rust is a much different language than C, so Hohndel wondered what the challenges would be for kernel maintainers who might start getting code in a language they do not know, or know well. Torvalds does not "see that as a big issue"; there are multiple languages in the kernel today. For example, the kernel makefiles are "makefiles in name only"; they are an "unholy mess of various macros and other helper functions that are really hard to understand". Beyond that, he gets patches with Perl code in them, but does not "even pretend to understand Perl"; in fact, he thinks it is a write-only language. But he generally trusts the maintainers to do the right thing.

It is his long-standing policy to trust maintainers—at least "until they screw up". When that happens, he sometimes responds in an "overly impolite" fashion and he was quite apologetic about that. "It is a personal failing of mine and I mean that very seriously; sometimes I go overboard". Hohndel suggested that perhaps it was "in a loving way", but Torvalds disagreed: "I wish I could say that." He said that he should "preemptively apologize to the Rust people".

He has seen the worries from some developers in the kernel community with regard to not understanding Rust. He thinks it is fine not to understand it; people do not understand the kernel's virtual memory (VM) subsystem, even though it is written in C. "The language is generally not the biggest hindrance to understanding." There will be Rust maintainers just as there are VM maintainers today; Rust is a "small technical change, not a fundamental one".

Hardware security woes

Fixes for various hardware security bugs, especially those related to speculative execution, have been with us for a number of years now, Hohndel said. Some of the "very well-maintained architectures", like x86 and Arm64, have a steady stream of these kinds of problems that need to be addressed in the kernel. There are lots of other architectures in the kernel, however, and he wondered why they did not seem to be affected by that. "Are the others safe or are they just not as well-maintained?" "I think it is a bit of both", Torvalds said.

These kinds of problems go back to Meltdown and Spectre, he said. Those problems made people realize that even if the software is written in a secure manner—which is rare and hard to do—"the hardware sometimes ends up really screwing you over anyway". The kernel is not the only software project affected, browser and virtualization developers have also been affected. "It's very frustrating" when the hardware cannot be trusted and "you have to do a lot of extra work to fix hardware bugs". The good news is that the kinds of vulnerabilities being found are getting more esoteric and are affecting fewer people.

The architectures affected by these kinds of bugs are mostly just x86, Arm, PowerPC, and s390, he said, in part because there are more Linux systems running on those processors. But the kernel supports 15 other architectures "that never see these kinds of security problems". Sometimes that is because they are embedded, in-order processors that do little or no speculation. Other times, there is less scrutiny of the architectures and there are fewer people "digging into what could go wrong" on those systems.

He has gotten used to these hardware security problems; "I used to be very very frustrated about this." After five years, "you kind of grow inured to the pain". Hohndel asked if the hardware vendors had gotten better at working with the kernel community on handling and fixing these kinds of bugs. Torvalds said that they had; much of the early pain was due to something of a culture clash between the kernel community and the hardware vendors.

Normally, kernel security fixes are disclosed within seven days, he said; the patches are posted publicly, though the implications may not be disclosed at that point. The hardware companies want to have security embargoes of, say, 12 months, which makes it difficult. He does not want to have to try to remember which bugs have not been disclosed, so that they should not be discussed, and so on.

The whole process goes against the way things normally work in open source. Another problem is that the people who can fix the bug in question may be people who "you can't talk to". So a small team has to be gathered to fix the bug, without talking to all of the relevant people, without using all of the public testing infrastructure, and so on. The next two weeks after a release of a bug fix of this sort are always spent fixing up the details that slipped through the cracks because the normal process could not be followed.

Moving onto the wider question of security for our systems, Torvalds emphasized that bugs need to be accepted. "Bugs will happen, if they don't happen in hardware, they will happen in software, and if they don't happen in your software, they will happen in somebody else's software." Some of those bugs will be security bugs; the "only way to try to do security right" is to have multiple layers of defense.

The kernel provides a layer of its own, he said, but it has multiple security layers within it as well. There are various hardening efforts going on to try to reduce the scope of bugs when they occur. The compilers are being used to insert various kinds of checks into the kernel, some of which are expensive, and some of which are assisted by various hardware features for things like pointer authentication. But defensive programming at all layers of today's software stacks is going to be needed. "Anybody who thinks you can get 100% security is living in some dream world that is not this reality," Torvalds said.

Fun stuff

"I don't want to make this a security talk, I want to talk about stuff that's fun", Hohndel said, to some laughter in the audience. He asked Torvalds about what he did outside of software development, though Torvalds disclaimed any knowledge of that at first. "Is there life outside of the screen in front of you?"

He did admit to working on a hardware project during the pandemic. The problem, he found, was one of motivation. He really likes "doing social software", though not in the social networking sense of that term. He likes talking with people during the development—over email, not face-to-face, he emphasized. With a laugh, Hohndel noted that there were a lot of caveats in all of that.

Torvalds said that he designed a circuit board with a microcontroller on it, which worked correctly when it came back from being manufactured in China. But at that point, he thought "now what?" It was an interesting learning experience, but it turns out he really did not want to create hardware.

Beyond that, Torvalds likes to scuba dive. He started another open-source project, Subsurface, to log his dives, because the dive computer makers did not consider Linux a big target for their software. Subsurface is now maintained by Hohndel, who thanked him for the plug.

"Linux is my baby", Torvalds said, but he found it amusing that when his oldest daughter went to college (for computer science—"I didn't push her"), he was much better known for Git in the CS department. "I only did Git for six months"; he will take the credit, but in reality Junio Hamano has done an excellent job of maintaining the project. If Git-using attendees ever see Hamano "buy him a beer or something". Torvalds said: "My name comes up much too often when it comes to Git." With that, the keynote session ran out of time; they plan to be back on the stage in Dublin for OSS Europe (OSSEU) in September.

[I would like to thank LWN subscribers for supporting my travel to Austin for OSSNA.]

Comments (3 posted)

NFS: the new millennium

June 24, 2022

This article was contributed by Neil Brown

The network filesystem (NFS) protocol has been with us for nearly 40 years. While defined initially as a stateless protocol, NFS implementations have always had to manage state, and that need has been increasingly built into the protocol over successive revisions. The early days of NFS were discussed, with a focus on state management, in the first part of this series. This article completes the job with a look at the evolution of NFS since, approximately, the beginning of this millennium.

The early days of NFS were controlled by Sun Microsystems, the originator of the NFS protocol and author of both the specification and implementation. As the new millennium approached, interest in NFS increased and independent implementations appeared. Of particular relevance here are the implementations in the Linux kernel that drew my attention — particularly the server implementation — and the Filer appliance produced and sold by Network Appliance (NetApp). The community's interest in NFS extended as far as a desire to have more say in the further development of the protocol. I do not know what negotiations happened, but happen they did, and one clear outcome is documented for us in RFC 2339, wherein Sun Microsystems agreed to assign to The Internet Society certain rights concerning the development of version 4 (and beyond) of NFS, providing this development achieved "Proposed Standard" status within 24 months, meaning by early 2000. That particular deadline went wooshing past and was extended. We got a "Proposed Standard" in late 2000 with RFC 3010, which was revised for RFC 3530 in April 2003 and again for RFC 7530 in March 2015.

The IETF working group tasked with this development was mostly driven by Sun and NetApp; it had two co-chairs, one from each company, and most of the authors listed on the RFC are from these companies. My memory of these discussions is that there was quite a long list of perceived needs but no shared vision of a coherent whole. The impending (and changing) deadline drove a desire to get something out, even if it wasn't perfect. Consequently NFSv4 — particularly this first attempt which is now referred to as NFSv4.0 — felt to me like useful pieces that had been glued together, rather than carefully woven into a fabric; the elegance that I could see in NFSv2 was gone.

NFSv4 brought all of the various protocols that we saw before into one single protocol with one single specification. While the "tools" approach can be extremely powerful and is great for building prototypes, there usually comes a time when the strength provided by integration outweighs the agility provided by discrete components — for NFS, version 4 was that time. Support for access-control lists (ACLs), quota-tracking, security negotiation, namespace management, byte-range locking, and status management were all brought together, often in quite different forms than in their original separate incarnations. Of all these many changes, I want to focus on just two areas that have implications for the management of shared state.

The change attribute and delegations for cache consistency

As we have already seen, timestamps are not ideal for tracking when file content has changed on the server. Even if the client knows that timestamps are reported with some high precision, it cannot know how many write requests can be processed in that unit of time. So a timestamp is, at best, a useful hint. The designers of NFSv4 wanted something better, so they introduced the "change" attribute, sometimes called a "changeid". This is a 64-bit number that must be increased whenever the object (such as a file or directory) changes in any way.

This changeid is a mandatory aspect of the protocol so, for several years, the Linux NFS server was noncompliant since no Linux filesystem could provide a conforming changeid. This was fixed in Linux 2.6.31, but only for ext4, with XFS following in v3.11. For filesystems that don't provide an i_version, the Linux NFS server lies and uses the inode's change time instead, which may not provide the same guarantees.

The wcc (weak cache consistency) attribute information that NFSv3 introduced is preserved in NFSv4, but only for directories, though it uses the changeid rather than the change time and, strangely, is not provided for the SETATTR operation. Wcc attributes are not provided for files. It is still possible to get "before" and "after" attributes for WRITE requests, as every NFSv4 request is a COMPOUND procedure comprising a sequence of basic operations, and this can contain the sequence GETATTR, WRITE, GETATTR. However, these are not guaranteed to be performed atomically, so some other client could also perform a WRITE between the two GETATTR calls. If the difference between the "before" and "after" changeids is precisely one, it should be safe to assume no intervening changes, but the protocol specification doesn't make this explicit. Instead, NFSv4 provides delegations.

A delegation (or more specifically a "read delegation") is a promise made by the server to the client that no other client (or local application on the server) can write to the file. The server will proactively recall the delegation before any conflicting open request will be allowed to complete. While a client holds a delegation, it can be sure that all changes made on the server were made by itself, so it doesn't even need to check the changeid to ensure that its cache remains accurate. This provides a strong cache-coherency guarantee.

So, providing that the server offers a read delegation whenever a client opens a file that no one else is writing to, caching is easy. Exactly when the server should do that is not entirely clear; there is a cost in offering delegations since they need to be recalled when the file is opened for write access, and this can delay the open request.

Note that the server can also offer a "write delegation" if no other clients or applications have the file open for either read or write. It is not clear to me how useful this really is. The most obvious theoretical benefits are that writes do not need to be flushed before a file is closed, and that byte-range locks do not need to be notified to the server. Whether these benefits are practical is less obvious. The Linux kernel's NFS server never offers write delegations.

clientids, stateids, seqids, and client state management

As mentioned, NFSv4 integrates byte-range locking and thus needs the server to track the state of all locks held by each client; the server also needs to know when the client restarts. The design of this functionality is all new in NFSv4 (and somewhat improved in NFSv4.1).

The biggest difference in usability is that, rather than a client reporting that it was rebooted (as the STATMON protocol allows), the NFSv4 client needs to regularly report that it hasn't rebooted. If the server hasn't heard from the client for the "lease time" (typically 90 seconds in Linux) it is permitted to assume the client has disappeared, and must not prevent another client from performing an access that would be blocked by the state held by the first client. So clients that are not otherwise active need to at least send an "I'm still here" message (RENEW in NFSv4.0) often enough so that, even with possible network delays, the server will never go 90 seconds (or whatever is configured) without seeing something.

This all means that, if a machine crashes without rebooting, locked files do not remain locked indefinitely. Conversely, it means that, if a router failure or cabling problem causes network traffic to be interrupted for too long (known as a "network partition"), locks can be lost even while the client is still up and, when the network is fixed, the client will not be able to proceed. On Linux, the client application will receive an EIO error for any attempt to access a file descriptor on which it held a lock that has since been lost.

All of this could have been achieved relatively simply. For example, each request could contain a timestamp of when the client booted. The server would remember this against the IP address of the client and, if it ever changed, or if nothing were seen for a period of at least the lease time, the server could discard any state that the client previously owned. Correspondingly, each reply from the server could contain a timestamp so that server reboots could be detected by the client. However, this would have been too simplistic. The designers of NFSv4 had considerable experience with NFSv3 and with the Network Lock Manager to guide them, and they decided that there was sufficient justification for some more complexity.

Clientids and the client identifier

Depending on a client's IP address is not really a good idea. Partly, this is because running multiple clients in user space, or using network address translation (NAT), can result in several clients having the same IP address, and partly because mobile devices can change their IP address. The latter wasn't a big concern during NFSv4.0 development (though 4.1 handles it better), but user-space clients and the problems of NAT certainly were. Different clients could be identified by their port number, but if a client connecting through a NAT gateway lost its connection and had to re-establish it, the new connection could use a different port, thus appearing to be a different client. So NFSv4 requires each client to generate a universally unique client identifier (which can be up to 1KB in size), combine that with an instance identifier (like a boot timestamp), and submit both to the server via the SETCLIENTID request. The server responds with a 64-bit clientid number that can then be used in any request in which the client needs to identify itself.

This client identifier is the one recently discussed at the Linux Storage, Filesystem, and Memory-management Summit (LFSMM). By default, Linux uses the host name as the main source of uniqueness. This works well enough on private networks when hosts are well configured. Problems arise, though, in containers that do not configure a unique host name but which do create a new network namespace and, as a result, get an independent NFS client instance. Problems may also arise in situations where clients in different administrative domains (and hence with possible host-name overlap) access a shared server.

stateids and seqid — per-file state

Another shortcoming with the simple approach is that it collects all of the state together without clear differentiation in either space or time.

Differentiation in space means that the state of each file can be managed separately. In particular, if the server hasn't heard from the client for the lease time, it must discard any state for which there is a conflicting request, but it isn't required to discard state which is uncontested. So, when the client regains contact, it might have lost access to some files but not others. This requires that it be possible to identify different elements of state so that the server can tell the client which have been lost and which are still valid. The Linux NFS server wasn't able to realize the full benefits of this until recently, when the "Courteous Server" functionality was merged.

This finer-grained state management is largely realized by the NFSv4 OPEN request. The very existence of OPEN is a departure from the approach of NFSv3 and is only possible because the server can track the state of the client and, in particular, which files it has open. An OPEN request can indicate which of READ or WRITE access is required, or both, and it can ask the server to DENY READ or WRITE access to all other clients. This denial is anathema for POSIX interfaces, but is needed for interoperability with other APIs. The Linux server supports such denial between NFSv4 clients, but doesn't allow local access on the server, or NFSv3 access, to be blocked; in addition, existing local opens do not cause a conflicting NFSv4 OPEN to fail.

The NFSv4 OPEN request also indicates an "open owner", which is formed from an active clientid combined with an arbitrary label. The Linux client generates a different label for each user so, if a single user opens a file multiple times concurrently, the server will just see the file opened once. The OPEN request returns a "stateid" which represents the open file and should be used in any READ/WRITE requests. Each such stateid is distinct and can be invalidated by the server (in exceptional circumstances) without affecting any other.

Subsequent OPEN or OPEN_DOWNGRADE requests can change both the access flags and the DENY flags associated with the file (for the relevant open-owner). Each of these yields a new stateid, though it is not entirely new. Each stateid has two parts: the "seqid", which increments on each change, and an "other" part, which doesn't. This allows the client to unambiguously determine whether a second OPEN request opened the same file as the first — as the "other" parts of the stateids will match. It also makes it clear in what order various changes were performed on the server, so the client can be sure it remains synchronized with the server. Thus the "seqid" gives a time dimension to the states.

This can be particularly relevant when a CLOSE request is sent at much the same time as an OPEN request for the same file. The client may not know it is operating on the same file, due to the presence of hard links, so this cannot be seen as incorrect behavior by the client. If the server performs the CLOSE first, the OPEN will then likely succeed and all will be well. If the server performs the OPEN first it will increment the seqid of the state for that file and, when it sees the CLOSE, it will reject it because the seqid is old. The file will stay open in either case, which is what the client wanted (since it opened the file twice but only closed it once).

There are similar stateids, including seqids, for LOCK and UNLOCK requests, and these have a corresponding "lock owner". The lock owner will correspond to a process for POSIX locks, or an open file descriptor for OFD locks.

NFSv4.1 — a step forward

The NFSv4 working group went to some trouble to allow for future versions and to describe the sorts of things that might change. This was not in vain and, in January 2010, NFSv4.1 was described in RFC 5661 (with an update in RFC 8881 about 10 years later).

V4.1 contains lots of little improvements based on several years of experience with what had been nearly an entirely new protocol. Many people are suspicious of "dot-zero" releases and, with NFSv4, there is some justification for this. NFSv4.0 does work, and works quite well, but 4.1 works better. Most of the little things don't rate a mention here, but the decision to exclude UDP as a supported transport is interesting because it is user-visible. UDP has no congestion control, so NFS doesn't work well over it in general. Of course, UDP can still be used as long as some other protocol that manages congestion, like QUIC, is layered in between. V4.1 also allows the server to tell the client that it is safe to unlink files that are still open so the clumsy renaming-on-remove can be avoided.

Possibly the biggest user-visible change in NFSv4.1 is the addition of "pNFS" — parallel NFS. This appears to be a marketing term, as it is easy to say but only loosely captures the important changes. With NFSv4.1, it becomes possible to offload I/O requests (READ and WRITE) to some other protocol, which could well communicate over a different medium to a different server. This allows a single NFS client to communicate with a cluster filesystem without having to channel all the requests through a single IP address. This is certainly an extra level of parallelism, but it is not fair to say that earlier NFS did not allow any parallelism. Even NFSv2 could have multiple outstanding requests that the server could be handling concurrently.

These offload protocols, and how they integrate, are described in separate RFCs. There is support for a block-access protocol (RFC 5663) or the OSD object storage protocol (RFC 5664), which might run over iSCSI, for example, or a "flexible file" based approach using NFSv3 or later (RFC 8435) for the data access. This is only of particular interest here because there is a new sort of state that needs to be managed — there are objects called "layouts".

A layout describes how to access part of a file using some other protocol. Each layout has a stateid which can be allocated and then relinquished, so the server always knows which layouts might still be in use. This is important if the server needs, for example, to migrate a file to a different storage location — it probably shouldn't do that while a client thinks it knows what block location it can use to access that file.

Sessions and a reliable DRC

From our perspective of managing state, the biggest change in NFSv4.1 is that the protocol finally allows for a completely reliable duplicate request cache. As described in part 1, this is needed for the rare case when a request or reply might have been lost and the client has to resend a request. In versions up to and including NFSv4.0, the server would just make a best-effort guess at which requests and replies might be worth remembering of a while. In NFSv4.1, it can know.

An NFSv4.1 session is a new element of state that is forked off from the global clientid by the CREATE_SESSION request. Given a clientid and a sequence number, a new sessionid is allocated that has a collection of different attributes associated with it, including a maximum number of concurrent requests. The server will allocate this many "slots" in its duplicate request cache, and the client will assign a slot number less than this number to each request. The client promises never to reuse a slot number until it has seen a reply to the previous request with that number, and the server promises to remember the replies to the most recent request in each slot, if the client asked it to.

The client can even ask the server to store the cached replies in stable storage so that they survive a reboot. If the server implements this functionality and agrees to provide it, then the result is as close to perfect exactly-once semantics as it is possible to get.

There is, superficially, an imbalance here. The requests that are most common (READ, WRITE, GETATTR, ACCESS, LOOKUP) are idempotent and do not need to be cached, yet the server must reserve cache space for each slot, thus either wasting cache space or unnecessarily limiting concurrency. This not a problem in practice, since the protocol allows the client to create multiple sessions with different parameters. It could create one with a large slot count and a maximum cached size of zero, and use this for all idempotent requests. It could also create a session with a more modest slot count and much larger maximum cached size, and use this for requests that mustn't be repeated.

Directory delegations

A third new sort of state in NFSv4.1 — accompanying layouts and sessions — is directory delegations. In NFSv4.0, it is possible to open files, but all interactions with directories remained much as they were in NFSv3, where each operation was discrete and there was no ongoing state. In v4.1, we get something a bit like an OPEN of a directory with GET_DIR_DELEGATION. This request doesn't contain an explicit clientid; instead, it uses the clientid associated with the session that the request is part of. The delegation is essentially a standing request that the client be informed of any changes made to the directory by other clients. Depending on what specifics are negotiated, this might involve the server saying "something has changed, you no longer have the delegation", or it may provide more fine-grained details of what, specifically, has changed. This allows for strong file-name cache coherence and even allows client-side applications to receive notifications of changes. This functionality is not implemented by Linux, either for the server or the client.

NFSv4.2 - meeting customer needs

The latest version of NFS is v4.2, described in RFC 7862 (Nov 2016). That document describes the goals of this revision being "to take common local file system features that have not been available through earlier versions of NFS and to offer them remotely."

In contrast to NFSv4.1, which primarily provided existing functionality in a more efficient or reliable manner, v4.2 provides genuinely new functionality — at least new to NFS. These include support for the SEEK_DATA and SEEK_HOLE functionality of lseek(), which allows sparse files to be managed efficiently, support of posix_fallocate() to explicitly allocate and deallocate space in a file, support for posix_fadvise(), so the client can tell the server what sort of I/O patterns to optimize for, and support for reflinks, which allow content to be shared between files without copying. None of these add any new state to the protocol, so they aren't directly in our area of focus for this discussion.

One new element of functionality that does involve a new form of state is server-side copy. This functionality can support copy_file_range() and can copy between two files on one server or — if the servers cooperate — between two files on different servers. Closely related functionality (using the WRITE_SAME operation) can initialize a file using a given pattern repeated as necessary.

When the client sends a COPY request, the server has the option of either performing the full copy before replying, or scheduling the operation asynchronously and returning immediately. In the latter case, a stateid is returned to the client which represents the ongoing action on the server. The client can use this stateid to query status (for the all-important progress bar) or to cancel the copy. The server can use the stateid to notify the client of completion or failure.

The future for NFS

As yet, there are no hints of an NFSv4.3 in the foreseeable future, and it could be that no such version will ever be described. The v4.2 specification differs from its predecessors in that it is not a complete specification, but instead references the v4.1 specification and adds some extensions. The model for future extension, which is itself extended in RFC 8178 (July 2017), allows for further incremental changes to be added without requiring that the minor version number be changed. This has already been put to good use with RFC 8275 which allows the POSIX "umask" to be sent to the server during file creation, and RFC 8276 which adds support for extended attributes.

There are, of course, still ongoing development efforts around NFS. One of the more interesting areas involves describing how the NFS protocol can be usefully transported on something other than TCP or RDMA, which are the main two protocols in use today.

NFS draft-cel-nfsv4-rpc-tls-pseudoflavors-02 looks at using ONC-RPC (the underlying RPC layer used by NFS) over TLS and, particularly, explores how the authentication provided by TLS can interact with the authentication requirements of NFS. Then, draft-cel-nfsv4-rpc-over-quicv1-00 builds on this to explore how NFS can be used over the QUIC protocol; The "cel" in the names of the drafts refer to Chuck Lever, the current maintainer of the Linux NFS server.

Other drafts and all the RFCs can be found on the web page for the IETF NFSv4 working group.

While NFS, much like Linux, does not seem to be finished yet, it does appear to have come to terms with being a stateful protocol, with all the state fitting into one coherent model. This is highlighted by the way that a totally new form of state — asynchronous copying on the server — was fit into the model in NFSv4.2 with no fuss. So it is likely the future improvements will focus elsewhere, perhaps following the recent moves toward improved security and support for new filesystem functionality. Who knows, maybe one day it will even make peace with Btrfs.

Comments (24 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>