NFS: the new millennium
The early days of NFS were controlled by Sun Microsystems, the originator of the NFS protocol and author of both the specification and implementation. As the new millennium approached, interest in NFS increased and independent implementations appeared. Of particular relevance here are the implementations in the Linux kernel that drew my attention — particularly the server implementation — and the Filer appliance produced and sold by Network Appliance (NetApp). The community's interest in NFS extended as far as a desire to have more say in the further development of the protocol. I do not know what negotiations happened, but happen they did, and one clear outcome is documented for us in RFC 2339, wherein Sun Microsystems agreed to assign to The Internet Society certain rights concerning the development of version 4 (and beyond) of NFS, providing this development achieved "Proposed Standard" status within 24 months, meaning by early 2000. That particular deadline went wooshing past and was extended. We got a "Proposed Standard" in late 2000 with RFC 3010, which was revised for RFC 3530 in April 2003 and again for RFC 7530 in March 2015.
The IETF working group tasked with this development was mostly driven by Sun and NetApp; it had two co-chairs, one from each company, and most of the authors listed on the RFC are from these companies. My memory of these discussions is that there was quite a long list of perceived needs but no shared vision of a coherent whole. The impending (and changing) deadline drove a desire to get something out, even if it wasn't perfect. Consequently NFSv4 — particularly this first attempt which is now referred to as NFSv4.0 — felt to me like useful pieces that had been glued together, rather than carefully woven into a fabric; the elegance that I could see in NFSv2 was gone.
NFSv4 brought all of the various protocols that we saw before into one single protocol with one single specification. While the "tools" approach can be extremely powerful and is great for building prototypes, there usually comes a time when the strength provided by integration outweighs the agility provided by discrete components — for NFS, version 4 was that time. Support for access-control lists (ACLs), quota-tracking, security negotiation, namespace management, byte-range locking, and status management were all brought together, often in quite different forms than in their original separate incarnations. Of all these many changes, I want to focus on just two areas that have implications for the management of shared state.
The change attribute and delegations for cache consistency
As we have already seen, timestamps are not ideal for tracking when file content has changed on the server. Even if the client knows that timestamps are reported with some high precision, it cannot know how many write requests can be processed in that unit of time. So a timestamp is, at best, a useful hint. The designers of NFSv4 wanted something better, so they introduced the "change" attribute, sometimes called a "changeid". This is a 64-bit number that must be increased whenever the object (such as a file or directory) changes in any way.
This changeid is a mandatory aspect of the protocol so, for several
years, the Linux NFS server was noncompliant since no Linux filesystem
could provide a conforming changeid. This was fixed in Linux
2.6.31, but only for ext4, with XFS following in v3.11. For filesystems
that don't provide an i_version,
the Linux NFS server lies and uses
the inode's change time instead, which may not provide the same
guarantees.
The wcc (weak cache consistency) attribute information that NFSv3 introduced is preserved in NFSv4, but only for directories, though it uses the changeid rather than the change time and, strangely, is not provided for the SETATTR operation. Wcc attributes are not provided for files. It is still possible to get "before" and "after" attributes for WRITE requests, as every NFSv4 request is a COMPOUND procedure comprising a sequence of basic operations, and this can contain the sequence GETATTR, WRITE, GETATTR. However, these are not guaranteed to be performed atomically, so some other client could also perform a WRITE between the two GETATTR calls. If the difference between the "before" and "after" changeids is precisely one, it should be safe to assume no intervening changes, but the protocol specification doesn't make this explicit. Instead, NFSv4 provides delegations.
A delegation (or more specifically a "read delegation") is a promise made by the server to the client that no other client (or local application on the server) can write to the file. The server will proactively recall the delegation before any conflicting open request will be allowed to complete. While a client holds a delegation, it can be sure that all changes made on the server were made by itself, so it doesn't even need to check the changeid to ensure that its cache remains accurate. This provides a strong cache-coherency guarantee.
So, providing that the server offers a read delegation whenever a client opens a file that no one else is writing to, caching is easy. Exactly when the server should do that is not entirely clear; there is a cost in offering delegations since they need to be recalled when the file is opened for write access, and this can delay the open request.
Note that the server can also offer a "write delegation" if no other clients or applications have the file open for either read or write. It is not clear to me how useful this really is. The most obvious theoretical benefits are that writes do not need to be flushed before a file is closed, and that byte-range locks do not need to be notified to the server. Whether these benefits are practical is less obvious. The Linux kernel's NFS server never offers write delegations.
clientids, stateids, seqids, and client state management
As mentioned, NFSv4 integrates byte-range locking and thus needs the server to track the state of all locks held by each client; the server also needs to know when the client restarts. The design of this functionality is all new in NFSv4 (and somewhat improved in NFSv4.1).
The biggest difference in usability is that, rather than a client reporting that it was rebooted (as the STATMON protocol allows), the NFSv4 client needs to regularly report that it hasn't rebooted. If the server hasn't heard from the client for the "lease time" (typically 90 seconds in Linux) it is permitted to assume the client has disappeared, and must not prevent another client from performing an access that would be blocked by the state held by the first client. So clients that are not otherwise active need to at least send an "I'm still here" message (RENEW in NFSv4.0) often enough so that, even with possible network delays, the server will never go 90 seconds (or whatever is configured) without seeing something.
This all means that, if a machine crashes without rebooting, locked files do not remain locked indefinitely. Conversely, it means that, if a router failure or cabling problem causes network traffic to be interrupted for too long (known as a "network partition"), locks can be lost even while the client is still up and, when the network is fixed, the client will not be able to proceed. On Linux, the client application will receive an EIO error for any attempt to access a file descriptor on which it held a lock that has since been lost.
All of this could have been achieved relatively simply. For example, each request could contain a timestamp of when the client booted. The server would remember this against the IP address of the client and, if it ever changed, or if nothing were seen for a period of at least the lease time, the server could discard any state that the client previously owned. Correspondingly, each reply from the server could contain a timestamp so that server reboots could be detected by the client. However, this would have been too simplistic. The designers of NFSv4 had considerable experience with NFSv3 and with the Network Lock Manager to guide them, and they decided that there was sufficient justification for some more complexity.
Clientids and the client identifier
Depending on a client's IP address is not really a good idea. Partly, this is because running multiple clients in user space, or using network address translation (NAT), can result in several clients having the same IP address, and partly because mobile devices can change their IP address. The latter wasn't a big concern during NFSv4.0 development (though 4.1 handles it better), but user-space clients and the problems of NAT certainly were. Different clients could be identified by their port number, but if a client connecting through a NAT gateway lost its connection and had to re-establish it, the new connection could use a different port, thus appearing to be a different client. So NFSv4 requires each client to generate a universally unique client identifier (which can be up to 1KB in size), combine that with an instance identifier (like a boot timestamp), and submit both to the server via the SETCLIENTID request. The server responds with a 64-bit clientid number that can then be used in any request in which the client needs to identify itself.
This client identifier is the one recently discussed at the Linux Storage, Filesystem, and Memory-management Summit (LFSMM). By default, Linux uses the host name as the main source of uniqueness. This works well enough on private networks when hosts are well configured. Problems arise, though, in containers that do not configure a unique host name but which do create a new network namespace and, as a result, get an independent NFS client instance. Problems may also arise in situations where clients in different administrative domains (and hence with possible host-name overlap) access a shared server.
stateids and seqid — per-file state
Another shortcoming with the simple approach is that it collects all of the state together without clear differentiation in either space or time.
Differentiation in space means that the state of each file can be managed separately. In particular, if the server hasn't heard from the client for the lease time, it must discard any state for which there is a conflicting request, but it isn't required to discard state which is uncontested. So, when the client regains contact, it might have lost access to some files but not others. This requires that it be possible to identify different elements of state so that the server can tell the client which have been lost and which are still valid. The Linux NFS server wasn't able to realize the full benefits of this until recently, when the "Courteous Server" functionality was merged.
This finer-grained state management is largely realized by the NFSv4 OPEN request. The very existence of OPEN is a departure from the approach of NFSv3 and is only possible because the server can track the state of the client and, in particular, which files it has open. An OPEN request can indicate which of READ or WRITE access is required, or both, and it can ask the server to DENY READ or WRITE access to all other clients. This denial is anathema for POSIX interfaces, but is needed for interoperability with other APIs. The Linux server supports such denial between NFSv4 clients, but doesn't allow local access on the server, or NFSv3 access, to be blocked; in addition, existing local opens do not cause a conflicting NFSv4 OPEN to fail.
The NFSv4 OPEN request also indicates an "open owner", which is formed from an active clientid combined with an arbitrary label. The Linux client generates a different label for each user so, if a single user opens a file multiple times concurrently, the server will just see the file opened once. The OPEN request returns a "stateid" which represents the open file and should be used in any READ/WRITE requests. Each such stateid is distinct and can be invalidated by the server (in exceptional circumstances) without affecting any other.
Subsequent OPEN or OPEN_DOWNGRADE requests can change both the access flags and the DENY flags associated with the file (for the relevant open-owner). Each of these yields a new stateid, though it is not entirely new. Each stateid has two parts: the "seqid", which increments on each change, and an "other" part, which doesn't. This allows the client to unambiguously determine whether a second OPEN request opened the same file as the first — as the "other" parts of the stateids will match. It also makes it clear in what order various changes were performed on the server, so the client can be sure it remains synchronized with the server. Thus the "seqid" gives a time dimension to the states.
This can be particularly relevant when a CLOSE request is sent at much the same time as an OPEN request for the same file. The client may not know it is operating on the same file, due to the presence of hard links, so this cannot be seen as incorrect behavior by the client. If the server performs the CLOSE first, the OPEN will then likely succeed and all will be well. If the server performs the OPEN first it will increment the seqid of the state for that file and, when it sees the CLOSE, it will reject it because the seqid is old. The file will stay open in either case, which is what the client wanted (since it opened the file twice but only closed it once).
There are similar stateids, including seqids, for LOCK and UNLOCK requests, and these have a corresponding "lock owner". The lock owner will correspond to a process for POSIX locks, or an open file descriptor for OFD locks.
NFSv4.1 — a step forward
The NFSv4 working group went to some trouble to allow for future versions and to describe the sorts of things that might change. This was not in vain and, in January 2010, NFSv4.1 was described in RFC 5661 (with an update in RFC 8881 about 10 years later).
V4.1 contains lots of little improvements based on several years of experience with what had been nearly an entirely new protocol. Many people are suspicious of "dot-zero" releases and, with NFSv4, there is some justification for this. NFSv4.0 does work, and works quite well, but 4.1 works better. Most of the little things don't rate a mention here, but the decision to exclude UDP as a supported transport is interesting because it is user-visible. UDP has no congestion control, so NFS doesn't work well over it in general. Of course, UDP can still be used as long as some other protocol that manages congestion, like QUIC, is layered in between. V4.1 also allows the server to tell the client that it is safe to unlink files that are still open so the clumsy renaming-on-remove can be avoided.
Possibly the biggest user-visible change in NFSv4.1 is the addition of "pNFS" — parallel NFS. This appears to be a marketing term, as it is easy to say but only loosely captures the important changes. With NFSv4.1, it becomes possible to offload I/O requests (READ and WRITE) to some other protocol, which could well communicate over a different medium to a different server. This allows a single NFS client to communicate with a cluster filesystem without having to channel all the requests through a single IP address. This is certainly an extra level of parallelism, but it is not fair to say that earlier NFS did not allow any parallelism. Even NFSv2 could have multiple outstanding requests that the server could be handling concurrently.
These offload protocols, and how they integrate, are described in separate RFCs. There is support for a block-access protocol (RFC 5663) or the OSD object storage protocol (RFC 5664), which might run over iSCSI, for example, or a "flexible file" based approach using NFSv3 or later (RFC 8435) for the data access. This is only of particular interest here because there is a new sort of state that needs to be managed — there are objects called "layouts".
A layout describes how to access part of a file using some other protocol. Each layout has a stateid which can be allocated and then relinquished, so the server always knows which layouts might still be in use. This is important if the server needs, for example, to migrate a file to a different storage location — it probably shouldn't do that while a client thinks it knows what block location it can use to access that file.
Sessions and a reliable DRC
From our perspective of managing state, the biggest change in NFSv4.1 is that the protocol finally allows for a completely reliable duplicate request cache. As described in part 1, this is needed for the rare case when a request or reply might have been lost and the client has to resend a request. In versions up to and including NFSv4.0, the server would just make a best-effort guess at which requests and replies might be worth remembering of a while. In NFSv4.1, it can know.
An NFSv4.1 session is a new element of state that is forked off from the global clientid by the CREATE_SESSION request. Given a clientid and a sequence number, a new sessionid is allocated that has a collection of different attributes associated with it, including a maximum number of concurrent requests. The server will allocate this many "slots" in its duplicate request cache, and the client will assign a slot number less than this number to each request. The client promises never to reuse a slot number until it has seen a reply to the previous request with that number, and the server promises to remember the replies to the most recent request in each slot, if the client asked it to.
The client can even ask the server to store the cached replies in stable storage so that they survive a reboot. If the server implements this functionality and agrees to provide it, then the result is as close to perfect exactly-once semantics as it is possible to get.
There is, superficially, an imbalance here. The requests that are most common (READ, WRITE, GETATTR, ACCESS, LOOKUP) are idempotent and do not need to be cached, yet the server must reserve cache space for each slot, thus either wasting cache space or unnecessarily limiting concurrency. This not a problem in practice, since the protocol allows the client to create multiple sessions with different parameters. It could create one with a large slot count and a maximum cached size of zero, and use this for all idempotent requests. It could also create a session with a more modest slot count and much larger maximum cached size, and use this for requests that mustn't be repeated.
Directory delegations
A third new sort of state in NFSv4.1 — accompanying layouts and sessions — is directory delegations. In NFSv4.0, it is possible to open files, but all interactions with directories remained much as they were in NFSv3, where each operation was discrete and there was no ongoing state. In v4.1, we get something a bit like an OPEN of a directory with GET_DIR_DELEGATION. This request doesn't contain an explicit clientid; instead, it uses the clientid associated with the session that the request is part of. The delegation is essentially a standing request that the client be informed of any changes made to the directory by other clients. Depending on what specifics are negotiated, this might involve the server saying "something has changed, you no longer have the delegation", or it may provide more fine-grained details of what, specifically, has changed. This allows for strong file-name cache coherence and even allows client-side applications to receive notifications of changes. This functionality is not implemented by Linux, either for the server or the client.
NFSv4.2 - meeting customer needs
The latest version of NFS is v4.2, described in RFC 7862 (Nov 2016).
That document describes the goals of this revision being "to take common
local file system features that have not been available
through earlier versions of NFS and to offer them remotely.
"
In contrast to NFSv4.1, which primarily provided existing functionality in a more efficient or reliable manner, v4.2 provides genuinely new functionality — at least new to NFS. These include support for the SEEK_DATA and SEEK_HOLE functionality of lseek(), which allows sparse files to be managed efficiently, support of posix_fallocate() to explicitly allocate and deallocate space in a file, support for posix_fadvise(), so the client can tell the server what sort of I/O patterns to optimize for, and support for reflinks, which allow content to be shared between files without copying. None of these add any new state to the protocol, so they aren't directly in our area of focus for this discussion.
One new element of functionality that does involve a new form of state is server-side copy. This functionality can support copy_file_range() and can copy between two files on one server or — if the servers cooperate — between two files on different servers. Closely related functionality (using the WRITE_SAME operation) can initialize a file using a given pattern repeated as necessary.
When the client sends a COPY request, the server has the option of either performing the full copy before replying, or scheduling the operation asynchronously and returning immediately. In the latter case, a stateid is returned to the client which represents the ongoing action on the server. The client can use this stateid to query status (for the all-important progress bar) or to cancel the copy. The server can use the stateid to notify the client of completion or failure.
The future for NFS
As yet, there are no hints of an NFSv4.3 in the foreseeable future, and it could be that no such version will ever be described. The v4.2 specification differs from its predecessors in that it is not a complete specification, but instead references the v4.1 specification and adds some extensions. The model for future extension, which is itself extended in RFC 8178 (July 2017), allows for further incremental changes to be added without requiring that the minor version number be changed. This has already been put to good use with RFC 8275 which allows the POSIX "umask" to be sent to the server during file creation, and RFC 8276 which adds support for extended attributes.
There are, of course, still ongoing development efforts around NFS. One of the more interesting areas involves describing how the NFS protocol can be usefully transported on something other than TCP or RDMA, which are the main two protocols in use today.
NFS draft-cel-nfsv4-rpc-tls-pseudoflavors-02 looks at using ONC-RPC (the underlying RPC layer used by NFS) over TLS and, particularly, explores how the authentication provided by TLS can interact with the authentication requirements of NFS. Then, draft-cel-nfsv4-rpc-over-quicv1-00 builds on this to explore how NFS can be used over the QUIC protocol; The "cel" in the names of the drafts refer to Chuck Lever, the current maintainer of the Linux NFS server.
Other drafts and all the RFCs can be found on the web page for the IETF NFSv4 working group.
While NFS, much like Linux, does not seem to be finished yet, it does
appear to have come to terms with being a stateful protocol, with all
the state fitting into one coherent model. This is highlighted by the
way that a totally new form of state — asynchronous copying on the
server — was fit into the model in NFSv4.2 with no fuss. So it is
likely the future improvements will focus elsewhere, perhaps following
the recent moves toward improved security and support for new
filesystem functionality. Who knows, maybe one day it will even make
peace with Btrfs.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/NFS |
| GuestArticles | Brown, Neil |
