NFS: the early years
I recently had cause to reflect on the changes to the NFS (Network File System) protocol over the years and found that it was a story worth telling. It would be easy for such a story to become swamped by the details, as there are many of those, but one idea does stand out from the rest. The earliest version of NFS has been described as a "stateless" protocol, a term I still hear used occasionally. Much of the story of NFS follows the growth in the acknowledgment of, and support for, state. This article looks at the evolution of NFS (and its handling of state) during the early part of its life; a second installment will bring the story up to the present.
By "state" I mean any information that is remembered by both the client and the server, and that can change on one side, thus necessitating a change on the other. As we will see, there are many elements of state. One simple example is file content when it is cached on the client, either to eliminate read requests or to combine write requests. The client needs to know when cached data must be flushed or purged so that the client and server remain largely synchronized. Another obvious form of state is file locks, for which the server and client must always agree on what locks the client holds at any time. Each side must be able to discover when the other has crashed so that locks can be discarded or recovered.
NFSv2 — the first version
Presumably there was a "version 1" of NFS developed inside Sun Microsystems, but the first to be publicly available was version 2, which appeared in 1984. The protocol is described in RFC 1094, though this is not seen as an authoritative document; rather, the implementation from Sun defined the protocol. There were other network filesystems being developed around the same time, such as AFS (the Andrew File System), and RFS (Remote File Sharing). One distinctive difference that NFS had, when compared to these, is that it was simple. One might argue that it was too simple, as it could not correctly implement some POSIX semantics. However, this simplicity meant that it could provide good performance for a lot of common workloads.
The early 1980s was the time of the "3M Computer" which suggested a goal for personal workstations of one megabyte of memory, one MIPS of processing power, and one megapixel (monochrome) of display. This seems almost comically underpowered by today's standard, particularly when one considers that a price tag of a mega-penny ($10,000) was thought to be acceptable. But this was the sort of hardware on which NFSv2 had to run — and had to run well — in order to be accepted. History suggests that it was adequate to the task.
Consequences of being "stateless"
The NFSv2 protocol has no explicit support for any state management. There is no concept of "opening" a file, no support for locking, nor any mention of caching in the RFC. There are only simple, self-contained access requests, all of which involve file handles.
The "file handle" is a central unifying feature of NFSv2: it is an opaque, 32-byte identifier for a file that is stable and unique within a given NFS server across all time. NFSv2 allows the client to look up the file handle for a given name in a given directory (identified by some other file handle), to inspect and change attributes (ownership, size, timestamps, etc.) given a file handle, and to read and write blocks of data at a given offset of a given file handle.
As far as possible, the operations chosen for NFSv2 are idempotent, so that, if any request were repeated, it would have the same result on the second or third attempt as it had on the first. This is necessary for true stateless operation over an imperfect network. NFS was originally implemented over UDP, which does not guarantee delivery, so the client had to be prepared to resend a request if it didn't get a reply. The client cannot know if it was the request or the reply that was lost, and a truly stateless server cannot remember if any given request has been seen already so that it can suppress a repeat. Consequently, when the client resends a request, it might repeat an operation that has already been performed, so idempotent operations are best.
Unfortunately, not all filesystem operations under POSIX can be idempotent. A good example is MKDIR, which should make a directory if the given name is not in use, or return an error if the name is already used, even if it is used for a directory. This means that repeating the request can result in an incorrect error result. Standard practice for minimizing this problem is to implement a Duplicate Request Cache (DRC) on the server. This is a record of recent, non-idempotent requests that have been handled, along with the result that was returned. Effectively, this means that both the client (which must naturally track requests that have not yet received a reply) and the server maintain a list of outstanding requests that changes over time. These lists match our definition of "state", so the original NFSv2 was not actually stateless in practice, even if it was according to the specification.
As the server cannot know when the client sees a reply, it cannot know when a request is no longer outstanding, so it must use some heuristics to discard old cache entries. It will inevitably remember many requests that it doesn't need to, and may discard some that will soon be needed. While this is clearly not ideal, experience suggests that it is reasonably effective for normal workloads.
Maintaining this cache requires that the server knows which client each request came from, so it needs some reliable way to identify clients. This is a need that we will see repeated as state management becomes more explicit with the development of the protocol. For the DRC, the client identifier used is derived from the client's IP address and port number. When TCP support was added to NFS, the protocol type needed to be included together with the host address and port number. As TCP provides reliable delivery, it might seem that the DRC is not needed, but this isn't entirely true. It is possible for a TCP connection to "break" if a network problem causes the client and server to be unable to communicate for an extended period of time. NFS is prepared to wait indefinitely, but TCP is not. If TCP does break the connection, the client cannot know the status of any outstanding requests, so it must retransmit them on a new connection, and the server might still see duplicate requests. To make sure this works, NFS clients are careful to reconnect using the same source port as the earlier connection.
A duplicate request cache is not perfect, partly because the heuristic may discard entries before the client has actually received the reply, and partly because it is not preserved across server reboots, so a request might be acted upon both immediately before and after a server crash. In many cases, this is an occasional inconvenience but not a huge problem; will anyone really suffer if "mkdir" occasionally returns EEXIST when it shouldn't? But there is one situation that turned out to be quite problematic and isn't handled by the DRC at all. That is exclusive create.
Before Unix had any concept of file locks (as it didn't in Edition 7 Unix, which became the base for BSD), it was common to use lock files. If exclusive access was required to some file, such as /usr/spool/mail/neilb, the convention was that the application must first create a lock file with a related name, such as /usr/spool/mail/neilb.lock. This must be an "exclusive" creation using the flags O_CREAT|O_EXCL, which would fail if the file already existed. An application that found that it couldn't create the file because some other application had done so already would wait and try again.
Exclusive create is not an idempotent operation — by design — and NFSv2 has no support for it at all. Clients could perform a lookup and, if that reported no existing file, they could then create the file. This two-step sequence is clearly susceptible to races, so it is not reliable. This failing of NFS does not appear to have decreased its popularity, but certainly resulted in a lot of cursing over the years. It also resulted in some innovation; there are other ways to create lock files.
One way is to generate a string that will be unique across all clients — possibly with host name, process ID, and timestamp — and then create a temporary file with this string as both name and content. This file is then (hard) linked to the name for the lock file. If the hard-link succeeds, the lock has been claimed. If it fails because the name already exists, then the application can read the content of that file. If it matches the generated unique string, then the error was due to a retransmit and again the lock has been claimed. Otherwise the application needs to sleep and try again.
Another unfortunate consequence of avoiding state management involves files that are unlinked while they are still open. POSIX is perfectly happy with these unlinked-but-open files and assures that the file will continue to behave normally until it is finally closed, at which point it will cease to exist. An NFS server, since it does not know which files are open on which client, finds it difficult to be so accommodating, so NFS client implementations don't rely on help from the server. Instead, when handling a request to unlink (remove) a file that is open, the client will instead rename the file to something obscure and unique, like .nfs-xyzzy, and will then unlink this name when the file is finally closed. This relieves the server from needing to track the state of the client, but is an occasional inconvenience to the client. If an application opens the only file in some directory, unlinks the file, then tries to remove the directory, that last step will fail as the directory is not empty but contains an obscure .nfs-XX name — unless the client moves the obscure name into a parent or converts the RMDIR into another rename operation. In practice this sequence of operations is so rare that NFS clients don't bother to make it work.
The NFS ecosystem
When I said above that NFSv2 didn't support file locking, that is only half the story — it is accurate but not complete. NFS was, in fact, part of a suite of protocols that could be used together to provide a more complete service. NFS didn't support locks, but there was another protocol that did. The protocols that could be used with NFS include:
-
NLM (the Network Lock Manager). This allows the client to request a byte-range lock on a given file (identified using an NFS file handle), and allows the server to grant it (or not), either immediately or later. Naturally this is an explicitly stateful protocol, as both the client and server must maintain the same list of locks for each client.
-
STATMON (the Status Monitor). When a node — whether client or server — crashes or otherwise reboots, any transient state, such as file locks, is lost, so its peer needs to respond. A server will purge the locks held by that client, while a client will try to reclaim the locks that were lost. The chosen method with NLM is to have each node record a list of peers in stable storage, and to notify them all when it reboots; they can then clean up. This task of recording and then notifying peers was the job of STATMON. Of course, if a client crashed while holding a lock and never rebooted, the server would never know that the lock was no longer held. This could, at times, be inconvenient.
-
MOUNT. When mounting an NFSv2 filesystem, you need to know the file handle for the root of the filesystem, and NFS has no way to provide this. The task is handled instead by the MOUNT protocol. This protocol expects the server to keep track of which clients have mounted which filesystems, so this useful information can be reported. However, as MOUNT doesn't interact with STATMON, clients can reboot and effectively unmount filesystems without telling the server. While implementations do still record the list of active mounts, nobody trusts them.
In later versions, MOUNT also handled security negotiations. A server might require some sort of cryptographic security (such as Kerberos) for accessing some filesystems, and this requirement is communicated to the client using the MOUNT protocol.
-
RQUOTA (remote quotas). NFS can report various attributes of files and of filesystems, but one attribute that is not supported is quotas — possibly because these are attributes of users, not of files. To fill this gap for people who need it to be filled, there exists the RQUOTA protocol.
-
NFSACL (POSIX draft ACLs). Much as we have RQUOTA for quotas, we have NFSACL for access control lists. This allows both examining the ACLs and (unlike RQUOTA) setting them.
Beyond these, there are other protocols that are only loosely connected, such as "Yellow Pages", also known as the Network Information Server (NIS), which helped a collection of machines have consistent username-to-UID mappings; "rpc.ugid", which tried to help out when they didn't; and maybe even NTP which ensures that an NFS client and server had the same idea of the current time. These aren't really part of NFS in any meaningful sense, but are part of the ecosystem that allowed NFS to flourish.
NFSv3 — bigger is better.
NFSv3 came along about ten years later (1995). By this time, workstations were faster (and more colorful) and disk drives were bigger. 32 Bits were no longer enough to represent the number of bytes in a file, blocks in a filesystem, or inodes in a filesystem, and 32 bytes were no longer enough for a file handle, so these sizes were all doubled. NFSv3 also gained the READDIRPLUS operation to receive the names in a directory together with file attributes, so that ls -l could be implemented more efficiently. Note that deciding when to use READDIRPLUS and when to use the simpler READDIR is far from trivial. The Linux NFS client is still, in 2022, receiving improvements to the heuristics.
There were two particular areas of change that relate to state management, one which addressed the exclusive-create problem discussed above, and one which helped with maintaining a cache of data on the client. The first of these extended the CREATE operation.
In NFSv3, a CREATE request can indicate whether the request is UNCHECKED, GUARDED, or EXCLUSIVE. The first of these allows the operation to succeed whether the file already exists or not. The second must fail if the file exists, but it is like MKDIR in that a retransmission may result in an error where there shouldn't be one, so it is not particularly helpful. EXCLUSIVE is more interesting.
The EXCLUSIVE create request is accompanied by eight bytes of unique client identification (our recurring theme) called a "verifier". The RFC (RFC 1813) suggests that "perhaps" this verifier could contain the client's IP address or some other unique data. The Linux NFS client uses four bytes of the jiffies internal timer and four bytes of the requesting process's process ID number. The server is required to store this verifier to stable storage atomically while creating the file. If the server is later asked to create a file which already exists, the stored client identifier must be compared with that in the request and, if they match, the server must report a successful exclusive creation on the assumption that this is a resend of an earlier request.
The Linux NFS server stores this verifier in the mtime and atime fields of the file it creates. The NFSv3 protocol acknowledges this possibility and requires that, once the client receives the reply indicating successful creation, it must issue a SETATTR request to set correct values for any file attributes that the server might have overloaded to store the verifier. This SETATTR step acknowledges to the server the completion of some non-idempotent request — exactly what we thought might have been helpful for the DRC implementation.
Client-side caching and close-to-open cache consistency
The NFSv2 RFC did not describe client-side caching, but that doesn't mean that implementations didn't do any. They had to be careful though. It is only safe to cache data if there is good reason to expect that the data hasn't changed on the server. NFS practice provides two ways for the client to convince itself that cached data is safe to use.
The NFS server can report various attributes of a file, particularly size and last-change time. If neither of these change from previously seen values, it might be reasonable to assume that the file content hasn't changed. NFSv2 allows the change timestamp to be reported to the closest microsecond, but that doesn't guarantee that the server maintains that level of precision. Even twenty years after NFSv2 was first used, there were important Linux filesystems that could only report one-second granularity for time stamps. So, if an NFS client sees a timestamp that is at least one second in the past, and then reads data, it is safe to cache that data until it sees the timestamp change. If it sees a timestamp that is within one second of "now", then it is much less safe to make assumptions.
NFSv3 introduced an FSINFO request that allowed the server to report various limits and preferences, and included a "time_delta", which is the time granularity that can be assumed for change time and other timestamps. This allows client-side caching to be a little more precise.
As noted above, it is considered safe to use cached data for a file until its attributes are seen to change. The client could choose never to look at the file attributes again and, thus, never see a change, but that is not permitted. The way to affirm data safety consists of two rules about when the client must check the attributes.
The first rule is simple: check occasionally. The protocol doesn't specify minimum or maximum timeouts but most implementations allow these to be configured. Linux defaults to a three-second timeout which is extended exponentially as long as nothing appears to be changing, to a maximum of one minute. This means that the client might provide data from cache that is up to 60 seconds old, but no longer. The second rule builds on an assumption that multiple applications never open the same file at the same time, unless they use locking or they are all read-only.
When a client opens a file, it must verify any cached data (by checking timestamps) and discard any that it cannot be confident of. As long as the file remains open, the client can assume that no changes will happen on the server that it doesn't request itself. When it closes the file, the client must flush all changes to the server before the close completes. If each client does this, then any application that opens a file will see all changes made by any other application on any client that closed the file before this open happened, so this model is sometimes referred to as "close-to-open consistency".
When byte-range locking is used, the same basic model applies, but the open operation becomes the moment when the client is granted a lock, and the close is when it releases the lock. After being granted a lock, the client must revalidate or purge any cached data in the range of the lock and, before releasing a lock, it must flush cached changes in this region to the server.
As the above relies on the change time to validate the cache, and as the change time updates whenever any client writes to the file, the logical implication is that, when a client writes to a file, it must purge its own cache since the timestamp has changed. It is quite justified to maintain the cache until the file is closed (or the region is unlocked), but not beyond. This need is particularly visible when byte-range locking is used. One client might lock one region, write to it, and unlock. Another client might lock, write, and unlock a different region, with the write requests happening at exactly the same time. There is no way that either client can tell if another client wrote to the file or not, as the timestamp covers the whole file, not just one range. So they must both purge their whole cache before the next time the file is opened or locked.
At least, there was no way to tell before NFSv3 introduced weak cache consistencies (wcc) attributes. The reply to an NFSv3 WRITE request allows the server to report some attributes — size and time stamps — both before and after the write request, and requires that, if it does report them, then no other write happened between the two sets of attributes. A client can use this information to detect when a change in timestamps was due purely to its own writes, and when they were due to some other client. It can, thus, determine whether it is the only client writing to a file (a fairly common situation) and, when so, preserve its cache even though the timestamp is changing. Wcc attributes are also available in replies to SETATTR and to requests that modify a directory, such as CREATE or REMOVE, so a client can also tell if it is the sole actor in a directory, and manage its cache accordingly.
This is "weak" cache consistency, as it still requires the client to check the timestamps occasionally. Strong cache consistency requires the server to explicitly tell the client that change is imminent, and we don't get that until a later version of the protocol. Despite being weak, it is still a clear step forward in allowing the client to maintain knowledge about the state of the server, and so another nail in the coffin of the fiction of a stateless protocol.
As an aside, the Linux NFS server doesn't provide these wcc attributes for writes to files. To do this, it would need to hold the file lock while collecting attributes and performing the write. Since Linux 2.3.7, the underlying filesystem has been responsible for taking the lock during a write, so nfsd cannot provide the attributes atomically. Linux NFS does provide wcc attributes for changes to directories, though.
NFS — the next generation
These early versions of NFS were developed within Sun Microsystems. The code was made available for other Unix vendors to include in their offerings and, while these vendors were able to tweak the implementation as needed, they were not able to change the protocol; that was controlled by Sun.
As the new millennium approached, interest in NFS increased and independent implementations appeared. This resulted in a wider range of developers with opinions — well-informed opinions — on how NFS could be improved. To satisfy these developers without risking dangerous fragmentation, a process was needed for those opinions to be heard and answered. The nature of this process and the changes that appeared in subsequent versions of the NFS protocol will be the subject of a forthcoming conclusion to this story.
| Index entries for this article | |
|---|---|
| Kernel | Filesystems/NFS |
| GuestArticles | Brown, Neil |
