LWN: Comments on "NFS: the early years"

Remote filesystems: a lost cause?

ajmacleod — Fri, 22 Jul 2022 15:48:00 +0000

That's very interesting, thank you for the reply.

Remote filesystems: a lost cause?

donald.buczek — Mon, 18 Jul 2022 20:15:33 +0000

We use self-developed tools to broadcast /etc files like passwd, group, exports or autofs-maps to all systems [1]. For shadow, we replaced nis with a self-develooped nss service, which queries a central server via tls [2].

[1]: https://github.molgen.mpg.de/mariux64/mxtools/blob/master...

[2]: https://github.molgen.mpg.de/mariux64/mxshadow

Remote filesystems: a lost cause?

ajmacleod — Mon, 18 Jul 2022 16:08:25 +0000

Out of interest, what did you replace NIS with? I likewise have found NFS indispensable but NIS has hung around a lot longer than it should have in some use cases because for them it just works and is very simple (yes, too simple.)

Remote filesystems: a lost cause?

marcH — Sun, 10 Jul 2022 07:40:24 +0000

Thanks! So mostly reads, no read/write concurrency and rarely in the critical path? Then sure, as long as you have the tools to monitor and catch "deviant" users then why not. Not exactly plug and play / "consumer"-friendly though and quite far from the illusion of a local disk.

Remote filesystems: a lost cause?

donald.buczek — Sun, 10 Jul 2022 06:04:34 +0000

NFS doesn't suck.

We have ~400 Linux systems ranging from workstations to really big compute servers using a global namespace supported by autofs and nfs for /home and all places where our scientific data goes (/project). We even have software packages installed and used over nfs ( /pkg ) because, to make our data processing reproducible, we keep every version of every library or framework and that would be too much to keep on every system. Only base Linux system, copies of the current/default version of some highly used packages (e.g. python) and some scratch space are on local disks.

We have ~500 active users and they usually don't complain about perfomance or responsiveness of their desktop.

We do this for decades and it used to be a pain, but with the progress of NFS and the steps we've taken here (e.g. replace NIS), everything runs really smoothly these days! If it doesn't, some user acted against the rules (e.g. hammering a fileserver from many distributed cluster jobs or trying to up- or download a few terrabytes without a speed limit) and we have a talk.

Oh, and the same namespace ( /home, /project ) is accessed over CIFS by our Windows and macOS workstation. Plus the files are accessed locally on the fileservers, too. For example, we typically would put daemons and cronjobs, which don't need much ram or cpu, on the fileserver where their projects are local, to reduce network dependency. Also users are allowed to log in to the HPC node where their job is executing to monitor or debug its behavior.

And (related to other discussions) we couldn't live without symlinks. And all these systems are multi-user, security border between users is basic requirement.

NFS, symlinks and multiuser are not dead, "Nowadays things are done like this and that" might be true in a statistical sense but should not be generalized.

Remote filesystems: a lost cause?

atnot — Sat, 09 Jul 2022 22:57:31 +0000

> Does _any_ network filesystem "suck less"? The entire concept seems like a lost cause.

I think it gets worse, because all filesystems are secretly network filesystems. It doesn't really matter that much whether the two computers speak to each other over ethernet, SCSI or PCIe, you just notice it less with lower latency. Basing persistent state storage on what amounts to a shared, globally read-writable single address space without any transaction or really, any well defined concurrency semantics is, I think, a fundamental dead end. See also the symlink discussion. So much effort is thrown into pretending files work like they did on a pdp11 at a very deep level, and I think it's really something that needs to be moved beyond. Git is actually a pretty good example I never thought of there, in the way it sort of emulates a traditional filesystem structure on top of a content-addressed key-value blob store.

Remote filesystems: a lost cause?

marcH — Sat, 09 Jul 2022 21:56:40 +0000

Indeed the https://en.wikipedia.org/wiki/Fallacies_of_distributed_co... come to mind.

Does _any_ network filesystem "suck less"? The entire concept seems like a lost cause. Sure, a samba share is marginally more convenient than rsync or wget stuff.tgz for some occasional recursive _download_ but that's still just glorified FTP. Is there any actual and successful use case where multiple clients are actively working on the same, shared tree at the same time?

Just for fun try to time "git status" over a network file system and compare with a local file system. The thing that relies a lot on "state" is caching and caching is what has made computers faster and faster. Is high-performance caching compatible with sharing access over the network? It does not look like it.

I'm more and more convinced that the future belongs to application-specific protocols. The poster child is of course git which is capable of synchronizing Gigabytes while transferring kilobytes. Of course there is a price to pay: the synchronization requires explicit user action, remote resources do not magically appear like their local. But that's just acknowledging Network Fallacies and reality.

PS: NUMA and CXL seem similarly... "ambitious"

NFS: the early years

neilbrown — Sat, 25 Jun 2022 11:17:59 +0000

The article at that second link contains the gem:

> duplicate request processing can result in incorrect results (affectionately called ‘filesystem corruption’ by those not in a filesystem development group).

That's the sort of witticism that would be right at home here in lwn.net.
It goes on to describe a problem which I've heard of before described as "Nulls Frequently Substituted".

NFS: the early years

bfields — Fri, 24 Jun 2022 13:26:51 +0000

A lot of the original papers are readable and interesting, by the way. Two off the top of my head:

http://www.cs.siue.edu/~icrk/514_resources/papers/sandber... also introduces the VFS: "In order to build the NFS into the UNIX 4.2 kernel in a user transparent way, we decided to add a new interface to the kernel which separates generic filesystem operations from specific filesystem implementations."

https://archive.org/details/1989-conference-proceedings-w... starting p. 53 (there has to be a better link) introduces the DRC; it's interesting that the correctness improvements are described as a "beneficial side effect", with the main purpose increased bandwidth.

NFS: dangerous fragmentation

neilbrown — Thu, 23 Jun 2022 21:57:37 +0000

> Neil appears to be talking about fragmentation of the protocol and the developer community, whereas you seem to be talking about something technical.

Indeed. The only technical issue that I know of which involves NFS and fragmentation relates to UDP packets being fragmented into IP packets, and not necessarily being assembled properly.
See "man 5 nfs" and the section titled "Using NFS over UDP on high-speed links".
You can find that man page at http://man.he.net/man5/nfs if you don't want to leave your browser just now.
Or maybe https://www.man7.org/linux/man-pages/man5/nfs.5.html but that site isn't responding for me just now.

NFS: dangerous fragmentation

giraffedata — Thu, 23 Jun 2022 17:39:22 +0000

My colleague Jay only recently told me about the dangerous fragmentation issue alluded to at the end

Neil appears to be talking about fragmentation of the protocol and the developer community, whereas you seem to be talking about something technical.

NFS: the early years

droundy — Thu, 23 Jun 2022 03:04:02 +0000

> If an application opens the only file in some directory, unlinks the file, then tries to remove the directory, that last step will fail as the directory is not empty but contains an obscure .nfs-XX name — unless the client moves the obscure name into a parent or converts the RMDIR into another rename operation. In practice this sequence of operations is so rare that NFS clients don't bother to make it work.

Wow, that is an amazing level of sarcasm! If there was one thing that is memorable about NFS, it's the nuisance of perpetual failing `rm -rf` and then trying to track down the darn progress holding a file open.

NFS: the early years

dublin — Wed, 22 Jun 2022 21:48:44 +0000

Yep, that's it. It *was* a bit unfair/greedy, but in a way that actually made throughput better for everyone, since NFS was a fair portion of the traffic (with X being a large part of the rest - it didn't impact that enough to be a real concern.

Also, we never had all that many SGIs serving NFS, so I have no idea if this might have fallen over at a larger scale...)

BTW, this technique eliminated NFS-related collisions and backoffs, but that doesn't impact things nearly as much as people think - CSMA/CD is really pretty efficient: If you assume that *every other packet* (50%, crazy high) collided, you *still* got 97% of Ethernet's throughput, even with that many backoffs. Do the math if you don't believe me...) I think it was mostly fast because it kept the NFS pipeline flowing on both client and server...

NFS: the early years

donaldh — Wed, 22 Jun 2022 10:17:18 +0000

Ah yes, my comp-sci lab was similarly hosed. Not helped at all by the diskless Sun Sparc SLCs that had root fs and swap mounted via NFS.

NFS: the early years

ewen — Wed, 22 Jun 2022 08:24:42 +0000

If you transmit frames back to back with no gap, then other nodes doing carrier sense would still be hearing your earlier transmission (previous frame) as you start the next one. So ordinary carrier sense plus delay to transmit would cause other nodes not to talk over you, as they’d never observe a quiet period before the (second or subsequent) frame started being sent.

Obviously they could still talk over the first frame as normal, as they wouldn’t hear the start of the transmission for a while (hence minimum frame lengths, so the sender can learn overtalk happened).

But the result should be either you lose the first frame due to overtalk (and don’t send the rest), or you send all frames before anyone else gets a word in, without interruption/overtalk. On a NFS heavy, reasonably congested network it’s probably a net win for everyone over constantly losing one in N frames out of a larger NFS request/reply (and thus endless retries tying up the shared medium).

Seems like a very clever hack, if a bit “unfair”/“greedy” of a shared medium if the “fake jumbo frames” aren’t reasonably well separated from more “fake jumbo frames”.

Ewen

NFS: the early years

geert — Wed, 22 Jun 2022 08:23:50 +0000

I ran the second stage of debootstrap on a dev board in JP using nfsroot, connected to my NFS server (in EU) using TAP/TUN over SSH. Took all night to complete, but it did work.
Subsequent boot ups indeed took ca. 30 minutes, so it looks like 8 ms or 300 ms don't seem to make much of a difference?

NFS: the early years

geert — Wed, 22 Jun 2022 08:18:20 +0000

I guess this made the situation even worse with some of the early PC Ethernet cards, which didn't have enough RAM to hold 8 KiB worth of packets in their receive buffer, and/or couldn't handle back-to-back packets?

NFS: the early years

jwarnica — Tue, 21 Jun 2022 23:52:38 +0000

This would only decrease collisions if everyone else was unnecessary-out-of-spec level good at the "carrier sense" part of the paradigm. Which, of course, is statistically unlikely at scale.

So I'm a tad confused....

NFS: the early years

jwarnica — Tue, 21 Jun 2022 23:44:06 +0000

NFS is proof-of-failure #1 for the basic theory of RPC, that is, that without thinking about it you can put a C level (that is, asm level) subroutine call across the network without any further thought.

This put "the network is the computer" behind schedule by a decade, and triggered people to fix the idea by double downing on the idea of ignoring the network in the form of CORBA, which contributed another decade of schedule slip.

Except as examples of what not to do, good riddance.

I'm sorry for the sysadmins who had to deal with this. I'm even more sorry for those who let it escape the Valley.

NFS: the early years

dublin — Tue, 21 Jun 2022 20:18:48 +0000

Sun's TOPS protocol (which ran on Suns, PCs and Macs) was also quite good, but was perceived as a bit expensive and never really got much traction outside 1990-ish Sun shops... IIRC, it came not only with really nice cross platform file sharing, but even email that could be gatewayed to Internet mail - pretty serious stuff in the days of Novell everywhere.

FWIW, Novell's NCP is perhaps the best file-sharing protocol architecture I've ever encountered. Its protocol design was so latency-tolerant that I ran it well enough over 56K geosync INMARSAT satellite connections in 1993 to be quite usable - a feat impossible with NFS - and believe me, I tried! (The application was emergency spill reponse for the oil companies - the entire project brief was more or less two sentences: Within 15 minutes of hitting a spill site anywhere in the world, we must have voice, data, and filesharing connectivity back to Houston. Oh, and your budget is 1/10th of what the satellite equipment vendor's experts want, everything must be checkable as luggage, and are no skilled IT people will be there - so it just has to work. We nailed it.)

NFS: the early years

dublin — Tue, 21 Jun 2022 20:00:03 +0000

I was one of the architects of Chevron's IP network in the early 90s. I was puzzled when one of our customer groups requested an SGI box for their NFS server, since I was pretty familiar with the workings of NFS and knew that Sun's implementation *should* be superior (I'd compared them in the past).

At first I thought this group was falling for a bunch of SGI sales BS, but they still had the demo box in place on our network, and sure enough, it flat left our best Sun and DEC Ultrix NFS servers in the dust - both of which were were notably faster than the IBMs and HPs of the day. But how?

This set a group of us protocol performance jocks on a quest to get to the bottom of why, on the same networks, SGI could deliver such dramatically better NFS performance. (Only FDDI was 100 Mbps then: All our Ethernet was 10 Mbps, and though we were playing with the first Kalpana Etherswitch, that wasn't in play here, and we were testing the servers and clients on the same segment.) One thing we noticed after a day of poring over network analyzer data was that performance was a bit burstier than normal, and throughput was great, but that the overall network utilization was actually a bit *less*. SGI was proud of their newfound performance, but it was pretty clear that their SEs had no clue *why* it got so much better all of a sudden. Curiouser and curiouser...

We finally noticed that it wasn't just bursty - all SGI responses were back-to-back: SGI was cheating - and it turns out, quite elegantly. They were managing to deliver complete NFS blocks all at once, years before jumbo frames were even a thing. (NFSv2 blocks were 8K, vs the 1500 byte MTU of all Ethernet at the time.) What made the SGI such an NFS screamer was that they brilliantly violated the Ethernet standard, with no real significant downside: They simply sent all six frames of an NFS block out one after another, with NO chance for anyone to interrupt - they had implemented stateful semantics into their Ethernet driver for NFS serving! When sending one of the up to six frames in an NFS block, the last byte of one frame was *immediately* followed by the preamble for the next one, with no silence or backoff as would normally be required by the spec. This eliminated the potential for Ethernet collisions between these frames, since all other nodes would always lose and have to back off! Since no other node ever had the chance to interrupt, the effect was that even prior to jumbo frames, the server was delivering an entire NFS block as a train, and the impact on the network from collisions and retransmissions was greatly reduced overall, not to mention that NFS performance to the clients was much faster.

All in all, it was one of the cleverest network protocol performance hacks I've ever seen, and certainly worked extremely well in optimizing NFS performance over those old style Ethernet networks. Only a few years later, we had 100 Mbps Ethernet, Jumbo frames, and cut-through switching a la Kalpana was mainstream, so this killer hack was only known to a few, and vanished without a trace. (As far as I know, SGI never fessed up to this, I think because they feared the potential backlash of enterprise customers who couldn't stomach the idea of a vendor violating the sacrosanct 802.3 standard, even if doing so was a win-win in this case...)

NFS: the early years

Sesse — Tue, 21 Jun 2022 16:29:05 +0000

My workstation used to be rootless, with NFS root. Once, I took it to a site 8 ms away (on gigabit Ethernet all the way, effectively zero congestion). I remember it took more than half an hour to boot…

NFS: the early years

amcrae — Tue, 21 Jun 2022 14:01:06 +0000

One of the key aspects of NFS that set it apart from a variety of other contenders for a distributed filesystem at the time was the consistent naming that allowed it to integrate cleanly as part of a normal filesystem hierarchy.
Other protocols required special names (one used "..." as a special path to access the remote filesystem).
NFS allowed you to mount a remote filesystem anywhere in the hierarchy and let it appear as part of the normal directory structure with no special naming or other special characteristics. This meant that nearly all programs would work with no changes.
NFS as a protocol wasn't perfect, but is was a heck of a lot better at the time than most other alternatives.
I also remember PC-NFS, which was a package that allow simple-minded IBM-PCs running DOS to access NFS. A better option than IPX if you had a mostly Unix environment at the time.
I always think of NFS as a great example of something that wasn't perfect, but it solved 95% of the most critical part of the problem at the time.

NFS: the early years

ballombe — Tue, 21 Jun 2022 12:43:27 +0000

In 1995, stateless means that any client or server node could reboot without corrupting the others,
but also that 'append' was emulated by truncate + rewrite, which was very inefficient.
In practice, the directory cache would fill the server memory about once a week at my site, so a cronjob was setup to reboot it every night. This was a good remainder to go to sleep since otherwise, you would
get "NFS server not responding still trying" for about 20 minutes...

But still, NFS in 1995 was pretty neat. The fact that we still speak about this 1984 technology today speak volume.

NFS: the early years

nix — Tue, 21 Jun 2022 11:03:21 +0000

Well, it's not a stateless filesystem, is it. It's a stateless *protocol* (implementing a filesystem which is of course stateful). This was useful back in the day when you had one fileserver and about ten thousand clients on massively collision-prone university thicknet networks that ran at the speed of a degraded dog[1]; the fileserver could never have managed to keep track of anything nontrivial about all those clients, so it was damn good it didn't have to. These days that sort of environment is more or less history...

[1] I remember quite late in this era, in 1995, finding that 97% of all packets on my university comp sci lab's 10Mb/s Ethernet appeared to be retransmissions... that network was *unusable*.

NFS: the early years

pwfxq — Tue, 21 Jun 2022 05:54:34 +0000

Talk of NFS and fragmentation brings back bad memories of mixed FDDI/Ethernet networks and having to manually set the MTU everywhere.

NFS: the early years

Cyberax — Tue, 21 Jun 2022 04:59:32 +0000

A couple of years ago I implemented a simple NFSv4 client in Go ( https://github.com/Cyberax/go-nfs-client ) for AWS EFS, and I was really surprised by how _easy_ it was to do. In fact, almost all of it was written within just 1 day, while I was anxiously waiting for the election results.

NFS is a really great example of how to create simple and robust distributed systems.

NFS: the early years

lathiat — Tue, 21 Jun 2022 02:14:24 +0000

My colleague Jay only recently told me about the dangerous fragmentation issue alluded to at the end which I found totally fascinating which I'd not heard of despite using NFS since before 2007. I look forward to that part of the story.

NFS: the early years

willy — Mon, 20 Jun 2022 23:27:07 +0000

In some ways, NFS is an example of worse-is-better. While UDP isn't reliable, in practice NFS sucked too much over a WAN, and didn't support federated UIDs anyway, so it was used on LANs, which meant a reliable Ethernet physical layer.

The very notion of a stateless filesystem is ridiculous. Filesystems exist to store state. How tightly coupled the client & server are and how much the client and server trust each other are legitimate areas for discussion.

For those who don't know, I once wrote an NFS 2/3 server in ARM assembler. It was the nineties ...