Not logged in
Log in now
Create an account
Subscribe to LWN
LWN.net Weekly Edition for May 23, 2013
An "enum" for Python 3
An unexpected perf feature
LWN.net Weekly Edition for May 16, 2013
A look at the PyPy 2.0 release
Throwing one away
Posted Sep 20, 2012 15:55 UTC (Thu) by khim (subscriber, #9252)
Because NFS is completely stateless protocol. This means that server should handle telldir correctly even if it crashed and was restarted, for example.
This is circular logic: you need telldir(2) to implement telldir(2) and this can not be done entirely in userspace. Duh. Of course. But why NFS is designed this way in a first place?
Posted Sep 20, 2012 16:24 UTC (Thu) by ballombe (subscriber, #9523)
You never had your $HOME hosted on a original NFS server, did you ? The ability to recover from crash without impacting client too much was very important when crashes were a daily event.
Posted Sep 20, 2012 22:12 UTC (Thu) by horen (subscriber, #2514)
Posted Sep 20, 2012 18:02 UTC (Thu) by nix (subscriber, #2304)
And that cookie is, of course, an (encoded) return value of telldir().
Worse yet, because NFS is stateless, *any* nonpersistent telldir() cookies are going to fail with NFS: the pathological case of someone who does a telldir() and then a seekdir() much much later in a different call to opendir() is downright *common* if you've got an NFS server looking at that filesystem.
Which means that every NFS-exportable fs (any serious FS, period) needs a way to encode positions in all directories, stably, into a 32-bit number, with vaguely reasonable things happening to those positions even when the directories change. It's a complete pig on a lot of filesystems: they sometimes need whole extra data structures just to make this one system call work. But, as Neil says, something like it does indeed seem to be essential if you want stateless network filesystems of any sort to work. I wish there was a better way, but I wish for a lot of impossible things.
Posted Sep 20, 2012 20:59 UTC (Thu) by khim (subscriber, #9252)
Ignoring here hypothetical protocols which can only return the whole directory in one go, since they would have obviously terrible performance.
I'm not so sure. You'll only have "an obviously terrible performance" if you have thousands (or maybe millions?) of files in one directory. When NFS was initially developed you had horrible performance in such a case no matter what (typically you had O(N²) performance and thousands of files were not acceptable for that reason alone) and later versions are not completely stateless so in the end the whole exercise only created grief for a lot of people without any actual upside.
But it's obviously to late to try to fix it.
Posted Sep 20, 2012 21:13 UTC (Thu) by Cyberax (✭ supporter ✭, #52523)
I.e. use the returned value for the cookie itself.
Posted Sep 20, 2012 21:22 UTC (Thu) by bronson (subscriber, #4806)
Posted Sep 20, 2012 21:34 UTC (Thu) by Cyberax (✭ supporter ✭, #52523)
Posted Sep 20, 2012 22:00 UTC (Thu) by neilbrown (subscriber, #359)
Nothing wrong with that, except that we already have a much more broadly used interface which you can design you filesystem to.
There was at one stage a proposal before the NFSv4 working group to allow the lookup key for directories to be either an opaque fixed-sized blob, or a directory entry name. The filesystem would somehow indicate what it wanted (Came from Hans Reiser if I remember correctly). Unfortunately it never went anywhere.
Posted Sep 21, 2012 15:57 UTC (Fri) by faramir (subscriber, #2327)
BTW, are there any standards which speak to what one should see in the presence of directory changes? Personally, the most I would hope for in the dynamic directory case is that a program would eventually see every file that existed at opendir() time before readdir() returned no more files.
Posted Sep 21, 2012 18:00 UTC (Fri) by knobunc (subscriber, #4678)
Posted Sep 21, 2012 19:27 UTC (Fri) by Cyberax (✭ supporter ✭, #52523)
Posted Sep 21, 2012 20:15 UTC (Fri) by knobunc (subscriber, #4678)
Posted Sep 21, 2012 20:49 UTC (Fri) by neilbrown (subscriber, #359)
Why would you have an ordering that isn't stable? The obvious answer is that a hash table with chaining is commonly used and does not reliably provide a stable order (as collisions tend to be ordered based on creation order).
My preferred mechanism for directory indexing is a hash table with internal chaining. It easily provides a reliable fixed size telldir cookie, and performance should be quite good until the number of directory entries gets to about half the cookie space.
Posted Sep 21, 2012 20:56 UTC (Fri) by neilbrown (subscriber, #359)
> If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir_r() returns an entry for that file is unspecified.
Which presumably implies that if a file is not added or removed after opendir, then readdir will return it precisely once. That is certainly how I understand it.
Posted Sep 21, 2012 23:06 UTC (Fri) by nix (subscriber, #2304)
(e.g. in the scheme I sketched above, if a readdir requester's random-cookie/filename->DIR* entry was expired by the server and the name the readdir requester passed was missing from the directory, readdir would simply start passing back filenames from the start of the directory over again. This does mean that under extreme load, so the cookies kept expiring, and seriously extreme modification rates of a sufficiently huge directory, so that at least one name was deleted while being readdir()ed and while its cookie expired on every pass through the directory, readdir() might never come to an end -- but that's possible under extreme modification rates anyway, even on local filesystems, and this is a pathological case that's unlikely to occur in practice. To be honest, given that bugs in which filenames were persistently omitted if they were in the wrong place in the directory persisted in the BSDs for something like thirty years, it seems that programs' actual requirements of readdir() are rather less extreme than the guarantees!)
Posted Sep 27, 2012 19:09 UTC (Thu) by cras (guest, #7000)
Posted Oct 5, 2012 18:56 UTC (Fri) by foom (subscriber, #14868)
>> 3. the application could be offered an interface for atomic directory
>> reads that requires the application to provide sufficient memory in a
>> single contiguous buffer (making it thread-safe in the same go).
>Actually, you can do this today, if you use the underlying
>sys_getdents64 system call. But the application would have to
>allocate potentially a very large amount of userspace memory.
Posted Sep 20, 2012 21:54 UTC (Thu) by dlang (✭ supporter ✭, #313)
For simple filesystems you get simple answers. for complex filesystems things get really messy.
Posted Sep 20, 2012 20:38 UTC (Thu) by bfields (subscriber, #19510)
NFS is completely stateless protocol.
NFSv4 isn't stateless, and in practice though NFSv3 itself may have been stateful, it was usually run alongside NLM.
And I don't see why you couldn't in theory add a little more state to the protocol and make seekdir/telldir. Whether that would actually be a practical solution to anyone's problem, I don't know....
Posted Sep 20, 2012 21:43 UTC (Thu) by neilbrown (subscriber, #359)
Obviously "state" and "stateless" are relative terms which we need to be careful with.
NFSv4 certainly has a lot of state, but for much of this state (not including the files!) the server it allowed to drop the state - on a reboot. NFSv4 has a whole sub-protocol for recovering that state which essentially involves the server saying "If forgot everything, tell me what you know" and the clients saying "I had this file locked and this one open etc etc". I.e. the clients also store the state and feed it back to the server (Bruce of course knows all of this).
Were "current directory pointer" to be part of the "state" of an open file (when that file was a directory) ... how would the client reinstate that state when the server lost it? It would need a stable cookie!
I think the NFSv4 protocol does mention the possibility of the server saving some of its state to "stable storage" - there are times (particularly relating to extended partial network partitions) where that is needed and so the cost would be justified. (a bit like /var/lib/nfs/sm).
I suspect that storing directory offsets (after every readdir call!) to stable storage would be less than ideal for performance.
Posted Sep 21, 2012 15:17 UTC (Fri) by bfields (subscriber, #19510)
1. Directory-modifying operations could be blocked during the grace period, during which clients could reclaim their previous cursors. (Is that enough to help?)
2. The existence of a directory-open might make it practical to keep readdir cursors in stable storage (since now you have to remember a limited amount of state for open directories, as opposed to remembering every cookie you've ever handed out.)
Posted Sep 21, 2012 22:57 UTC (Fri) by nix (subscriber, #2304)
I don't see why the client doesn't just remember 'the last filename we readdir()ed' and hand that back, probably in addition with a crude, random, non-stable cookie generated by the server in *its* last response in the same opendir() loop. I suppose it would make the readdir request a bit bigger... but in the common case of readdir()s following each other in quick succession it need be no slower: the server could keep a bit of extremely temporary local-only state (basically a readdir() handle and the random cookie we sent back with the last response) for recently received readdir() requests, compare the incoming request with the random cookie and filename, and now we know the next name to use, expiring the pair when we hit the end of the directory. If we expire the cookie/filename pair (which should be rare enough), we just need to opendir() the directory and readdir() through it until we find the filename again.
What am I missing? (Other than the fact that we can't change NFS because it's already there and doesn't work this way.)
Posted Sep 21, 2012 22:40 UTC (Fri) by nix (subscriber, #2304)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds