User: Password:
Subscribe / Log in / New account

Throwing one away

Throwing one away

Posted Sep 20, 2012 18:02 UTC (Thu) by nix (subscriber, #2304)
In reply to: Throwing one away by khim
Parent article: Throwing one away

It's much more fundamental than that. To implement the READDIR request, or anything like it in a stateless protocol (ignoring here hypothetical protocols which can only return the whole directory in one go, since they would have obviously terrible performance), you must be able to ask for 'the next bit of the directory from where I was readdir()ing last' even though the server has no idea where you might be in the directory because the protocol is stateless. NFS does this by handing you a cookie with each READDIR response which corresponds in some way the spec does not specify to the location of that file in a readdir() response, so you can pass it back in the next request and get the next filename.

And that cookie is, of course, an (encoded) return value of telldir().

Worse yet, because NFS is stateless, *any* nonpersistent telldir() cookies are going to fail with NFS: the pathological case of someone who does a telldir() and then a seekdir() much much later in a different call to opendir() is downright *common* if you've got an NFS server looking at that filesystem.

Which means that every NFS-exportable fs (any serious FS, period) needs a way to encode positions in all directories, stably, into a 32-bit number, with vaguely reasonable things happening to those positions even when the directories change. It's a complete pig on a lot of filesystems: they sometimes need whole extra data structures just to make this one system call work. But, as Neil says, something like it does indeed seem to be essential if you want stateless network filesystems of any sort to work. I wish there was a better way, but I wish for a lot of impossible things.

(Log in to post comments)

Throwing one away

Posted Sep 20, 2012 20:59 UTC (Thu) by khim (subscriber, #9252) [Link]

Ignoring here hypothetical protocols which can only return the whole directory in one go, since they would have obviously terrible performance.

I'm not so sure. You'll only have "an obviously terrible performance" if you have thousands (or maybe millions?) of files in one directory. When NFS was initially developed you had horrible performance in such a case no matter what (typically you had O(N²) performance and thousands of files were not acceptable for that reason alone) and later versions are not completely stateless so in the end the whole exercise only created grief for a lot of people without any actual upside.

But it's obviously to late to try to fix it.

Throwing one away

Posted Sep 20, 2012 21:13 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

There is a better way, and it's incredibly simple - just remember the last returned file name. Then send it back to the server.

I.e. use the returned value for the cookie itself.

Throwing one away

Posted Sep 20, 2012 21:22 UTC (Thu) by bronson (subscriber, #4806) [Link]

What if someone renames or deletes the file?

Throwing one away

Posted Sep 20, 2012 21:34 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Then you use the next filename. Amazon S3 does this, and it works just fine.

Throwing one away

Posted Sep 20, 2012 22:00 UTC (Thu) by neilbrown (subscriber, #359) [Link]

While this can certainly work "fine", it still imposes on the filesystem. It requires that the file system stores a lexically-order index of each directory. So you would have to design (part of) your filesystem to match this interface.

Nothing wrong with that, except that we already have a much more broadly used interface which you can design you filesystem to.

There was at one stage a proposal before the NFSv4 working group to allow the lookup key for directories to be either an opaque fixed-sized blob, or a directory entry name. The filesystem would somehow indicate what it wanted (Came from Hans Reiser if I remember correctly). Unfortunately it never went anywhere.

Throwing one away

Posted Sep 21, 2012 15:57 UTC (Fri) by faramir (subscriber, #2327) [Link]

I'm not sure why you need a lexically-ordered index if you use filenames as the location token. It would seem to me that what you need is a consistent ordering rather then a lexical one. I'm assumming here that we are talking about a directory which is static and we just want to make sure that we see every filename at some point.

BTW, are there any standards which speak to what one should see in the presence of directory changes? Personally, the most I would hope for in the dynamic directory case is that a program would eventually see every file that existed at opendir() time before readdir() returned no more files.

Throwing one away

Posted Sep 21, 2012 18:00 UTC (Fri) by knobunc (subscriber, #4678) [Link]

You need a lexically ordered one in case your "last-file" cookie refers to something that is no longer present. Otherwise what do you return?

Throwing one away

Posted Sep 21, 2012 19:27 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

You don't need a lexical order, just some stable ordering.

Throwing one away

Posted Sep 21, 2012 20:15 UTC (Fri) by knobunc (subscriber, #4678) [Link]

Agreed... but given that you need an ordering to allow you to locate where a missing arbitrary string needs to go, it must rely on an ordering property present in any arbitrary string. Which leaves you with little other than a lexical ordering.

Throwing one away

Posted Sep 21, 2012 20:49 UTC (Fri) by neilbrown (subscriber, #359) [Link]

Yes - ordering doesn't need to be strictly lexical, does need to be stable.

Why would you have an ordering that isn't stable? The obvious answer is that a hash table with chaining is commonly used and does not reliably provide a stable order (as collisions tend to be ordered based on creation order).

My preferred mechanism for directory indexing is a hash table with internal chaining. It easily provides a reliable fixed size telldir cookie, and performance should be quite good until the number of directory entries gets to about half the cookie space.

Throwing one away

Posted Sep 21, 2012 20:56 UTC (Fri) by neilbrown (subscriber, #359) [Link]

> are there any standards which speak to what one should see in the presence of directory changes?


> If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir_r() returns an entry for that file is unspecified.

Which presumably implies that if a file is not added or removed after opendir, then readdir will return it precisely once. That is certainly how I understand it.

Throwing one away

Posted Sep 21, 2012 23:06 UTC (Fri) by nix (subscriber, #2304) [Link]

I don't know. POSIX says that readdir() returns "an ordered sequence of all the directory entries in a particular directory", which surely implies that if a file is removed during readdir() you cannot skip other files -- but it never says, nor by my reading implies that you cannot repeat some of them in this situation. This means that if you're occasionally forced to restart a directory because you can't figure out where the requester was, so be it -- though it is probably best if that require multiple things to go wrong.

(e.g. in the scheme I sketched above, if a readdir requester's random-cookie/filename->DIR* entry was expired by the server and the name the readdir requester passed was missing from the directory, readdir would simply start passing back filenames from the start of the directory over again. This does mean that under extreme load, so the cookies kept expiring, and seriously extreme modification rates of a sufficiently huge directory, so that at least one name was deleted while being readdir()ed and while its cookie expired on every pass through the directory, readdir() might never come to an end -- but that's possible under extreme modification rates anyway, even on local filesystems, and this is a pathological case that's unlikely to occur in practice. To be honest, given that bugs in which filenames were persistently omitted if they were in the wrong place in the directory persisted in the BSDs for something like thirty years, it seems that programs' actual requirements of readdir() are rather less extreme than the guarantees!)

Throwing one away

Posted Sep 27, 2012 19:09 UTC (Thu) by cras (guest, #7000) [Link]

rename()s during readdir() can cause the renamed files to be returned twice or never. I'm not entirely sure if other non-renamed files can be returned twice. Deleting files (or at least rename()ing to another directory) can cause some files to be skipped with some filesystems (unfortunately I didn't keep more details). Dovecot's Maildir syncing code has some comments related to this. also.

Throwing one away

Posted Oct 5, 2012 18:56 UTC (Fri) by foom (subscriber, #14868) [Link]

Ooh, according to that lkml thread, it *is* possible to get a consistent state of a directory, on linux. You just need to allocate a buffer big enough for the whole directory to fit in. There doesn't appear to be a way to figure out how big that needs to be ahead-of-time, but at least you could iterate, doubling the buffer each time:

>> 3. the application could be offered an interface for atomic directory
>> reads that requires the application to provide sufficient memory in a
>> single contiguous buffer (making it thread-safe in the same go).
>Actually, you can do this today, if you use the underlying
>sys_getdents64 system call. But the application would have to
>allocate potentially a very large amount of userspace memory.

Throwing one away

Posted Sep 20, 2012 21:54 UTC (Thu) by dlang (subscriber, #313) [Link]

what happens if the filesystem doesn't store the files in a nice simple list, but instead stores them in some sort of tree format, and the tree can be re-ordered significantly between one call and the next.

For simple filesystems you get simple answers. for complex filesystems things get really messy.

Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds