Widening ext4's readdir() cookie
Posted Mar 28, 2013 8:58 UTC (Thu) by
paulj (subscriber, #341)
Parent article:
Widening ext4's readdir() cookie
It's interesting to read of the readdir offset debate, and read the hate for it. Most interesting is the "reluctantly" link in the article, http://lwn.net/Articles/544237/ - where Ted seems to make 2 claims:
- The readdir/telldir offset is a legacy problem (NFS), and if not for that it would not be in the kernel.
- It would not generally be practical for userspace to keep directories in memory while walking them for telldir/readdir.
Aren't these claims potentially contradictory? If it is not practical to keep directories in memory for telldir/readdir, why would it be practical for any other API? How would a dir-walking API work that didn't require userspace to keep a shadow copy (or didn't require the kernel to keep a shadow copy)? Is this perhaps a fundamental problem of a trade-off of where to keep state, rather than anything particular to readdir?
Also strange is the language in the SUSv3 spec which Ted quotes in another email (http://lwn.net/Articles/544257/):
" One of the perceived problems of implementation is that returning
to a given point in a directory is quite difficult to describe
formally, in spite of its intuitive appeal, when systems that use
B-trees, hashing functions, or other similar mechanisms to order
their directories are considered."
Difficult to describe formally? That's odd. Surely it is quite trivial? You define that directory entries have an absolute ordering (lexical, numberical, whatever you want - it doesn't matter). Why are B-trees a problem? B-trees are defined to provide an index over absolutely ordered objects. Given, say, a B-tree and a rank in that order (e.g. the cookie of a previous dirent), there should be no problem to determine what the next dirent in the order is - even if the B-tree has been changed or re-balanced, even if the dirent for the cookie no longer exists! Surely? All ordered indices should have this property (I.e. hash tables do not, but pretty much all commonly used trees would, no?), surely?
Basically, I'm puzzled, because the issues as stated don't fully make sense (either not issues, or potentially not consistent) to me. So I'm wondering what the deeper issues are, that the developers and spec writers havn't explained; or else whether perhaps the developers have gotten so "in" to the details of the problem that they're overlooking some things that'd be clearer if they stepped back? (Or it could just be me having one of my amazingly stupid moments ;) - in which case, my apologies tentatively in advance).
(
Log in to post comments)