By Michael Kerrisk
March 27, 2013
In a separate article, we explained how
an ext4 change to the kernel-user-space ABI in Linux 3.4 broke the
GlusterFS filesystem; here, we look in detail at the change and why it was
needed. The change in question was a
patch by Fan Yong that widened the readdir() "cookies"
produced by ext4 from 32 to 64 bits. Understanding why Fan's patch was
necessary first requires a bit of background on the readdir() API.
The readdir API consists of a number of functions that allow
an application to walk through the entries in a directory list. The opendir()
function opens a directory stream for a specified directory. The readdir()
function returns the contents of a directory stream, one entry at a
time. The telldir()
and
seekdir() functions provide lseek-style functionality: an
application can remember its current position in a directory stream using
telldir(), scan further entries with readdir(), and then
return to the remembered position using seekdir().
It turns out that supporting the readdir API is a source of
considerable pain for filesystem developers. The API was designed in a
simpler age, when directories were essentially linear tables of filenames
plus inode numbers. The first of the widely used Linux filesystems, ext2,
followed that design. In such filesystems, one can meaningfully talk about
an offset within a directory table.
However, in the interests of improving performance and supporting new
features, modern filesystems (such as ext4) have long since adopted more
complex data structures—typically B-trees (PDF)—for
representing directories. The problem with B-tree structures, from the
point of view of implementing the readdir() API, is that the nodes
in a tree can undergo (sometimes drastic) rearrangements as entries are
added to and removed from the tree. This reordering of the tree renders the
concept of a directory "offset" meaningless. The lack of a stable offset
value is obviously a difficulty when implementing telldir() and
seekdir(). However, it is also a problem for the implementation of
readdir(), which must be done in such a way that a loop using
readdir() to scan an entire directory will return a list of all
files in the directory, without duplicates. Consequently,
readdir() must internally also maintain some kind of stable
representation of a position within the directory stream.
Although there is no notion of an offset inside a B-tree, the
implementers of modern filesystems must still support the
readdir API (albeit
reluctantly); indeed, support for the API is a POSIX
requirement. Therefore, it is necessary to find some means of supporting
"directory position" semantics. This is generally done by fudging the
returned offset value, instead returning an internally understood "cookie"
value. The idea is that the kernel computes a hash value that encodes some
notion of the current position in a directory (tree) and returns that value
(the cookie) to user space. A subsequent readdir() or
seekdir() will pass the cookie back to the kernel, at which point
the kernel decodes the cookie to derive a position within the directory.
Encoding the directory position as a cookie works, more or less, but
has some limitations. The cookie has historically been a 31-bit hash
value, because older NFS implementations could handle only 32-bit
cookies. (The hash is 31-bit because the off_t type used to
represent the information is defined as a signed type, and negative offsets
are not allowed.) In earlier times, a 31-bit hash was not too much of a
problem: filesystem limitations meant that directories were usually small, so
the chance that two directory entries would hash to the same value was
small.
However, modern filesystems allow for large directories—so large
that the chance of two files producing the same 31-bit hash is
significant. For example, in a directory with 2000 entries, the chance of a
collision is around 0.1%. In a directory with 32,768 entries (the
historical limit in ext2), the chance is somewhat more than 20%. (For the
math behind these numbers, see the Wikipedia article
on the Birthday Paradox.) Modern filesystems have much higher limits on
the number of files in a directory, with a corresponding increase in the
chance of hash collisions; in a directory with 100,000
entries, the probability is over 90%.
Two files that hash to the same cookie value can lead to problems when
using readdir(), especially on NFS. Suppose that we want to scan
all of the files in a directory. And suppose that two files, say
abc and xyz, hash to the same value, and that the
directory is ordered such that abc is scanned first. When an NFS
client readdir() later reaches the file xyz, it will
receive a cookie that is exactly the same as for abc. Upon passing
that cookie back to the NFS server, the next readdir() will
commence at the file following abc. The NFS client code has some logic
to detect this situation; that logic causes readdir() to give the
(somewhat counter-intuitive) error ELOOP, "Too many levels of
symbolic links".
This error can be fairly easily reproduced on NFS with older
kernels. One simply has to create an ext4 directory containing enough
files, mount that directory over NFS, and run any program that performs a
readdir() loop over the directory on the NFS client. When working
with a local filesystem (no NFS involved), the same problem exists, but in
a different form. One does not encounter it when using readdir(),
because of the way in which that function is implemented on top of the
getdents() system call. Essentially, opendir() opens a
file descriptor that is used by getdents(); the kernel is able to
internally associate a directory position with that file descriptor, so
cookies play no part in the implementation of readdir(). By
contrast, because NFS is stateless, each
readdir() over NFS requires that the NFS server explicitly locate
the directory position corresponding to the cookie sent by the client.
On the other hand, the problem can be observed with a local ext4
filesystem when using telldir(), because that function explicitly
returns the directory "offset" cookie to the caller. If two directory
entries produce the same "offset" cookie when calling telldir(),
then a call to seekdir() after either of the telldir()
calls will go back to the same location. A user-space loop such as the
following easily reveals the problem, encountering a difficulty analogous
to a readdir() loop over NFS:
dirp = opendir("/path/to/ext4/dir");
while ((dirent = readdir(dirp)) != NULL) {
...
seekdir(dirp, telldir(dirp));
...
}
The seekdir(dirp, telldir(dirp)) call is a seeming no-op,
simply resetting the directory position to its current location. However,
where a directory entry hashes to the same value as an earlier
directory entry, the effect of the call will be to reset the directory
position to the earlier entry with the same hash. An infinite loop thus
results. Real programs would of course not use telldir() and
seekdir() in this manner. However, every now and then programs
that use those calls would obtain a surprising result: a seekdir()
would reposition the directory stream to a completely unexpected location.
Thus, the cookie collision problem needed to be fixed for the benefit
of both ext4 and (especially) NFS. The simplest way of reducing the
likelihood of hash collisions is to increase the size of the hash
space. That was the purpose of Fan's patch, which increased the size of the
hash space for the offset cookies produced by ext4 from 31 bits to 63. (A
similar
change has also been merged for ext3.) With a 63-bit hash space, even a
directory containing one million entries would have less than one chance in
four million of producing a hash collision. Of course, a corresponding
change is required in NFS, so that the NFS server is able to deal with the
larger cookie sizes. That change was provided in a
patch by Bernd Schubert.
Reading this article and the GlusterFS article together, one might
wonder why GlusterFS doesn't have the same problems with XFS that it has
with ext4. The answer, as noted by Dave
Chinner, is that XFS uses a rather different scheme to produce
readdir() cookies. That scheme produces cookies that require only
32 bits, and the cookies are produced in such a way as to guarantee that no
two files can generate the same cookie. XFS is able to produce unique
32-bit cookies due to the virtual mapping it overlays onto the directory
index; adding such a mapping to ext4 (which does not otherwise need it)
would be a large job.
(
Log in to post comments)