Time to merge GFS?
[Posted August 10, 2005 by corbet]
Red Hat recently
announced
that Fedora Core 4 was available with the Global Filesystem (GFS).
Like Oracle's OCFS2, GFS allows a tightly-linked cluster to manage
filesystems stored on a shared disk. Now that GFS is actually shipping,
Red Hat would like to see it merged into the mainline kernel. Thus,
recently, David Teigland
posted the patches for
review and asked for feedback. He got some.
One issue has to do with locking. Since the filesystem is kept on shared
storage, the nodes of the cluster must take care to avoid stepping on each
others' toes and corrupting things. The distributed lock manager (DLM)
subsystem is used to that end; whenever a node wishes to access a
particular block on the filesystem, it first obtains a cluster-wide lock on
that block. As long as the filesystem only supports the read()
and write() system calls, this locking works reasonably well. The
filesystem code can obtain the locks it needs, perform the operation, then
return the locks, and all works well.
The problem comes in when the filesystem supports mmap() as well.
Accesses to memory mapped with mmap() does not happen with the
read() and write() system calls; it is, instead, done
with regular memory operations. Locking in this case is handled in
conjunction with the virtual memory subsystem; the permissions on any
particular page are set to be consistent with the level of lock currently
held by the local node. If the node does not have a lock for a specific
block in the filesystem, the page table entry for the corresponding page
will show that page as being absent. If the process which made the mapping
tries to access the page, it will incur a page fault; the filesystems
nopage() method can then set up the mapping, acquiring whatever
locks are required.
Page faults are asynchronous events. In particular, a page fault could
happen while the kernel is busy handling a read() or
write() operation somewhere else in the filesystem. In this case,
the kernel will be acquiring two independent locks in the filesystem, and
in an arbitrary order. It does not take much experience with locking to
learn that, when multiple locks are to be acquired, the order in which they
are taken is critical. Consider a case where there are two locks (call
them "A" and "B") and two processes needing them. Imagine that one process
acquires A, while the other acquires B. Each process then attempts to grab
the remaining lock. At this point, both processes will wait forever; this
situation is called an "ABBA deadlock." Contrary to what some may believe,
the term has nothing to do with 1970's Swedish rock bands.
Avoiding this kind of deadlock requires a fair amount of ugly filesystem
trickery; Zach Brown put it this way:
So clustered file systems in Linux (GFS, Lustre, OCFS2, (GPFS?))
all walk vmas in their file->{read,write} to discover mappings that
belong to their files so that they can preemptively sort and
acquire the locks that will be needed to cover the mappings that
might be established in ->nopage. As you point out, this both
relies on the mappings not changing and gets very exciting when you
mix files and mappings between file systems that are each sorting
and acquiring their own DLM locks.
Sorting this situation out properly will probably require some sort of
support at the VFS layer. In that way, one hopes, a single, working
solution would be found. The alternative seems to be a bunch of brittle
and complicated code in each filesystem which has this problem.
Another glitch encountered by GFS is its support for "context-dependent
path names." These are, in essence, symbolic links with magic properties.
The GFS code, if it encounters "@hostname" as a component in a
symbolic link, will substitute the name of the current host. Similar
substitutions will happen for @mach, @os, @uid,
and others. There is also support for an alternative syntax
("{hostname}"), for whatever reason.
This mechanism exists to allow cluster nodes to establish private areas on
a shared disk. It can also be used, for example, to create
architecture-specific directories full of binaries on a common path. In
the past, administrators have used automounter trickery to a very similar
end. The filesystem hackers, who do not like to see this sort of magic
buried within individual filesystems, suggest that bind mounts should be
used instead. That technique, however, is relatively cumbersome and
error-prone, so there is some interest in finding a way to maintain the
sort of functionality implemented by context-dependent links.
The objections to context-dependent links include the addition of magic to
parts of the filesystem namespace and the fact that they are specific to
one filesystem. Moving the resolution of these links up to the VFS layer
could be a part of the solution, since it would then at least function the
same way for all filesystems. Adding this kind of semantics may always be
a hard sell, however, since it changes the way Linux filesystems are
expected to behave. The old, automounter-based approach may end up being
the recommended technique for those needing this sort of behavior.
(
Log in to post comments)