Time to merge GFS?

[Posted August 10, 2005 by corbet]

Red Hat recently announced that Fedora Core 4 was available with the Global Filesystem (GFS). Like Oracle's OCFS2, GFS allows a tightly-linked cluster to manage filesystems stored on a shared disk. Now that GFS is actually shipping, Red Hat would like to see it merged into the mainline kernel. Thus, recently, David Teigland posted the patches for review and asked for feedback. He got some.

One issue has to do with locking. Since the filesystem is kept on shared storage, the nodes of the cluster must take care to avoid stepping on each others' toes and corrupting things. The distributed lock manager (DLM) subsystem is used to that end; whenever a node wishes to access a particular block on the filesystem, it first obtains a cluster-wide lock on that block. As long as the filesystem only supports the read() and write() system calls, this locking works reasonably well. The filesystem code can obtain the locks it needs, perform the operation, then return the locks, and all works well.

The problem comes in when the filesystem supports mmap() as well. Accesses to memory mapped with mmap() does not happen with the read() and write() system calls; it is, instead, done with regular memory operations. Locking in this case is handled in conjunction with the virtual memory subsystem; the permissions on any particular page are set to be consistent with the level of lock currently held by the local node. If the node does not have a lock for a specific block in the filesystem, the page table entry for the corresponding page will show that page as being absent. If the process which made the mapping tries to access the page, it will incur a page fault; the filesystems nopage() method can then set up the mapping, acquiring whatever locks are required.

Page faults are asynchronous events. In particular, a page fault could happen while the kernel is busy handling a read() or write() operation somewhere else in the filesystem. In this case, the kernel will be acquiring two independent locks in the filesystem, and in an arbitrary order. It does not take much experience with locking to learn that, when multiple locks are to be acquired, the order in which they are taken is critical. Consider a case where there are two locks (call them "A" and "B") and two processes needing them. Imagine that one process acquires A, while the other acquires B. Each process then attempts to grab the remaining lock. At this point, both processes will wait forever; this situation is called an "ABBA deadlock." Contrary to what some may believe, the term has nothing to do with 1970's Swedish rock bands.

Avoiding this kind of deadlock requires a fair amount of ugly filesystem trickery; Zach Brown put it this way:

So clustered file systems in Linux (GFS, Lustre, OCFS2, (GPFS?)) all walk vmas in their file->{read,write} to discover mappings that belong to their files so that they can preemptively sort and acquire the locks that will be needed to cover the mappings that might be established in ->nopage. As you point out, this both relies on the mappings not changing and gets very exciting when you mix files and mappings between file systems that are each sorting and acquiring their own DLM locks.

Sorting this situation out properly will probably require some sort of support at the VFS layer. In that way, one hopes, a single, working solution would be found. The alternative seems to be a bunch of brittle and complicated code in each filesystem which has this problem.

Another glitch encountered by GFS is its support for "context-dependent path names." These are, in essence, symbolic links with magic properties. The GFS code, if it encounters "@hostname" as a component in a symbolic link, will substitute the name of the current host. Similar substitutions will happen for @mach, @os, @uid, and others. There is also support for an alternative syntax ("{hostname}"), for whatever reason.

This mechanism exists to allow cluster nodes to establish private areas on a shared disk. It can also be used, for example, to create architecture-specific directories full of binaries on a common path. In the past, administrators have used automounter trickery to a very similar end. The filesystem hackers, who do not like to see this sort of magic buried within individual filesystems, suggest that bind mounts should be used instead. That technique, however, is relatively cumbersome and error-prone, so there is some interest in finding a way to maintain the sort of functionality implemented by context-dependent links.

The objections to context-dependent links include the addition of magic to parts of the filesystem namespace and the fact that they are specific to one filesystem. Moving the resolution of these links up to the VFS layer could be a part of the solution, since it would then at least function the same way for all filesystems. Adding this kind of semantics may always be a hard sell, however, since it changes the way Linux filesystems are expected to behave. The old, automounter-based approach may end up being the recommended technique for those needing this sort of behavior.

Index entries for this article
Kernel	Clusters/Filesystems
Kernel	Filesystems/Cluster
Kernel	GFS

Time to merge GFS?

Posted Aug 11, 2005 10:25 UTC (Thu) by dw (subscriber, #12017) [Link] (2 responses)

As pointed out in the article, this symlink content scanning is completely unnecessary bloat, whos functionality can be accomplished through other functionality already available in the kernel.

@hostname  Boot-time symlink or bind mount
@mach      Boot-time symlink or bind mount
@os        Boot-time symlink or bind mount
@uid       Private namespaces

I built a cluster a few years back that booted 36 completely diskless nodes off a single shared readonly NFS root. This involved about 5 lines of bindmounting and tmpfs, along with commenting out fsck checks and suchnot. That cluster is still operational, sitting in a room about 10 yards away. :)

The problem has already been solved. Please, no more bloat!

Time to merge GFS?

Posted Aug 11, 2005 11:33 UTC (Thu) by penberg (guest, #30234) [Link]

> The problem has already been solved. Please, no more bloat!.

It has already been taken out of GFS2 code by the developers.

Time to merge GFS?

Posted Aug 11, 2005 17:49 UTC (Thu) by iabervon (subscriber, #722) [Link]

It gets a bit tricky if you've got a couple hundred symlinks bin -> @os/bin in different directories. The point with using symlinks for it is that regular users can create these links without suid programs when they want portions of their home directories to behave differently on different hosts.

Of course, this should probably be an aspect of namespaces, such that it applies to everything, but only if you've enabled it.

Time to merge GFS?

Posted Aug 11, 2005 12:28 UTC (Thu) by hpp (subscriber, #4756) [Link]

The Andrew File System (AFS) has a similar mechanism where a pathname component named @sys instead goes to your hardware platform. On my box here at work (we're a big AFS shop),

.../@sys/bin/mozilla

goes to

.../ia32.linux.2.4.glibc.2.3/bin/mozilla

and we use this to store binaries for multiple architectures side by side.

This has existed in AFS forever, and is supported in Linux. I'm not sure why this is not a problem in AFS but would be a problem for GFS.

Could this be the location where the magic @sys token gets expanded? In AFS, this is all done on the client side, so the fileserver is not aware of the magic / trickery in the lookup.

context-dependent symlink

Posted Aug 12, 2005 22:05 UTC (Fri) by giraffedata (guest, #1954) [Link]

The objections to context-dependent links include ... the fact that they are specific to one filesystem [type].

If that's a valid objection, we need to get rid of the "followlink" inode operation. Its existence specifically says that the meaning of symbolic link contents is filesystem-type-dependent. The common path walk code could as easily do a "readlink" inode operation and interpret the contents as a path itself. So someone wanted it that way.

In the continual battle between filesystem type consistency and filesystem type diversity, the VFS interface is what defines the front.

How things come back from the past ...

Posted Aug 14, 2005 18:36 UTC (Sun) by addw (guest, #1771) [Link]

I remember the sequent and pyramid boxes of a decade or two back ... these machines were multi universe ones: processes could see either System V or BSD system calls and file system layout - depending on the universe that the process was in.

One 'feature' was a conditional symbolic link ... what it pointed to depended on the universe setting.

Sounds like a nice idea, but it was vile. I hated it.