Time to merge GFS?
One issue has to do with locking. Since the filesystem is kept on shared storage, the nodes of the cluster must take care to avoid stepping on each others' toes and corrupting things. The distributed lock manager (DLM) subsystem is used to that end; whenever a node wishes to access a particular block on the filesystem, it first obtains a cluster-wide lock on that block. As long as the filesystem only supports the read() and write() system calls, this locking works reasonably well. The filesystem code can obtain the locks it needs, perform the operation, then return the locks, and all works well.
The problem comes in when the filesystem supports mmap() as well. Accesses to memory mapped with mmap() does not happen with the read() and write() system calls; it is, instead, done with regular memory operations. Locking in this case is handled in conjunction with the virtual memory subsystem; the permissions on any particular page are set to be consistent with the level of lock currently held by the local node. If the node does not have a lock for a specific block in the filesystem, the page table entry for the corresponding page will show that page as being absent. If the process which made the mapping tries to access the page, it will incur a page fault; the filesystems nopage() method can then set up the mapping, acquiring whatever locks are required.
Page faults are asynchronous events. In particular, a page fault could happen while the kernel is busy handling a read() or write() operation somewhere else in the filesystem. In this case, the kernel will be acquiring two independent locks in the filesystem, and in an arbitrary order. It does not take much experience with locking to learn that, when multiple locks are to be acquired, the order in which they are taken is critical. Consider a case where there are two locks (call them "A" and "B") and two processes needing them. Imagine that one process acquires A, while the other acquires B. Each process then attempts to grab the remaining lock. At this point, both processes will wait forever; this situation is called an "ABBA deadlock." Contrary to what some may believe, the term has nothing to do with 1970's Swedish rock bands.
Avoiding this kind of deadlock requires a fair amount of ugly filesystem trickery; Zach Brown put it this way:
Sorting this situation out properly will probably require some sort of support at the VFS layer. In that way, one hopes, a single, working solution would be found. The alternative seems to be a bunch of brittle and complicated code in each filesystem which has this problem.
Another glitch encountered by GFS is its support for "context-dependent path names." These are, in essence, symbolic links with magic properties. The GFS code, if it encounters "@hostname" as a component in a symbolic link, will substitute the name of the current host. Similar substitutions will happen for @mach, @os, @uid, and others. There is also support for an alternative syntax ("{hostname}"), for whatever reason.
This mechanism exists to allow cluster nodes to establish private areas on a shared disk. It can also be used, for example, to create architecture-specific directories full of binaries on a common path. In the past, administrators have used automounter trickery to a very similar end. The filesystem hackers, who do not like to see this sort of magic buried within individual filesystems, suggest that bind mounts should be used instead. That technique, however, is relatively cumbersome and error-prone, so there is some interest in finding a way to maintain the sort of functionality implemented by context-dependent links.
The objections to context-dependent links include the addition of magic to
parts of the filesystem namespace and the fact that they are specific to
one filesystem.  Moving the resolution of these links up to the VFS layer
could be a part of the solution, since it would then at least function the
same way for all filesystems.  Adding this kind of semantics may always be
a hard sell, however, since it changes the way Linux filesystems are
expected to behave.  The old, automounter-based approach may end up being
the recommended technique for those needing this sort of behavior.
| Index entries for this article | |
|---|---|
| Kernel | Clusters/Filesystems | 
| Kernel | Filesystems/Cluster | 
| Kernel | GFS | 
      Posted Aug 11, 2005 10:25 UTC (Thu)
                               by dw (subscriber, #12017)
                              [Link] (2 responses)
       
As pointed out in the article, this symlink content scanning is completely unnecessary bloat, whos functionality can be accomplished through other functionality already available in the kernel.
 
I built a cluster a few years back that booted 36 completely diskless nodes off a single shared readonly NFS root. This involved about 5 lines of bindmounting and tmpfs, along with commenting out fsck checks and suchnot. That cluster is still operational, sitting in a room about 10 yards away. :)
 
The problem has already been solved. Please, no more bloat!
 
     
    
      Posted Aug 11, 2005 11:33 UTC (Thu)
                               by penberg (guest, #30234)
                              [Link] 
       
It has already been taken out of GFS2 code by the developers. 
     
      Posted Aug 11, 2005 17:49 UTC (Thu)
                               by iabervon (subscriber, #722)
                              [Link] 
       
Of course, this should probably be an aspect of namespaces, such that it applies to everything, but only if you've enabled it. 
     
      Posted Aug 11, 2005 12:28 UTC (Thu)
                               by hpp (subscriber, #4756)
                              [Link] 
       
  .../@sys/bin/mozilla 
goes to 
  .../ia32.linux.2.4.glibc.2.3/bin/mozilla 
and we use this to store binaries for multiple architectures side by side. 
This has existed in AFS forever, and is supported in Linux.  I'm not sure why this is not a problem in AFS but would be a problem for GFS. 
Could this be the location where the magic @sys token gets expanded?  In AFS, this is all done on the client side, so the fileserver is not aware of the magic / trickery in the lookup. 
     
      Posted Aug 12, 2005 22:05 UTC (Fri)
                               by giraffedata (guest, #1954)
                              [Link] 
       
If that's a valid objection, we need to get rid of the "followlink" inode operation.  Its existence specifically says that the meaning of symbolic link contents is filesystem-type-dependent.  The common path walk code could as easily do a "readlink" inode operation and interpret the contents as a path itself.  So someone wanted it that way.
 
In the continual battle between filesystem type consistency and filesystem type diversity, the VFS interface is what defines the front.
      
           
     
      Posted Aug 14, 2005 18:36 UTC (Sun)
                               by addw (guest, #1771)
                              [Link] 
       
One 'feature' was a conditional symbolic link ... what it pointed to depended on the universe setting. 
Sounds like a nice idea, but it was vile. I hated it. 
     
    Time to merge GFS?
      
@hostname  Boot-time symlink or bind mount
@mach      Boot-time symlink or bind mount
@os        Boot-time symlink or bind mount
@uid       Private namespaces
      > The problem has already been solved. Please, no more bloat!.Time to merge GFS?
      
      
          
      It gets a bit tricky if you've got a couple hundred symlinks bin -> @os/bin in different directories. The point with using symlinks for it is that regular users can create these links without suid programs when they want portions of their home directories to behave differently on different hosts.Time to merge GFS?
      
      
          
      The Andrew File System (AFS) has a similar mechanism where a pathname component named @sys instead goes to your hardware platform.  On my box here at work (we're a big AFS shop),Time to merge GFS?
      
      
          context-dependent symlink
      
The objections to context-dependent links include ... the fact that they are specific to one filesystem [type].
      I remember the sequent and pyramid boxes of a decade or two back ... these machines were multi universe ones: processes could see either System V or BSD system calls and file system layout - depending on the universe that the process was in.How things come back from the past ...
      
      
          
 
           