The second version of
Oracle's cluster filesystem
has been in the works for some time. There
has been a recent increase in cluster-related code proposed for inclusion
into the mainline, so it was not entirely surprising to see an OCFS2 patch set
join the crowd. These
patches have found their way directly into the -mm tree for those wishing
to try them out.
As a cluster filesystem, OCFS2 carries rather more baggage than a
single-node filesystem like ext3. It does have, at its core, an on-disk
filesystem implementation which is heavily inspired by ext3. There are
some differences, though: it is an extent-based filesystem, meaning that
files are represented on-disk in large, contiguous chunks. Inode numbers
are 64 bits. OCFS2 does use the Linux JBD layer for journaling, however,
so it does not need to bring along much of its own journaling code.
To actually function in a clustered mode, OCFS2 must have information about
the cluster in which it is operating. To that end, it includes a simple
node information layer which holds a description of the systems which make
up the cluster. This data structure is managed from user space via configfs; the user-space tools, in turn, take
the relevant information from a single configuration file
(/etc/ocfs2/cluster.conf). It is not enough to know which nodes
should be part of a cluster, however: these nodes can come and go, and the
filesystem must be able to respond to these events. So OCFS2 also includes
a simple heartbeat implementation for monitoring which nodes are actually
alive. This code works by setting aside a special file; each node must
write a block to that file (with an updated time stamp) every so often. If
a particular block stops changing, its associated node is deemed to have
left the cluster.
Another important component is the distributed lock manager. OCFS2
includes a lock manager which, like the implementation covered last week, is called "dlm" and implements a
VMS-like interface. Oracle's implementation is simpler, however (its core
locking function only has eight parameters...), and it lacks many of the
fancier lock types and functions of Red Hat's implementation. There is
also a virtual filesystem interface ("dlmfs") making locking functionality
available to user space.
There is a simple, TCP-based messaging system which is used by OCFS2 to
talk between nodes in a cluster.
The remaining code is the filesystem implementation itself. It has all of
the complications that one would expect of a high-performance filesystem
implementation. OCFS2, however, is meant to operate with a disk which is,
itself, shared across the cluster (perhaps via some sort of storage-area
network or multipath scheme). So each node on the cluster manipulates the
filesystem directly, but they must do so in a way which avoids creating
chaos. The lock manager code handles much of this - nodes must take out
locks on on-disk data structures before working with them.
There is more to it than that, however. There is, for example, a separate
"allocation area" set aside for each node in the cluster; when a node needs
to add an extent to a file, it can take it directly from its own allocation
area and avoid contending with the other nodes for a global lock. There
are also certain operations (deleting and renaming files, for example)
which cannot be done by a node in isolation. It would not do for one node
to delete a file and recycle its blocks if that file remains open on
another node. So there is a voting mechanism for operations of this type;
a node wanting to delete a file first requests a vote. If another node
vetoes the operation, the file will remain for the time being. Either way,
all nodes in the cluster can note that the file is being deleted and adjust
their local data structures accordingly.
The code base as a whole was clearly written with an eye toward easing the
path into the mainline kernel. It adheres to the kernel's coding standards
and avoids the use of glue layers between the core filesystem code and the
kernel. There are no changes to the kernel's VFS layer.
Oracle's developers also appear to understand the current level of
sensitivity about the merging of cluster support code (node and lock
managers, heartbeat code) into the kernel. So they have kept their
implementation of these functionalities small and separate from the
filesystem itself. OCFS2 needs a lock manager now, for example, so it
provides one. But, should a different implementation be chosen for merging
at some future point, making the switch should not be too hard.
One assumes that OCFS2 will be merged at some point; adding a filesystem is
not usually controversial if it is implemented properly and does not drag
along intrusive VFS-layer changes. It is only one of many cluster
filesystems, however, so it is unlikely to be alone. The competition in
the cluster area, it seems, is just beginning.
to post comments)