LWN.net Logo

A look inside the OCFS2 filesystem

September 1, 2010

This article was contributed by Goldwyn Rodrigues

The Oracle Cluster Filesystem (ocfs2) is a filesystem for clustered systems accessing a shared device, such as a Storage Area Network (SAN). It enables all nodes on the cluster to see the same files; changes to any files are visible immediately on other nodes. It is the filesystem's responsibility to ensure that nodes do not corrupt data by writing into each other's files. To guarantee integrity, ocfs2 uses the Linux Distributed Lock Manager (DLM) to serialize events. However, a major goal of a clustered filesystem is to reduce cluster locking overhead in order to improve overall performance. This article will provide an overview of ocfs2 and how it is structured internally.

A brief history

Version 1 of the ocfs filesystem was an early effort by Oracle to create a filesystem for the clustered environment. It was a basic filesystem targeted to support Oracle database storage and did not have most of the POSIX features due to its limited disk format. Ocfs2 was a development effort to convert this basic filesystem into a general-purpose filesystem. The ocfs2 source code was merged in the Linux kernel with 2.6.16; since this merger (in 2005), a lot of features have been added to the filesystem which improve data storage efficiency and access times.

Clustering

Ocfs2 needs a cluster management system to handle cluster operations such as node membership and fencing. All nodes must have the same configuration, so each node knows about the other nodes in the cluster. There are currently two ways to handle cluster management for ocfs2:

  • Ocfs2 Cluster Base (O2CB) - This is the in-kernel implementation of cluster configuration management; it provides only the basic services needed to have a clustered filesystem running. Each node writes to a heartbeat file to let others know the node is alive. More information on running the ocfs2 filesystem using o2cb can be found int the ocfs2 user guide [PDF]. This mode does not have the capability of removing nodes from a live cluster and cannot be used for cluster-wide POSIX locks.

  • Linux High Availability - uses user-space tools, such as heartbeat and pacemaker, to perform cluster management. These packages are complete cluster management suites which can be used for advanced cluster activities such as different fail-over, STONITH (Shoot The Other Node In The Head - yes, computing terms can be barbaric), migration dependent services etc. It is also capable of removing nodes from a live cluster. It supports cluster-wide POSIX locks, as opposed to node-local locks. More information about cluster management tools can be found at clusterlabs.org and linux-ha.org

Disk format

ocfs2 separates the way data and metadata are stored on disk. To facilitate this, it has two types of blocks:
  • Metadata or simply "blocks" - the smallest addressable unit. These blocks contain the metadata of the filesystem, such as the inodes, extent blocks, group descriptors etc. The valid sizes are 512 bytes to 4KB (incremented in powers of two). Each metadata block contains a signature that says what the block contains. This signature is cross-checked while reading that specific data type.

  • Data Clusters - data storage for regular files. Valid cluster sizes range from 4KB to 1MB (in powers of two). A larger data cluster reduces the size of filesystem metadata such as allocation bitmaps, making filesystem activities such as data allocation or filesystem checks faster. On the other hand, a large cluster size increases internal fragmentation. A large cluster size is recommended for filesystems storing large files such as virtual machine images, while a small data cluster size is recommended for a filesystem which holds lots of small files, such as a mail directory.

Inodes

An inode occupies an entire block. The block number (with respect to the filesystem block size) doubles as the inode number. This organization may result in high disk space usage for a filesystem with a lot of small files. To minimize that problem, ocfs2 packs the data files into the inode itself if they are small enough to fit. This feature is known as "inline data." Inode numbers are 64 bits, which gives enough room for inode numbers to be addressed on large storage devices.

Ocfs2 inode layout Data in a regular file is maintained in a B-tree of extents; the root of this B-tree is the inode. The inode holds a list of extent records which may either point to data extents, or point to extent blocks (which are the intermediate nodes in the tree). A special field called l_tree_depth contains the depth of the tree. A value of zero indicates that the blocks pointed to by extent records are data extents. The extent records contain the offset from the start of the file in terms of cluster blocks, which helps in determining the path to take while traversing the B-tree to find the block to be accessed.

The basic unit of locking is the inode. Locks are granted on a special DLM data structure known as a lock resource. For any access to the file, the process must request a DLM lock on the lock resource over the cluster. DLM offers six lock modes to differentiate between the type of operation. Out of these, ocfs2 uses only three: exclusive, protected read, and null locks. The inode maintains three types of lock resources for different operations:

  • read-write lock resource: is used to serialize writes if multiple nodes perform I/O at the same time on a file.

  • inode lock resource: is used for metadata inode operations

  • open lock resource: is used to identify deletes of a file. When a file is open, this lock resource is opened in protected-read mode. If the node intends to delete it, it will request for a exclusive lock. If successful, it means that no other node is using the file and it can be safely deleted. If unsuccessful, the inode is treated as an orphan file (discussed later)

Directory

Directory entries are stored in name-inode pairs in blocks known as directory blocks. Access to the storage pattern of directory blocks is the same as for a regular file. However, directory blocks are allocated as cluster blocks. Since a directory block is considered to be a metadata block, the first allocation uses only a part of the cluster block. As the directory expands, the remaining unused blocks of the data cluster are filled until the data cluster block is fully used.

ocfs2 directory
layout on disk A relatively new feature is indexing the directory entries for faster retrievals and improved lookup times. Ocfs2 maintains a separate indexed tree based on the hash of the directory names; the hash index points to the directory block where the directory entry can be found. Once the directory block is read, the directory entry is searched linearly.

A special directory trailer is placed at the end of a directory block which contains additional data about that block. It keeps a track of the free space in the directory block for faster free space lookup during directory entry insert operations. The trailer also contains the checksum for the directory block, which is used by the metaecc feature (discussed later).

Filesystem Metadata

A special system directory (//) contains all the metadata files for the filesystem; this directory is not accessible from a regular mount. Note that the // notation is used only for the debugfs.ocfs2 tool. Files in the system directory, known as system files, are different from regular files, both in the terms on how they store information and what they store.

An example of a system file is the slotmap, which defines the mapping of a node in the cluster. A node joins a cluster by providing its unique DLM name. The slot map provides it with a slot number, and the node inherits all system files associated with the slot number. The slot number assignment is not persistent across boots, so a node may inherit the system files of another node. All node-associated files are suffixed by the slot number to differentiate the files of different slots.

A global bitmap file in the system directory keeps a record of the allocated blocks on the device. Each node also maintains a "local allocations" system file, which manages chunks of blocks obtained from the global bitmap. Maintaining local allocations decreases contention over global allocations.

The allocation units are divided into the following:

  • inode_alloc: allocates inodes for the local node.

  • extent_alloc: allocates extent blocks for the local node. Extent blocks are intermediate leaf nodes in the B-tree storage of the files.

  • local_alloc: allocates data in data cluster sizes for the use of regular file data.

Allocator layout Each allocator is associated with an inode; it maintains allocations in units known as "block groups." The allocation groups are preceded by a group descriptor which contains details about the block group, such as free units, allocation bitmaps etc. The allocator inode contains a chain of group descriptor block pointers. If this chain is exhausted, group descriptors are added to the existing ones in the form of linked list. Think of it as an array of linked lists. The new group descriptor is added to the smallest chain so that number of hops required to reach an allocation unit is small.

Things get complicated when allocated data blocks are freed because those blocks could belong to the allocation map of another node. To resolve this problem, a "truncate log" maintains the blocks which have been freed locally, but not yet returned to the global bitmap. Once the node gets a lock on the global bitmap, the blocks in the local truncate log are freed.

A file is not physically deleted until all processes accessing the file close it. Filesystems such as ext3 maintain an orphan list which contains a list of files which have been unlinked but still are in use by the system. Ocfs2 also maintains such a list to handle orphan inodes. Things are a bit more complex, however, because a node must check that a file to be deleted is not being used anywhere in the cluster. This check is coordinated using the inode lock resource. Whenever a file is unlinked, and the removed link happens to be the last link to the file, a check is made to determine whether another node is using the file by requesting an exclusive lock over inode lock resource. If the file is being used, it will be moved to the orphan directory and marked with a OCFS2_ORPHANED_FL flag. The orphan directory is later scanned to check for files not being accessed by any of the nodes in order to physically remove them from the storage device.

Ocfs2 maintains a journal to deal with unexpected crashes. It uses the Linux JBD2 layer for journaling. The journal files are maintained, per node, for all I/O performed locally. If a node dies, it is the responsibility of the other nodes in the cluster to replay the dead node's journal before proceeding with any operations.

Additional Features

Ocfs2 has a couple of other distinctive features that it can boast about. They include:

  • Reflinks is a feature to support snapshotting of files using copy-on-write (COW). Currently, a system call interface, to be called reflink() or copyfile() is being discussed upstream. Until the system call is finalized, users can access this feature via the reflink system tool which uses an ioctl() call to perform the snapshotting.

  • Metaecc is an error correcting feature for the metadata using Cyclic Redundancy Check (CRC32). The code warns if the calculated error-correcting code is different from the one stored, and re-mounts the filesystem read-only in order to avoid further corruption. It is also capable of correcting single-bit errors on the fly. A special data structure, ocfs2_block_check, is embedded in most metadata structures to hold the CRC32 values associated with the structure.

Ocfs2 developers continue to add features to keep it up to par with other new filesystems. Some features to expect in the near future are delayed allocation, online filesystem checks, and defragmentation. Since one of the main goals of ocfs2 is to support a database, file I/O performance is considered a priority, making it one of the best filesystems for the clustered environment.

[Thanks to Mark Fasheh for reviewing this article.]


(Log in to post comments)

A look inside the OCFS2 filesystem

Posted Sep 2, 2010 9:11 UTC (Thu) by Cato (subscriber, #7643) [Link]

Good article - however, presumably this project must be considered subject to future restrictions or litigation by Oracle, as with the closing of OpenSolaris and the Java patent lawsuit against Android.

A look inside the OCFS2 filesystem

Posted Sep 3, 2010 16:42 UTC (Fri) by mrjk (subscriber, #48482) [Link]

This is GPL'd code released under that license by Oracle. What do you think they could do? They can't assert patent rights due to their own release.

A look inside the OCFS2 filesystem

Posted Sep 3, 2010 21:38 UTC (Fri) by Tomasu (subscriber, #39889) [Link]

Tell that to SCO.

A look inside the OCFS2 filesystem

Posted Sep 6, 2010 1:40 UTC (Mon) by njs (guest, #40338) [Link]

SCO never tried to assert patent rights to anything.

And what they *did* try to assert just ended up with them laughed out of court and bankrupt; if there was a strategy there, it was a mixture of wild optimism and cynical conmanship. What would Oracle gain from that sort of nonsense?

A look inside the OCFS2 filesystem

Posted Sep 13, 2010 16:15 UTC (Mon) by topher (guest, #2223) [Link]

Oracle's recent behavior has served as an excellent reminder that they are a purely profit driven corporation with no ethics or qualms about screwing over competitors, or the community, if they feel it is in their best interest.

However, in this specific case, I don't think there's a concern. Oracle pushed hard to get OCFS2 into the mainline Linux kernel, and have benefited from it being there. OCFS2 is the recommended file system for Oracle RAC installations, and that is unlikely to change in the future.

I wouldn't recommend turning my back on Oracle, but as long as they can gain more from OCFS2 being freely available and unencumbered than they can otherwise, I expect it'll remain that way. And having seen what they charge for Oracle RAC licensing, they're making money on OCFS2, indirectly, quite well.

A look inside the OCFS2 filesystem

Posted Sep 17, 2010 5:11 UTC (Fri) by kevinm (guest, #69913) [Link]

I wonder why the special lock handling was needed for deleting - couldn't a node holding a file open just create a link to it in the // system file area? Then you can just have the normal "if you're removing the last link, delete it" behaviour.

A look inside the OCFS2 filesystem

Posted Sep 17, 2010 18:19 UTC (Fri) by sniper (subscriber, #13219) [Link]

How do you know that last link? The fs does not maintain a cluster-wise usage count. Too expensive. Also, one has to account for node crashes. What if the node that had the possible last link dies?

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds