April 22, 2009
This article was contributed by Goldwyn Rodrigues
The three R's of high availability are Redundancy, Redundancy and
Redundancy. However, on a typical setup built with commodity hardware,
it is not possible to add redundancy beyond a certain limit to
increase the number of 9's after your current uptime percentage (ie 99.999%).
Consider a simple example: an iSCSI server with the cluster nodes using
a distributed filesystem such as GFS2 or OCFS2. Even with
redundant power supplies and data channels on the iSCSI storage
server, there still exists a single point of failure: the storage.
The Distributed Replicated Block Device (DRBD) patch, developed by Linbit,
introduces duplicated block storage over the network with synchronous data
replication. If one of the storage nodes in the replicated
environment fails, the system has another block device to rely on, and
can safely failover. In short, it can be considered as an implementation of
RAID1 mirroring using a combination of a local disk and one on a remote node,
but with better integration with cluster software
such as heartbeat and efficient resynchronization with the ability to
exchange dirty bitmaps and data generation identifiers. DRBD currently
works only on 2-node clusters, though you could use a hybrid version to
expand this limit. When both nodes of the cluster are up, writes are
replicated and sent to both the local disk and the other node. For efficiency
reasons, reads are fetched from the local disk.
The level of data coupling used depends on the protocol chosen:
-
Protocol A: Writes are considered to complete as soon as the
local disk writes have completed, and the data packet has been placed
in the send queue for the peers. In case of a node failure, data loss
may occur because the data to be written to remote node disk may still
be in the send queue. However, the data on the failover node is
consistent, but not up-to-date. This is usually used for geographically
separated nodes.
- Protocol B: Writes on the primary node are considered to be
complete as soon as the local disk write has completed and the
replication packet has reached the peer node. Data loss may occur in
case of simultaneous failure of both participating nodes, because the
in-flight data may not have been committed to disk.
- Protocol C: Writes are considered complete only after both the
local and the remote node's disks have confirmed the writes are
complete. There is no data loss, so this is a popular schema for clustered
nodes, but the I/O throughput is dependent on the network bandwidth.
DRBD classifies the cluster nodes as either "primary" or "secondary."
Primary nodes can initiate modifications or writes whereas secondary
nodes cannot. This means that a secondary DRBD node does not
provide any access and cannot be mounted. Even read-only access is
disallowed for cache coherency reasons. The secondary node is present
mainly to act as the failover device in case of an error. The secondary
node may become primary depending on the network configuration.
Role assignment and designation is performed by the cluster
management software.
There are different ways in which a node may be
designated as primary:
-
Single Primary: The primary designation is given to one cluster
member. Since only one cluster member manipulates the data, this mode is
useful with conventional filesystems such as ext3 or XFS.
-
Dual Primary: Both cluster nodes can be primary and are
allowed to modify the data. This is typically used in cluster aware
filesystems such as ocfs2. DRBD for the current release can support a
maximum of two primary nodes in a basic cluster.
Worker Threads
A part of the communication between nodes is handled by threads to avoid deadlocks
and complex design issues. The threads used for communication are:
-
drbd_receiver: handles incoming packets. On
the secondary node, it allocates buffers, receives data blocks and
issues write requests to the local disk. If it receives a write
barrier, it sleeps until all pending write requests have been
finished.
-
drbd_sender: Sender thread for data blocks in response to a read
request. This is done in a thread other than drbd_receiver,
to avoid distributed deadlocks. If a resynchronization
process is running, its packets are generated by this thread.
-
drbd_asender: Acknowledgment sender. Hard drive drivers are informed
of request completions through interrupts. However, sending data over
the network in an interrupt callback routine may block the handler.
So, the interrupt handler places the packet in a queue which is picked up by
this thread and sent over the network.
Failures
DRBD requires a small reserve area for metadata, to handle post
failure operations (such as synchronization) efficiently.
This area can be configured either on a separate device
(external metadata), or within the DRBD block device (internal
metadata). It holds the metadata with respect to the disk including
the activity log and the dirty bitmap (described below).
Node Failures
If a secondary node dies, it does not affect the system as a whole because writes
are not initiated by the secondary node. If the failed node is primary,
the data yet to be written to disk, but for which completions are not
received, may get lost. To avoid this, DRBD maintains an "activity log,"
a reserved area on the local disk which contains
information about write operations which have not completed. The data is stored
in extents and is maintained in a least recently used (LRU) list.
Each change of the activity log causes a meta data update (single
sector write). The size of the activity log is configured by the user;
it is a tradeoff between minimizing updates to the meta data and the
resynchronization time after the crash of a primary node.
DRBD maintains a "dirty bitmap" in case it has to run without a peer node or
without a local disk. It describes the pages which have been dirtied by the
local node. Writes to the on-disk dirty bitmap are minimized by the
activity log. Each time an extent is evicted from the activity log, the part of
the bitmap associated with it which is no longer covered by the activity log
is written to disk. The dirty bitmaps are sent over the network to
communicate which pages are dirty should a resynchronization become
necessary. Bitmaps are
compressed (using run-length encoding) before sending on the network to reduce network
overhead. Since most of the of the bitmaps are sparse, it proves to be
pretty effective.
DRBD synchronizes data once the crashed node comes back up, or in response
to data inconsistencies caused by an interruption in the link.
Synchronization is performed in a linear order, by disk offset, in
the same disk layout as the consistent node. The rate of
synchronization can be configured by the rate parameter in the
DRBD configuration file.
Disk Failures
In case of local disk errors, the system may choose to deal with it
in one of the following ways, depending on the configuration:
- detach: Detach the node from the backing device and continue in
diskless mode. In this situation, the device on the peer node becomes
the main disk. This is the recommended configuration for high availability.
- pass_on: Pass the error to the upper layers on a primary
node. The disk error is ignored, but logged, when the node
is secondary.
- call-local-io-error: Invokes a script. This mode
can be used to perform a failover to a "healthy" node, and
automatically shift the primary designation to another node.
Data Inconsistency issues
In the dual-primary case, both nodes may write to the same disk sector,
making the data inconsistent. For writes at different offset, there is
no synchronization required. To avoid inconsistency issues, data
packets over the network are numbered sequentially to identify the
order of writes. However, there are still some corner-case
inconsistency problems the system can suffer from:
- Simultaneous writes by both nodes at the same time.
In such a situation, one of the node's writes are discarded. One of
the primary nodes is marked with the "discard-concurrent-writes" flag, which
causes it to discard write requests from the other node when it detects
simultaneous writes. The node with discard-concurrent-writes flag set,
sends a "discard ACK" to other nodes informing them that the write has been
discarded. The other node, on detecting the discard ACK, writes the
data from first node to keep the drives consistent.
- Local request while remote request in flight
This can happen when the disk latency exceeds the network latency.
The local node writes to a given block, sending the write operation to the
other node. The remote node then acknowledges the completion of the
request and sends a new write of its own to the same block - all before the
local write has completed. In this case, the local node
keeps the new data write request on hold until the local writes are
complete.
- Remote request while local request is still pending: this situation
comes about if the network reorders packets, causing a remote write to a
given block to arrive before the acknowledgment of a previous,
locally-generated write. Once again, the receiving node will simply hold
the new data until the ACK is received.
Conclusion
DRBD is not the only distributed storage implementation under development.
The implementation of Distributed Storage (DST) contributed by Evgeniy Polyakov
and accepted in staging tree takes a different approach.
DRBD is limited to 2-node active clusters, while DST can have
larger numbers of nodes. DST works on client-server model, where
the storage is at the server end, whereas DRBD is peer-to-peer based,
and designed for high-availability as compared to distributing
storage. DST, on the other hand, is designed for accumulative storage,
with storage nodes which can be added as needed. DST has a pluggable
module which accepts different algorithms for mapping the storage
nodes into a cumulative storage. The algorithm chosen can be mirroring
which would serve the same basic capability of replicated storage as
DRBD.
DRBD code is maintained in the git repository at
git://git.drbd.org/linux-2.6-drbd.git, under the "drbd" branch. It
contains the minor review comments posted on LKML
incorporated after the patchset was released by Philipp Reisner.
For further information, see the several PDF documents mention in the DRBD patch posting.
(
Log in to post comments)