October 7, 2009
This article was contributed by Neil Brown
For many years, Linux has had two separate subsystems for managing
indirect block devices: virtual storage devices which combine storage
from one or more other devices in various ways to provide either
improved performance, flexibility, capacity, or redundancy.
These two are DM (which stands for Device Mapper) and MD (which
might stand for Multiple Devices or Meta Disk and is only by pure
coincidence the reverse of DM).
For nearly as long there have been suggestions that having two
frameworks is a waste and that they should be unified. However little
visible effort has been made toward this unification and such
efforts as there might have been have not yielded any lasting success.
The most united thing about the two is that they have a common
directory in the Linux kernel source tree (drivers/md); this is
more a confusion than a unification. The two subsystems have both
seen ongoing development side by side, each occasionally gaining
functionality that the other has and so, in some ways, becoming similar.
But similarity is not unity, rather it serves to highlight the lack of
unity as it is no longer function that keeps the two separate,
only form.
Exploring why unification has never happened would be an interesting
historical exercise that would need to touch on the personalities of
the people involved, the drift in functionality between the two
systems which started out with quite different goals, the differing
perceptions of each by various members of the community, and the
technological differences that would need to be resolved.
Not being an historian, your author only feels competent to comment on
that last point, and, as it is the one where a greater understanding is
most likely to aid unification, this article will endeavor to expose
the significant technological issues that keep the two separate. In
particular, we will explore the weaknesses in each infrastructure.
Where a system has strengths, they are likely to be copied, thus creating
more uniformity. Where it has weaknesses, they are likely to be
assiduously avoided by others, thus creating barriers.
Trying to give as complete a picture as possible, we will explore more
than just DM and MD. Loop, NBD, and DRBD provide similar
functionality behind their own single-use infrastructure; exploring
them will ensure that we don't miss any important problems or needs.
The flaws of MD
Being the lead developer of MD for some years, your author feels
honour bound to start by identifying weaknesses in that system.
One of the more ugly aspects of MD is the creation of a new array
device. This is triggered by simply opening a device special file,
typically in /dev. In the old days, when we had a fairly static /dev
directory, this seemed a reasonable approach. It was simply necessary
to create a bunch of entries in /dev (md0, md1, md2, ...) with appropriate
major and minor numbers at the same time that the other static
content of /dev was created. Then, whenever one of those entries was
opened, the internal data structures would spontaneously be created
so that the details of the device could be filled in.
However with the more modern concept of a dynamic /dev, reflecting the
fact that the set of devices attached to a given system is quite
fluid, this doesn't fit very well. udev, which typically manages
/dev, only creates entries for devices that the kernel knows about. So
it will not create any md devices until the kernel believes them to
exist. They won't exist until the device file has been created and
can be opened - a classic catch-22 situation.
mdadm, the main management tool for MD, works around this problem by creating
a temporary device special file just so that it can open it and thereby
create the device. This works well enough, but is, nonetheless, quite
ugly. The internal implementation is particularly ugly and it was
only relatively recently that the races inherent in destroying an MD
device (which could be recreated at any moment by user space) were
closed so that MD devices don't have to exist forever.
A closely related problem with MD is that the block device
representing the array appears before the array is configured or has
any data in it. So when udev first creates /dev/md0, an attempt to
open and read from it, (to find out if a filesystem is
stored there, for example) will find no content. It is only after the
component devices have been attached to the array and it has been
fully configured that there is any point in trying to read data from
the array.
This initial state, where the device exists but is empty, is somewhat
like the case of removable-media devices, and can be managed along
the same lines as those: we could treat the array as media that can
spontaneously appear. However MD is, in other ways, quite unlike
removable media (there is no concept of "eject") and it would
generally cause less confusion if MD devices appeared fully configured so
they looked more like regular
disk drive devices.
A problem that has only recently been addressed is the
fact that MD manages the metadata for the arrays internally.
The kernel module knows all about the layout of data in the superblock
and updates it as appropriate. This makes it easy to implement,
but not so easy to extend. Due to the lack of any real standard, there
are many vendor-specific metadata layouts that all can be used to
describe the same sort of array. Supporting all of those in the
kernel would unnecessarily bloat the kernel, and supporting
them in user space requires information about required updates to be
reported to user space in a reliable way.
As mentioned, this problem has recently been addressed, so it is now
quite possible to manage vendor-specific metadata from user space.
It is still worth noting, though, as one of the problems that has stood in the
way of earlier attempts at DM/MD integration: DM does not
manage metadata at all, leaving it up to user-space tools.
The final flaw in MD to be exposed here is the use and nature of
the ioctl() commands that are used to configure and manage MD arrays.
The use of ioctl() has been frowned upon in the Linux community for some
years. There are a number of reasons for this. One is that
strace cannot decode newly-defined ioctls, so the use of
ioctl() can make a program's behaviour harder to observe. Another is
that it is a binary interface (typically passing C structures around)
and so, when Linux is configured to support multiple ABIs (e.g. a 32bit
and a 64bit version), there is often a need to translate the binary
structure from one ABI to the other (see the nearly 3000 lines in
fs/compat_ioctl.c).
In the case of MD, the ioctl() interface is not very extensible. The
command for configuring an array allows only a "level", a "layout", and
a "chunksize" to be specified. This works well enough for RAID0,
RAID1, and RAID5, but even with RAID10 we needed to encode multiple
values into the "layout" field which, while effective, isn't elegant.
In the last few years MD has grown a separate configuration interface
via a collection of attribute files exposed in sysfs. This is much
more extensible, and there are a growing number of features of MD
which require the sysfs interface. However even here there is still
room for improvement.
The MD attribute files are stored in a subdirectory of the block
device directory (e.g. /sys/block/md0/md/). While this seems
natural, it entrenches the above-mentioned problem that the block
device must exist before the array can be configured. If we wanted to
delay creation of the block device until the array is ready to serve
data, we would need to store these attribute files elsewhere in
sysfs.
The failings of DM
DM has a very different heritage than MD and, while it shares some of
the flaws of MD, it avoids others.
DM devices do not need to exist in /dev before they can be created.
Rather there is a dedicated "character device" which accepts DM
ioctl() commands, including the command to create a new device. Thus,
the catch-22 problem from which MD suffers, is not present in DM.
It has
been suggested that MD should take this approach too. However, while
it does solve one problem, it still leaves the problem of using
ioctl(). There doesn't seem a lot of point making significant change
to a subsystem unless the result avoids all known problems. So while
waiting for a perfect solution, no such small steps have been made to
bring MD and DM closer together.
The related issue of a block device existing before it is configured
is still present in DM, though separating the creation of the DM
device from the creation of the block device would be much easier in
DM. This is because, as mentioned, with DM all configuration happens
over the character device whereas with MD, the configuration
happens via the block device itself, so it must exist before it can be
configured.
While DM also uses ioctl() commands, which could be seen as a weakness,
the commands chosen are much more extensible than those used by MD.
The ioctl() command to configure a device essentially involves
passing a
text string to the relevant module within DM, and it interprets this
string in any way it likes. So DM is not limited to the fields that
were thought to be relevant when DM was first designed.
Metadata management with DM is very different than with MD. In the
original design, there was never any need for the kernel module to
modify metadata, so metadata management was left entirely in user space
where it belongs. More recently, with RAID1 and RAID5 (which is still
under development), the kernel is required to synchronously update the
metadata to record a device failure. This requires a degree of
interaction between the kernel and user space which has had to be added.
The main problem with the design of DM is the fact that it has two
layers: the table layer and the target layer. This undoubtedly comes
from the original focus of DM, which was logical volume management (LVM), and it fits
that focus quite well. However, it is an unnecessary layering and just
gets in the way of non-LVM applications.
A "target" is a concept internal to DM, which is the abstraction that each
different module presents. So striping, raid1, multipath, etc. each
present a target, and these targets can be combined via a table into a
block device.
A "table" is simply a list of targets each with a size and an offset.
This is analogous to the "linear" module in MD or what is elsewhere
described as concatenation. The targets are essentially joined
end-to-end to form a single larger block device.
This contrasts with MD where each module - raid0, raid1, or multipath, for
example -
presents a block device. This block device can be used as-is, or it
can be combined with others, via a separate array, into a
single larger block device.
To highlight the effect of this layering a little more, suppose we
were to have two different arrays made of a few devices. In one array
we want the data striped across the devices. In the other we
lay the data out filling the first device first and then moving on to
the next device.
With MD, the only difference between these two would be the choice
of "raid0" or "linear" as the module to manage them. With DM, the
first step would involve including all the devices in a single "stripe"
target, and then placing that target as the sole entry in a table.
The second would involve creating a number of "linear" targets,
one for each device, and then combining them into a table with
multiple entries.
Having this internal abstraction of a "target" serves to insulate and
isolate DM from the block device layer, which is the common abstraction
used by other virtual devices. A good example of this separation is
the online reconfiguration functionality that DM provides. The
boundary between the table and the targets allows DM to capture new
requests in the table layer while allowing the target layer to drain
and become idle, and then to potentially replace all the targets with
different targets before releasing the requests that have been held
back.
Without that internal "target" layer, that functionality would need to
be implemented in the block layer on its boundary with the driver code
(i.e. in generic_make_request() and
bio_endio()). Doing this would be more effort (i.e. DM would not
benefit from insulation) and it would then be more generally useful
(i.e. DM would not be so isolated). Many people have wanted to be
able to convert a normal device into a degraded RAID1 "array" or to enable
multipath support on a device without first unmounting the filesystem
which was mounted directly from one of the paths. If online
reconfiguration were supported at the block layer level, these changes
would become possible.
The difference of DRBD
DRBD, the Distributed Replicated Block Device, is the most complex of
the virtual block devices that do not aim to provide a general
framework. It is not yet included in the mainline, but it could yet be
merged in 2.6.33.
Its configuration mechanism is similar to that of DM in a number of
ways. There is a single channel which can be used to create and then
manage the block devices. The protocol used over this channel is
designed to be extensible, though the current definitions are very
much focused around the particular needs of DRBD (as would be
expected), so how easy it might be to extend to a different
sort of array is not immediately clear.
Where DM uses ioctl() with string commands over a dedicated
character device, DRBD uses a packed binary protocol over a netlink
connection. This is essentially a socket connection between the
kernel module and a user-space management program which carries binary
encoded messages back and forth. This is probably no better or worse
than ioctl(); it is simply different.
Presumably it was chosen because
there is general bad feeling about ioctl(), but no such bad feeling
about netlink.
Linus, however, doesn't seem
keen on either approach.
DRBD appears to share metadata management between the kernel module
and user space. Metadata which describes a particular DRBD
configuration is created and interpreted by user-space tools and the
information that is needed by the kernel is communicated over the
netlink socket.
DRBD uses other metadata to describe the current replication state of
the system - which blocks are known to be safely replicated and which
are possibly inconsistent between the replicas for some reason. This
metadata (an activity log and a bitmap) is managed by the kernel,
presumably for performance reasons.
This sharing of responsibility makes a lot of sense as it allows the
performance-sensitive portions to remain in the kernel but still leaves a
lot of flexibility to support different metadata formats.
This approach could be improved even more by making the bitmap and activity log
into independent modules that can be used by other virtual devices.
Each of DM, MD, and DRBD have very similar similar mechanisms for
tracking inconsistencies between component devices; this is
possibly the most obvious area where sharing would be beneficial.
Loop, NBD and the purpose of infrastructure
Partly to emphasize the fact that it isn't necessary to use a
framework to have a virtual block device, loop and NBD (the Network
Block Device) are worth considering.
While loop doesn't appear to aim to provide a framework for a
multiplicity of virtual devices, it nonetheless combines three different
functions into one device. It can make a regular file look like a
block device, it can provide primitive partitioning of a different
block device, and it can provide encryption and decryption so that an
encrypted device can be accessed.
Significantly, these are each functions that were subsequently added
to DM, thus highlighting the isolating effect of the design of DM.
NBD is much simpler in that is has just one function: it provides
a block device for which all I/O requests are forwarded over a network
connection to be serviced on - normally - a different host. It is
possibly most instructive as an example of a virtual block device that
doesn't need any surrounding framework or infrastructure.
Two areas where DM or MD devices make use of an infrastructure, while
Loop and NBD need to fend for themselves, are in the creation of new
devices and the configuration of those devices.
NBD takes a very simple approach of creating a predefined number of
devices at module initialization time and not allowing any more.
Loop is a little more flexible and uses the same mechanism as MD,
largely provided by the block layer, to create loop devices when the
block device special file is opened. It does not allow these to be
deleted until the module is unloaded, usually at system shutdown
time. This architecture suggests that some infrastructure could be helpful for
these drivers, and that the best place for that infrastructure could
well be in the block layer, and thus shared by all devices.
For configuration, both Loop and NBD use a fairly ad hoc
collection of ioctl() commands. As we have already observed, this is
both common and problematic. They could both benefit from a more
standardized and transparent configuration mechanism.
It might be appropriate to ask at this point why there is any need for
subsystem infrastructure such as DM and MD. Why not simply follow the
pattern seen in loop, NBD and DRDB and have a separate block device driver
for each sort of virtual block device?
The most obvious reason is one that doesn't really apply any more. At
the time when MD and DM were being written there was a strong
connection between major device numbers and block device drivers. Each
driver needed a separate major number. Loop is 7, NBD is 43, DRBD is
147, and MD is 9. DM doesn't have a permanently allocated number, it
chooses a spare number when the module is loaded, so it usually gets
254 or 253.
Furthermore, at that time, the number of available major
numbers was limited to 255 and there was danger of running out.
Allocating one major number for RAID0, one for LINEAR, one for RAID1
and so forth would have looked like a bit of a waste, so getting one
for MD and plugging different personalities into the one driver
might have been a simple matter of numerical economy.
Today, we have many more major numbers available, and we no longer
have a tight binding between major numbers and device drivers - a
driver simply claims whichever device numbers it wants at any time,
when the module is loaded, or when a device is created.
A second reason is the fact that all the MD personalities
envisioned at the time had a lot in common. In particular they each
used a number of component devices to create a larger device. While
creating a midlayer to encapsulate this functionality might be a
mistake,
it is a very tempting step and would seem to make implementation
easier.
Finally, as has been mentioned, having a single module which defines
its own internal interfaces can provide a measure of insulation from
other parts of the kernel. While this was mentioned only in the
context of DM, it is by no means absent from MD. That insulation,
while not necessarily in the best interests of the kernel as a whole,
can make life a lot easier for the individual developer.
None of these reasons really stand up as defensible today, though some
were certainly valid in the past. So it could be that, rather than
seeking unification of MD and DM, we should be seeking their
deprecation. If we can find a simple approach to allow different
implementations of virtual block devices to exist as independent
drivers, but still maintain all the same functionality as they
presently have, that is likely to be the best way forward.
Unification with the device model
This brings us to the Linux device model. While there may be no real
need to unify DM with MD, the devices they create need to fit into the
unifying model for devices which we call the "device model" and which is
exposed most obviously through various directory trees in sysfs.
The device model has a very broad concept of a "device." It is much
more than the traditional Unix block and character devices; it
includes busses, intermediate devices, and just about anything that
is in any way addressable.
In this model it would seem sensible for there to be an "array" device
which is quite separate from the "block" device providing access
to the data in the array. This is not unlike the current situation
where a SCSI bus has a child which is a SCSI target which, in turn, has
a child which is a SCSI LUN (Logical UNit), and that device itself is
still separate from the block device that we tend to think of as a
"SCSI disk".
This separation would allow the array to be created and configured
before the block device can come into being, thus removing any room for
confusion for udev.
The device model already allows for a bus driver to discover
devices on that
bus. In most cases this happens automatically during boot or at
hotplug time. However, it is possible to ask a bus to discover any new
devices, or to look for a particular new device. This last action
could easily be borrowed to manage creation of virtual block devices
on a virtual bus. The automatic scan would not find any devices, but
an explicit request for an explicitly-named device could always
succeed by simply creating that device.
If we then configure the device by filling in attribute files in the
virtual block device, we have a uniform and extensible mechanism for
configuring all virtual block devices that fits with an existing
model.
Again, the device model already allows for binding different drivers
to devices as implemented in the different "bind" files in the
/sys/bus directory tree. Utilizing this idea, once a virtual block
device was "discovered" on the virtual block device bus, an appropriate
driver could be bound to it that would interpret the attributes,
possibly create files for extra attributes, and, ultimately, instantiate
the block device.
Possibly the most difficult existing feature to represent cleanly in
the device model is the on-line reconfiguration that DM and, more
recently, MD provide. This allows control of an array to be passed
from one driver to another without needing to destroy and recreate the
block device (thus, for example, a filesystem can remain mounted
during the transition). Doing this exchange in a completely general way would
involve detaching a block device from one parent and attaching it to
another. This would be complex for a number of reasons,
one being the backing_dev_info structure, which creates quite
a tight connection between a filesystem and the driver for the
mounted block device.
Another weakness in the device model is that dependencies between
devices are very limited - a device can be dependent on at most one
other device, its parent. This doesn't fit very well with the
observation that an array is dependent on all the components of the
array, and that these components can change from time to time.
Fortunately this weakness
has already been
identified
and, hopefully, will be resolved in a way that also works for virtual block
devices.
So, while there are plenty of issues that this model leaves unresolved,
it does seem that unification with the device model holds the key to
unification between MD and DM, along with any other virtual block devices.
So what is the answer?
Knowing that a problem is hard does not excuse us from solving it.
With the growing interest in managing multiple devices together, as
seen in DRBD and Btrfs, as well as in the increasing convergence in
functionality between DM and MD, now might be the ideal time
to solve the problems and achieve unification.
Reflecting on the various problems and differences discussed above, it
would seem that a very important step would be to define and agree on
some interfaces; two in particular.
The first interface that we need is the device creation and
configuration interface. It needs to provide for all the different
needs of DM, MD, Loop, NBD, DRBD, and probably even Btrfs. It needs to
be sufficiently complete such that the current ioctl() and netlink
interfaces can be implemented entirely through calls into this new
interface. It is almost certain that this interface should be exposed
through sysfs and so needs to integrate well with the device model.
The second interface is the one between the block
layer and individual block drivers. This interface needs to be enhanced to
support all the functionality that a DM target expects of its
interface with a DM table, and in particular it needs to be able to
support hotplugging of the underlying driver while the block device
remains active.
Defining and, very importantly, agreeing on these interfaces will go a
long way towards achieving the long sought after unification.
(
Log in to post comments)