Infrastructure unification in the block layer
For nearly as long there have been suggestions that having two frameworks is a waste and that they should be unified. However little visible effort has been made toward this unification and such efforts as there might have been have not yielded any lasting success. The most united thing about the two is that they have a common directory in the Linux kernel source tree (drivers/md); this is more a confusion than a unification. The two subsystems have both seen ongoing development side by side, each occasionally gaining functionality that the other has and so, in some ways, becoming similar. But similarity is not unity, rather it serves to highlight the lack of unity as it is no longer function that keeps the two separate, only form.
Exploring why unification has never happened would be an interesting historical exercise that would need to touch on the personalities of the people involved, the drift in functionality between the two systems which started out with quite different goals, the differing perceptions of each by various members of the community, and the technological differences that would need to be resolved. Not being an historian, your author only feels competent to comment on that last point, and, as it is the one where a greater understanding is most likely to aid unification, this article will endeavor to expose the significant technological issues that keep the two separate. In particular, we will explore the weaknesses in each infrastructure. Where a system has strengths, they are likely to be copied, thus creating more uniformity. Where it has weaknesses, they are likely to be assiduously avoided by others, thus creating barriers.
Trying to give as complete a picture as possible, we will explore more than just DM and MD. Loop, NBD, and DRBD provide similar functionality behind their own single-use infrastructure; exploring them will ensure that we don't miss any important problems or needs.
The flaws of MD
Being the lead developer of MD for some years, your author feels honour bound to start by identifying weaknesses in that system.
One of the more ugly aspects of MD is the creation of a new array device. This is triggered by simply opening a device special file, typically in /dev. In the old days, when we had a fairly static /dev directory, this seemed a reasonable approach. It was simply necessary to create a bunch of entries in /dev (md0, md1, md2, ...) with appropriate major and minor numbers at the same time that the other static content of /dev was created. Then, whenever one of those entries was opened, the internal data structures would spontaneously be created so that the details of the device could be filled in.
However with the more modern concept of a dynamic /dev, reflecting the fact that the set of devices attached to a given system is quite fluid, this doesn't fit very well. udev, which typically manages /dev, only creates entries for devices that the kernel knows about. So it will not create any md devices until the kernel believes them to exist. They won't exist until the device file has been created and can be opened - a classic catch-22 situation.
mdadm, the main management tool for MD, works around this problem by creating a temporary device special file just so that it can open it and thereby create the device. This works well enough, but is, nonetheless, quite ugly. The internal implementation is particularly ugly and it was only relatively recently that the races inherent in destroying an MD device (which could be recreated at any moment by user space) were closed so that MD devices don't have to exist forever.
A closely related problem with MD is that the block device representing the array appears before the array is configured or has any data in it. So when udev first creates /dev/md0, an attempt to open and read from it, (to find out if a filesystem is stored there, for example) will find no content. It is only after the component devices have been attached to the array and it has been fully configured that there is any point in trying to read data from the array.
This initial state, where the device exists but is empty, is somewhat like the case of removable-media devices, and can be managed along the same lines as those: we could treat the array as media that can spontaneously appear. However MD is, in other ways, quite unlike removable media (there is no concept of "eject") and it would generally cause less confusion if MD devices appeared fully configured so they looked more like regular disk drive devices.
A problem that has only recently been addressed is the fact that MD manages the metadata for the arrays internally. The kernel module knows all about the layout of data in the superblock and updates it as appropriate. This makes it easy to implement, but not so easy to extend. Due to the lack of any real standard, there are many vendor-specific metadata layouts that all can be used to describe the same sort of array. Supporting all of those in the kernel would unnecessarily bloat the kernel, and supporting them in user space requires information about required updates to be reported to user space in a reliable way.
As mentioned, this problem has recently been addressed, so it is now quite possible to manage vendor-specific metadata from user space. It is still worth noting, though, as one of the problems that has stood in the way of earlier attempts at DM/MD integration: DM does not manage metadata at all, leaving it up to user-space tools.
The final flaw in MD to be exposed here is the use and nature of the ioctl() commands that are used to configure and manage MD arrays. The use of ioctl() has been frowned upon in the Linux community for some years. There are a number of reasons for this. One is that strace cannot decode newly-defined ioctls, so the use of ioctl() can make a program's behaviour harder to observe. Another is that it is a binary interface (typically passing C structures around) and so, when Linux is configured to support multiple ABIs (e.g. a 32bit and a 64bit version), there is often a need to translate the binary structure from one ABI to the other (see the nearly 3000 lines in fs/compat_ioctl.c).
In the case of MD, the ioctl() interface is not very extensible. The command for configuring an array allows only a "level", a "layout", and a "chunksize" to be specified. This works well enough for RAID0, RAID1, and RAID5, but even with RAID10 we needed to encode multiple values into the "layout" field which, while effective, isn't elegant.
In the last few years MD has grown a separate configuration interface via a collection of attribute files exposed in sysfs. This is much more extensible, and there are a growing number of features of MD which require the sysfs interface. However even here there is still room for improvement. The MD attribute files are stored in a subdirectory of the block device directory (e.g. /sys/block/md0/md/). While this seems natural, it entrenches the above-mentioned problem that the block device must exist before the array can be configured. If we wanted to delay creation of the block device until the array is ready to serve data, we would need to store these attribute files elsewhere in sysfs.
The failings of DM
DM has a very different heritage than MD and, while it shares some of the flaws of MD, it avoids others.
DM devices do not need to exist in /dev before they can be created. Rather there is a dedicated "character device" which accepts DM ioctl() commands, including the command to create a new device. Thus, the catch-22 problem from which MD suffers, is not present in DM. It has been suggested that MD should take this approach too. However, while it does solve one problem, it still leaves the problem of using ioctl(). There doesn't seem a lot of point making significant change to a subsystem unless the result avoids all known problems. So while waiting for a perfect solution, no such small steps have been made to bring MD and DM closer together.
The related issue of a block device existing before it is configured is still present in DM, though separating the creation of the DM device from the creation of the block device would be much easier in DM. This is because, as mentioned, with DM all configuration happens over the character device whereas with MD, the configuration happens via the block device itself, so it must exist before it can be configured.
While DM also uses ioctl() commands, which could be seen as a weakness, the commands chosen are much more extensible than those used by MD. The ioctl() command to configure a device essentially involves passing a text string to the relevant module within DM, and it interprets this string in any way it likes. So DM is not limited to the fields that were thought to be relevant when DM was first designed.
Metadata management with DM is very different than with MD. In the original design, there was never any need for the kernel module to modify metadata, so metadata management was left entirely in user space where it belongs. More recently, with RAID1 and RAID5 (which is still under development), the kernel is required to synchronously update the metadata to record a device failure. This requires a degree of interaction between the kernel and user space which has had to be added.
The main problem with the design of DM is the fact that it has two layers: the table layer and the target layer. This undoubtedly comes from the original focus of DM, which was logical volume management (LVM), and it fits that focus quite well. However, it is an unnecessary layering and just gets in the way of non-LVM applications.
A "target" is a concept internal to DM, which is the abstraction that each different module presents. So striping, raid1, multipath, etc. each present a target, and these targets can be combined via a table into a block device.
A "table" is simply a list of targets each with a size and an offset. This is analogous to the "linear" module in MD or what is elsewhere described as concatenation. The targets are essentially joined end-to-end to form a single larger block device.
This contrasts with MD where each module - raid0, raid1, or multipath, for example - presents a block device. This block device can be used as-is, or it can be combined with others, via a separate array, into a single larger block device.
To highlight the effect of this layering a little more, suppose we were to have two different arrays made of a few devices. In one array we want the data striped across the devices. In the other we lay the data out filling the first device first and then moving on to the next device. With MD, the only difference between these two would be the choice of "raid0" or "linear" as the module to manage them. With DM, the first step would involve including all the devices in a single "stripe" target, and then placing that target as the sole entry in a table. The second would involve creating a number of "linear" targets, one for each device, and then combining them into a table with multiple entries.
Having this internal abstraction of a "target" serves to insulate and isolate DM from the block device layer, which is the common abstraction used by other virtual devices. A good example of this separation is the online reconfiguration functionality that DM provides. The boundary between the table and the targets allows DM to capture new requests in the table layer while allowing the target layer to drain and become idle, and then to potentially replace all the targets with different targets before releasing the requests that have been held back.
Without that internal "target" layer, that functionality would need to be implemented in the block layer on its boundary with the driver code (i.e. in generic_make_request() and bio_endio()). Doing this would be more effort (i.e. DM would not benefit from insulation) and it would then be more generally useful (i.e. DM would not be so isolated). Many people have wanted to be able to convert a normal device into a degraded RAID1 "array" or to enable multipath support on a device without first unmounting the filesystem which was mounted directly from one of the paths. If online reconfiguration were supported at the block layer level, these changes would become possible.
The difference of DRBD
DRBD, the Distributed Replicated Block Device, is the most complex of the virtual block devices that do not aim to provide a general framework. It is not yet included in the mainline, but it could yet be merged in 2.6.33.
Its configuration mechanism is similar to that of DM in a number of ways. There is a single channel which can be used to create and then manage the block devices. The protocol used over this channel is designed to be extensible, though the current definitions are very much focused around the particular needs of DRBD (as would be expected), so how easy it might be to extend to a different sort of array is not immediately clear.
Where DM uses ioctl() with string commands over a dedicated character device, DRBD uses a packed binary protocol over a netlink connection. This is essentially a socket connection between the kernel module and a user-space management program which carries binary encoded messages back and forth. This is probably no better or worse than ioctl(); it is simply different. Presumably it was chosen because there is general bad feeling about ioctl(), but no such bad feeling about netlink. Linus, however, doesn't seem keen on either approach.
DRBD appears to share metadata management between the kernel module and user space. Metadata which describes a particular DRBD configuration is created and interpreted by user-space tools and the information that is needed by the kernel is communicated over the netlink socket. DRBD uses other metadata to describe the current replication state of the system - which blocks are known to be safely replicated and which are possibly inconsistent between the replicas for some reason. This metadata (an activity log and a bitmap) is managed by the kernel, presumably for performance reasons.
This sharing of responsibility makes a lot of sense as it allows the performance-sensitive portions to remain in the kernel but still leaves a lot of flexibility to support different metadata formats. This approach could be improved even more by making the bitmap and activity log into independent modules that can be used by other virtual devices. Each of DM, MD, and DRBD have very similar similar mechanisms for tracking inconsistencies between component devices; this is possibly the most obvious area where sharing would be beneficial.
Loop, NBD and the purpose of infrastructure
Partly to emphasize the fact that it isn't necessary to use a framework to have a virtual block device, loop and NBD (the Network Block Device) are worth considering. While loop doesn't appear to aim to provide a framework for a multiplicity of virtual devices, it nonetheless combines three different functions into one device. It can make a regular file look like a block device, it can provide primitive partitioning of a different block device, and it can provide encryption and decryption so that an encrypted device can be accessed. Significantly, these are each functions that were subsequently added to DM, thus highlighting the isolating effect of the design of DM.
NBD is much simpler in that is has just one function: it provides a block device for which all I/O requests are forwarded over a network connection to be serviced on - normally - a different host. It is possibly most instructive as an example of a virtual block device that doesn't need any surrounding framework or infrastructure.
Two areas where DM or MD devices make use of an infrastructure, while Loop and NBD need to fend for themselves, are in the creation of new devices and the configuration of those devices. NBD takes a very simple approach of creating a predefined number of devices at module initialization time and not allowing any more. Loop is a little more flexible and uses the same mechanism as MD, largely provided by the block layer, to create loop devices when the block device special file is opened. It does not allow these to be deleted until the module is unloaded, usually at system shutdown time. This architecture suggests that some infrastructure could be helpful for these drivers, and that the best place for that infrastructure could well be in the block layer, and thus shared by all devices.
For configuration, both Loop and NBD use a fairly ad hoc collection of ioctl() commands. As we have already observed, this is both common and problematic. They could both benefit from a more standardized and transparent configuration mechanism.
It might be appropriate to ask at this point why there is any need for subsystem infrastructure such as DM and MD. Why not simply follow the pattern seen in loop, NBD and DRDB and have a separate block device driver for each sort of virtual block device? The most obvious reason is one that doesn't really apply any more. At the time when MD and DM were being written there was a strong connection between major device numbers and block device drivers. Each driver needed a separate major number. Loop is 7, NBD is 43, DRBD is 147, and MD is 9. DM doesn't have a permanently allocated number, it chooses a spare number when the module is loaded, so it usually gets 254 or 253.
Furthermore, at that time, the number of available major numbers was limited to 255 and there was danger of running out. Allocating one major number for RAID0, one for LINEAR, one for RAID1 and so forth would have looked like a bit of a waste, so getting one for MD and plugging different personalities into the one driver might have been a simple matter of numerical economy. Today, we have many more major numbers available, and we no longer have a tight binding between major numbers and device drivers - a driver simply claims whichever device numbers it wants at any time, when the module is loaded, or when a device is created.
A second reason is the fact that all the MD personalities envisioned at the time had a lot in common. In particular they each used a number of component devices to create a larger device. While creating a midlayer to encapsulate this functionality might be a mistake, it is a very tempting step and would seem to make implementation easier.
Finally, as has been mentioned, having a single module which defines its own internal interfaces can provide a measure of insulation from other parts of the kernel. While this was mentioned only in the context of DM, it is by no means absent from MD. That insulation, while not necessarily in the best interests of the kernel as a whole, can make life a lot easier for the individual developer.
None of these reasons really stand up as defensible today, though some were certainly valid in the past. So it could be that, rather than seeking unification of MD and DM, we should be seeking their deprecation. If we can find a simple approach to allow different implementations of virtual block devices to exist as independent drivers, but still maintain all the same functionality as they presently have, that is likely to be the best way forward.
Unification with the device model
This brings us to the Linux device model. While there may be no real need to unify DM with MD, the devices they create need to fit into the unifying model for devices which we call the "device model" and which is exposed most obviously through various directory trees in sysfs. The device model has a very broad concept of a "device." It is much more than the traditional Unix block and character devices; it includes busses, intermediate devices, and just about anything that is in any way addressable.
In this model it would seem sensible for there to be an "array" device which is quite separate from the "block" device providing access to the data in the array. This is not unlike the current situation where a SCSI bus has a child which is a SCSI target which, in turn, has a child which is a SCSI LUN (Logical UNit), and that device itself is still separate from the block device that we tend to think of as a "SCSI disk". This separation would allow the array to be created and configured before the block device can come into being, thus removing any room for confusion for udev.
The device model already allows for a bus driver to discover devices on that bus. In most cases this happens automatically during boot or at hotplug time. However, it is possible to ask a bus to discover any new devices, or to look for a particular new device. This last action could easily be borrowed to manage creation of virtual block devices on a virtual bus. The automatic scan would not find any devices, but an explicit request for an explicitly-named device could always succeed by simply creating that device. If we then configure the device by filling in attribute files in the virtual block device, we have a uniform and extensible mechanism for configuring all virtual block devices that fits with an existing model.
Again, the device model already allows for binding different drivers to devices as implemented in the different "bind" files in the /sys/bus directory tree. Utilizing this idea, once a virtual block device was "discovered" on the virtual block device bus, an appropriate driver could be bound to it that would interpret the attributes, possibly create files for extra attributes, and, ultimately, instantiate the block device.
Possibly the most difficult existing feature to represent cleanly in the device model is the on-line reconfiguration that DM and, more recently, MD provide. This allows control of an array to be passed from one driver to another without needing to destroy and recreate the block device (thus, for example, a filesystem can remain mounted during the transition). Doing this exchange in a completely general way would involve detaching a block device from one parent and attaching it to another. This would be complex for a number of reasons, one being the backing_dev_info structure, which creates quite a tight connection between a filesystem and the driver for the mounted block device.
Another weakness in the device model is that dependencies between devices are very limited - a device can be dependent on at most one other device, its parent. This doesn't fit very well with the observation that an array is dependent on all the components of the array, and that these components can change from time to time. Fortunately this weakness has already been identified and, hopefully, will be resolved in a way that also works for virtual block devices.
So, while there are plenty of issues that this model leaves unresolved, it does seem that unification with the device model holds the key to unification between MD and DM, along with any other virtual block devices.
So what is the answer?
Knowing that a problem is hard does not excuse us from solving it. With the growing interest in managing multiple devices together, as seen in DRBD and Btrfs, as well as in the increasing convergence in functionality between DM and MD, now might be the ideal time to solve the problems and achieve unification. Reflecting on the various problems and differences discussed above, it would seem that a very important step would be to define and agree on some interfaces; two in particular.
The first interface that we need is the device creation and configuration interface. It needs to provide for all the different needs of DM, MD, Loop, NBD, DRBD, and probably even Btrfs. It needs to be sufficiently complete such that the current ioctl() and netlink interfaces can be implemented entirely through calls into this new interface. It is almost certain that this interface should be exposed through sysfs and so needs to integrate well with the device model.
The second interface is the one between the block layer and individual block drivers. This interface needs to be enhanced to support all the functionality that a DM target expects of its interface with a DM table, and in particular it needs to be able to support hotplugging of the underlying driver while the block device remains active.
Defining and, very importantly, agreeing on these interfaces will go a
long way towards achieving the long sought after unification.
| Index entries for this article | |
|---|---|
| GuestArticles | Brown, Neil |
