LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.32-rc3, released by Linus on October 4. Note that there was no -rc2; that version was skipped to avoid confusion resulting from the fat-fingering of the version number in -rc1.

[O]ne thing that might be worth mention (despite being fairly small) is that there's been some IO latency tweaking in the block layer (CFQ scheduler). I'm hoping that ends up being one of those noticeable things, where people might actually notice better responsiveness. Give it a try.

One other user-visible change: by default, VFAT filesystems are now mounted with shortname=mixed instead of shortname=lower, preventing the downcasing of filenames when filesystems are copied. Also notable is that Btrfs is finally able to handle full-disk situations; we'll have to find something else to give Chris Mason grief about. The short-form changelog is in the announcement, or see the full changelog for all the details.

The current stable kernel is 2.6.31.3, released on October 7. It contains a single fix for a TTY problem that was affecting a number of users.

Previously, 2.6.31.2 was released on October 5. This is a large stable release; from the review announcement:

This release is big. Yeah, really big. There are a number of areas that needed some rework in order to get things back to working order. Like the tty layer. Hopefully everyone can now use their usb to serial devices again without oopsing the kernel. Xen and KVM also have reasonably big fixes, as does the ath5k and iwlwifi drivers. One might say that the patches for the iwlwifi drivers are a bit "bigger" than normal -stable material, but the wifi maintainer wants them, so he can handle the fallout. XHCI (the USB 3.0 controller) also has a big update here, to get it into workable shape to coincide with the release of the USB 3.0 developer kit. Without it, it wouldn't be really useful. And there's a whole raft of other important fixes as well, not to make light of them. A huge system speedup for large boxes is also in here, for those who like running benchmarks.

So expect more than the usual amount of change for a stable update.

2.6.27.36 and 2.6.30.9 were also released on October 5; they contain a somewhat smaller set of fixes. "This is the last release of the 2.6.30-stable series. Everyone should now move to the 2.6.31 kernel tree. If there are any issues preventing people from doing this, please let me know!"

Comments (none posted)

Quotes of the week

Al Viro has managed to get his and his wife's paperwork in order, and has returned to USA. Apparently, there was no record of his having been discharged from the Soviet Army. The authorities also lacked any record of his wife having an address in Russia while an adult. This situation was reportedly resolved as only Al Viro could have resolved it.
-- from the Netconf 2009 minutes

Reports of its demise were greatly exaggerated. But it's going to be a few days before we're back in sync with "current time" - I wanted to get the retrospective episodes on the merge window done, even though it's "old" now because the merge window period is of high value. I've been doing about 3 hours a night over the past few days to get through it. I view it as some kind of oddly exciting psychological masochism.
-- Jon Masters; who says podcasting is boring?

gad. You said "floppy" and "ioctl" in the same sentence. Where angels fear to tread.
-- Andrew Morton

The _reason_ for the driver exemption was the fact that even a broken driver is better than no driver at all for somebody who just can't get a working system without it, but that argument really goes away when the driver is so specialized that it's not about regular hardware any more.

And the whole "driver exemption" seems to have become a by-word for "I can ignore the merge window for 50% of my code". Which makes me very tired of it if there aren't real advantages to real users. So I'm seriously considering a "the driver has to be mass market and also actually matter to an install" rule for the exemption to be valid.

-- Linus Torvalds

Checkpatch is not very bright, it has no understanding of style beyond playing with pattern regexps. It's a rather dim tool that helps people get work done... or as some would have it a rather dim tool used by even dimmer tools to make noise on kernel list.
-- Alan Cox

Comments (2 posted)

Netconf 2009 minutes available

"Netconf 2009" was an invitation-only summit of networking developers held prior to LinuxCon. The minutes for the two days of discussion have now been posted on the Netconf 2009 page. Topics covered include transmit interrupt mitigation, bridging, multiqueue networking, wireless networking, and more.

Comments (none posted)

The CFQ "low latency" mode

One of the changes slipped into the 2.6.32-rc3 release was the addition of a "low latency" mode for the CFQ I/O scheduler. Normally the scheduler will try to delay many new I/O requests for a short time in the hope that they can be joined with other requests which may come shortly thereafter. This behavior will minimize disk seeks and maximize I/O request size, so it is clearly good for throughput. But the addition of delays can be a problem if the overriding goal is to complete the operation as quickly as possible.

The new mode (initially called "desktop" before being renamed "low_latency") is enabled by default; it can be adjusted by setting the iosched/low_latency attribute associated with each block device in sysfs. When set, some of the delays for "synchronous operations" (reads, generally) no longer happen. The result should be more responsive I/O and, one would hope, happier users.

Note: please see the comments for a description of this change which is more, um, accurate. Your editor blames the Death Flu that his kids brought home.

Comments (10 posted)

The ACPI processor aggregator driver

Patches merged into the mainline carry a number of tags to indicate who wrote them, who reviewed them, etc. A certain commit merged for 2.6.32 contains a relatively unusual tag, though:

    NACKed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

The merging of this patch has drawn some complaints: why should it have made it into the mainline when a core developer clearly has problems with it?

The story goes something like this. ACPI provides a mechanism by which it can ask the system to make processors go idle in emergency situations; these can include power problems or an overheating system. The ACPI folks had originally proposed putting some hacks into the scheduler to implement this functionality. These changes, it seems, were little loved; that was the patch that Peter Zijlstra blocked outright.

So Shaohua Li went back and implemented this functionality as a driver instead. If the ACPI hardware starts sounding the red alert, this driver will create a top-priority realtime thread and bind it to the CPU that is to be idled. That thread, when it "runs," will simply put the CPU into a relatively deep sleep state for a while. When the emergency passes, the thread will go away and normal life resumes. It's a bit of a hack, but it gets the job done, and it is not destructive to system state the way hot-unplugging the CPU would be.

The proper fix would be to enhance the scheduler (the right way) to provide this functionality. But that almost certainly requires the intervention of a real scheduler hacker, and they haven't yet gotten around to solving the problem. So the ACPI "driver" is in the mainline for now. And it may stay that way; Linus said:

In fact, the only reason the scheduler people even know about it is that Len at first tried to do something more invasive, and was shot down. Now it's just a driver, and the scheduler people can _try_ to do it some other way if they really care, but that's _their_ problem. Not the driver.

In the meantime, I personally suspect we probably never want to even try to solve it in the scheduler, because why the hell should we care and add complex logic for something like that? At least not until we end up having the same issue on some other architecture too, and it turns from a hacky ACPI thing into something more.

And that's where things stand. The driver is little loved, but it will also be little used, can be replaced with a better mechanism if the right people care, and, in the mean time, it may solve a real problem for some users.

Comments (2 posted)

Kernel development news

2.6.x-rc0

By Jonathan Corbet
October 7, 2009
The mislabeling of 2.6.32-rc1 in the makefile might have been the cause of some confusion, though the skipping of -rc2 will have avoided the worst of it. But it seems that there is confusion with version numbers at other times, leading to a push for a change that Linus has absolutely no intention of making.

Responding to the 2.6.32-rc3 announcement, Len Brown noted that, as far as the version number is concerned, there is no difference between (say) 2.6.31 and a kernel checked out near the end of the 2.6.32 merge window, despite the fact that those two kernels differ significantly from each other. Len had a simple request:

This could be clarified if you update Makefile on the 1st commit after 2.6.X is frozen to simply be 2.6.Y-merge or 2.6.Y-rc0 or something. Anything but 2.6.X.

Others echoed this request, but Linus made it clear that he was not interested in this idea:

So no. I'm not going to do -rc0. Because doing that is _stupid_. And until you understand _why_ it's stupid, it's pointless talking about it, and when you _do_ understand that it's stupid, you'll agree with me.

So what is the problem with the -rc0 idea? It turns out there are a few, one of which being that there is already a much more flexible mechanism built into the kernel build system. If the LOCALVERSION_AUTO configuration option is set, the extra version information will be set in a more specific manner. Your editor, who has not been at home long enough to install a new kernel on his desktop for a bit, is currently running a kernel which reports its version as:

    2.6.31-rc5-00002-g3ce001e

It says that the kernel is the one found at git commit ID g3ce001e; the 00002 indicates that it is two commits after 2.6.31-rc5. This version number makes the exact kernel being run clear in a way that a simple makefile tweak would not. Even if -rc0 were really indicative, it would not really say which kernel was being run.

It gets worse than that, though, especially when developers start bisecting kernels to track down bugs. Consider this example: the two post-2.6.31-rc5 commits in your editor's kernel are a pair of BKL-removal patches which fell through the cracks and didn't make the 2.6.32 merge window. Assuming they make it into 2.6.33, the (simplified) git revision history will look something like this:

[Revisions diagram]

A developer trying to use bisection to find a problem in 2.6.33-rc1 might well end up at your editor's commit g3ce001e - as a stopping point, of course; that commit could not possibly be the cause of the problem. Should that developer look at the kernel version number at that point, they will not see 2.6.33-rc0 (even if Linus were to make that change) or even 2.6.32 - the version will be 2.6.31-rc5, the version that particular commit is based on. In the git era, kernel development is not a straight-line affair.

What this implies is that anybody who depends on the kernel version number as found in the Makefile is likely to end up confused. There is, of course, one important exception: that number is meaningful only for the actual release it represents. At any other time, it is an unreliable guide.

That doesn't change the fact that people are getting confused by running a kernel which identifies itself as 2.6.x, but which is really closer to 2.6.x+1. So it seems likely that a couple of things will be done to help. One of those is to make the LOCALVERSION_AUTO option enabled by default, and, possibly, difficult to disable. The other is to add some smarts to the build system which tries to check whether the kernel being built differs from the one which was tagged with the official release number. If that is the case, a simple "+" is appended to the version number. So a kernel checked out in the middle of the 2.6.33 merge window would identify itself as 2.6.32+.

Linus doesn't much like that last option (he sees it as losing a lot of information that the full LOCALVERSION_AUTO option provides), but he "doesn't hate" it either. He actually managed to not hate the idea enough to put together a patch implementing it. It has not been merged as of this writing; there is still some discussion happening about possible changes to the LOCALVERSION_AUTO format. But it seems likely that something along these lines will go in during the 2.6.33 merge window, if not before.

Comments (6 posted)

Concurrency-managed workqueues

By Jonathan Corbet
October 7, 2009
A "thread pool" is a common group of processes which can be called on to perform work at some future time. The kernel does not lack for thread pool implementations; indeed, there are more choices than one might like. Options include workqueues, the slow work mechanism, and asynchronous function calls - not to mention various private thread pool implementations found elsewhere in the kernel. It has long been thought that having just one thread pool mechanism would be better, but nobody, so far, has managed to put together a single implementation that everybody likes.

Of the mechanisms listed above, the most commonly used by far is workqueues. A workqueue makes it easy for code to set aside work to be done in process context at a future time, but workqueues are not without their problems. There is a shared workqueue that all can use, but one long-running task can create indefinite delays for others, so few developers take advantage of it. Instead, the kernel has filled with subsystem-specific workqueues, each of which contributes to the surfeit of kernel threads running on contemporary systems. Workqueue threads contend with each other for the CPU, causing more context switches than are really necessary. It's discouragingly easy to create deadlocks with workqueues when one task depends on work done by another. All told, workqueues - despite a couple of major rewrites already - are in need of a bit of a face lift.

Tejun Heo has provided that face lift in the form of his concurrency managed workqueues patch. This 19-part series massively reworks the workqueue code, addressing the shortcomings of the current workqueue subsystem. This effort is clearly aimed at replacing the other thread pool implementations in the kernel too, though that work is left for a later date.

Current workqueues have dedicated threads associated with them - a single thread in some cases, one thread per CPU in others. The new workqueues do away with that; there are no threads dedicated to any specific workqueue. Instead, there is a global pool of threads attached to each CPU in the system. When a work item is enqueued, it will be passed to one of the global threads at the right time (as deemed by the workqueue code). One interesting implication of this change is that tasks submitted to the same workqueue on the same CPU may now execute concurrently - something which does not happen with current workqueues.

One of the key features of the new code is its ability to manage concurrency in general. In one sense, all workqueue tasks are executed concurrently after submission. Actually doing things that way would yield poor results, though; those tasks would simply contend with each other, causing more context switches, worse cache behavior, and generally worse performance. What's really needed is a way to run exactly one workqueue task at a time (avoiding contention) but to switch immediately to another if that task blocks for any reason (avoiding processor idle time). Doing this job correctly requires that the workqueue manager become a sort of special-purpose scheduler.

As it happens, that's just how Tejun has implemented it. The workqueue patch adds a new scheduler class which behaves very much like the normal fair scheduler class. The workqueue class adds a couple of hooks which call back into the workqueue code whenever a task running under that class transitions between the blocked and runnable states. When the first workqueue task is submitted, a thread running under the workqueue scheduler class is created to execute it. As long as that task continues to run, other tasks will wait. But as soon as the running task blocks on some resource, the scheduler will notify the workqueue code and another thread will be created to run the next task. The workqueue manager will create as many threads as needed (up to a limit) to keep the CPU busy, but it tries to only have one task actually running at any given time.

Also new with Tejun's patch is the concept of "rescuer" threads. In a tightly resource-constrained system, it may become impossible to create new worker threads. But any existing threads may be waiting for the results of other tasks which have not yet been executed. In that situation, everything will stop cold. To deal with this problem, some special "rescuer" threads are kept around. If attempts to create new workers fail for a period of time, the rescuers will be summoned to execute tasks and, hopefully, clear the logjam.

The handling of CPU hotplugging is interesting. If a CPU is being taken offline, the system needs to move all work off that CPU as quickly as possible. To that end, the workqueue manager responds to a hot-unplug notification by creating a special "trustee" manager on a CPU which is sticking around. That trustee takes over responsibility for the workqueue running on the doomed CPU, executing tasks until they are all gone and the workqueue can be shut down. Meanwhile, the CPU can go offline without waiting for the workqueue to drain.

These patches were generally welcomed, but there were some concerns expressed. The biggest complaint related to the special-purpose scheduling class. The hooks were described as (1) not really scheduler-related, and (2) potentially interesting beyond the workqueue code. For example, Linus suggested that this kind of hook could be used to implement the big kernel lock semantics, releasing the lock when a process sleeps and reacquiring it on wakeup. The scheduler class will probably go away in the next version of the patch; what remains to be seen is what will replace it.

One idea which was suggested was to use the preemption notifier hooks which are already in the kernel. These notifiers would have to become mandatory, and some new callbacks would be required. Another possibility would be to give in to the inevitable future when perf counters events will take over the entire kernel. Event tracepoints are designed to provide callbacks at specific points in the kernel; some already exist for most of the interesting scheduler events. Using them in this context would mostly be a matter of streamlining the perf events mechanism to handle this task efficiently.

Andrew Morton was concerned that the new code would take away the ability for a specific workqueue user to modify its worker tasks - changing their priority, say, or having them run under a different UID. It turns out that, so far, only a couple of workqueues have been modified in this way. The workqueue used by stop_machine() puts its worker threads into the realtime scheduling class, allowing them to monopolize the processors when needed; Tejun simply replaced that workqueue with a set of dedicated kernel threads. The ACPI code had bound a workqueue thread to CPU 0 because some operations corrupt the system if run anywhere else; that case is easily handled with the existing schedule_work_on() function. So it seems that, for now at least, there is no need for non-default worker threads.

One remaining issue is that some subsystems use single-threaded workqueues as a sort of synchronization mechanism; they expect tasks to complete in the same order they were submitted. Global thread pools change that behavior; Tejun has not yet said how he will solve that problem.

It almost certainly will be solved, along with the other concerns. David Howells, the creator of the slow work subsystem, thinks that the new workqueues could be a good replacement. In summary, this change looks likely to be accepted, perhaps as early as 2.6.33. Then we might finally have a single thread pool in the kernel.

Comments (11 posted)

Infrastructure unification in the block layer

October 7, 2009

This article was contributed by Neil Brown

For many years, Linux has had two separate subsystems for managing indirect block devices: virtual storage devices which combine storage from one or more other devices in various ways to provide either improved performance, flexibility, capacity, or redundancy. These two are DM (which stands for Device Mapper) and MD (which might stand for Multiple Devices or Meta Disk and is only by pure coincidence the reverse of DM).

For nearly as long there have been suggestions that having two frameworks is a waste and that they should be unified. However little visible effort has been made toward this unification and such efforts as there might have been have not yielded any lasting success. The most united thing about the two is that they have a common directory in the Linux kernel source tree (drivers/md); this is more a confusion than a unification. The two subsystems have both seen ongoing development side by side, each occasionally gaining functionality that the other has and so, in some ways, becoming similar. But similarity is not unity, rather it serves to highlight the lack of unity as it is no longer function that keeps the two separate, only form.

Exploring why unification has never happened would be an interesting historical exercise that would need to touch on the personalities of the people involved, the drift in functionality between the two systems which started out with quite different goals, the differing perceptions of each by various members of the community, and the technological differences that would need to be resolved. Not being an historian, your author only feels competent to comment on that last point, and, as it is the one where a greater understanding is most likely to aid unification, this article will endeavor to expose the significant technological issues that keep the two separate. In particular, we will explore the weaknesses in each infrastructure. Where a system has strengths, they are likely to be copied, thus creating more uniformity. Where it has weaknesses, they are likely to be assiduously avoided by others, thus creating barriers.

Trying to give as complete a picture as possible, we will explore more than just DM and MD. Loop, NBD, and DRBD provide similar functionality behind their own single-use infrastructure; exploring them will ensure that we don't miss any important problems or needs.

The flaws of MD

Being the lead developer of MD for some years, your author feels honour bound to start by identifying weaknesses in that system.

One of the more ugly aspects of MD is the creation of a new array device. This is triggered by simply opening a device special file, typically in /dev. In the old days, when we had a fairly static /dev directory, this seemed a reasonable approach. It was simply necessary to create a bunch of entries in /dev (md0, md1, md2, ...) with appropriate major and minor numbers at the same time that the other static content of /dev was created. Then, whenever one of those entries was opened, the internal data structures would spontaneously be created so that the details of the device could be filled in.

However with the more modern concept of a dynamic /dev, reflecting the fact that the set of devices attached to a given system is quite fluid, this doesn't fit very well. udev, which typically manages /dev, only creates entries for devices that the kernel knows about. So it will not create any md devices until the kernel believes them to exist. They won't exist until the device file has been created and can be opened - a classic catch-22 situation.

mdadm, the main management tool for MD, works around this problem by creating a temporary device special file just so that it can open it and thereby create the device. This works well enough, but is, nonetheless, quite ugly. The internal implementation is particularly ugly and it was only relatively recently that the races inherent in destroying an MD device (which could be recreated at any moment by user space) were closed so that MD devices don't have to exist forever.

A closely related problem with MD is that the block device representing the array appears before the array is configured or has any data in it. So when udev first creates /dev/md0, an attempt to open and read from it, (to find out if a filesystem is stored there, for example) will find no content. It is only after the component devices have been attached to the array and it has been fully configured that there is any point in trying to read data from the array.

This initial state, where the device exists but is empty, is somewhat like the case of removable-media devices, and can be managed along the same lines as those: we could treat the array as media that can spontaneously appear. However MD is, in other ways, quite unlike removable media (there is no concept of "eject") and it would generally cause less confusion if MD devices appeared fully configured so they looked more like regular disk drive devices.

A problem that has only recently been addressed is the fact that MD manages the metadata for the arrays internally. The kernel module knows all about the layout of data in the superblock and updates it as appropriate. This makes it easy to implement, but not so easy to extend. Due to the lack of any real standard, there are many vendor-specific metadata layouts that all can be used to describe the same sort of array. Supporting all of those in the kernel would unnecessarily bloat the kernel, and supporting them in user space requires information about required updates to be reported to user space in a reliable way.

As mentioned, this problem has recently been addressed, so it is now quite possible to manage vendor-specific metadata from user space. It is still worth noting, though, as one of the problems that has stood in the way of earlier attempts at DM/MD integration: DM does not manage metadata at all, leaving it up to user-space tools.

The final flaw in MD to be exposed here is the use and nature of the ioctl() commands that are used to configure and manage MD arrays. The use of ioctl() has been frowned upon in the Linux community for some years. There are a number of reasons for this. One is that strace cannot decode newly-defined ioctls, so the use of ioctl() can make a program's behaviour harder to observe. Another is that it is a binary interface (typically passing C structures around) and so, when Linux is configured to support multiple ABIs (e.g. a 32bit and a 64bit version), there is often a need to translate the binary structure from one ABI to the other (see the nearly 3000 lines in fs/compat_ioctl.c).

In the case of MD, the ioctl() interface is not very extensible. The command for configuring an array allows only a "level", a "layout", and a "chunksize" to be specified. This works well enough for RAID0, RAID1, and RAID5, but even with RAID10 we needed to encode multiple values into the "layout" field which, while effective, isn't elegant.

In the last few years MD has grown a separate configuration interface via a collection of attribute files exposed in sysfs. This is much more extensible, and there are a growing number of features of MD which require the sysfs interface. However even here there is still room for improvement. The MD attribute files are stored in a subdirectory of the block device directory (e.g. /sys/block/md0/md/). While this seems natural, it entrenches the above-mentioned problem that the block device must exist before the array can be configured. If we wanted to delay creation of the block device until the array is ready to serve data, we would need to store these attribute files elsewhere in sysfs.

The failings of DM

DM has a very different heritage than MD and, while it shares some of the flaws of MD, it avoids others.

DM devices do not need to exist in /dev before they can be created. Rather there is a dedicated "character device" which accepts DM ioctl() commands, including the command to create a new device. Thus, the catch-22 problem from which MD suffers, is not present in DM. It has been suggested that MD should take this approach too. However, while it does solve one problem, it still leaves the problem of using ioctl(). There doesn't seem a lot of point making significant change to a subsystem unless the result avoids all known problems. So while waiting for a perfect solution, no such small steps have been made to bring MD and DM closer together.

The related issue of a block device existing before it is configured is still present in DM, though separating the creation of the DM device from the creation of the block device would be much easier in DM. This is because, as mentioned, with DM all configuration happens over the character device whereas with MD, the configuration happens via the block device itself, so it must exist before it can be configured.

While DM also uses ioctl() commands, which could be seen as a weakness, the commands chosen are much more extensible than those used by MD. The ioctl() command to configure a device essentially involves passing a text string to the relevant module within DM, and it interprets this string in any way it likes. So DM is not limited to the fields that were thought to be relevant when DM was first designed.

Metadata management with DM is very different than with MD. In the original design, there was never any need for the kernel module to modify metadata, so metadata management was left entirely in user space where it belongs. More recently, with RAID1 and RAID5 (which is still under development), the kernel is required to synchronously update the metadata to record a device failure. This requires a degree of interaction between the kernel and user space which has had to be added.

The main problem with the design of DM is the fact that it has two layers: the table layer and the target layer. This undoubtedly comes from the original focus of DM, which was logical volume management (LVM), and it fits that focus quite well. However, it is an unnecessary layering and just gets in the way of non-LVM applications.

A "target" is a concept internal to DM, which is the abstraction that each different module presents. So striping, raid1, multipath, etc. each present a target, and these targets can be combined via a table into a block device.

A "table" is simply a list of targets each with a size and an offset. This is analogous to the "linear" module in MD or what is elsewhere described as concatenation. The targets are essentially joined end-to-end to form a single larger block device.

This contrasts with MD where each module - raid0, raid1, or multipath, for example - presents a block device. This block device can be used as-is, or it can be combined with others, via a separate array, into a single larger block device.

To highlight the effect of this layering a little more, suppose we were to have two different arrays made of a few devices. In one array we want the data striped across the devices. In the other we lay the data out filling the first device first and then moving on to the next device. With MD, the only difference between these two would be the choice of "raid0" or "linear" as the module to manage them. With DM, the first step would involve including all the devices in a single "stripe" target, and then placing that target as the sole entry in a table. The second would involve creating a number of "linear" targets, one for each device, and then combining them into a table with multiple entries.

Having this internal abstraction of a "target" serves to insulate and isolate DM from the block device layer, which is the common abstraction used by other virtual devices. A good example of this separation is the online reconfiguration functionality that DM provides. The boundary between the table and the targets allows DM to capture new requests in the table layer while allowing the target layer to drain and become idle, and then to potentially replace all the targets with different targets before releasing the requests that have been held back.

Without that internal "target" layer, that functionality would need to be implemented in the block layer on its boundary with the driver code (i.e. in generic_make_request() and bio_endio()). Doing this would be more effort (i.e. DM would not benefit from insulation) and it would then be more generally useful (i.e. DM would not be so isolated). Many people have wanted to be able to convert a normal device into a degraded RAID1 "array" or to enable multipath support on a device without first unmounting the filesystem which was mounted directly from one of the paths. If online reconfiguration were supported at the block layer level, these changes would become possible.

The difference of DRBD

DRBD, the Distributed Replicated Block Device, is the most complex of the virtual block devices that do not aim to provide a general framework. It is not yet included in the mainline, but it could yet be merged in 2.6.33.

Its configuration mechanism is similar to that of DM in a number of ways. There is a single channel which can be used to create and then manage the block devices. The protocol used over this channel is designed to be extensible, though the current definitions are very much focused around the particular needs of DRBD (as would be expected), so how easy it might be to extend to a different sort of array is not immediately clear.

Where DM uses ioctl() with string commands over a dedicated character device, DRBD uses a packed binary protocol over a netlink connection. This is essentially a socket connection between the kernel module and a user-space management program which carries binary encoded messages back and forth. This is probably no better or worse than ioctl(); it is simply different. Presumably it was chosen because there is general bad feeling about ioctl(), but no such bad feeling about netlink. Linus, however, doesn't seem keen on either approach.

DRBD appears to share metadata management between the kernel module and user space. Metadata which describes a particular DRBD configuration is created and interpreted by user-space tools and the information that is needed by the kernel is communicated over the netlink socket. DRBD uses other metadata to describe the current replication state of the system - which blocks are known to be safely replicated and which are possibly inconsistent between the replicas for some reason. This metadata (an activity log and a bitmap) is managed by the kernel, presumably for performance reasons.

This sharing of responsibility makes a lot of sense as it allows the performance-sensitive portions to remain in the kernel but still leaves a lot of flexibility to support different metadata formats. This approach could be improved even more by making the bitmap and activity log into independent modules that can be used by other virtual devices. Each of DM, MD, and DRBD have very similar similar mechanisms for tracking inconsistencies between component devices; this is possibly the most obvious area where sharing would be beneficial.

Loop, NBD and the purpose of infrastructure

Partly to emphasize the fact that it isn't necessary to use a framework to have a virtual block device, loop and NBD (the Network Block Device) are worth considering. While loop doesn't appear to aim to provide a framework for a multiplicity of virtual devices, it nonetheless combines three different functions into one device. It can make a regular file look like a block device, it can provide primitive partitioning of a different block device, and it can provide encryption and decryption so that an encrypted device can be accessed. Significantly, these are each functions that were subsequently added to DM, thus highlighting the isolating effect of the design of DM.

NBD is much simpler in that is has just one function: it provides a block device for which all I/O requests are forwarded over a network connection to be serviced on - normally - a different host. It is possibly most instructive as an example of a virtual block device that doesn't need any surrounding framework or infrastructure.

Two areas where DM or MD devices make use of an infrastructure, while Loop and NBD need to fend for themselves, are in the creation of new devices and the configuration of those devices. NBD takes a very simple approach of creating a predefined number of devices at module initialization time and not allowing any more. Loop is a little more flexible and uses the same mechanism as MD, largely provided by the block layer, to create loop devices when the block device special file is opened. It does not allow these to be deleted until the module is unloaded, usually at system shutdown time. This architecture suggests that some infrastructure could be helpful for these drivers, and that the best place for that infrastructure could well be in the block layer, and thus shared by all devices.

For configuration, both Loop and NBD use a fairly ad hoc collection of ioctl() commands. As we have already observed, this is both common and problematic. They could both benefit from a more standardized and transparent configuration mechanism.

It might be appropriate to ask at this point why there is any need for subsystem infrastructure such as DM and MD. Why not simply follow the pattern seen in loop, NBD and DRDB and have a separate block device driver for each sort of virtual block device? The most obvious reason is one that doesn't really apply any more. At the time when MD and DM were being written there was a strong connection between major device numbers and block device drivers. Each driver needed a separate major number. Loop is 7, NBD is 43, DRBD is 147, and MD is 9. DM doesn't have a permanently allocated number, it chooses a spare number when the module is loaded, so it usually gets 254 or 253.

Furthermore, at that time, the number of available major numbers was limited to 255 and there was danger of running out. Allocating one major number for RAID0, one for LINEAR, one for RAID1 and so forth would have looked like a bit of a waste, so getting one for MD and plugging different personalities into the one driver might have been a simple matter of numerical economy. Today, we have many more major numbers available, and we no longer have a tight binding between major numbers and device drivers - a driver simply claims whichever device numbers it wants at any time, when the module is loaded, or when a device is created.

A second reason is the fact that all the MD personalities envisioned at the time had a lot in common. In particular they each used a number of component devices to create a larger device. While creating a midlayer to encapsulate this functionality might be a mistake, it is a very tempting step and would seem to make implementation easier.

Finally, as has been mentioned, having a single module which defines its own internal interfaces can provide a measure of insulation from other parts of the kernel. While this was mentioned only in the context of DM, it is by no means absent from MD. That insulation, while not necessarily in the best interests of the kernel as a whole, can make life a lot easier for the individual developer.

None of these reasons really stand up as defensible today, though some were certainly valid in the past. So it could be that, rather than seeking unification of MD and DM, we should be seeking their deprecation. If we can find a simple approach to allow different implementations of virtual block devices to exist as independent drivers, but still maintain all the same functionality as they presently have, that is likely to be the best way forward.

Unification with the device model

This brings us to the Linux device model. While there may be no real need to unify DM with MD, the devices they create need to fit into the unifying model for devices which we call the "device model" and which is exposed most obviously through various directory trees in sysfs. The device model has a very broad concept of a "device." It is much more than the traditional Unix block and character devices; it includes busses, intermediate devices, and just about anything that is in any way addressable.

In this model it would seem sensible for there to be an "array" device which is quite separate from the "block" device providing access to the data in the array. This is not unlike the current situation where a SCSI bus has a child which is a SCSI target which, in turn, has a child which is a SCSI LUN (Logical UNit), and that device itself is still separate from the block device that we tend to think of as a "SCSI disk". This separation would allow the array to be created and configured before the block device can come into being, thus removing any room for confusion for udev.

The device model already allows for a bus driver to discover devices on that bus. In most cases this happens automatically during boot or at hotplug time. However, it is possible to ask a bus to discover any new devices, or to look for a particular new device. This last action could easily be borrowed to manage creation of virtual block devices on a virtual bus. The automatic scan would not find any devices, but an explicit request for an explicitly-named device could always succeed by simply creating that device. If we then configure the device by filling in attribute files in the virtual block device, we have a uniform and extensible mechanism for configuring all virtual block devices that fits with an existing model.

Again, the device model already allows for binding different drivers to devices as implemented in the different "bind" files in the /sys/bus directory tree. Utilizing this idea, once a virtual block device was "discovered" on the virtual block device bus, an appropriate driver could be bound to it that would interpret the attributes, possibly create files for extra attributes, and, ultimately, instantiate the block device.

Possibly the most difficult existing feature to represent cleanly in the device model is the on-line reconfiguration that DM and, more recently, MD provide. This allows control of an array to be passed from one driver to another without needing to destroy and recreate the block device (thus, for example, a filesystem can remain mounted during the transition). Doing this exchange in a completely general way would involve detaching a block device from one parent and attaching it to another. This would be complex for a number of reasons, one being the backing_dev_info structure, which creates quite a tight connection between a filesystem and the driver for the mounted block device.

Another weakness in the device model is that dependencies between devices are very limited - a device can be dependent on at most one other device, its parent. This doesn't fit very well with the observation that an array is dependent on all the components of the array, and that these components can change from time to time. Fortunately this weakness has already been identified and, hopefully, will be resolved in a way that also works for virtual block devices.

So, while there are plenty of issues that this model leaves unresolved, it does seem that unification with the device model holds the key to unification between MD and DM, along with any other virtual block devices.

So what is the answer?

Knowing that a problem is hard does not excuse us from solving it. With the growing interest in managing multiple devices together, as seen in DRBD and Btrfs, as well as in the increasing convergence in functionality between DM and MD, now might be the ideal time to solve the problems and achieve unification. Reflecting on the various problems and differences discussed above, it would seem that a very important step would be to define and agree on some interfaces; two in particular.

The first interface that we need is the device creation and configuration interface. It needs to provide for all the different needs of DM, MD, Loop, NBD, DRBD, and probably even Btrfs. It needs to be sufficiently complete such that the current ioctl() and netlink interfaces can be implemented entirely through calls into this new interface. It is almost certain that this interface should be exposed through sysfs and so needs to integrate well with the device model.

The second interface is the one between the block layer and individual block drivers. This interface needs to be enhanced to support all the functionality that a DM target expects of its interface with a DM table, and in particular it needs to be able to support hotplugging of the underlying driver while the block device remains active.

Defining and, very importantly, agreeing on these interfaces will go a long way towards achieving the long sought after unification.

Comments (24 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds