Brief items
The current development kernel is 2.6.32-rc3,
released by Linus on
October 4. Note that there was no -rc2; that version was skipped to
avoid confusion resulting from the fat-fingering of the version number in
-rc1.
[O]ne thing that might be worth mention (despite being fairly
small) is that there's been some IO latency tweaking in the block layer
(CFQ scheduler). I'm hoping that ends up being one of those noticeable
things, where people might actually notice better responsiveness. Give it a
try.
One other user-visible change: by default, VFAT filesystems
are now mounted with shortname=mixed instead of
shortname=lower, preventing the downcasing of filenames when
filesystems are copied. Also notable is that Btrfs is finally able to
handle full-disk situations; we'll have to find something else to give
Chris Mason grief about.
The short-form changelog is in the announcement, or see the
full changelog for all the details.
The current stable kernel is 2.6.31.3, released on October 7. It contains a
single fix for a TTY problem that was affecting a number of users.
Previously, 2.6.31.2 was released on October 5. This is
a large stable release; from the review announcement:
This release
is big. Yeah, really big. There are a number of areas that needed some
rework in order to get things back to working order. Like the tty layer.
Hopefully everyone can now use their usb to serial devices again without
oopsing the kernel. Xen and KVM also have reasonably big fixes, as does
the ath5k and iwlwifi drivers. One might say that the patches for the
iwlwifi drivers are a bit "bigger" than normal -stable material, but the
wifi maintainer wants them, so he can handle the fallout. XHCI (the USB
3.0 controller) also has a big update here, to get it into workable shape
to coincide with the release of the USB 3.0 developer kit. Without it, it
wouldn't be really useful. And there's a whole raft of other important
fixes as well, not to make light of them. A huge system speedup for large
boxes is also in here, for those who like running benchmarks.
So
expect more than the usual amount of change for a stable update.
2.6.27.36 and 2.6.30.9 were also released on
October 5; they contain a somewhat smaller set of fixes. "This
is the last release of the 2.6.30-stable series. Everyone should now move
to the 2.6.31 kernel tree. If there are any issues preventing people from
doing this, please let me know!"
Comments (none posted)
Al Viro has managed to get his and his wife's paperwork in order,
and has returned to USA. Apparently, there was no record of his
having been discharged from the Soviet Army. The authorities also
lacked any record of his wife having an address in Russia while an
adult. This situation was reportedly resolved as only Al Viro could
have resolved it.
-- from the
Netconf
2009 minutes
Reports of its demise were greatly exaggerated. But it's going to
be a few days before we're back in sync with "current time" - I
wanted to get the retrospective episodes on the merge window done,
even though it's "old" now because the merge window period is of
high value. I've been doing about 3 hours a night over the past few
days to get through it. I view it as some kind of oddly exciting
psychological masochism.
--
Jon Masters; who says podcasting is
boring?
gad. You said "floppy" and "ioctl" in the same sentence. Where angels
fear to tread.
--
Andrew Morton
The _reason_ for the driver exemption was the fact that even a
broken driver is better than no driver at all for somebody who just
can't get a working system without it, but that argument really
goes away when the driver is so specialized that it's not about
regular hardware any more.
And the whole "driver exemption" seems to have become a by-word for
"I can ignore the merge window for 50% of my code". Which makes me
very tired of it if there aren't real advantages to real users.
So I'm seriously considering a "the driver has to be mass market
and also actually matter to an install" rule for the exemption to
be valid.
--
Linus Torvalds
Checkpatch is not very bright, it has no understanding of style
beyond playing with pattern regexps. It's a rather dim tool that
helps people get work done... or as some would have it a rather
dim tool used by even dimmer tools to make noise on kernel list.
--
Alan Cox
Comments (2 posted)
"Netconf 2009" was an invitation-only summit of networking developers held
prior to LinuxCon. The minutes for the two days of discussion have now
been posted on
the
Netconf 2009 page. Topics covered include transmit interrupt
mitigation, bridging, multiqueue networking, wireless networking, and
more.
Comments (none posted)
One of the changes slipped into the 2.6.32-rc3 release was the addition of
a "low latency" mode for the CFQ I/O scheduler. Normally the scheduler
will try to delay many new I/O requests for a short time in the hope that
they can be joined with other requests which may come shortly thereafter.
This behavior will minimize disk seeks and maximize I/O request size, so it
is clearly good for throughput. But the addition of delays can be
a problem if the overriding goal is to complete the operation as quickly as
possible.
The new mode (initially called "desktop" before being renamed
"low_latency") is enabled by default; it can be adjusted by setting the
iosched/low_latency attribute associated with each block device in
sysfs. When set, some of the delays for "synchronous operations" (reads,
generally) no longer happen. The result should be more responsive I/O and,
one would hope, happier users.
Note: please see the comments for a description of this change which is more, um, accurate. Your editor blames the Death Flu that his kids brought home.
Comments (10 posted)
Patches merged into the mainline carry a number of tags to indicate who
wrote them, who reviewed them, etc. A
certain
commit merged for 2.6.32 contains a relatively unusual tag, though:
NACKed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
The merging of this patch has drawn some complaints: why should it have
made it into the mainline when a core developer clearly has problems with
it?
The story goes something like this. ACPI provides a mechanism
by which it can ask the system to make processors go idle in emergency
situations; these can include power problems or an overheating system. The
ACPI folks had originally proposed putting some hacks into the scheduler to
implement this functionality. These changes, it seems, were little loved;
that was the patch that Peter Zijlstra blocked outright.
So Shaohua Li went back and implemented this functionality as a driver
instead. If the ACPI hardware starts sounding the red alert, this driver
will create a top-priority realtime thread and bind it to the CPU that is
to be idled. That thread, when it "runs," will simply put the CPU into a
relatively deep sleep state for a while. When the emergency passes, the
thread will go away and normal life resumes. It's a bit of a hack, but it
gets the job done, and it is not destructive to system state the way
hot-unplugging the CPU would be.
The proper fix would be to enhance the scheduler (the right way) to provide
this functionality. But that almost certainly requires the intervention of
a real scheduler hacker, and they haven't yet gotten around to solving the
problem. So the ACPI "driver" is in the mainline for now. And it may stay
that way; Linus said:
In fact, the only reason the scheduler people even know about it is
that Len at first tried to do something more invasive, and was shot
down. Now it's just a driver, and the scheduler people can _try_ to
do it some other way if they really care, but that's _their_
problem. Not the driver.
In the meantime, I personally suspect we probably never want to
even try to solve it in the scheduler, because why the hell should
we care and add complex logic for something like that? At least not
until we end up having the same issue on some other architecture
too, and it turns from a hacky ACPI thing into something more.
And that's where things stand. The driver is little loved, but it will
also be little used, can be replaced with a better mechanism if the right
people care, and, in the mean time, it may solve a real problem for some
users.
Comments (2 posted)
Kernel development news
By Jonathan Corbet
October 7, 2009
The mislabeling of 2.6.32-rc1 in the makefile might have been the cause of
some confusion, though the skipping of -rc2 will have avoided the worst of
it. But it seems that there is confusion with version numbers at other
times, leading to a push for a change that Linus has absolutely no
intention of making.
Responding to the 2.6.32-rc3 announcement, Len Brown noted that, as far as the version
number is concerned, there is no difference between (say) 2.6.31 and a
kernel checked out near the end of the 2.6.32 merge window, despite the
fact that those two kernels differ significantly from each other. Len had
a simple request:
This could be clarified if you update Makefile on the 1st commit
after 2.6.X is frozen to simply be 2.6.Y-merge or 2.6.Y-rc0 or
something. Anything but 2.6.X.
Others echoed this request, but Linus made it
clear that he was not interested in this idea:
So no. I'm not going to do -rc0. Because doing that is
_stupid_. And until you understand _why_ it's stupid, it's
pointless talking about it, and when you _do_ understand that it's
stupid, you'll agree with me.
So what is the problem with the -rc0 idea? It turns out there are a few,
one of which being that there is already a much more flexible mechanism
built into the kernel build system. If the LOCALVERSION_AUTO
configuration option is set, the extra version information will be set in a
more specific manner. Your editor, who has not been at home long enough to
install a new kernel on his desktop for a bit, is currently running a
kernel which reports its version as:
2.6.31-rc5-00002-g3ce001e
It says that the kernel is the one found at git commit ID g3ce001e;
the 00002 indicates that it is two commits after 2.6.31-rc5. This
version number makes the exact kernel being run clear in a way that
a simple makefile tweak would not. Even if -rc0 were really indicative, it
would not really say which kernel was being run.
It gets worse than that, though, especially when developers start bisecting
kernels to track down bugs. Consider this example: the two post-2.6.31-rc5
commits in your editor's kernel are a pair of
BKL-removal patches which fell through the cracks and didn't make the
2.6.32 merge window. Assuming they make it into 2.6.33, the (simplified)
git revision history will look something like this:
A developer trying to use bisection to find a problem in 2.6.33-rc1 might
well end up at your editor's commit g3ce001e - as a stopping
point, of course; that commit could not possibly be the cause of the
problem. Should that developer look at the kernel version number at that
point, they will not see 2.6.33-rc0 (even if Linus were to make that
change) or even 2.6.32 - the version will be 2.6.31-rc5, the version that
particular commit is based on. In the git era,
kernel development is not a straight-line affair.
What this implies is that anybody who depends on the kernel version number
as found in the Makefile is likely to end up confused. There is, of
course, one important exception: that number is meaningful only for the
actual release it represents. At any other time, it is an unreliable
guide.
That doesn't change the fact that people are getting confused by running a
kernel which identifies itself as 2.6.x, but which is really closer to
2.6.x+1. So it seems likely that a couple of things will be done to help.
One of those is to make the LOCALVERSION_AUTO option enabled by
default, and, possibly, difficult to disable. The other is to add some
smarts to the build system which tries to check whether the kernel being built
differs from the one which was tagged with the official release number. If
that is the case, a simple "+" is appended to the version number.
So a kernel checked out in the middle of the 2.6.33 merge window would
identify itself as 2.6.32+.
Linus doesn't much like that last option (he sees it as losing a lot of
information that the full LOCALVERSION_AUTO option provides), but
he "doesn't hate" it either. He actually managed to not hate the idea
enough to put together a patch implementing
it. It has not been merged as of this writing; there is still some
discussion happening about possible changes to the
LOCALVERSION_AUTO format. But it seems likely that something
along these lines will go in during the 2.6.33 merge window, if not before.
Comments (6 posted)
By Jonathan Corbet
October 7, 2009
A "thread pool" is a common group of processes which can be called on to
perform work at some future time. The kernel does not lack for thread pool
implementations; indeed, there are more choices than one might like. Options
include
workqueues, the
slow work mechanism, and
asynchronous function calls -
not to mention various private thread pool implementations found elsewhere
in the kernel. It has long been thought that having just one thread pool
mechanism would be better, but nobody, so far, has managed to put together
a single implementation that everybody likes.
Of the mechanisms listed above, the most commonly used by far is
workqueues. A workqueue makes it easy for code to set aside work to be
done in process context at a future time, but workqueues are not without
their problems. There is a shared workqueue that all can use, but one
long-running task can create indefinite delays for others, so few
developers take advantage of it. Instead, the kernel has filled with
subsystem-specific workqueues, each of which contributes to the surfeit of
kernel threads running on contemporary systems. Workqueue threads contend
with each other for the CPU, causing more context switches than are really
necessary. It's discouragingly easy to create deadlocks with workqueues
when one task depends on work done by another. All told, workqueues -
despite a couple of major rewrites already - are in need of a bit of a face
lift.
Tejun Heo has provided that face lift in the form of his concurrency managed workqueues
patch. This 19-part series massively reworks the workqueue code,
addressing the shortcomings of the current workqueue subsystem. This
effort is clearly aimed at replacing the other thread pool implementations
in the kernel too, though that work is left for a later date.
Current workqueues have dedicated threads associated with them - a single
thread in some cases, one thread per CPU in others. The new workqueues do
away with that; there are no threads dedicated to any specific workqueue.
Instead, there is a global pool of threads attached to each CPU in the
system. When a work item is enqueued, it will be passed to one of the
global threads at the right time (as deemed by the workqueue code). One
interesting implication of this change is that tasks submitted to the same
workqueue on the same CPU may now execute concurrently - something which
does not happen with current workqueues.
One of the key features of the new code is its ability to manage
concurrency in general. In one sense, all workqueue tasks are executed
concurrently after submission. Actually doing things that way would yield
poor results, though; those tasks would simply contend with each other,
causing more context switches, worse cache behavior, and generally worse
performance. What's really needed is a way to run exactly one workqueue
task at a time (avoiding contention) but to switch immediately to another
if that task blocks for any reason (avoiding processor idle time). Doing
this job correctly requires that the workqueue manager become a sort of
special-purpose scheduler.
As it happens, that's just how Tejun has implemented it. The workqueue
patch adds a new scheduler class which behaves very much like the normal
fair scheduler class. The workqueue class adds a couple of hooks which
call back into the workqueue code whenever a task running under that class
transitions between the blocked and runnable states. When the first
workqueue task is submitted, a thread running under the workqueue scheduler
class is created to execute it. As long as that task continues to run,
other tasks will wait. But as soon as the running task blocks on some
resource, the scheduler will notify the workqueue code and another thread
will be created to run the next task. The workqueue manager will create as
many threads as needed (up to a limit) to keep the CPU busy, but it tries
to only have one task actually running at any given time.
Also new with Tejun's patch is the concept of "rescuer" threads. In a
tightly resource-constrained system, it may become impossible to create new
worker threads. But any existing threads may be waiting for the results of
other tasks which have not yet been executed. In that situation,
everything will stop cold. To deal with this problem, some special
"rescuer" threads are kept around. If attempts to create new workers fail
for a period of time, the rescuers will be summoned to execute tasks and,
hopefully, clear the logjam.
The handling of CPU hotplugging is interesting. If a CPU is being
taken offline, the system needs to move all work off that CPU as quickly as
possible. To that end, the workqueue manager responds to a hot-unplug
notification by creating a special "trustee" manager on a CPU which is
sticking around. That trustee takes over responsibility for the workqueue
running on the doomed CPU, executing tasks until they are all gone and the
workqueue can be shut down. Meanwhile, the CPU can go offline without
waiting for the workqueue to drain.
These patches were generally welcomed, but there were some concerns
expressed. The
biggest complaint related to the special-purpose scheduling
class. The hooks were described as (1) not really scheduler-related,
and (2) potentially interesting beyond the workqueue code. For
example, Linus suggested that this kind of hook could be used
to implement the big kernel lock semantics, releasing the lock when a
process sleeps and reacquiring it on wakeup. The scheduler class will
probably go away in the next version of the patch; what remains to be seen
is what will replace it.
One idea which was suggested was to use the preemption notifier hooks which
are already in the kernel. These notifiers would have to become mandatory,
and some new callbacks would be required. Another possibility would be to
give in to
the inevitable future when perf counters events will take over
the entire kernel. Event tracepoints are designed to provide callbacks at
specific points in the kernel; some already exist for most of the
interesting scheduler events. Using them in this context would mostly be a
matter of streamlining the perf events mechanism to handle this task
efficiently.
Andrew Morton was concerned that the new
code would take away the ability for a specific workqueue user to modify
its worker tasks - changing their priority, say, or having them run under a
different UID. It turns out that, so far, only a couple of workqueues have
been modified in this way. The workqueue used by stop_machine()
puts its worker threads into the realtime scheduling class, allowing them
to monopolize the processors when needed; Tejun simply replaced that
workqueue with a set of dedicated kernel threads. The ACPI code had bound
a workqueue thread to CPU 0 because some operations corrupt the system
if run anywhere else; that case is easily handled with the existing
schedule_work_on() function. So it seems that, for now at least,
there is no need for non-default worker threads.
One remaining issue is that some subsystems use single-threaded workqueues
as a sort of synchronization mechanism; they expect tasks to complete in
the same order they were submitted. Global thread pools change that
behavior; Tejun has not yet said how he will solve that problem.
It almost certainly will be solved, along with the other concerns. David
Howells, the creator of the slow work subsystem, thinks that the new workqueues could be a good
replacement. In summary, this change looks likely to be accepted, perhaps
as early as 2.6.33. Then we might finally have a single thread pool in the
kernel.
Comments (11 posted)
October 7, 2009
This article was contributed by Neil Brown
For many years, Linux has had two separate subsystems for managing
indirect block devices: virtual storage devices which combine storage
from one or more other devices in various ways to provide either
improved performance, flexibility, capacity, or redundancy.
These two are DM (which stands for Device Mapper) and MD (which
might stand for Multiple Devices or Meta Disk and is only by pure
coincidence the reverse of DM).
For nearly as long there have been suggestions that having two
frameworks is a waste and that they should be unified. However little
visible effort has been made toward this unification and such
efforts as there might have been have not yielded any lasting success.
The most united thing about the two is that they have a common
directory in the Linux kernel source tree (drivers/md); this is
more a confusion than a unification. The two subsystems have both
seen ongoing development side by side, each occasionally gaining
functionality that the other has and so, in some ways, becoming similar.
But similarity is not unity, rather it serves to highlight the lack of
unity as it is no longer function that keeps the two separate,
only form.
Exploring why unification has never happened would be an interesting
historical exercise that would need to touch on the personalities of
the people involved, the drift in functionality between the two
systems which started out with quite different goals, the differing
perceptions of each by various members of the community, and the
technological differences that would need to be resolved.
Not being an historian, your author only feels competent to comment on
that last point, and, as it is the one where a greater understanding is
most likely to aid unification, this article will endeavor to expose
the significant technological issues that keep the two separate. In
particular, we will explore the weaknesses in each infrastructure.
Where a system has strengths, they are likely to be copied, thus creating
more uniformity. Where it has weaknesses, they are likely to be
assiduously avoided by others, thus creating barriers.
Trying to give as complete a picture as possible, we will explore more
than just DM and MD. Loop, NBD, and DRBD provide similar
functionality behind their own single-use infrastructure; exploring
them will ensure that we don't miss any important problems or needs.
The flaws of MD
Being the lead developer of MD for some years, your author feels
honour bound to start by identifying weaknesses in that system.
One of the more ugly aspects of MD is the creation of a new array
device. This is triggered by simply opening a device special file,
typically in /dev. In the old days, when we had a fairly static /dev
directory, this seemed a reasonable approach. It was simply necessary
to create a bunch of entries in /dev (md0, md1, md2, ...) with appropriate
major and minor numbers at the same time that the other static
content of /dev was created. Then, whenever one of those entries was
opened, the internal data structures would spontaneously be created
so that the details of the device could be filled in.
However with the more modern concept of a dynamic /dev, reflecting the
fact that the set of devices attached to a given system is quite
fluid, this doesn't fit very well. udev, which typically manages
/dev, only creates entries for devices that the kernel knows about. So
it will not create any md devices until the kernel believes them to
exist. They won't exist until the device file has been created and
can be opened - a classic catch-22 situation.
mdadm, the main management tool for MD, works around this problem by creating
a temporary device special file just so that it can open it and thereby
create the device. This works well enough, but is, nonetheless, quite
ugly. The internal implementation is particularly ugly and it was
only relatively recently that the races inherent in destroying an MD
device (which could be recreated at any moment by user space) were
closed so that MD devices don't have to exist forever.
A closely related problem with MD is that the block device
representing the array appears before the array is configured or has
any data in it. So when udev first creates /dev/md0, an attempt to
open and read from it, (to find out if a filesystem is
stored there, for example) will find no content. It is only after the
component devices have been attached to the array and it has been
fully configured that there is any point in trying to read data from
the array.
This initial state, where the device exists but is empty, is somewhat
like the case of removable-media devices, and can be managed along
the same lines as those: we could treat the array as media that can
spontaneously appear. However MD is, in other ways, quite unlike
removable media (there is no concept of "eject") and it would
generally cause less confusion if MD devices appeared fully configured so
they looked more like regular
disk drive devices.
A problem that has only recently been addressed is the
fact that MD manages the metadata for the arrays internally.
The kernel module knows all about the layout of data in the superblock
and updates it as appropriate. This makes it easy to implement,
but not so easy to extend. Due to the lack of any real standard, there
are many vendor-specific metadata layouts that all can be used to
describe the same sort of array. Supporting all of those in the
kernel would unnecessarily bloat the kernel, and supporting
them in user space requires information about required updates to be
reported to user space in a reliable way.
As mentioned, this problem has recently been addressed, so it is now
quite possible to manage vendor-specific metadata from user space.
It is still worth noting, though, as one of the problems that has stood in the
way of earlier attempts at DM/MD integration: DM does not
manage metadata at all, leaving it up to user-space tools.
The final flaw in MD to be exposed here is the use and nature of
the ioctl() commands that are used to configure and manage MD arrays.
The use of ioctl() has been frowned upon in the Linux community for some
years. There are a number of reasons for this. One is that
strace cannot decode newly-defined ioctls, so the use of
ioctl() can make a program's behaviour harder to observe. Another is
that it is a binary interface (typically passing C structures around)
and so, when Linux is configured to support multiple ABIs (e.g. a 32bit
and a 64bit version), there is often a need to translate the binary
structure from one ABI to the other (see the nearly 3000 lines in
fs/compat_ioctl.c).
In the case of MD, the ioctl() interface is not very extensible. The
command for configuring an array allows only a "level", a "layout", and
a "chunksize" to be specified. This works well enough for RAID0,
RAID1, and RAID5, but even with RAID10 we needed to encode multiple
values into the "layout" field which, while effective, isn't elegant.
In the last few years MD has grown a separate configuration interface
via a collection of attribute files exposed in sysfs. This is much
more extensible, and there are a growing number of features of MD
which require the sysfs interface. However even here there is still
room for improvement.
The MD attribute files are stored in a subdirectory of the block
device directory (e.g. /sys/block/md0/md/). While this seems
natural, it entrenches the above-mentioned problem that the block
device must exist before the array can be configured. If we wanted to
delay creation of the block device until the array is ready to serve
data, we would need to store these attribute files elsewhere in
sysfs.
The failings of DM
DM has a very different heritage than MD and, while it shares some of
the flaws of MD, it avoids others.
DM devices do not need to exist in /dev before they can be created.
Rather there is a dedicated "character device" which accepts DM
ioctl() commands, including the command to create a new device. Thus,
the catch-22 problem from which MD suffers, is not present in DM.
It has
been suggested that MD should take this approach too. However, while
it does solve one problem, it still leaves the problem of using
ioctl(). There doesn't seem a lot of point making significant change
to a subsystem unless the result avoids all known problems. So while
waiting for a perfect solution, no such small steps have been made to
bring MD and DM closer together.
The related issue of a block device existing before it is configured
is still present in DM, though separating the creation of the DM
device from the creation of the block device would be much easier in
DM. This is because, as mentioned, with DM all configuration happens
over the character device whereas with MD, the configuration
happens via the block device itself, so it must exist before it can be
configured.
While DM also uses ioctl() commands, which could be seen as a weakness,
the commands chosen are much more extensible than those used by MD.
The ioctl() command to configure a device essentially involves
passing a
text string to the relevant module within DM, and it interprets this
string in any way it likes. So DM is not limited to the fields that
were thought to be relevant when DM was first designed.
Metadata management with DM is very different than with MD. In the
original design, there was never any need for the kernel module to
modify metadata, so metadata management was left entirely in user space
where it belongs. More recently, with RAID1 and RAID5 (which is still
under development), the kernel is required to synchronously update the
metadata to record a device failure. This requires a degree of
interaction between the kernel and user space which has had to be added.
The main problem with the design of DM is the fact that it has two
layers: the table layer and the target layer. This undoubtedly comes
from the original focus of DM, which was logical volume management (LVM), and it fits
that focus quite well. However, it is an unnecessary layering and just
gets in the way of non-LVM applications.
A "target" is a concept internal to DM, which is the abstraction that each
different module presents. So striping, raid1, multipath, etc. each
present a target, and these targets can be combined via a table into a
block device.
A "table" is simply a list of targets each with a size and an offset.
This is analogous to the "linear" module in MD or what is elsewhere
described as concatenation. The targets are essentially joined
end-to-end to form a single larger block device.
This contrasts with MD where each module - raid0, raid1, or multipath, for
example -
presents a block device. This block device can be used as-is, or it
can be combined with others, via a separate array, into a
single larger block device.
To highlight the effect of this layering a little more, suppose we
were to have two different arrays made of a few devices. In one array
we want the data striped across the devices. In the other we
lay the data out filling the first device first and then moving on to
the next device.
With MD, the only difference between these two would be the choice
of "raid0" or "linear" as the module to manage them. With DM, the
first step would involve including all the devices in a single "stripe"
target, and then placing that target as the sole entry in a table.
The second would involve creating a number of "linear" targets,
one for each device, and then combining them into a table with
multiple entries.
Having this internal abstraction of a "target" serves to insulate and
isolate DM from the block device layer, which is the common abstraction
used by other virtual devices. A good example of this separation is
the online reconfiguration functionality that DM provides. The
boundary between the table and the targets allows DM to capture new
requests in the table layer while allowing the target layer to drain
and become idle, and then to potentially replace all the targets with
different targets before releasing the requests that have been held
back.
Without that internal "target" layer, that functionality would need to
be implemented in the block layer on its boundary with the driver code
(i.e. in generic_make_request() and
bio_endio()). Doing this would be more effort (i.e. DM would not
benefit from insulation) and it would then be more generally useful
(i.e. DM would not be so isolated). Many people have wanted to be
able to convert a normal device into a degraded RAID1 "array" or to enable
multipath support on a device without first unmounting the filesystem
which was mounted directly from one of the paths. If online
reconfiguration were supported at the block layer level, these changes
would become possible.
The difference of DRBD
DRBD, the Distributed Replicated Block Device, is the most complex of
the virtual block devices that do not aim to provide a general
framework. It is not yet included in the mainline, but it could yet be
merged in 2.6.33.
Its configuration mechanism is similar to that of DM in a number of
ways. There is a single channel which can be used to create and then
manage the block devices. The protocol used over this channel is
designed to be extensible, though the current definitions are very
much focused around the particular needs of DRBD (as would be
expected), so how easy it might be to extend to a different
sort of array is not immediately clear.
Where DM uses ioctl() with string commands over a dedicated
character device, DRBD uses a packed binary protocol over a netlink
connection. This is essentially a socket connection between the
kernel module and a user-space management program which carries binary
encoded messages back and forth. This is probably no better or worse
than ioctl(); it is simply different.
Presumably it was chosen because
there is general bad feeling about ioctl(), but no such bad feeling
about netlink.
Linus, however, doesn't seem
keen on either approach.
DRBD appears to share metadata management between the kernel module
and user space. Metadata which describes a particular DRBD
configuration is created and interpreted by user-space tools and the
information that is needed by the kernel is communicated over the
netlink socket.
DRBD uses other metadata to describe the current replication state of
the system - which blocks are known to be safely replicated and which
are possibly inconsistent between the replicas for some reason. This
metadata (an activity log and a bitmap) is managed by the kernel,
presumably for performance reasons.
This sharing of responsibility makes a lot of sense as it allows the
performance-sensitive portions to remain in the kernel but still leaves a
lot of flexibility to support different metadata formats.
This approach could be improved even more by making the bitmap and activity log
into independent modules that can be used by other virtual devices.
Each of DM, MD, and DRBD have very similar similar mechanisms for
tracking inconsistencies between component devices; this is
possibly the most obvious area where sharing would be beneficial.
Loop, NBD and the purpose of infrastructure
Partly to emphasize the fact that it isn't necessary to use a
framework to have a virtual block device, loop and NBD (the Network
Block Device) are worth considering.
While loop doesn't appear to aim to provide a framework for a
multiplicity of virtual devices, it nonetheless combines three different
functions into one device. It can make a regular file look like a
block device, it can provide primitive partitioning of a different
block device, and it can provide encryption and decryption so that an
encrypted device can be accessed.
Significantly, these are each functions that were subsequently added
to DM, thus highlighting the isolating effect of the design of DM.
NBD is much simpler in that is has just one function: it provides
a block device for which all I/O requests are forwarded over a network
connection to be serviced on - normally - a different host. It is
possibly most instructive as an example of a virtual block device that
doesn't need any surrounding framework or infrastructure.
Two areas where DM or MD devices make use of an infrastructure, while
Loop and NBD need to fend for themselves, are in the creation of new
devices and the configuration of those devices.
NBD takes a very simple approach of creating a predefined number of
devices at module initialization time and not allowing any more.
Loop is a little more flexible and uses the same mechanism as MD,
largely provided by the block layer, to create loop devices when the
block device special file is opened. It does not allow these to be
deleted until the module is unloaded, usually at system shutdown
time. This architecture suggests that some infrastructure could be helpful for
these drivers, and that the best place for that infrastructure could
well be in the block layer, and thus shared by all devices.
For configuration, both Loop and NBD use a fairly ad hoc
collection of ioctl() commands. As we have already observed, this is
both common and problematic. They could both benefit from a more
standardized and transparent configuration mechanism.
It might be appropriate to ask at this point why there is any need for
subsystem infrastructure such as DM and MD. Why not simply follow the
pattern seen in loop, NBD and DRDB and have a separate block device driver
for each sort of virtual block device?
The most obvious reason is one that doesn't really apply any more. At
the time when MD and DM were being written there was a strong
connection between major device numbers and block device drivers. Each
driver needed a separate major number. Loop is 7, NBD is 43, DRBD is
147, and MD is 9. DM doesn't have a permanently allocated number, it
chooses a spare number when the module is loaded, so it usually gets
254 or 253.
Furthermore, at that time, the number of available major
numbers was limited to 255 and there was danger of running out.
Allocating one major number for RAID0, one for LINEAR, one for RAID1
and so forth would have looked like a bit of a waste, so getting one
for MD and plugging different personalities into the one driver
might have been a simple matter of numerical economy.
Today, we have many more major numbers available, and we no longer
have a tight binding between major numbers and device drivers - a
driver simply claims whichever device numbers it wants at any time,
when the module is loaded, or when a device is created.
A second reason is the fact that all the MD personalities
envisioned at the time had a lot in common. In particular they each
used a number of component devices to create a larger device. While
creating a midlayer to encapsulate this functionality might be a
mistake,
it is a very tempting step and would seem to make implementation
easier.
Finally, as has been mentioned, having a single module which defines
its own internal interfaces can provide a measure of insulation from
other parts of the kernel. While this was mentioned only in the
context of DM, it is by no means absent from MD. That insulation,
while not necessarily in the best interests of the kernel as a whole,
can make life a lot easier for the individual developer.
None of these reasons really stand up as defensible today, though some
were certainly valid in the past. So it could be that, rather than
seeking unification of MD and DM, we should be seeking their
deprecation. If we can find a simple approach to allow different
implementations of virtual block devices to exist as independent
drivers, but still maintain all the same functionality as they
presently have, that is likely to be the best way forward.
Unification with the device model
This brings us to the Linux device model. While there may be no real
need to unify DM with MD, the devices they create need to fit into the
unifying model for devices which we call the "device model" and which is
exposed most obviously through various directory trees in sysfs.
The device model has a very broad concept of a "device." It is much
more than the traditional Unix block and character devices; it
includes busses, intermediate devices, and just about anything that
is in any way addressable.
In this model it would seem sensible for there to be an "array" device
which is quite separate from the "block" device providing access
to the data in the array. This is not unlike the current situation
where a SCSI bus has a child which is a SCSI target which, in turn, has
a child which is a SCSI LUN (Logical UNit), and that device itself is
still separate from the block device that we tend to think of as a
"SCSI disk".
This separation would allow the array to be created and configured
before the block device can come into being, thus removing any room for
confusion for udev.
The device model already allows for a bus driver to discover
devices on that
bus. In most cases this happens automatically during boot or at
hotplug time. However, it is possible to ask a bus to discover any new
devices, or to look for a particular new device. This last action
could easily be borrowed to manage creation of virtual block devices
on a virtual bus. The automatic scan would not find any devices, but
an explicit request for an explicitly-named device could always
succeed by simply creating that device.
If we then configure the device by filling in attribute files in the
virtual block device, we have a uniform and extensible mechanism for
configuring all virtual block devices that fits with an existing
model.
Again, the device model already allows for binding different drivers
to devices as implemented in the different "bind" files in the
/sys/bus directory tree. Utilizing this idea, once a virtual block
device was "discovered" on the virtual block device bus, an appropriate
driver could be bound to it that would interpret the attributes,
possibly create files for extra attributes, and, ultimately, instantiate
the block device.
Possibly the most difficult existing feature to represent cleanly in
the device model is the on-line reconfiguration that DM and, more
recently, MD provide. This allows control of an array to be passed
from one driver to another without needing to destroy and recreate the
block device (thus, for example, a filesystem can remain mounted
during the transition). Doing this exchange in a completely general way would
involve detaching a block device from one parent and attaching it to
another. This would be complex for a number of reasons,
one being the backing_dev_info structure, which creates quite
a tight connection between a filesystem and the driver for the
mounted block device.
Another weakness in the device model is that dependencies between
devices are very limited - a device can be dependent on at most one
other device, its parent. This doesn't fit very well with the
observation that an array is dependent on all the components of the
array, and that these components can change from time to time.
Fortunately this weakness
has already been
identified
and, hopefully, will be resolved in a way that also works for virtual block
devices.
So, while there are plenty of issues that this model leaves unresolved,
it does seem that unification with the device model holds the key to
unification between MD and DM, along with any other virtual block devices.
So what is the answer?
Knowing that a problem is hard does not excuse us from solving it.
With the growing interest in managing multiple devices together, as
seen in DRBD and Btrfs, as well as in the increasing convergence in
functionality between DM and MD, now might be the ideal time
to solve the problems and achieve unification.
Reflecting on the various problems and differences discussed above, it
would seem that a very important step would be to define and agree on
some interfaces; two in particular.
The first interface that we need is the device creation and
configuration interface. It needs to provide for all the different
needs of DM, MD, Loop, NBD, DRBD, and probably even Btrfs. It needs to
be sufficiently complete such that the current ioctl() and netlink
interfaces can be implemented entirely through calls into this new
interface. It is almost certain that this interface should be exposed
through sysfs and so needs to integrate well with the device model.
The second interface is the one between the block
layer and individual block drivers. This interface needs to be enhanced to
support all the functionality that a DM target expects of its
interface with a DM table, and in particular it needs to be able to
support hotplugging of the underlying driver while the block device
remains active.
Defining and, very importantly, agreeing on these interfaces will go a
long way towards achieving the long sought after unification.
Comments (24 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>