The current 2.6 prepatch remains 2.6.24-rc2
. Quite a few patches
have found their way into the mainline git repository since -rc2 was
released; they are mostly fixes but there's also some ongoing CIFS ACL
support work and the removal of a number of obsolete documents. Expect the
-rc3 release sometime in the very near future.
The current -mm tree is 2.6.24-rc2-mm1 - the first -mm
release since 2.6.23-mm1 came out on October 11. Recent changes to
-mm include a number of device mapper updates, a big driver tree update
(which has broken a number of things), a lot of IDE updates, bidirectional
SCSI support, a large set of SLUB fixes and other "mammary manglement"
patches, 64-bit capability support, a number of ext4 enhancements, and the
PCI hotplug development tree.
Comments (2 posted)
In the introduction to 2.6.24-rc2-mm1
first -mm tree in some time), Andrew Morton noted that some people want
something even more bleeding-edge. So he has created the -mm of the minute tree
which is updated a few times every day. "I will attempt to ensure
that the patches in there actually apply, but they sure as heck won't all
compile and run.
" The tree is exported as a patch series, so Quilt
is needed to turn it into something which can be compiled. Have fun.
Comments (12 posted)
Kernel development news
I claim that we'd have a much higher quality kernel if we had a
single central mailing list instead of these elitist fractured
lists. Every kernel topic would have global visibility, and it
would be trivially easy to get the interest of other people, across
-- Ingo Molnar
If it's not reported on linux-scsi, there's a significant chance of
us missing the bug report. The fact that some people notice bugs
go past on LKML and forward them to linux-scsi is a happy accident
and not necessarily something to rely on.
LKML has 10-20x the traffic of linux-scsi and a much smaller signal
to noise ratio. Having a specialist list where all the experts in
the field hang out actually enhances our ability to fix bugs.
-- James Bottomley
Comments (16 posted)
Discussions of kernel quality are not a new phenomenon on linux-kernel. It
is, indeed, a topic which comes up with a certain regularity, more so than
with many other free software projects. The size of the kernel, the rate
at which its code changes, and the wide range of environments in which the
kernel runs all lead to unique challenges; add in the fact that kernel bugs
can lead to catastrophic system failures and you have the material for no
end of debate.
The latest round began when Natalie Protasevich, a Google developer who
spends some time helping Andrew Morton track bugs, posted this list of a few dozen open bugs which
seemed worthy of further attention. Andrew responded with his view of what was happening
with those bug reports; that view was "no response from developers" in most
So I count around seven reports which people are doing something
with and twenty seven which have been just ignored.
A number of developers came back saying, in essence, that Andrew was
employing an overly heavy hand and that his assertions were not always
correct. Regardless of whether his claims are correct, Andrew has
clearly touched a nerve.
He defended his posting by raising his
often-expressed fear that the quality of the kernel is in decline. This
is, he says, something which requires attention now:
If the kernel _is_ slowly deteriorating then this won't become
readily apparent until it has been happening for a number of years.
By that stage there will be so much work to do to get us back to an
acceptable level that it will take a huge effort. And it will take
a long time after that for the kernel to get its reputation back.
But is the kernel deteriorating? That is a very hard question to answer
for a number of reasons. There is no objective standard by which the
quality of the kernel can be judged. Certain kinds of problems can be
found by automated testing, but, in the kernel space, many bugs can
only be found by running the kernel with specific workloads on specific combinations
of hardware. A rising number of bug reports does not necessarily indicate
decreasing quality when both the number of users and the size of the code
base are increasing.
Along the same lines, as Ingo Molnar pointed
out, a decreasing number of bug reports does not necessarily mean that
quality is improving. It could, instead, indicate that testers are simply
getting frustrated and dropping out of the development process - a
worsening kernel could actually cause the reporting of fewer bugs. So Ingo
says we need to treat our testers better, but we also need to work harder
at actually measuring the quality of the kernel:
I tried to make the point that the only good approach is to remove
our current subjective bias from quality metrics and to at least
realize what a cavalier attitude we still have to QA. The moment we
are able to _measure_ how bad we are, kernel developers will adopt
in a second and will improve those metrics. Lets use more debug
tools, both static and dynamic ones. Lets measure tester base and
we need to measure _lost_ early adopters and the reasons why they
It is generally true that problems which can be measured and quantified
tend to be addressed more quickly and effectively. The classic example is
PowerTop, which makes power management problems obvious. Once developers
could see where the trouble was and, more to the point, could see just how
much their fixes improved the situation, vast numbers of problems went away
over a short period of time. At the moment, the kernel developers can
adopt any of a number of approaches to improving kernel quality, but they
In the absence of objective measurements, developers
trying to improve kernel quality are really just groping in the dark.
will not have any way of really knowing if that effort is helping the
situation or not. In the absence of objective measurements, developers
trying to improve kernel quality are really just groping in the dark.
As an example, consider the discussion of the "git bisect" feature.
If one is trying to find a regression which happened between 2.6.23 and
2.6.24-rc1, one must conceivably look at several thousand patches to find
the one which caused the problem - a task which most people tend to find
just a little intimidating. Bisection helps the tester perform a binary
search over a range of patches, eliminating half of them in each
compile-and-boot cycle. Using bisect, a regression can be tracked down in
a relatively automatic way with "only" a dozen or so kernel builds and
reboots. At the end of the process, the guilty patch will have been
identified in an unambiguous way.
Bisection works so well that developers will often ask a tester to use it
to track down a problem they are reporting. Some people see this practice
as a way for lazy kernel developers to dump the work of tracking down their
bugs on the users who are bitten by those bugs. Building and testing a
dozen kernels is, they say, too much to ask of a tester. Mark Lord, for
example, asserts that most bugs are relatively
easy to find when a developer actually looks at the code; the whole
bisect process is often unnecessary:
I'm just asking that developers here do more like our Top Penguin
does, and actually look at problems and try to understand them and
suggest fixes to try. And not rely solely on the git-bisect
crutch. It's a good crutch, provided the reporter is a kernel
developer, or has a lot of time on their hands. But we debugged
Linux here for a long time without it.
On the other hand, some developers see bisection as a powerful tool which
has made it easier for testers to actively help the process. David Miller
Like the internet, this time spent is beneficial because it's
pushing the work out to the end nodes. In fact git bisect is an
awesome example of the end node principle in action for software
development and QA.
For the end-user wanting their bug fixed and the developer it's a
win win situation because the reporter is actually able to do
something proactive which will help get the bug they want fixed
Returning to original bug list: another issue which came up was the use of
mailing lists other than linux-kernel. Some of the bugs had not been
addressed because they had never been reported to the mailing list
dedicated to the affected subsystem. Other bugs, marked by Andrew as
having had no response, had, in fact, been discussed (and sometimes fixed)
on subsystem-specific lists. In both situations, the problem is a lack of
communication between subsystem lists and the larger community.
In response, some developers have, once again, called for a reduction in
the use of subsystem-specific lists. We are, they say, all working on a
single kernel, and we are all interested in what happens with that kernel.
Discussing kernel subsystems in isolation is likely to result in a
Ingo Molnar expresses it this way:
We lose much more by forced isolation of discussion than what we
win by having less traffic! It's _MUCH_ easier to narrow down
information (by filter by threads, by topics, by people, etc.) than
it is to gobble information together from various fractured
sources. We learned it _again and again_ that isolation of kernel
discussions causes bad things.
Moving discussions back onto linux-kernel seems like a very hard sell,
though. Most subsystem-specific lists feature much lower traffic, a
friendlier atmosphere, and more focused conversation. Many subscribers of
such lists are unlikely to feel that moving back to linux-kernel would
improve their lives. So, perhaps, the best that can be hoped for is that
more developers would subscribe to both lists and make a point of ensuring
that relevant information flows in both directions.
David Miller pointed out another reason why
some bug reports don't see a lot of responses: developers have to choose
which bugs to try to address. Problems which affect a lot of users, and
which can be readily reproduced, have a much higher chance of getting
quick developer attention. Bug reports which end up at the bottom of the
prioritized list ("chaff"), instead, tend to languish. The system, says
David, tends to work reasonably well:
Luckily if the report being ignored isn't chaff, it will show up
again (and again and again) and this triggers a reprioritization
because not only is the bug no longer chaff, it also now got a lot
of information tagged to it so it's a double worthwhile investment
to work on the problem.
Given that there are unlikely to ever be enough developers to respond to
every single kernel bug report, the real problem comes down to
prioritization. Andrew Morton has a clear
idea of which reports should be handled first: regressions from
If we're really active in chasing down the regressions then I think
we can be confident that the kernel isn't deteriorating. Probably
it will be improving as we also fix some always-been-there bugs.
Attention to regressions has improved significantly over the last couple of
years or so. They tend to be much more actively tracked, and the list of
known regressions is consulted before kernel releases are made. The real
problem, according to Andrew, is that any regressions which are still there
after a release tend to fall off the list. Better attention to those
problems would help to ensure that the quality of the kernel improved over
Comments (10 posted)
One of the great advantages of multiprocessor computers is the fact that
main memory is available to all processors on the system. This ability to
share data gives programmers a great deal of flexibility. One of the first
things those programmers learn (or should learn), however, is that actually
sharing data between processors is to be avoided whenever possible. The
sharing of data - especially data which changes - causes all kinds of bad
cache behavior and greatly reduced performance. The recently-concluded What every programmer should know
series covers these problems in great detail.
Over the years, kernel developers have made increasing use of per-CPU data
in an effort to minimize memory contention and its associated performance
penalties. As a simple example, consider the disk operation statistics
maintained by the block layer. Incrementing a global counter for every
disk operation would cause the associated cache line to bounce continually
between processors; disk operations are frequent enough that the
performance cost would be measurable. So each CPU maintains its own set of
counters locally; it never has to contend with any other CPU to increment
one of those counters. When a total count is needed, all of the per-CPU
counters are added up. Given that the counters are queried far more rarely
than they are modified, storing them in per-CPU form yields a significant
In current kernels, most of these per-CPU variables are managed with an
array of pointers. So, for example, the kmem_cache structure (as
implemented by the SLUB allocator) contains this field:
struct kmem_cache_cpu *cpu_slab[NR_CPUS];
Note that the array is dimensioned to hold one pointer for every possible
CPU in the system. Most deployed computers have fewer than the maximum
number of processors, though, so there is, in general, no point in
allocating NR_CPUS objects for that array. Instead, only the
entries in the array which correspond to existing processors are populated;
for each of those processors, the requisite object is allocated using
kmalloc() and stored into the array. The end result is an array
that looks something like the diagram on the right. In this case, per-CPU
objects have been allocated for four processors, with the remaining entries
in the array being unallocated.
A quick look at the diagram immediately shows one potential problem with
this scheme: each of these per-CPU arrays is likely to have some wasted
space at the end. NR_CPUS is a configuration-time constant; most
general-purpose kernels (e.g. those shipped by distributors) tend to have
NR_CPUS set high enough to work on most or all systems which might
reasonably be encountered. In short, NR_CPUS is likely to be quite a bit
larger than the number of processors actually present, with the result that
there will be a significant amount of wasted space at the end of each
In fact, Christoph Lameter noticed that are more problems than that; in
response, he has posted a patch
series for a new per-CPU allocator. The deficiencies addressed by
Christoph's patch (beyond the wasted
space in each per-CPU array) include:
- If one of these per-CPU arrays is embedded within a larger data
structure, it may separate the other variables in that structure,
causing them to occupy more cache lines than they otherwise would.
- Each CPU uses exactly one pointer from that array (most of the time);
that pointer will reside in the processor's data cache while it is
being used. Cache lines hold quite a bit more than one pointer,
though; in this case, the rest of the cache line is almost certain to
hold the pointers for the other CPUs. Thus, scarce cache space is
being wasted on completely useless data.
- Accessing the object requires two pointer lookups - one to get the
object pointer from the array, and one to get to the object itself.
Christoph's solution is quite simple in concept: turn all of those little
per-CPU arrays into one big per-CPU array. With this scheme, each
processor is allocated a dedicated range of memory at system initialization
time. These ranges are all contiguous in the kernel's virtual address
space, so, given a pointer to the per-CPU area for CPU 0, the area for
any other processor is just a pointer addition away.
When a per-CPU object is allocated, each CPU gets a copy obtained from its
own per-CPU area. Crucially, the offset into each CPU's area is the same,
so the address of any CPU's object is trivially calculated from the address
of the first object. So the array of pointers can go away, replaced by a
single pointer to the object in the area reserved for CPU 0. The
resulting organization looks (with the application of sufficient
imagination) something like the diagram to the right. For a given object,
there is only a single pointer; all of the other versions of that object
are found by applying a
constant offset to that pointer.
The interface for the new allocator is relatively straightforward. A new
per-CPU variable is created with:
void *per_cpu_var = CPU_ALLOC(type, gfp_flags);
This call will allocate a set of per-CPU variables of the given
type, using the usual gfp_flags to control how the
allocation is performed. A pointer to a specific CPU's version of the
variable can be had with:
void *CPU_PTR(per_cpu_var, unsigned int cpu);
The THIS_CPU() form, as might be expected, returns a pointer to
the version of the variable allocated for the current CPU. There is a
CPU_FREE() macro for returning a per-CPU object to the system.
Christoph's patch converts all users of the existing per-CPU interface and
ends by removing that API altogether.
There are a number of advantages to this approach. There's one less
pointer operation for each access to a per-CPU variable. The same pointer
is used on all processors, resulting in smaller data structures and better
cache line utilization. Per-CPU variables for a given processor are
grouped together in memory, which, again, should lead to better cache use.
All of the memory wasted in the old pointer arrays has been reclaimed.
Christoph also claims that this mechanism, by making it easier to keep
track of per-CPU memory, makes the support of CPU hotplugging easier.
The amount of discussion inspired by this patch set has been relatively
low. There were complaints about the UPPER CASE NAMES used by the macros.
The biggest complaint, though, has to do with the way the static
per-CPU areas bloat the kernel's data space. On some architectures it
makes the kernel too large to boot, and it's a real cost on all
architectures. Just how this issue will be resolved is not yet clear.
If a solution can be found, the new per-CPU code has a good chance of
getting into the mainline when the 2.6.25 merge window opens.
Comments (4 posted)
Ceph is a distributed filesystem that is described as scaling from gigabytes to
petabytes of data with excellent performance and reliability. The project
is LGPL-licensed, with plans to move from a
FUSE-based client into the kernel. This led Sage Weil to post a message to linux-kernel
describing the project and looking for filesystem developers who might be
willing to help. There are quite a few interesting features in Ceph which
might make it a nice addition to Linux.
Weil outlines why he thinks Ceph might be of interest to kernel hackers:
periodically see frustration on this list with the lack of a scalable GPL
distributed file system with sufficiently robust replication and failure
recovery to run on commodity hardware, and would like to think that--with
a little love--Ceph could fill that gap.
The filesystem is well
described in a paper
from the 2006 USENIX Operating Systems Design and Implementation conference.
The project's homepage has the
expected mailing list, wiki, and source code repository along with a detailed
overview of the feature set.
Ceph is designed to be extremely scalable, from both the storage and
retrieval perspectives. One of its main innovations is splitting up
operations on metadata from those on file data. With Ceph, there are two
kinds of storage nodes, metadata servers (MDSs) and object storage devices
(OSDs), with clients contacting the type appropriate for the kind of
operation they are performing. The MDSs cache the metadata for files and
directories, journaling any changes, and periodically writing the metadata
as a data object
to the OSDs.
Data objects are distributed throughout the available OSDs using a
hash-like function that allows all entities (clients, MDSs, and OSDs) to
calculate the locations of an object. Coupled with an infrequently
changing OSD cluster map, all the participants can figure out where the
data is stored or where to store it.
Both the OSDs and MDSs rebalance themselves to accommodate changing
conditions and usage patterns. The MDS cluster distributes the cached
metadata throughout, possibly replicating metadata of frequently used
subtrees of the filesystem in multiple nodes of the cluster. This is done
to keep the workload evenly balanced throughout the MDS cluster. For
similar reasons, the OSDs automatically migrate data objects onto storage devices that
have been newly added to the OSD cluster; thus distributing the workload
by not allowing new devices to sit idle.
Ceph does N-way replication of its data, spread throughout the cluster.
When an OSD fails, the data is automatically re-replicated throughout the
remaining OSDs. Recovery of the replicas can be parallelized because both
the source and destination are spread over multiple disks. Unlike some other
cluster filesystems, Ceph starts from the assumption that disk failure will
be a regular occurrence. It does not require OSDs to have RAID or other
reliable disk systems, which allows the use of commodity hardware for the
In his linux-kernel posting, Weil describes the
current status of Ceph:
I would describe the code base
(weighing in at around 40,000 semicolon-lines) as early alpha quality:
there is a healthy amount of debugging work to be done, but the basic
features of the system are complete and can be tested and
In addition to creating an in-kernel filesystem for
the clients (OSDs and MDSs run as userspace processes), there are several
other features – notably snapshots and security – listed as needing work.
Originally the topic of Weil's PhD. thesis,
Ceph is also something that he
hopes to eventually use at a web hosting company he helped start before
We spend a lot of money on storage, and the proprietary products out there
are both expensive and largely unsatisfying. I think that any
organization with a significant investment in storage in the data center
should be interested [in Ceph]. There are few viable open source options once you
scale beyond a few terabytes, unless you want to spend all your time
moving data around between servers as volume sizes grow/contract over
Unlike other projects, especially those springing from academic
backgrounds, Ceph has some financial backing that could help it get to a
polished state more quickly. Weil is looking to hire kernel and filesystem
hackers to get Ceph to a point where it can be used reliably in production
systems. Currently, he is sponsoring the work through his web hosting
company, though an independent foundation or other organization to foster
Ceph is a possibility down the road.
Other filesystems with similar feature sets are available for Linux, but
Ceph takes a fundamentally different approach to most of them. For those
interested in filesystem hacking or just looking for a reliable solution
scalable to multiple petabytes, Ceph is worth a look.
Comments (9 posted)
Patches and updates
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>