User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.24-rc2. Quite a few patches have found their way into the mainline git repository since -rc2 was released; they are mostly fixes but there's also some ongoing CIFS ACL support work and the removal of a number of obsolete documents. Expect the -rc3 release sometime in the very near future.

The current -mm tree is 2.6.24-rc2-mm1 - the first -mm release since 2.6.23-mm1 came out on October 11. Recent changes to -mm include a number of device mapper updates, a big driver tree update (which has broken a number of things), a lot of IDE updates, bidirectional SCSI support, a large set of SLUB fixes and other "mammary manglement" patches, 64-bit capability support, a number of ext4 enhancements, and the PCI hotplug development tree.

Comments (2 posted)

The -mm of the minute tree

In the introduction to 2.6.24-rc2-mm1 (the first -mm tree in some time), Andrew Morton noted that some people want something even more bleeding-edge. So he has created the -mm of the minute tree, which is updated a few times every day. "I will attempt to ensure that the patches in there actually apply, but they sure as heck won't all compile and run." The tree is exported as a patch series, so Quilt is needed to turn it into something which can be compiled. Have fun.

Comments (12 posted)

Kernel development news

Quotes of the week

I claim that we'd have a much higher quality kernel if we had a single central mailing list instead of these elitist fractured lists. Every kernel topic would have global visibility, and it would be trivially easy to get the interest of other people, across subsystems.
-- Ingo Molnar

If it's not reported on linux-scsi, there's a significant chance of us missing the bug report. The fact that some people notice bugs go past on LKML and forward them to linux-scsi is a happy accident and not necessarily something to rely on.

LKML has 10-20x the traffic of linux-scsi and a much smaller signal to noise ratio. Having a specialist list where all the experts in the field hang out actually enhances our ability to fix bugs.

-- James Bottomley

Comments (16 posted)

Various topics related to kernel quality

By Jonathan Corbet
November 14, 2007
Discussions of kernel quality are not a new phenomenon on linux-kernel. It is, indeed, a topic which comes up with a certain regularity, more so than with many other free software projects. The size of the kernel, the rate at which its code changes, and the wide range of environments in which the kernel runs all lead to unique challenges; add in the fact that kernel bugs can lead to catastrophic system failures and you have the material for no end of debate.

The latest round began when Natalie Protasevich, a Google developer who spends some time helping Andrew Morton track bugs, posted this list of a few dozen open bugs which seemed worthy of further attention. Andrew responded with his view of what was happening with those bug reports; that view was "no response from developers" in most cases:

So I count around seven reports which people are doing something with and twenty seven which have been just ignored.

A number of developers came back saying, in essence, that Andrew was employing an overly heavy hand and that his assertions were not always correct. Regardless of whether his claims are correct, Andrew has clearly touched a nerve.

He defended his posting by raising his often-expressed fear that the quality of the kernel is in decline. This is, he says, something which requires attention now:

If the kernel _is_ slowly deteriorating then this won't become readily apparent until it has been happening for a number of years. By that stage there will be so much work to do to get us back to an acceptable level that it will take a huge effort. And it will take a long time after that for the kernel to get its reputation back.

But is the kernel deteriorating? That is a very hard question to answer for a number of reasons. There is no objective standard by which the quality of the kernel can be judged. Certain kinds of problems can be found by automated testing, but, in the kernel space, many bugs can only be found by running the kernel with specific workloads on specific combinations of hardware. A rising number of bug reports does not necessarily indicate decreasing quality when both the number of users and the size of the code base are increasing.

Along the same lines, as Ingo Molnar pointed out, a decreasing number of bug reports does not necessarily mean that quality is improving. It could, instead, indicate that testers are simply getting frustrated and dropping out of the development process - a worsening kernel could actually cause the reporting of fewer bugs. So Ingo says we need to treat our testers better, but we also need to work harder at actually measuring the quality of the kernel:

I tried to make the point that the only good approach is to remove our current subjective bias from quality metrics and to at least realize what a cavalier attitude we still have to QA. The moment we are able to _measure_ how bad we are, kernel developers will adopt in a second and will improve those metrics. Lets use more debug tools, both static and dynamic ones. Lets measure tester base and we need to measure _lost_ early adopters and the reasons why they are lost.

It is generally true that problems which can be measured and quantified tend to be addressed more quickly and effectively. The classic example is PowerTop, which makes power management problems obvious. Once developers could see where the trouble was and, more to the point, could see just how much their fixes improved the situation, vast numbers of problems went away over a short period of time. At the moment, the kernel developers can adopt any of a number of approaches to improving kernel quality, but they In the absence of objective measurements, developers trying to improve kernel quality are really just groping in the dark. will not have any way of really knowing if that effort is helping the situation or not. In the absence of objective measurements, developers trying to improve kernel quality are really just groping in the dark.

As an example, consider the discussion of the "git bisect" feature. If one is trying to find a regression which happened between 2.6.23 and 2.6.24-rc1, one must conceivably look at several thousand patches to find the one which caused the problem - a task which most people tend to find just a little intimidating. Bisection helps the tester perform a binary search over a range of patches, eliminating half of them in each compile-and-boot cycle. Using bisect, a regression can be tracked down in a relatively automatic way with "only" a dozen or so kernel builds and reboots. At the end of the process, the guilty patch will have been identified in an unambiguous way.

Bisection works so well that developers will often ask a tester to use it to track down a problem they are reporting. Some people see this practice as a way for lazy kernel developers to dump the work of tracking down their bugs on the users who are bitten by those bugs. Building and testing a dozen kernels is, they say, too much to ask of a tester. Mark Lord, for example, asserts that most bugs are relatively easy to find when a developer actually looks at the code; the whole bisect process is often unnecessary:

I'm just asking that developers here do more like our Top Penguin does, and actually look at problems and try to understand them and suggest fixes to try. And not rely solely on the git-bisect crutch. It's a good crutch, provided the reporter is a kernel developer, or has a lot of time on their hands. But we debugged Linux here for a long time without it.

On the other hand, some developers see bisection as a powerful tool which has made it easier for testers to actively help the process. David Miller says:

Like the internet, this time spent is beneficial because it's pushing the work out to the end nodes. In fact git bisect is an awesome example of the end node principle in action for software development and QA. For the end-user wanting their bug fixed and the developer it's a win win situation because the reporter is actually able to do something proactive which will help get the bug they want fixed faster.

Returning to original bug list: another issue which came up was the use of mailing lists other than linux-kernel. Some of the bugs had not been addressed because they had never been reported to the mailing list dedicated to the affected subsystem. Other bugs, marked by Andrew as having had no response, had, in fact, been discussed (and sometimes fixed) on subsystem-specific lists. In both situations, the problem is a lack of communication between subsystem lists and the larger community.

In response, some developers have, once again, called for a reduction in the use of subsystem-specific lists. We are, they say, all working on a single kernel, and we are all interested in what happens with that kernel. Discussing kernel subsystems in isolation is likely to result in a lower-quality kernel. Ingo Molnar expresses it this way:

We lose much more by forced isolation of discussion than what we win by having less traffic! It's _MUCH_ easier to narrow down information (by filter by threads, by topics, by people, etc.) than it is to gobble information together from various fractured sources. We learned it _again and again_ that isolation of kernel discussions causes bad things.

Moving discussions back onto linux-kernel seems like a very hard sell, though. Most subsystem-specific lists feature much lower traffic, a friendlier atmosphere, and more focused conversation. Many subscribers of such lists are unlikely to feel that moving back to linux-kernel would improve their lives. So, perhaps, the best that can be hoped for is that more developers would subscribe to both lists and make a point of ensuring that relevant information flows in both directions.

David Miller pointed out another reason why some bug reports don't see a lot of responses: developers have to choose which bugs to try to address. Problems which affect a lot of users, and which can be readily reproduced, have a much higher chance of getting quick developer attention. Bug reports which end up at the bottom of the prioritized list ("chaff"), instead, tend to languish. The system, says David, tends to work reasonably well:

Luckily if the report being ignored isn't chaff, it will show up again (and again and again) and this triggers a reprioritization because not only is the bug no longer chaff, it also now got a lot of information tagged to it so it's a double worthwhile investment to work on the problem.

Given that there are unlikely to ever be enough developers to respond to every single kernel bug report, the real problem comes down to prioritization. Andrew Morton has a clear idea of which reports should be handled first: regressions from previous releases.

If we're really active in chasing down the regressions then I think we can be confident that the kernel isn't deteriorating. Probably it will be improving as we also fix some always-been-there bugs.

Attention to regressions has improved significantly over the last couple of years or so. They tend to be much more actively tracked, and the list of known regressions is consulted before kernel releases are made. The real problem, according to Andrew, is that any regressions which are still there after a release tend to fall off the list. Better attention to those problems would help to ensure that the quality of the kernel improved over time.

Comments (10 posted)

Better per-CPU variables

By Jonathan Corbet
November 12, 2007
One of the great advantages of multiprocessor computers is the fact that main memory is available to all processors on the system. This ability to share data gives programmers a great deal of flexibility. One of the first things those programmers learn (or should learn), however, is that actually sharing data between processors is to be avoided whenever possible. The sharing of data - especially data which changes - causes all kinds of bad cache behavior and greatly reduced performance. The recently-concluded What every programmer should know about memory series covers these problems in great detail.

Over the years, kernel developers have made increasing use of per-CPU data in an effort to minimize memory contention and its associated performance penalties. As a simple example, consider the disk operation statistics maintained by the block layer. Incrementing a global counter for every disk operation would cause the associated cache line to bounce continually between processors; disk operations are frequent enough that the performance cost would be measurable. So each CPU maintains its own set of counters locally; it never has to contend with any other CPU to increment one of those counters. When a total count is needed, all of the per-CPU counters are added up. Given that the counters are queried far more rarely than they are modified, storing them in per-CPU form yields a significant performance improvement.

In current kernels, most of these per-CPU variables are managed with an array of pointers. So, for example, the kmem_cache structure (as implemented by the SLUB allocator) contains this field:

    struct kmem_cache_cpu *cpu_slab[NR_CPUS];

[percpu array] Note that the array is dimensioned to hold one pointer for every possible CPU in the system. Most deployed computers have fewer than the maximum number of processors, though, so there is, in general, no point in allocating NR_CPUS objects for that array. Instead, only the entries in the array which correspond to existing processors are populated; for each of those processors, the requisite object is allocated using kmalloc() and stored into the array. The end result is an array that looks something like the diagram on the right. In this case, per-CPU objects have been allocated for four processors, with the remaining entries in the array being unallocated.

A quick look at the diagram immediately shows one potential problem with this scheme: each of these per-CPU arrays is likely to have some wasted space at the end. NR_CPUS is a configuration-time constant; most general-purpose kernels (e.g. those shipped by distributors) tend to have NR_CPUS set high enough to work on most or all systems which might reasonably be encountered. In short, NR_CPUS is likely to be quite a bit larger than the number of processors actually present, with the result that there will be a significant amount of wasted space at the end of each per-CPU array.

In fact, Christoph Lameter noticed that are more problems than that; in response, he has posted a patch series for a new per-CPU allocator. The deficiencies addressed by Christoph's patch (beyond the wasted space in each per-CPU array) include:

  • If one of these per-CPU arrays is embedded within a larger data structure, it may separate the other variables in that structure, causing them to occupy more cache lines than they otherwise would.

  • Each CPU uses exactly one pointer from that array (most of the time); that pointer will reside in the processor's data cache while it is being used. Cache lines hold quite a bit more than one pointer, though; in this case, the rest of the cache line is almost certain to hold the pointers for the other CPUs. Thus, scarce cache space is being wasted on completely useless data.

  • Accessing the object requires two pointer lookups - one to get the object pointer from the array, and one to get to the object itself.

Christoph's solution is quite simple in concept: turn all of those little per-CPU arrays into one big per-CPU array. With this scheme, each processor is allocated a dedicated range of memory at system initialization time. These ranges are all contiguous in the kernel's virtual address [New percpu structure] space, so, given a pointer to the per-CPU area for CPU 0, the area for any other processor is just a pointer addition away.

When a per-CPU object is allocated, each CPU gets a copy obtained from its own per-CPU area. Crucially, the offset into each CPU's area is the same, so the address of any CPU's object is trivially calculated from the address of the first object. So the array of pointers can go away, replaced by a single pointer to the object in the area reserved for CPU 0. The resulting organization looks (with the application of sufficient imagination) something like the diagram to the right. For a given object, there is only a single pointer; all of the other versions of that object are found by applying a constant offset to that pointer.

The interface for the new allocator is relatively straightforward. A new per-CPU variable is created with:

    #include <linux/cpu_alloc.h>

    void *per_cpu_var = CPU_ALLOC(type, gfp_flags);

This call will allocate a set of per-CPU variables of the given type, using the usual gfp_flags to control how the allocation is performed. A pointer to a specific CPU's version of the variable can be had with:

    void *CPU_PTR(per_cpu_var, unsigned int cpu);
    void *THIS_CPU(per_cpu_var);

The THIS_CPU() form, as might be expected, returns a pointer to the version of the variable allocated for the current CPU. There is a CPU_FREE() macro for returning a per-CPU object to the system. Christoph's patch converts all users of the existing per-CPU interface and ends by removing that API altogether.

There are a number of advantages to this approach. There's one less pointer operation for each access to a per-CPU variable. The same pointer is used on all processors, resulting in smaller data structures and better cache line utilization. Per-CPU variables for a given processor are grouped together in memory, which, again, should lead to better cache use. All of the memory wasted in the old pointer arrays has been reclaimed. Christoph also claims that this mechanism, by making it easier to keep track of per-CPU memory, makes the support of CPU hotplugging easier.

The amount of discussion inspired by this patch set has been relatively low. There were complaints about the UPPER CASE NAMES used by the macros. The biggest complaint, though, has to do with the way the static per-CPU areas bloat the kernel's data space. On some architectures it makes the kernel too large to boot, and it's a real cost on all architectures. Just how this issue will be resolved is not yet clear. If a solution can be found, the new per-CPU code has a good chance of getting into the mainline when the 2.6.25 merge window opens.

Comments (4 posted)

The Ceph filesystem

By Jake Edge
November 14, 2007

Ceph is a distributed filesystem that is described as scaling from gigabytes to petabytes of data with excellent performance and reliability. The project is LGPL-licensed, with plans to move from a FUSE-based client into the kernel. This led Sage Weil to post a message to linux-kernel describing the project and looking for filesystem developers who might be willing to help. There are quite a few interesting features in Ceph which might make it a nice addition to Linux.

Weil outlines why he thinks Ceph might be of interest to kernel hackers:

I periodically see frustration on this list with the lack of a scalable GPL distributed file system with sufficiently robust replication and failure recovery to run on commodity hardware, and would like to think that--with a little love--Ceph could fill that gap.

The filesystem is well described in a paper from the 2006 USENIX Operating Systems Design and Implementation conference. The project's homepage has the expected mailing list, wiki, and source code repository along with a detailed overview of the feature set.

Ceph is designed to be extremely scalable, from both the storage and retrieval perspectives. One of its main innovations is splitting up operations on metadata from those on file data. With Ceph, there are two kinds of storage nodes, metadata servers (MDSs) and object storage devices (OSDs), with clients contacting the type appropriate for the kind of operation they are performing. The MDSs cache the metadata for files and directories, journaling any changes, and periodically writing the metadata as a data object to the OSDs.

Data objects are distributed throughout the available OSDs using a hash-like function that allows all entities (clients, MDSs, and OSDs) to independently calculate the locations of an object. Coupled with an infrequently changing OSD cluster map, all the participants can figure out where the data is stored or where to store it.

Both the OSDs and MDSs rebalance themselves to accommodate changing conditions and usage patterns. The MDS cluster distributes the cached metadata throughout, possibly replicating metadata of frequently used subtrees of the filesystem in multiple nodes of the cluster. This is done to keep the workload evenly balanced throughout the MDS cluster. For similar reasons, the OSDs automatically migrate data objects onto storage devices that have been newly added to the OSD cluster; thus distributing the workload by not allowing new devices to sit idle.

Ceph does N-way replication of its data, spread throughout the cluster. When an OSD fails, the data is automatically re-replicated throughout the remaining OSDs. Recovery of the replicas can be parallelized because both the source and destination are spread over multiple disks. Unlike some other cluster filesystems, Ceph starts from the assumption that disk failure will be a regular occurrence. It does not require OSDs to have RAID or other reliable disk systems, which allows the use of commodity hardware for the OSD nodes.

In his linux-kernel posting, Weil describes the current status of Ceph:

I would describe the code base (weighing in at around 40,000 semicolon-lines) as early alpha quality: there is a healthy amount of debugging work to be done, but the basic features of the system are complete and can be tested and benchmarked.

In addition to creating an in-kernel filesystem for the clients (OSDs and MDSs run as userspace processes), there are several other features – notably snapshots and security – listed as needing work.

Originally the topic of Weil's PhD. thesis, Ceph is also something that he hopes to eventually use at a web hosting company he helped start before graduate school:

We spend a lot of money on storage, and the proprietary products out there are both expensive and largely unsatisfying. I think that any organization with a significant investment in storage in the data center should be interested [in Ceph]. There are few viable open source options once you scale beyond a few terabytes, unless you want to spend all your time moving data around between servers as volume sizes grow/contract over time.

Unlike other projects, especially those springing from academic backgrounds, Ceph has some financial backing that could help it get to a polished state more quickly. Weil is looking to hire kernel and filesystem hackers to get Ceph to a point where it can be used reliably in production systems. Currently, he is sponsoring the work through his web hosting company, though an independent foundation or other organization to foster Ceph is a possibility down the road.

Other filesystems with similar feature sets are available for Linux, but Ceph takes a fundamentally different approach to most of them. For those interested in filesystem hacking or just looking for a reliable solution scalable to multiple petabytes, Ceph is worth a look.

Comments (9 posted)

Patches and updates

Kernel trees


Build system

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds