The current -mm tree is 2.6.24-rc2-mm1 - the first -mm release since 2.6.23-mm1 came out on October 11. Recent changes to -mm include a number of device mapper updates, a big driver tree update (which has broken a number of things), a lot of IDE updates, bidirectional SCSI support, a large set of SLUB fixes and other "mammary manglement" patches, 64-bit capability support, a number of ext4 enhancements, and the PCI hotplug development tree.2.6.24-rc2-mm1 (the first -mm tree in some time), Andrew Morton noted that some people want something even more bleeding-edge. So he has created the -mm of the minute tree, which is updated a few times every day. "I will attempt to ensure that the patches in there actually apply, but they sure as heck won't all compile and run." The tree is exported as a patch series, so Quilt is needed to turn it into something which can be compiled. Have fun.
Kernel development news
LKML has 10-20x the traffic of linux-scsi and a much smaller signal to noise ratio. Having a specialist list where all the experts in the field hang out actually enhances our ability to fix bugs.
The latest round began when Natalie Protasevich, a Google developer who spends some time helping Andrew Morton track bugs, posted this list of a few dozen open bugs which seemed worthy of further attention. Andrew responded with his view of what was happening with those bug reports; that view was "no response from developers" in most cases:
A number of developers came back saying, in essence, that Andrew was employing an overly heavy hand and that his assertions were not always correct. Regardless of whether his claims are correct, Andrew has clearly touched a nerve.
He defended his posting by raising his often-expressed fear that the quality of the kernel is in decline. This is, he says, something which requires attention now:
But is the kernel deteriorating? That is a very hard question to answer for a number of reasons. There is no objective standard by which the quality of the kernel can be judged. Certain kinds of problems can be found by automated testing, but, in the kernel space, many bugs can only be found by running the kernel with specific workloads on specific combinations of hardware. A rising number of bug reports does not necessarily indicate decreasing quality when both the number of users and the size of the code base are increasing.
Along the same lines, as Ingo Molnar pointed out, a decreasing number of bug reports does not necessarily mean that quality is improving. It could, instead, indicate that testers are simply getting frustrated and dropping out of the development process - a worsening kernel could actually cause the reporting of fewer bugs. So Ingo says we need to treat our testers better, but we also need to work harder at actually measuring the quality of the kernel:
It is generally true that problems which can be measured and quantified tend to be addressed more quickly and effectively. The classic example is PowerTop, which makes power management problems obvious. Once developers could see where the trouble was and, more to the point, could see just how much their fixes improved the situation, vast numbers of problems went away over a short period of time. At the moment, the kernel developers can adopt any of a number of approaches to improving kernel quality, but they [PULL QUOTE: In the absence of objective measurements, developers trying to improve kernel quality are really just groping in the dark. END QUOTE] will not have any way of really knowing if that effort is helping the situation or not. In the absence of objective measurements, developers trying to improve kernel quality are really just groping in the dark.
As an example, consider the discussion of the "git bisect" feature. If one is trying to find a regression which happened between 2.6.23 and 2.6.24-rc1, one must conceivably look at several thousand patches to find the one which caused the problem - a task which most people tend to find just a little intimidating. Bisection helps the tester perform a binary search over a range of patches, eliminating half of them in each compile-and-boot cycle. Using bisect, a regression can be tracked down in a relatively automatic way with "only" a dozen or so kernel builds and reboots. At the end of the process, the guilty patch will have been identified in an unambiguous way.
Bisection works so well that developers will often ask a tester to use it to track down a problem they are reporting. Some people see this practice as a way for lazy kernel developers to dump the work of tracking down their bugs on the users who are bitten by those bugs. Building and testing a dozen kernels is, they say, too much to ask of a tester. Mark Lord, for example, asserts that most bugs are relatively easy to find when a developer actually looks at the code; the whole bisect process is often unnecessary:
On the other hand, some developers see bisection as a powerful tool which has made it easier for testers to actively help the process. David Miller says:
Returning to original bug list: another issue which came up was the use of mailing lists other than linux-kernel. Some of the bugs had not been addressed because they had never been reported to the mailing list dedicated to the affected subsystem. Other bugs, marked by Andrew as having had no response, had, in fact, been discussed (and sometimes fixed) on subsystem-specific lists. In both situations, the problem is a lack of communication between subsystem lists and the larger community.
In response, some developers have, once again, called for a reduction in the use of subsystem-specific lists. We are, they say, all working on a single kernel, and we are all interested in what happens with that kernel. Discussing kernel subsystems in isolation is likely to result in a lower-quality kernel. Ingo Molnar expresses it this way:
Moving discussions back onto linux-kernel seems like a very hard sell, though. Most subsystem-specific lists feature much lower traffic, a friendlier atmosphere, and more focused conversation. Many subscribers of such lists are unlikely to feel that moving back to linux-kernel would improve their lives. So, perhaps, the best that can be hoped for is that more developers would subscribe to both lists and make a point of ensuring that relevant information flows in both directions.
David Miller pointed out another reason why some bug reports don't see a lot of responses: developers have to choose which bugs to try to address. Problems which affect a lot of users, and which can be readily reproduced, have a much higher chance of getting quick developer attention. Bug reports which end up at the bottom of the prioritized list ("chaff"), instead, tend to languish. The system, says David, tends to work reasonably well:
Given that there are unlikely to ever be enough developers to respond to every single kernel bug report, the real problem comes down to prioritization. Andrew Morton has a clear idea of which reports should be handled first: regressions from previous releases.
Attention to regressions has improved significantly over the last couple of years or so. They tend to be much more actively tracked, and the list of known regressions is consulted before kernel releases are made. The real problem, according to Andrew, is that any regressions which are still there after a release tend to fall off the list. Better attention to those problems would help to ensure that the quality of the kernel improved over time.What every programmer should know about memory series covers these problems in great detail.
Over the years, kernel developers have made increasing use of per-CPU data in an effort to minimize memory contention and its associated performance penalties. As a simple example, consider the disk operation statistics maintained by the block layer. Incrementing a global counter for every disk operation would cause the associated cache line to bounce continually between processors; disk operations are frequent enough that the performance cost would be measurable. So each CPU maintains its own set of counters locally; it never has to contend with any other CPU to increment one of those counters. When a total count is needed, all of the per-CPU counters are added up. Given that the counters are queried far more rarely than they are modified, storing them in per-CPU form yields a significant performance improvement.
In current kernels, most of these per-CPU variables are managed with an array of pointers. So, for example, the kmem_cache structure (as implemented by the SLUB allocator) contains this field:
struct kmem_cache_cpu *cpu_slab[NR_CPUS];
Note that the array is dimensioned to hold one pointer for every possible CPU in the system. Most deployed computers have fewer than the maximum number of processors, though, so there is, in general, no point in allocating NR_CPUS objects for that array. Instead, only the entries in the array which correspond to existing processors are populated; for each of those processors, the requisite object is allocated using kmalloc() and stored into the array. The end result is an array that looks something like the diagram on the right. In this case, per-CPU objects have been allocated for four processors, with the remaining entries in the array being unallocated.
A quick look at the diagram immediately shows one potential problem with this scheme: each of these per-CPU arrays is likely to have some wasted space at the end. NR_CPUS is a configuration-time constant; most general-purpose kernels (e.g. those shipped by distributors) tend to have NR_CPUS set high enough to work on most or all systems which might reasonably be encountered. In short, NR_CPUS is likely to be quite a bit larger than the number of processors actually present, with the result that there will be a significant amount of wasted space at the end of each per-CPU array.
In fact, Christoph Lameter noticed that are more problems than that; in response, he has posted a patch series for a new per-CPU allocator. The deficiencies addressed by Christoph's patch (beyond the wasted space in each per-CPU array) include:
Christoph's solution is quite simple in concept: turn all of those little per-CPU arrays into one big per-CPU array. With this scheme, each processor is allocated a dedicated range of memory at system initialization time. These ranges are all contiguous in the kernel's virtual address space, so, given a pointer to the per-CPU area for CPU 0, the area for any other processor is just a pointer addition away.
When a per-CPU object is allocated, each CPU gets a copy obtained from its own per-CPU area. Crucially, the offset into each CPU's area is the same, so the address of any CPU's object is trivially calculated from the address of the first object. So the array of pointers can go away, replaced by a single pointer to the object in the area reserved for CPU 0. The resulting organization looks (with the application of sufficient imagination) something like the diagram to the right. For a given object, there is only a single pointer; all of the other versions of that object are found by applying a constant offset to that pointer.
The interface for the new allocator is relatively straightforward. A new per-CPU variable is created with:
#include <linux/cpu_alloc.h> void *per_cpu_var = CPU_ALLOC(type, gfp_flags);
This call will allocate a set of per-CPU variables of the given type, using the usual gfp_flags to control how the allocation is performed. A pointer to a specific CPU's version of the variable can be had with:
void *CPU_PTR(per_cpu_var, unsigned int cpu); void *THIS_CPU(per_cpu_var);
The THIS_CPU() form, as might be expected, returns a pointer to the version of the variable allocated for the current CPU. There is a CPU_FREE() macro for returning a per-CPU object to the system. Christoph's patch converts all users of the existing per-CPU interface and ends by removing that API altogether.
There are a number of advantages to this approach. There's one less pointer operation for each access to a per-CPU variable. The same pointer is used on all processors, resulting in smaller data structures and better cache line utilization. Per-CPU variables for a given processor are grouped together in memory, which, again, should lead to better cache use. All of the memory wasted in the old pointer arrays has been reclaimed. Christoph also claims that this mechanism, by making it easier to keep track of per-CPU memory, makes the support of CPU hotplugging easier.
The amount of discussion inspired by this patch set has been relatively low. There were complaints about the UPPER CASE NAMES used by the macros. The biggest complaint, though, has to do with the way the static per-CPU areas bloat the kernel's data space. On some architectures it makes the kernel too large to boot, and it's a real cost on all architectures. Just how this issue will be resolved is not yet clear. If a solution can be found, the new per-CPU code has a good chance of getting into the mainline when the 2.6.25 merge window opens.
Ceph is a distributed filesystem that is described as scaling from gigabytes to petabytes of data with excellent performance and reliability. The project is LGPL-licensed, with plans to move from a FUSE-based client into the kernel. This led Sage Weil to post a message to linux-kernel describing the project and looking for filesystem developers who might be willing to help. There are quite a few interesting features in Ceph which might make it a nice addition to Linux.
Weil outlines why he thinks Ceph might be of interest to kernel hackers:
The filesystem is well described in a paper from the 2006 USENIX Operating Systems Design and Implementation conference. The project's homepage has the expected mailing list, wiki, and source code repository along with a detailed overview of the feature set.
Ceph is designed to be extremely scalable, from both the storage and retrieval perspectives. One of its main innovations is splitting up operations on metadata from those on file data. With Ceph, there are two kinds of storage nodes, metadata servers (MDSs) and object storage devices (OSDs), with clients contacting the type appropriate for the kind of operation they are performing. The MDSs cache the metadata for files and directories, journaling any changes, and periodically writing the metadata as a data object to the OSDs.
Data objects are distributed throughout the available OSDs using a hash-like function that allows all entities (clients, MDSs, and OSDs) to independently calculate the locations of an object. Coupled with an infrequently changing OSD cluster map, all the participants can figure out where the data is stored or where to store it.
Both the OSDs and MDSs rebalance themselves to accommodate changing conditions and usage patterns. The MDS cluster distributes the cached metadata throughout, possibly replicating metadata of frequently used subtrees of the filesystem in multiple nodes of the cluster. This is done to keep the workload evenly balanced throughout the MDS cluster. For similar reasons, the OSDs automatically migrate data objects onto storage devices that have been newly added to the OSD cluster; thus distributing the workload by not allowing new devices to sit idle.
Ceph does N-way replication of its data, spread throughout the cluster. When an OSD fails, the data is automatically re-replicated throughout the remaining OSDs. Recovery of the replicas can be parallelized because both the source and destination are spread over multiple disks. Unlike some other cluster filesystems, Ceph starts from the assumption that disk failure will be a regular occurrence. It does not require OSDs to have RAID or other reliable disk systems, which allows the use of commodity hardware for the OSD nodes.
In his linux-kernel posting, Weil describes the current status of Ceph:
In addition to creating an in-kernel filesystem for the clients (OSDs and MDSs run as userspace processes), there are several other features – notably snapshots and security – listed as needing work.
Originally the topic of Weil's PhD. thesis, Ceph is also something that he hopes to eventually use at a web hosting company he helped start before graduate school:
Unlike other projects, especially those springing from academic backgrounds, Ceph has some financial backing that could help it get to a polished state more quickly. Weil is looking to hire kernel and filesystem hackers to get Ceph to a point where it can be used reliably in production systems. Currently, he is sponsoring the work through his web hosting company, though an independent foundation or other organization to foster Ceph is a possibility down the road.
Other filesystems with similar feature sets are available for Linux, but Ceph takes a fundamentally different approach to most of them. For those interested in filesystem hacking or just looking for a reliable solution scalable to multiple petabytes, Ceph is worth a look.
Patches and updates
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds