|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.36-rc6, which was released on September 28. "Nothing here strikes me as particularly interesting. I'd like developers to take a look at Rafael's latest regression list (subject line of "2.6.36-rc5-git7: Reported regressions from 2.6.35" on lkml and various other mailing lists), as it's reasonably short. That said, for some reason I don't have that "warm and fuzzy" feeling, possibly because there's still been more commits in these -rc's than I'd really like at this stage (and no, the one extra day isn't enough to account for it)." The short-form changelog is in the announcement, or see the full changelog for all the details.

Stable updates: 2.6.32.23 and 2.6.35.6 were released on September 27. A typo fix that only affected Xen users necessitated the release of 2.6.35.7, which was done live on-stage at LinuxCon Japan on September 29.

Comments (none posted)

Quotes of the week

I'm beginning to think we need to have an entry in the kernel newbie's FAQ warning people that the output of various scripts such as checkpatch and get_maintainer are not authoritative, and are heuristics intended to be supplemented by human intelligence.
-- Ted Ts'o

Comments (2 posted)

Kernel development news

Maintaining a stable kernel on an unstable base

By Jonathan Corbet
September 29, 2010
Greg Kroah-Hartman launched his LinuxCon Japan 2010 keynote by stating that the most fun thing about working on Linux is that it is not stable; it is, in fact, the fastest-moving software project in the history of the world. This claim was justified with a number of statistics on development speed, all of which will be quite familiar to LWN readers. In summary, over the last year, the kernel has been absorbing 5.5 changes per hour, every hour, without a break. How, he asked, might one try to build a stable kernel on top of such a rapidly-changing base?

The answer began with a history lesson. Fifteen years ago, the 2.0.0 kernel came out, and things were looking good. We had good performance, SMP support, a shiny new mascot, and more. After four months of stabilization work, the 2.1.0 tree was branched off, and development of the mainline resumed. This was, of course, the days of the traditional even/odd development cycle, which seemed like the right way to do things at the time.

It took 848 days and 141 development releases to reach the 2.2.0 kernel. There was a strong feeling that things should go faster than that, so when, four months later, the 2.3.0 kernel came out, there was a hope that this development cycle would be a little bit shorter. To an extent, we succeeded: it only took 604 days and 58 releases to get to 2.4.0. But people who were watching at the time will remember that 2.4 took a long time to really stabilize; it was a full ten months before Linus felt ready to create the 2.5 branch and go into development mode again.

This time around, the developers intended to do a short development cycle for real. There was a lot of new code which they wanted to get into the hands of users as soon as possible. In fact, the pressure to push features to users was so strong that the distributors were putting considerable resources into backporting 2.5 code into the 2.4 kernels they were shipping. The result was "a mess" at all levels: shipped 2.4 kernels were an unstable mixture of patches, and the developers ended up doing their feature work twice: once for 2.5, and once for the backport. It did not work very well.

As a result, the 2.5 development cycle ran for 1057 days, with 86 releases. It was painful in a number of ways, but the end result - the 2.6 kernel - was significantly better than 2.4. Various things happened over the course of this development cycle; the development community learned a number of lessons about how kernel development should be done. The advent [Greg Kroah-Hartman] of BitKeeper made distributed development work much better than it did in the past and highlighted the importance of breaking changes down into small, reviewable, debuggable pieces. The kernel community which existed at the 2.6.0 release was wiser and more experienced than what had existed before; we had figured out how to do things better.

This evolution led to the adoption of the "new" development model in the early 2.6 days. The separate development and stable branches were gone, replaced with a single, fast-moving tree with releases about every three months. This system worked well for development; it is still in use several years later. But it made life a bit difficult for distributors and users. Even three months can be a long time to wait for important fixes, and, if those fixes come with a new load of bugs, they may not be entirely welcome. So it became clear that there needed to be a mechanism to distribute fixes (and only fixes) to users more quickly.

The discussion led to Linus's classic email saying that it would not be possible to find somebody who could maintain a stable kernel over any period of time. But, still, he expressed some guidelines by which a suitable "sucker" could try to create such a tree. Within a few minutes, Greg had held up his hand as a potential sucker; Chris Wright followed thereafter. Greg has been doing it ever since; Chris created about 50 stable releases before eventually moving back to "real work" and away from stable kernel work.

The stable tree has been in operation ever since. The model has changed little over that time; once a mainline release happens, it will receive stable updates for at least one development cycle. For most kernels, those updates stop after exactly one cycle. This is an important part of how the stable tree works; it puts an upper bound on the number of trees which must be maintained, and it encourages users to move forward to more current kernels.

Greg presented the rules which apply to submissions to the stable tree: they must fix real bugs, be small and easily verified, etc. The most important rule, though, is the one stating that any patches must appear in the mainline before they can be applied to the stable tree. That rule ensures that important fixes get into both trees and increases assurance that the fixes have been properly reviewed.

Some kernels receive longer stable support than others; one example is 2.6.32. A number of distribution kernel maintainers got together around 2.6.30 to see if they could all settle on a single kernel to maintain for a longer period; they settled on 2.6.32. That kernel has since been incorporated into SLES11 SP1, RHEL6, Debian Squeeze, Ubuntu 10.04 LTS, and Oracle's recently-announced enterprise kernel update. It has received over 2000 fixes to date, with contributions from everybody involved; 2.6.32 is a great example of inter-distribution contribution. It is also, as the result of all those fixes, a high-quality kernel at this point.

Greg pointed out one other interesting thing about 2.6.32: two enterprise distributions (SLES and Oracle's offering) have moved forward to this kernel for an existing distribution. That is a bit of a change in an area where distributors have typically stuck with their original kernel versions over the lifetime of a release. There are significant costs to staying with an ancient kernel, so it would be encouraging if these distributors were to figure out how to move to newer stable kernels without creating problems for their users.

The stable process is generally working well, with maintainers doing an increasingly good job of sending important fixes over. Some maintainers are quite good, with dedicated repository branches for stable patches. Others are...not quite so good; SCSI maintainer James Bottomley was told in a rather un-Japanese manner that he and his developers could be doing better.

People who are interested in upcoming stable releases can participate in the review cycle as well. Two or three days before each release, Greg posts all of the candidate patches to the lists for review. Some people complain about the large number of posts, but he ignores them: the Linux community, he says, does its development in public. There are starting to be more people who are interested in helping with pre-release testing, a development which Greg described as "awesome."

The talk concluded with a demo: Greg packaged up and released 2.6.35.7 (code name "Yokohama") from the stage. It seems that the 2.6.35.6 update - evidently released during Dirk Hohndel's MeeGo talk earlier in the week - contained a typo which made life difficult for Xen users. The fix, possibly the first major kernel release done in front of a crowd, hopefully will not suffer from the same kind of problem.

Comments (5 posted)

Organizing kernel messages

By Jonathan Corbet
September 29, 2010
In a previous life, your editor developed Fortran code on a VAX/VMS system. Every message emitted by VMS came decorated with a unique identifier which could be used to look it up in a massive blue binder, yielding a few paragraphs of (hopefully) helpful text on what the message actually meant. Linux has no analogous mechanism, but that is not the result of a lack of attempts. A talk at LinuxCon Japan detailed a new approach to organized kernel messaging which, its authors hope, has a better chance of making it into the mainline.

Andrew Morton recently described the kernel's approach to messaging this way:

The kernel's whole approach to messaging is pretty haphazard and lame and sad. There have been various proposals to improve the usefulness and to rationally categorise things in way which are more useful to operators, but nothing seems to ever get over the line

At LinuxCon Japan, Hisashi Hashimoto described an effort which, he hopes, will get over the line. To that end, he and others have examined previous attempts to bring order to kernel messaging. Undeterred, they have pushed forward with a new project; he then introduced Kazuo Ito who discussed the actual work.

Attempts to regularize kernel messaging usually involve either attaching an identifier to kernel messages or standardizing the message format in some way. One thing that Ito-san noted at the outset is that any scheme requiring wholesale changes to printk() lines is probably not going to get very far. There are over 75,000 such lines in the kernel, many of them buried within macros; there is no practical way to change them all. Other wrapper functions, such as dev_printk(), complicate the situation further. So any change will have to be effected in a way which works with the existing mass of printk() calls.

A few approaches were considered. One would be to create a set of wrapper macros which would format message identifiers and pass them to printk(); the disadvantage of this method, of course, is that it still requires changing all of the printk() call sites. It's also [Kazuo Ito] possible to turn printk() into a macro which would assemble a message identifier from the available file name and line number information; those identifiers, though, would be too volatile for the intended use. So the approach which the developers favored was hooking into printk() itself to add message identifiers to messages as they find their way to the console and the logs.

These message identifiers (also called "message-locating helper tokens") must be assigned in some sort of automatic manner; asking the development community to maintain a list of identifiers and attach them to messages seems like a sure road to disappointment. So one must immediately think of how those identifiers will be generated; the two main concerns are uniqueness and stability. It turns out that Ito-san is not concerned with absolute uniqueness; if, on occasion, two or three kernel messages end up with the same identifier, the administrator should still be able to sort things out without a great deal of pain.

Stability is important, though; if message identifiers change frequently between releases - not to mention between boots - their value will be reduced. For that reason, generating messages at compile time using preprocessor variables like __FILE__ and __LINE__ to generate the identifiers, while easy, is not sufficient. One could also use the virtual address of the printk() call site, which is guaranteed to be unique, but that could even change from one system boot to the next, depending on things like the order in which modules are loaded. So a different approach needs to be found.

What this group has settled on is generating a CRC32 hash of the message format string at run time. There is a certain runtime cost to that which would have been nice to avoid, but it's not that high and, if printk() calls are a bottleneck to system performance, there are other problems. If the system has been configured to output message identifiers, this hash value will be prepended (with a "(%08x):" format) to the message before it is printed. A CRC32 hash is not guaranteed to produce a unique identifier for each message (though it is better than CRC16, which is guaranteed to have collisions with 75,000 messages), but it will be close enough.

Discussion of the current implementation during the talk revealed that there are some remaining problems. Messages printed with dev_printk() will all end up with the same identifier, which is an undesirable result. The newly-added "%pV" format directive (which indicates the passing of a structure containing a new format string and argument list) also complicates things significantly by adding recursive format string processing. So the implementation will require some work, but there was not a lot of disagreement over the basic approach.

It was only toward the end of the talk that there was some discussion of what the use cases for this feature are. The initial goal is simply to make it easier to find where a message is coming from in the kernel code. The use of macros, helper functions, etc. can make it hard to track down a message with a simple grep operation. But, with a message ID and a supporting database (to be maintained with a user-space tool), developers should be able to go directly to the correct printk() call. Vinod Kutty noted that, in large installations, automatic monitoring systems could use the identifiers to recognize situations requiring some sort of response. There are also long-term goals around creating databases of messages translated to other languages and help information for specific messages.

So there are real motivations for this sort of work. But, as was noted back at the beginning, getting any kind of message identifier patch through the process has always been a losing proposition so far. It is hoped that, this time around, the solution will be sufficiently useful (even to kernel developers) and sufficiently nonintrusive that it might just get over the line. We should find out soon; once the patch has been fixed, it will be posted to the mailing list for comments.

Comments (26 posted)

Namespace file descriptors

By Jake Edge
September 29, 2010

Giving different groups of processes their own view of global kernel resources—network environments and filesystem trees for example—is one of the goals of the kernel container developers. These views, or namespaces, are created as part of a clone() with one of the CLONE_NEW* flags and are only visible to the new process and its children. Eric Biederman has proposed a mechanism that would allow other processes, outside of the namespace-creator's descendants, to see and access those namespaces.

When we looked at an earlier version back in March, Biederman had proposed two new system calls, nsfd() and setns(). Since that time, he has eliminated the nsfd() call by adding a new /proc/<pid>/ns directory with files that can be opened to provide a file descriptor for the different kinds of namespaces. That removes the need for a dedicated system call to find and return an fd to a namespace.

Currently, there must be a process running in a namespace to keep it around, but there are use cases where it is rather cumbersome to have a dedicated process for keeping the namespace alive. With the new patches, doing a bind mount of the proc file for a namespace:

    mount --bind /proc/self/ns/net /some/path
for example, will keep the namespace alive until it is unmounted.

The setns() call is unchanged from the earlier proposal:

    int setns(unsigned int nstype, int nsfd);
It will set the namespace of the process to that indicated by the file descriptor nsfd, which should be a reference to an open namespace /proc file. nstype is either zero or the name of the namespace type the caller is trying to switch to ("net", "ipc", "uts", and "mnt" are implemented), so the call will fail if the namespace that is referred to by nsfd does not correspond. The call will also fail unless the caller has the CAP_SYS_ADMIN capability (root privileges, essentially).

For this round, Biederman has also added something of a convenience function, in the form of the socketat() system call:

    int socketat(int nsfd, int family, int type, int protocol);
The call parallels socket(), but takes an nsfd parameter for the namespace to create the socket in. As pointed out in the discussion of that patch, socketat() could be implemented using setns():
    setns(0, nsfd);
    sock = socket(...);
    setns(0, original_nsfd);
Biederman agrees that it could be done in user space, but is concerned about race conditions in an implementation of that kind. In addition, unlike for the other namespace types, he has some specific use cases in mind for network namespaces:

The use case are applications are the handful of networking applications that find that it makes sense to listen to sockets from multiple network namespaces at once. Say a home machine that has a vpn into your office network and the vpn into the office network runs in a different network namespace so you don't have to worry about address conflicts between the two networks, the chance of accidentally bridging between them, and so you can use different dns resolvers for the different networks.

But he also realized that it might be a somewhat controversial addition. Overall, there has been relatively little discussion of the patchset on linux-kernel, and Biederman said that it had received positive reviews on the containers mailing list. He posted the patches so that other kernel developers could review the ABI additions, and there seem to be no complaints with setns() and the /proc filesystem additions.

Changes for the "pid" namespace were not included in these patches as there is some work needed before that namespace can be safely unshared. That work doesn't affect the ABI, though. Once the pid namespace is added in, it seems likely we will see these patches return, perhaps without socketat(), sometime soon. Allowing suitably privileged processes to access others' namespaces will be a useful addition, and one that may not be too far off.

Comments (5 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.36-rc6 ?
Greg KH Linux 2.6.35.6 ?
Greg KH Linux 2.6.35.7 ?
Greg KH Linux 2.6.32.23 ?

Core kernel code

Device drivers

Documentation

Michael Kerrisk man-pages-3.27 is released ?

Filesystems and block I/O

Networking

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds