Brief items
The current development kernel is 2.6.36-rc6, which was released on September 28. "Nothing here strikes me as particularly interesting. I'd like
developers to take a look at Rafael's latest regression list (subject line
of "2.6.36-rc5-git7: Reported regressions from 2.6.35" on lkml and various
other mailing lists), as it's reasonably short. That said, for some reason
I don't have that "warm and fuzzy" feeling, possibly because there's still
been more commits in these -rc's than I'd really like at this stage (and
no, the one extra day isn't enough to account for it)." The
short-form changelog is in the announcement, or see the
full changelog for all the details.
Stable updates: 2.6.32.23 and 2.6.35.6 were released on
September 27. A typo fix that only affected Xen users necessitated the
release of 2.6.35.7, which
was done live on-stage at LinuxCon Japan on September 29.
Comments (none posted)
I'm beginning to think we need to have an entry in the kernel newbie's
FAQ warning people that the output of various scripts such as
checkpatch and get_maintainer are not authoritative, and are
heuristics intended to be supplemented by human intelligence.
--
Ted Ts'o
Comments (2 posted)
Kernel development news
By Jonathan Corbet
September 29, 2010
Greg Kroah-Hartman launched his LinuxCon Japan 2010 keynote by stating that
the most fun thing about working on Linux is that it is not stable; it is,
in fact, the fastest-moving software project in the history of the world.
This claim was justified with a number of statistics on development speed,
all of which will be quite familiar to LWN readers. In summary, over the
last year, the kernel has been absorbing 5.5 changes per hour, every hour,
without a break. How, he asked, might one try to build a stable kernel on
top of such a rapidly-changing base?
The answer began with a history lesson. Fifteen years ago, the 2.0.0
kernel came out, and things were looking good. We had good performance,
SMP support, a shiny new mascot, and more. After four months of
stabilization work, the 2.1.0 tree was branched off, and development of the
mainline resumed. This was, of course, the days of the traditional
even/odd development cycle, which seemed like the right way to do things at
the time.
It took 848 days and 141 development releases to reach the 2.2.0 kernel.
There was a strong feeling that things should go faster than that, so when,
four months later, the 2.3.0 kernel came out, there was a hope that this
development cycle would be a little bit shorter. To an extent, we succeeded: it only took 604
days and 58 releases to get to 2.4.0. But people who were watching at the
time will remember that 2.4 took a long time to really stabilize; it was a
full ten months before Linus felt ready to create the 2.5 branch and go
into development mode again.
This time around, the developers intended to do a short development cycle
for real. There was a lot of new code which they wanted to get into the
hands of users as soon as possible. In fact, the pressure to push features
to users was so strong that the distributors were putting considerable
resources into backporting 2.5 code into the 2.4 kernels they were
shipping. The result was "a mess" at all levels: shipped 2.4 kernels were
an unstable mixture of patches, and the developers ended up doing their
feature work twice: once for 2.5, and once for the backport. It did not
work very well.
As a result, the 2.5 development cycle ran for 1057 days, with 86
releases. It was painful in a number of ways, but the end result - the 2.6
kernel - was significantly better than 2.4. Various things happened over
the course of this development cycle; the development community learned a
number of lessons about how kernel development should be done. The advent
of BitKeeper made distributed development work much better than it did in
the past and highlighted the importance of breaking changes down into
small, reviewable, debuggable pieces. The kernel community which existed
at the 2.6.0 release was wiser and more experienced than what had existed
before; we had figured out how to do things better.
This evolution led to the adoption of the "new" development model in the
early 2.6 days. The separate development and stable branches were gone,
replaced with a single, fast-moving tree with releases about every three
months. This system worked well for development; it is still in use
several years later. But it made life a bit difficult for distributors and
users. Even three months can be a long time to wait for important fixes,
and, if those fixes come with a new load of bugs, they may not be entirely
welcome. So it became clear that there needed to be a mechanism to
distribute fixes (and only fixes) to users more quickly.
The discussion led to Linus's classic email
saying that it would not be possible to find somebody who could maintain a
stable kernel over any period of time. But, still, he expressed some
guidelines by which a suitable "sucker" could try to create such a tree.
Within a few minutes, Greg had held up his hand as a potential sucker;
Chris Wright followed thereafter. Greg has been doing it ever since; Chris
created about 50 stable releases before eventually moving back to "real
work" and away from stable kernel work.
The stable tree has been in operation ever since. The model has changed
little over that time; once a mainline release happens, it will receive
stable updates for at least one development cycle. For most kernels, those
updates stop after exactly one cycle. This is an important part of how the
stable tree works; it puts an upper bound on the number of trees which must
be maintained, and it encourages users to move forward to more current
kernels.
Greg presented the rules which apply to submissions to the stable tree:
they must fix real bugs, be small and easily verified, etc. The most
important rule, though, is the one stating that any patches must appear in
the mainline before they can be applied to the stable tree. That rule
ensures that important fixes get into both trees and increases assurance
that the fixes have been properly reviewed.
Some kernels receive longer stable support than others; one example is
2.6.32. A number of distribution kernel maintainers got together around
2.6.30 to see if they could all settle on a single kernel to maintain for a
longer period; they settled on 2.6.32. That kernel has since been
incorporated into SLES11 SP1, RHEL6, Debian Squeeze, Ubuntu 10.04 LTS, and
Oracle's recently-announced enterprise kernel update. It has received over
2000 fixes to date, with contributions from everybody involved; 2.6.32 is a
great example of inter-distribution contribution. It is also, as the
result of all those fixes, a high-quality kernel at this point.
Greg pointed out one other interesting thing about 2.6.32: two enterprise
distributions (SLES and Oracle's offering) have moved forward to this
kernel for an existing distribution. That is a bit of a change in an area
where distributors have typically stuck with their original kernel versions
over the lifetime of a release. There are significant costs to staying
with an ancient kernel, so it would be encouraging if these distributors
were to figure out how to move to newer stable kernels without creating
problems for their users.
The stable process is generally working well, with maintainers doing an
increasingly good job of sending important fixes over. Some maintainers
are quite good, with dedicated repository branches for stable patches.
Others are...not quite so good; SCSI maintainer James Bottomley was told in
a rather un-Japanese manner that he and his developers could be doing
better.
People who are interested in upcoming stable releases can participate in
the review cycle as well. Two or three days before each release, Greg
posts all of the candidate patches to the lists for review. Some people
complain about the large number of posts, but he ignores them: the Linux
community, he says, does its development in public. There are starting to
be more people who are interested in helping with pre-release testing, a
development which Greg described as "awesome."
The talk concluded with a demo: Greg packaged up and released 2.6.35.7 (code name "Yokohama")
from the stage. It seems that the 2.6.35.6 update - evidently released
during Dirk Hohndel's MeeGo talk earlier in the week - contained a typo
which made life difficult for Xen users. The fix, possibly the first major
kernel release done in front of a crowd, hopefully will not suffer from the
same kind of problem.
Comments (4 posted)
By Jonathan Corbet
September 29, 2010
In a previous life, your editor developed Fortran code on a VAX/VMS
system. Every message emitted by VMS came decorated with a unique
identifier which could be used to look it up in a massive blue binder,
yielding a few paragraphs of (hopefully) helpful text on what the message
actually meant. Linux has no analogous mechanism, but that is not the
result of a
lack of attempts. A talk at LinuxCon Japan detailed a new approach to
organized kernel messaging which, its authors hope, has a better chance of
making it into the mainline.
Andrew Morton recently described the kernel's approach
to messaging this way:
The kernel's whole approach to messaging is pretty haphazard and
lame and sad. There have been various proposals to improve the
usefulness and to rationally categorise things in way which are
more useful to operators, but nothing seems to ever get over the
line
At LinuxCon Japan, Hisashi Hashimoto described an effort which, he hopes,
will get over the line. To that end, he and others have examined previous
attempts to bring order to kernel messaging. Undeterred, they have pushed
forward with a new project; he then introduced Kazuo Ito who discussed the
actual work.
Attempts to regularize kernel messaging usually involve either attaching an
identifier to kernel messages or standardizing the message format in some
way. One thing that Ito-san noted at the outset is that any scheme
requiring wholesale changes to printk() lines is probably not
going to get very far. There are over 75,000 such lines in the kernel,
many of them buried within macros; there is no practical way to change them all.
Other wrapper functions, such as dev_printk(), complicate the
situation further. So any change will have to be effected in a way which
works with the existing mass of printk() calls.
A few approaches were considered. One would be to create a set of wrapper
macros which would format message identifiers and pass them to
printk(); the disadvantage of this method, of course, is that it
still requires changing all of the printk() call sites. It's also
possible to turn printk() into a macro which would assemble a
message identifier from the available file name and line number
information; those identifiers, though, would be too volatile for the
intended use. So the approach which the developers favored was hooking
into printk() itself to add message identifiers to messages as
they find their way to the console and the logs.
These message identifiers (also called "message-locating helper tokens")
must be assigned in some sort of automatic manner; asking the development
community to maintain a list of identifiers and attach them to messages
seems like a sure road to disappointment. So one must immediately think of
how those identifiers will be generated; the two main concerns are
uniqueness and stability. It turns out that Ito-san is not concerned with
absolute uniqueness; if, on occasion, two or three kernel messages end up
with the same identifier, the administrator should still be able to sort
things out without a great deal of pain.
Stability is important, though; if message identifiers change frequently
between releases - not to mention between boots - their value will be
reduced. For that reason, generating messages at compile time using
preprocessor variables like __FILE__ and __LINE__ to
generate the identifiers, while easy, is not sufficient. One could also
use the virtual address of the printk() call site, which is
guaranteed to be unique, but that could even change from one system boot to
the next, depending on things like the order in which modules are loaded.
So a different approach needs to be found.
What this group has settled on is generating a CRC32 hash of the message
format string at run time. There is a certain runtime cost to that which
would have been nice to avoid, but it's not that high and, if
printk() calls are a bottleneck to system performance, there are
other problems. If the system has been configured to output message
identifiers, this hash value will be prepended (with a "(%08x):"
format) to the message before it is printed. A CRC32 hash is not
guaranteed to produce a unique identifier for each message (though it is
better than CRC16, which is guaranteed to have collisions with 75,000
messages), but it will be close enough.
Discussion of the current implementation during the talk revealed that
there are some remaining problems. Messages printed with
dev_printk() will all end up with the same identifier, which is an
undesirable result. The newly-added "%pV" format directive
(which indicates the passing of a structure containing a new format string
and argument list) also complicates things significantly by adding
recursive format string processing. So the implementation will require
some work, but there was not a lot of disagreement over the basic approach.
It was only toward the end of the talk that there was some discussion of
what the use cases for this feature are. The initial goal is simply to
make it easier to find where a message is coming from in the kernel code.
The use of macros, helper functions, etc. can make it hard to track down a
message with a simple grep operation. But, with a message ID and
a supporting database (to be maintained with a user-space tool), developers
should be able to go directly to the correct printk() call. Vinod
Kutty noted that, in large installations, automatic monitoring systems
could use the identifiers to recognize situations requiring some sort of
response. There are also long-term goals around creating databases of
messages translated to other languages and help information for specific
messages.
So there are real motivations for this sort of work. But, as was noted
back at the beginning, getting any kind of message identifier patch through
the process has always been a losing proposition so far. It is hoped that,
this time around, the solution will be sufficiently useful (even to kernel
developers) and sufficiently nonintrusive that it might just get over the
line. We should find out soon; once the patch has been fixed, it will be
posted to the mailing list for comments.
Comments (26 posted)
By Jake Edge
September 29, 2010
Giving different groups of processes their own view of global kernel
resources—network environments and filesystem trees for
example—is one of the goals of the kernel container developers. These
views, or namespaces, are created as part of a clone() with one of
the
CLONE_NEW* flags and are only visible to
the new process and its children. Eric Biederman has proposed a mechanism that would
allow other processes, outside of the namespace-creator's descendants, to
see and
access those namespaces.
When we looked at an earlier
version back in March, Biederman had proposed two new system calls,
nsfd() and setns(). Since that time, he has eliminated
the nsfd() call by adding a new /proc/<pid>/ns
directory with files that can be opened to provide a file descriptor
for the different kinds of namespaces. That removes the need for a
dedicated system call to find and
return an fd to a namespace.
Currently, there must be a process running in a namespace to keep it around,
but there are use cases where it is rather cumbersome to have a dedicated
process for keeping the namespace alive. With the new patches, doing a
bind mount of the proc
file for a namespace:
mount --bind /proc/self/ns/net /some/path
for example, will keep the namespace alive until it is unmounted.
The setns() call is unchanged from the earlier proposal:
int setns(unsigned int nstype, int nsfd);
It will set the namespace of the process to that indicated by the file
descriptor
nsfd, which should be a reference to an open namespace
/proc file.
nstype is either zero or the name of the
namespace type the caller is trying to switch to ("net", "ipc", "uts", and
"mnt" are implemented), so the call will fail if the namespace that is
referred to by
nsfd does not correspond. The call will also fail
unless the caller has the
CAP_SYS_ADMIN capability (root privileges, essentially).
For this round, Biederman has also added something of a convenience
function, in the form of the socketat() system call:
int socketat(int nsfd, int family, int type, int protocol);
The call parallels
socket(), but takes an
nsfd
parameter for the namespace to create the socket in. As pointed out in the
discussion of that
patch,
socketat() could be implemented using
setns():
setns(0, nsfd);
sock = socket(...);
setns(0, original_nsfd);
Biederman
agrees that it could be done in user space, but is concerned
about race conditions in an implementation of that kind. In addition,
unlike for the other namespace types, he has some specific use cases in
mind for network namespaces:
The use case are applications are the handful of networking applications
that find that it makes sense to listen to sockets from multiple network
namespaces at once. Say a home machine that has a vpn into your office
network and the vpn into the office network runs in a different network
namespace so you don't have to worry about address conflicts between
the two networks, the chance of accidentally bridging between them,
and so you can use different dns resolvers for the different networks.
But he also realized that it might be a somewhat controversial addition.
Overall, there has been relatively little discussion of the patchset on
linux-kernel, and Biederman said that it had received positive reviews on
the containers mailing list. He posted the patches so that other kernel
developers could review the ABI additions, and there seem to be no
complaints with setns() and the /proc filesystem additions.
Changes for the "pid" namespace were not included in these patches as there is
some work needed before that namespace can be safely unshared. That work
doesn't affect the ABI, though. Once the pid namespace is added in, it
seems likely we will see these
patches return, perhaps without socketat(), sometime soon.
Allowing suitably privileged processes to access others' namespaces will
be a useful addition, and one that may not be too far off.
Comments (5 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>