The current development kernel is 3.5-rc4
on June 24. Linus says: "So while
we still have 200+ commits in this -rc, they really are all pretty tiny and
insignificant. Sure, if the particular issue they fixed hit you (or you are
the developer of those life-changing lines ;), you may disagree with the
"insignificant" part, but to me, this is just how I like the -rc's at this
Stable updates: the 3.0.36 and 3.4.4 stable kernels were released on
Comments (none posted)
MUWUHAhaha, watch as I destroy your attempts to reduce line count
in your diffstat.
— Mel Gorman
Kernel developers tend to look at code from the point of view "does
it work as designed", "is it clean", "is it efficient", "do I
understand it", etc. We often forget to step back and really
consider whether or not it should be merged at all.
— Andrew Morton
Very few people add printk()s as "inform the system logging daemon
about an event". The prevailing mindset is that perfect code does
not need any logging, so what is left are over 50,000 call sites of
bragging and debugging code, occasionally massaged to be somewhat
— Ingo Molnar
Comments (none posted)
The Linux Foundation has a couple of new installments in its "30 Linux
kernel developers in 30 weeks" series: Sarah
("Find a medium-sized project, in a part of the Linux
kernel community that has a responsive mailing list. Don't waste your time
on a bunch of spelling fix patches.
") and Thomas
("Quite a few people consider me to be one of the Grumpy
Old Men. That's related to my age and the age-related unwillingness to cope
Comments (1 posted)
Jim Gettys has posted a
collecting a lot of thoughts on what's wrong with the
Internet and how the problems can be addressed. "'Fairness' between
applications is also essential. We should reduce/eliminate the current
perverse incentives for applications to abuse the network, as HTTP does
today. We’ve had an arms race conspiracy for the last decade between web
browsers and web sites to minimize latency that is destructive to other
traffic we may care about (such as telephony, teleconferencing and
gaming). Sometimes this is best addressed by fixing protocols to be both
more efficient and more friendly to the network, as HTTP/1.1 pipelining and
now SPDY are intended to do. But the 'web site sharding' problem is
impossible for clients to avoid.
Comments (8 posted)
Btrfs hacker Josef Bacik has let it be known that he will be leaving Red
Hat and joining the growing crowd of kernel developers at Fusion IO.
Full Story (comments: none)
Patches going into the mainline kernel contain a number of tags; a
"Signed-off-by:" from the author is mandatory, but most patches include
tags like "Acked-by:" or "Reported-by:" as well. The officially recognized
tags are documented in the SubmittingPatches file, but some developers have a
certain habit of inventing their own as well. Andrew Morton, while
grumpily trying to discourage such usage, did a bit of digging for
unofficial tags. The result surprised him, leading him to ask
"Geeze, guys. Who knew there were so many Kernel Komedians?
For example, he found a number of variants on "Acked-by:", including:
And so on. One can only feel bad for the developers who felt the need to
add "Repented-by:" or "Fatfingered-by:" and wonder about the story behind
tags like "Antagonized-by:" or "Signed-off-and-morning-tea-spilled-by:".
Andrew may dislike these tags, but others seem to find them amusing and
will give them a "Whatevered-by:" at worst.
Full Story (comments: 8)
Kernel development news
from Cong Wang
to discuss the various mechanisms to store the kernel's "dying breath"
spawned a rather large thread on the ksummit-2012-discuss mailing list.
While things like pstore were set up specifically
to provide a means to store kernel crash information, that doesn't
necessarily make it easy for users to access and report kernel
crashes. That led to suggestions and discussion of better ways for users
to get the information out of their crashed systems—including using
codes to facilitate the process.
Most regular users do not have a serial console set up to record crash
information on a separate machine. So the kernel backtrace that appears
after a crash is just written to the console, which means that much of it
will have scrolled off the screen. Even the data that is there is hard to
extract, with some folks trying to type the information in, which is
tedious, not to mention error-prone. A QR code that encoded the relevant
data could certainly help there.
Konrad Rzeszutek Wilk was the first to broach
the QR code idea, though he said it did not originate with him. It
turns out that H. Peter Anvin and Dirk Hohndel have been "messing
with" the idea, but Will Deacon and Marc Zyngier actually showed
something along those lines at the recent Linaro Connect in Hong Kong.
Deacon was hesitant
to call it a prototype, but said that there was some work done on
encoding a kernel crash backtrace as a QR code. There were two
problems with their approach:
Even without any error correction, the QR code started to get pretty
large (and unreadable) after more than a few lines of backtrace. This
should be fairly easy to fix by encoding the data in a more sensible
manner rather than just verbatim (especially since a backtrace is
a well-structured log). Maybe you could even gzip the whole thing after
that too (then sell an android app to gunzip it :p)
Displaying the QR code on a panic could be problematic. We tried using
the ASCII option of libqrencode but we couldn't find any phone that
would read the result. So we need a way to get to the framebuffer once
we've sawn our head off (maybe this is easier with x86 and VGA modes?).
One of the original motivations for kernel
modesetting (KMS) was to get readable oops information to the screen.
Using KMS to display a fairly simple QR code graphic instead should be
workable, rather than creating an ASCII version as Deacon describes.
Matthew Garrett noted
that it should be fairly straightforward at least for hardware that has KMS
KMS already has atomic modeswitch support for showing panics. We'd just
need to ensure that there's an unaccelerated path for dumping contents
directly to the framebuffer. If you don't have KMS then you don't get to
play with modern useful functionality.
There is some disagreement about where the decoding of any QR code should
take place. Garrett believes that existing QR
apps in phones should be used, while others are not convinced they can be
coerced into being flexible enough to deal with the large QR codes that might
result from a kernel backtrace. Garrett has also done some work on the problem
Basic design was as follows: Take the backtrace, compress it, encode in
an alphanumeric QR code including an http:// prefix, submit to
http://kbu.gs/blah automatically when user takes a picture
Anvin would rather see some kind of web
application that accepts a photo of the QR code and decodes it on the server. For
one thing, having one (working) decoding code base is desirable: "I can tell you just how bad a lot of the QR decoder software running on
smartphones are -- because I have tried them." In addition, though,
a web application would also have the photo itself, so even if it didn't
decode because of picture quality or other reasons, those photos could be
used to improve the quality of the decoder.
But that implies that a user would need to download an app to their phone
or use some web application as suggested
by John Hawley. Garrett was not in favor of either solution, noting that
requiring an app makes its harder for users, while a web application
doesn't really make it any better:
And now your workflow is "Take picture, move to browser, upload, wait to
see if it decodes, back to camera, back to browser", etc. I know we're
expected to be bad at UX here, but come on.
Given that many users already use photos to report crashes—taking a
picture of the screen with the last part of the backtrace—the QR code
mechanism, even if a bit cumbersome, might be able to provide the full
backtrace. But, as Dave Jones suggested,
just having scrollback available on the console after a crash would make
much of the problem disappear:
"What would be a thousand times more useful would be having working scrollback
when we panic, like we had circa 2.2".
Users could then take a photo, scroll back a
ways, take another, and so on. In the thread, there was widespread
agreement that console scrollback would be desirable. But it turns out that the
advent of USB keyboards caused the loss of that feature. Doing USB
handling inside the panic code would be messy,
so bringing that
feature back is difficult. Other ideas were mentioned, like providing
enough of the USB stack to write the crash information to a USB stick as
Anvin suggests, or
to "auto-scroll" the console output after a crash without requiring
keyboard input as proposed
by Paul Gortmaker.
Making it easier for users to report crashes with useful information was
one branch of the discussion, but the folks who work on the embedded side
are looking for more developer-oriented solutions as well. Tony Luck outlined
the pstore back-ends that are currently available to store crash and other
information in various places (ERST, EFI variables, RAM) that are
accessible after a reboot. Wang, Tim
Bird, Jason Wessel, and others are interested in discussing that piece of
While QR codes may seem like something of gimmick, they can compress a fair
amount of data into a form that can be digested elsewhere. Getting useful
information out of an unresponsive, crashed Linux system is fairly
difficult at this point, so finding better ways to do so would be good.
Should the program committee decide to add this topic, a lively discussion
seems likely. If not, though, enough people are looking
into the idea that something will emerge sooner or later.
Comments (20 posted)
The record-oriented logging patch set
pulled into the mainline during the 3.5 merge window. These changes are
meant to make the processing of kernel messages generated by
and friends more reliable, more informative, and more
easily consumed by automatic systems. But recently it has turned out that
these changes make printk()
less useful for kernel developers.
Now there is some uncertainty as to whether this feature can be repaired in
time, or whether it will be reverted back out of the 3.5 release.
One of the core design features of the new printk() is a change
from byte-streamed output to record-oriented output. Current kernels can
easily corrupt messages on their way to the log; for example, when the log
buffer overflows, the kernel simply wraps around and partially overwrites
older messages. Messages from multiple CPUs can also get confused,
especially if one or more CPUs are using multiple printk() calls
to output a single line of text. The switch to the record-oriented
mechanism eliminates these problems; it also makes it possible to attach
useful structured information to messages. As a whole, it looks like a
solid improvement to the kernel logging subsystem.
There is just one little problem, though: when the kernel outputs a partial
message (by passing a string to printk() that does not end with a
newline), the logging system will buffer the text until the rest of the
message arrives. The good news is that this buffering causes the full line
to be output together once it's complete—if things go well. The situation
when things do not go well was best summarized by Andrew Morton:
If a driver does
printk("testing the frobnozzle ...");
and do_test() hangs up, we really really want the user to know that
there was a frobnozzle testing problem. Please tell me this isn't
Not only is this behavior now broken, but it has also burned at least one
developer who ended up spending a lot of time trying to figure out why the
kernel was hanging. Kernel developers depend heavily on printk(),
so this change has caused a fair amount of concern.
Bugs happen, of course; the important thing is to fix them. A number of
possible fixes have been discussed on the list, including:
- Leave printk() as it is, and change specific callers to
output only full lines. Kay Sievers, the author of the
printk() changes, suggested
that approach, saying "We really should not optimize for
cosmetics (full lines work reliably, they are not buffered) of
self-tests, for the price of the reliability and integrity of all
- Adding a printk_flush() function to be called in places where
it is important to see partial lines even if things go wrong before
the newline character is printed. The problem with this approach is
that, like printing full lines only, it requires changing every place
in the code where the problem might hit. Experience says that many of
those places can only be found the hard way.
- Add a global knob by which buffering can be turned on or off; this
knob might be set by either user space or the kernel. This idea was
not particularly popular; it seems unlikely that the knob will be set
for unbuffered output when it really matters.
- Simply revert the printk() changes for 3.5 and try again for
3.6 or later. Ingo Molnar posted a
patch to this effect, seemingly as a way of pressuring Kay
to take the problem more seriously.
As of this writing, most of the discussion centers around this patch from Steven Rostedt which simply
removes the buffering from printk(). For the most part, the
advantages of the new code remain. But it is now possible that a single
line of output created with multiple printk() calls may be split
into multiple lines, with messages from other CPUs mixed in between. It
seems to many to be a reasonable compromise fix.
Except that Kay still doesn't like the splitting
of continuation lines. Andrew Morton is also concerned about where the printk()
code is going, saying "The core printk code is starting to make one
think of things like 'cleansing with fire'." Steven, meanwhile, is
reconsidering the whole thing, saying that,
perhaps, printk() is not the right tool for structured logging and
other approaches should be considered. And Greg Kroah-Hartman has suggested that it might be better just to fix
the call sites rather than further complicating the printk() code.
Linus, however, has argued strongly for the
merging of Steven's patch. His view is that buffering at the logging level
is fine, but text emitted with printk() has to get to the console
immediately. So chances are that some version of Steven's fix will be
applied for the 3.5 release. But it has become clear, again, that
adding structured logging to the kernel while not making life harder for
kernel developers is a difficult problem.
Comments (6 posted)
It has often been said that memory management patches can take a long time
to be accepted into the mainline kernel. Because memory management
performance regressions can take years to be discovered, developers
in this area have become highly conservative; making memory management
changes is not a recommended endeavor for those lacking patience. But
there may be an area where progress can be even more glacial, for different
reasons. Security-oriented changes are subject to arbitrary delays because
tighter security can break programs and irritate users.
Consider the classic symbolic link vulnerability, wherein an attacker fools
a privileged program into writing to a file behind an attacker-controlled
symbolic link. Such vulnerabilities can be exploited to overwrite files
that the attacker would not otherwise have access to. One does not have to
dig far into the LWN vulnerability list to
see that the identification and patching of symbolic link vulnerabilities
is an ongoing process. One might think that, if somebody could come up
with a way to eliminate such vulnerabilities altogether, it would be
adopted in a hurry.
As it happens, Kees Cook has a way to deal
with this class of vulnerabilities. It is based on the observations that
symbolic link vulnerabilities almost always involve links placed in
/tmp, and that /tmp has the "sticky" bit set in any
contemporary distribution. Given that:
The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink
and follower match, or when the directory owner matches the
In short, this change would make it so that nobody could create symbolic
links in /tmp and expect a privileged program to follow them.
Lest one think that Kees is taking credit for this concept, he posted a bit
of history for this idea, starting with a 1996 Bugtraq
message from Zygo Blaxell and a
kernel patch by Andrew Tridgell from the same year. This idea, in
other words, has been floating around for at least 16 years, but an
has never found its way into the mainline kernel. Memory management
changes are amazingly fast in comparison.
The reason for the resistance, of course, is that this is a change in
filesystem semantics. There are concerns that it would break POSIX
compliance, though Kees claims that POSIX is silent on this particular
behavior. Also of concern is the possibility of breaking existing
applications. Kees responds that any broken applications would be easily
noticed (while those suffering from symbolic link vulnerabilities are not),
and that no applications relying on existing behavior have ever been
found. There have also been disagreements over how the patch should be
implemented, but those have seemingly mostly been resolved.
So Kees thinks that his current patch set
(a variant of one we have seen
should be considered for merging, finally. The patches implement the
symbolic link restrictions, but also add a new rule for hard links: a hard
link to a file can only be created if the user owns the file or has write
access to it. Once again, this change eliminates a class of attacks, but
at a small cost: older versions of the "at" daemon break unless a small
patch is applied. No other problems have been found, Kees says, after 1.5
years of experience with this patch in the Ubuntu kernel.
Whether that is enough evidence to get the changes merged this time around
remains to be seen. It has only been 16 years, after all, and one would
not want to be too hasty about such a thing.
Meanwhile, Kees has put together a separate security-oriented patch that
has run into some concerns of its own. On Linux systems, there is a sysctl
knob (suid_dumpable) that controls whether a crashing setuid
process generates a core dump or not. Setting it to a non-zero value
allows core dumps to happen; setting it to two applies certain restrictions
that are intended to make it safe. But, Kees says, that's not the case; it
allows a user to create a file called core in almost any
directory, containing arbitrary text (environment strings, for example).
This capability is not necessarily as harmless as one might think; as the 2006 cron vulnerability shows, some
programs will happily pick out the strings they understand in a file full
of junk, happily
ignoring the rest. Thus, he claims, allowing users to create files in
arbitrary locations is asking for trouble.
His response has been through a number of iterations:
- Version 1 disallowed storing
core dumps from privileged executables into a file. If the
core_pattern knob is set to a pipe, instead, core dumps
happen as before. This was seen as an incompatible ABI change,
though, and one that would cause surprising results.
- Version 2 added a new setting (3) that
would only allow setuid core dumps to a pipe. The previous "safe"
setting (2) was deprecated; attempting to set it would fail with an
EINVAL error. This version ran into trouble as a result of how it
interacted with the sysctl mechanism.
- Version 3 fixed the sysctl
difficulties but was opposed by Andrew Morton, who feared that the
deprecation of the previous mode would break current systems in
surprising ways. He suggested keeping suid_dumpable=2 as a
working mode with a warning.
- Version 4 went back to something
closer to version 1, but with some loud warnings emitted. But
then Eric Biederman asked whether disallowing relative paths would be
a sufficient fix.
- Thus, version 5 (the current version,
as of this writing), just disallows the writing of setuid core dumps
to relative paths. Should core_pattern be set to a relative
path ("core", for example), a warning will be logged instead.
Thus far, there has not been much in the way of complaints about the fifth
iteration of the patch. So, possibly, it will not be necessary to wait for
years until this particular bit of security tightening gets into the
mainline kernel. Of course, unlike the system's link behavior, the core
dump behavior can be changed now by concerned system administrators—no need
to wait at all.
Comments (29 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>