Kernel development
Brief items
Kernel release status
The current development kernel is 3.5-rc4, released on June 24. Linus says: "So while we still have 200+ commits in this -rc, they really are all pretty tiny and insignificant. Sure, if the particular issue they fixed hit you (or you are the developer of those life-changing lines ;), you may disagree with the "insignificant" part, but to me, this is just how I like the -rc's at this point."
Stable updates: the 3.0.36 and 3.4.4 stable kernels were released on June 22.
Quotes of the week
Developer profiles: Sarah Sharp and Thomas Gleixner (Linux.com)
The Linux Foundation has a couple of new installments in its "30 Linux kernel developers in 30 weeks" series: Sarah Sharp ("Find a medium-sized project, in a part of the Linux kernel community that has a responsive mailing list. Don't waste your time on a bunch of spelling fix patches.") and Thomas Gleixner ("
Quite a few people consider me to be one of the Grumpy Old Men. That's related to my age and the age-related unwillingness to cope with crap.").
Gettys: The Internet is Broken, and How to Fix It
Jim Gettys has posted a lengthy article collecting a lot of thoughts on what's wrong with the Internet and how the problems can be addressed. "'Fairness' between applications is also essential. We should reduce/eliminate the current perverse incentives for applications to abuse the network, as HTTP does today. We’ve had an arms race conspiracy for the last decade between web browsers and web sites to minimize latency that is destructive to other traffic we may care about (such as telephony, teleconferencing and gaming). Sometimes this is best addressed by fixing protocols to be both more efficient and more friendly to the network, as HTTP/1.1 pipelining and now SPDY are intended to do. But the 'web site sharding' problem is impossible for clients to avoid."
Fusion IO picks up another developer
Btrfs hacker Josef Bacik has let it be known that he will be leaving Red Hat and joining the growing crowd of kernel developers at Fusion IO.Attack of the Kernel Komedians
Patches going into the mainline kernel contain a number of tags; a "Signed-off-by:" from the author is mandatory, but most patches include tags like "Acked-by:" or "Reported-by:" as well. The officially recognized tags are documented in the SubmittingPatches file, but some developers have a certain habit of inventing their own as well. Andrew Morton, while grumpily trying to discourage such usage, did a bit of digging for unofficial tags. The result surprised him, leading him to ask "Geeze, guys. Who knew there were so many Kernel Komedians?"
For example, he found a number of variants on "Acked-by:", including:
Cautiously-acked-by: Delightedly-acked-by: Embarrassingly-Acked-by: Emphatically-Acked-by: Grudgingly-acked-by: Hella-acked-by: Sort-Of-Acked-By:
And so on. One can only feel bad for the developers who felt the need to add "Repented-by:" or "Fatfingered-by:" and wonder about the story behind tags like "Antagonized-by:" or "Signed-off-and-morning-tea-spilled-by:". Andrew may dislike these tags, but others seem to find them amusing and will give them a "Whatevered-by:" at worst.
Kernel development news
Displaying QR codes for kernel crashes
A proposal from Cong Wang to discuss the various mechanisms to store the kernel's "dying breath" spawned a rather large thread on the ksummit-2012-discuss mailing list. While things like pstore were set up specifically to provide a means to store kernel crash information, that doesn't necessarily make it easy for users to access and report kernel crashes. That led to suggestions and discussion of better ways for users to get the information out of their crashed systems—including using QR codes to facilitate the process.
Most regular users do not have a serial console set up to record crash information on a separate machine. So the kernel backtrace that appears after a crash is just written to the console, which means that much of it will have scrolled off the screen. Even the data that is there is hard to extract, with some folks trying to type the information in, which is tedious, not to mention error-prone. A QR code that encoded the relevant data could certainly help there.
Konrad Rzeszutek Wilk was the first to broach the QR code idea, though he said it did not originate with him. It turns out that H. Peter Anvin and Dirk Hohndel have been "messing with" the idea, but Will Deacon and Marc Zyngier actually showed something along those lines at the recent Linaro Connect in Hong Kong. Deacon was hesitant to call it a prototype, but said that there was some work done on encoding a kernel crash backtrace as a QR code. There were two problems with their approach:
- Even without any error correction, the QR code started to get pretty large (and unreadable) after more than a few lines of backtrace. This should be fairly easy to fix by encoding the data in a more sensible manner rather than just verbatim (especially since a backtrace is a well-structured log). Maybe you could even gzip the whole thing after that too (then sell an android app to gunzip it :p)
-
Displaying the QR code on a panic could be problematic. We tried using the ASCII option of libqrencode but we couldn't find any phone that would read the result. So we need a way to get to the framebuffer once we've sawn our head off (maybe this is easier with x86 and VGA modes?).
One of the original motivations for kernel modesetting (KMS) was to get readable oops information to the screen. Using KMS to display a fairly simple QR code graphic instead should be workable, rather than creating an ASCII version as Deacon describes. Matthew Garrett noted that it should be fairly straightforward at least for hardware that has KMS support:
There is some disagreement about where the decoding of any QR code should take place. Garrett believes that existing QR apps in phones should be used, while others are not convinced they can be coerced into being flexible enough to deal with the large QR codes that might result from a kernel backtrace. Garrett has also done some work on the problem and described his approach:
Anvin would rather see some kind of web
application that accepts a photo of the QR code and decodes it on the server. For
one thing, having one (working) decoding code base is desirable: "I can tell you just how bad a lot of the QR decoder software running on
smartphones are -- because I have tried them.
" In addition, though,
a web application would also have the photo itself, so even if it didn't
decode because of picture quality or other reasons, those photos could be
used to improve the quality of the decoder.
But that implies that a user would need to download an app to their phone or use some web application as suggested by John Hawley. Garrett was not in favor of either solution, noting that requiring an app makes its harder for users, while a web application doesn't really make it any better:
Given that many users already use photos to report crashes—taking a
picture of the screen with the last part of the backtrace—the QR code
mechanism, even if a bit cumbersome, might be able to provide the full
backtrace. But, as Dave Jones suggested,
just having scrollback available on the console after a crash would make
much of the problem disappear:
"What would be a thousand times more useful would be having working scrollback
when we panic, like we had circa 2.2
".
Users could then take a photo, scroll back a ways, take another, and so on. In the thread, there was widespread agreement that console scrollback would be desirable. But it turns out that the advent of USB keyboards caused the loss of that feature. Doing USB handling inside the panic code would be messy, so bringing that feature back is difficult. Other ideas were mentioned, like providing enough of the USB stack to write the crash information to a USB stick as Anvin suggests, or to "auto-scroll" the console output after a crash without requiring keyboard input as proposed by Paul Gortmaker.
Making it easier for users to report crashes with useful information was one branch of the discussion, but the folks who work on the embedded side are looking for more developer-oriented solutions as well. Tony Luck outlined the pstore back-ends that are currently available to store crash and other information in various places (ERST, EFI variables, RAM) that are accessible after a reboot. Wang, Tim Bird, Jason Wessel, and others are interested in discussing that piece of the puzzle.
While QR codes may seem like something of gimmick, they can compress a fair amount of data into a form that can be digested elsewhere. Getting useful information out of an unresponsive, crashed Linux system is fairly difficult at this point, so finding better ways to do so would be good. Should the program committee decide to add this topic, a lively discussion seems likely. If not, though, enough people are looking into the idea that something will emerge sooner or later.
printk() problems
The record-oriented logging patch set was pulled into the mainline during the 3.5 merge window. These changes are meant to make the processing of kernel messages generated by printk() and friends more reliable, more informative, and more easily consumed by automatic systems. But recently it has turned out that these changes make printk() less useful for kernel developers. Now there is some uncertainty as to whether this feature can be repaired in time, or whether it will be reverted back out of the 3.5 release.One of the core design features of the new printk() is a change from byte-streamed output to record-oriented output. Current kernels can easily corrupt messages on their way to the log; for example, when the log buffer overflows, the kernel simply wraps around and partially overwrites older messages. Messages from multiple CPUs can also get confused, especially if one or more CPUs are using multiple printk() calls to output a single line of text. The switch to the record-oriented mechanism eliminates these problems; it also makes it possible to attach useful structured information to messages. As a whole, it looks like a solid improvement to the kernel logging subsystem.
There is just one little problem, though: when the kernel outputs a partial message (by passing a string to printk() that does not end with a newline), the logging system will buffer the text until the rest of the message arrives. The good news is that this buffering causes the full line to be output together once it's complete—if things go well. The situation when things do not go well was best summarized by Andrew Morton:
printk("testing the frobnozzle ..."); do_test(); printk(" OK\n");
and do_test() hangs up, we really really want the user to know that there was a frobnozzle testing problem. Please tell me this isn't broken.
Not only is this behavior now broken, but it has also burned at least one developer who ended up spending a lot of time trying to figure out why the kernel was hanging. Kernel developers depend heavily on printk(), so this change has caused a fair amount of concern.
Bugs happen, of course; the important thing is to fix them. A number of possible fixes have been discussed on the list, including:
- Leave printk() as it is, and change specific callers to
output only full lines. Kay Sievers, the author of the
printk() changes, suggested
that approach, saying "
We really should not optimize for cosmetics (full lines work reliably, they are not buffered) of self-tests, for the price of the reliability and integrity of all other users.
" - Adding a printk_flush() function to be called in places where
it is important to see partial lines even if things go wrong before
the newline character is printed. The problem with this approach is
that, like printing full lines only, it requires changing every place
in the code where the problem might hit. Experience says that many of
those places can only be found the hard way.
- Add a global knob by which buffering can be turned on or off; this
knob might be set by either user space or the kernel. This idea was
not particularly popular; it seems unlikely that the knob will be set
for unbuffered output when it really matters.
- Simply revert the printk() changes for 3.5 and try again for 3.6 or later. Ingo Molnar posted a patch to this effect, seemingly as a way of pressuring Kay to take the problem more seriously.
As of this writing, most of the discussion centers around this patch from Steven Rostedt which simply removes the buffering from printk(). For the most part, the advantages of the new code remain. But it is now possible that a single line of output created with multiple printk() calls may be split into multiple lines, with messages from other CPUs mixed in between. It seems to many to be a reasonable compromise fix.
Except that Kay still doesn't like the splitting
of continuation lines. Andrew Morton is also concerned about where the printk()
code is going, saying "The core printk code is starting to make one
think of things like 'cleansing with fire'.
" Steven, meanwhile, is
reconsidering the whole thing, saying that,
perhaps, printk() is not the right tool for structured logging and
other approaches should be considered. And Greg Kroah-Hartman has suggested that it might be better just to fix
the call sites rather than further complicating the printk() code.
Linus, however, has argued strongly for the merging of Steven's patch. His view is that buffering at the logging level is fine, but text emitted with printk() has to get to the console immediately. So chances are that some version of Steven's fix will be applied for the 3.5 release. But it has become clear, again, that adding structured logging to the kernel while not making life harder for kernel developers is a difficult problem.
Tightening security: not for the impatient
It has often been said that memory management patches can take a long time to be accepted into the mainline kernel. Because memory management performance regressions can take years to be discovered, developers in this area have become highly conservative; making memory management changes is not a recommended endeavor for those lacking patience. But there may be an area where progress can be even more glacial, for different reasons. Security-oriented changes are subject to arbitrary delays because tighter security can break programs and irritate users.Consider the classic symbolic link vulnerability, wherein an attacker fools a privileged program into writing to a file behind an attacker-controlled symbolic link. Such vulnerabilities can be exploited to overwrite files that the attacker would not otherwise have access to. One does not have to dig far into the LWN vulnerability list to see that the identification and patching of symbolic link vulnerabilities is an ongoing process. One might think that, if somebody could come up with a way to eliminate such vulnerabilities altogether, it would be adopted in a hurry.
As it happens, Kees Cook has a way to deal with this class of vulnerabilities. It is based on the observations that symbolic link vulnerabilities almost always involve links placed in /tmp, and that /tmp has the "sticky" bit set in any contemporary distribution. Given that:
In short, this change would make it so that nobody could create symbolic links in /tmp and expect a privileged program to follow them. Lest one think that Kees is taking credit for this concept, he posted a bit of history for this idea, starting with a 1996 Bugtraq message from Zygo Blaxell and a kernel patch by Andrew Tridgell from the same year. This idea, in other words, has been floating around for at least 16 years, but an implementation has never found its way into the mainline kernel. Memory management changes are amazingly fast in comparison.
The reason for the resistance, of course, is that this is a change in filesystem semantics. There are concerns that it would break POSIX compliance, though Kees claims that POSIX is silent on this particular behavior. Also of concern is the possibility of breaking existing applications. Kees responds that any broken applications would be easily noticed (while those suffering from symbolic link vulnerabilities are not), and that no applications relying on existing behavior have ever been found. There have also been disagreements over how the patch should be implemented, but those have seemingly mostly been resolved.
So Kees thinks that his current patch set (a variant of one we have seen before) should be considered for merging, finally. The patches implement the symbolic link restrictions, but also add a new rule for hard links: a hard link to a file can only be created if the user owns the file or has write access to it. Once again, this change eliminates a class of attacks, but at a small cost: older versions of the "at" daemon break unless a small patch is applied. No other problems have been found, Kees says, after 1.5 years of experience with this patch in the Ubuntu kernel.
Whether that is enough evidence to get the changes merged this time around remains to be seen. It has only been 16 years, after all, and one would not want to be too hasty about such a thing.
Meanwhile, Kees has put together a separate security-oriented patch that has run into some concerns of its own. On Linux systems, there is a sysctl knob (suid_dumpable) that controls whether a crashing setuid process generates a core dump or not. Setting it to a non-zero value allows core dumps to happen; setting it to two applies certain restrictions that are intended to make it safe. But, Kees says, that's not the case; it allows a user to create a file called core in almost any directory, containing arbitrary text (environment strings, for example). This capability is not necessarily as harmless as one might think; as the 2006 cron vulnerability shows, some programs will happily pick out the strings they understand in a file full of junk, happily ignoring the rest. Thus, he claims, allowing users to create files in arbitrary locations is asking for trouble.
His response has been through a number of iterations:
- Version 1 disallowed storing
core dumps from privileged executables into a file. If the
core_pattern knob is set to a pipe, instead, core dumps
happen as before. This was seen as an incompatible ABI change,
though, and one that would cause surprising results.
- Version 2 added a new setting (3) that
would only allow setuid core dumps to a pipe. The previous "safe"
setting (2) was deprecated; attempting to set it would fail with an
EINVAL error. This version ran into trouble as a result of how it
interacted with the sysctl mechanism.
- Version 3 fixed the sysctl
difficulties but was opposed by Andrew Morton, who feared that the
deprecation of the previous mode would break current systems in
surprising ways. He suggested keeping suid_dumpable=2 as a
working mode with a warning.
- Version 4 went back to something
closer to version 1, but with some loud warnings emitted. But
then Eric Biederman asked whether disallowing relative paths would be
a sufficient fix.
- Thus, version 5 (the current version, as of this writing), just disallows the writing of setuid core dumps to relative paths. Should core_pattern be set to a relative path ("core", for example), a warning will be logged instead.
Thus far, there has not been much in the way of complaints about the fifth iteration of the patch. So, possibly, it will not be necessary to wait for years until this particular bit of security tightening gets into the mainline kernel. Of course, unlike the system's link behavior, the core dump behavior can be changed now by concerned system administrators—no need to wait at all.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>