Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.5-rc4, released on June 24. Linus says: "So while we still have 200+ commits in this -rc, they really are all pretty tiny and insignificant. Sure, if the particular issue they fixed hit you (or you are the developer of those life-changing lines ;), you may disagree with the "insignificant" part, but to me, this is just how I like the -rc's at this point."

Stable updates: the 3.0.36 and 3.4.4 stable kernels were released on June 22.

Comments (none posted)

Quotes of the week

MUWUHAhaha, watch as I destroy your attempts to reduce line count in your diffstat.

— Mel Gorman

Kernel developers tend to look at code from the point of view "does it work as designed", "is it clean", "is it efficient", "do I understand it", etc. We often forget to step back and really consider whether or not it should be merged at all.

— Andrew Morton

Very few people add printk()s as "inform the system logging daemon about an event". The prevailing mindset is that perfect code does not need any logging, so what is left are over 50,000 call sites of bragging and debugging code, occasionally massaged to be somewhat user friendly.

— Ingo Molnar

Comments (none posted)

Developer profiles: Sarah Sharp and Thomas Gleixner (Linux.com)

The Linux Foundation has a couple of new installments in its "30 Linux kernel developers in 30 weeks" series: Sarah Sharp ("Find a medium-sized project, in a part of the Linux kernel community that has a responsive mailing list. Don't waste your time on a bunch of spelling fix patches.") and Thomas Gleixner ("Quite a few people consider me to be one of the Grumpy Old Men. That's related to my age and the age-related unwillingness to cope with crap.").

Comments (1 posted)

Gettys: The Internet is Broken, and How to Fix It

Jim Gettys has posted a lengthy article collecting a lot of thoughts on what's wrong with the Internet and how the problems can be addressed. "'Fairness' between applications is also essential. We should reduce/eliminate the current perverse incentives for applications to abuse the network, as HTTP does today. We’ve had an arms race conspiracy for the last decade between web browsers and web sites to minimize latency that is destructive to other traffic we may care about (such as telephony, teleconferencing and gaming). Sometimes this is best addressed by fixing protocols to be both more efficient and more friendly to the network, as HTTP/1.1 pipelining and now SPDY are intended to do. But the 'web site sharding' problem is impossible for clients to avoid."

Comments (8 posted)

Fusion IO picks up another developer

Btrfs hacker Josef Bacik has let it be known that he will be leaving Red Hat and joining the growing crowd of kernel developers at Fusion IO.

Full Story (comments: none)

Attack of the Kernel Komedians

Patches going into the mainline kernel contain a number of tags; a "Signed-off-by:" from the author is mandatory, but most patches include tags like "Acked-by:" or "Reported-by:" as well. The officially recognized tags are documented in the SubmittingPatches file, but some developers have a certain habit of inventing their own as well. Andrew Morton, while grumpily trying to discourage such usage, did a bit of digging for unofficial tags. The result surprised him, leading him to ask "Geeze, guys. Who knew there were so many Kernel Komedians?"

For example, he found a number of variants on "Acked-by:", including:

    Cautiously-acked-by:
    Delightedly-acked-by:
    Embarrassingly-Acked-by:
    Emphatically-Acked-by:
    Grudgingly-acked-by:
    Hella-acked-by:
    Sort-Of-Acked-By:

And so on. One can only feel bad for the developers who felt the need to add "Repented-by:" or "Fatfingered-by:" and wonder about the story behind tags like "Antagonized-by:" or "Signed-off-and-morning-tea-spilled-by:". Andrew may dislike these tags, but others seem to find them amusing and will give them a "Whatevered-by:" at worst.

Full Story (comments: 8)

Displaying QR codes for kernel crashes

By Jake Edge
June 27, 2012

A proposal from Cong Wang to discuss the various mechanisms to store the kernel's "dying breath" spawned a rather large thread on the ksummit-2012-discuss mailing list. While things like pstore were set up specifically to provide a means to store kernel crash information, that doesn't necessarily make it easy for users to access and report kernel crashes. That led to suggestions and discussion of better ways for users to get the information out of their crashed systems—including using QR codes to facilitate the process.

Most regular users do not have a serial console set up to record crash information on a separate machine. So the kernel backtrace that appears after a crash is just written to the console, which means that much of it will have scrolled off the screen. Even the data that is there is hard to extract, with some folks trying to type the information in, which is tedious, not to mention error-prone. A QR code that encoded the relevant data could certainly help there.

Konrad Rzeszutek Wilk was the first to broach the QR code idea, though he said it did not originate with him. It turns out that H. Peter Anvin and Dirk Hohndel have been "messing with" the idea, but Will Deacon and Marc Zyngier actually showed something along those lines at the recent Linaro Connect in Hong Kong. Deacon was hesitant to call it a prototype, but said that there was some work done on encoding a kernel crash backtrace as a QR code. There were two problems with their approach:

Even without any error correction, the QR code started to get pretty large (and unreadable) after more than a few lines of backtrace. This should be fairly easy to fix by encoding the data in a more sensible manner rather than just verbatim (especially since a backtrace is a well-structured log). Maybe you could even gzip the whole thing after that too (then sell an android app to gunzip it :p)
Displaying the QR code on a panic could be problematic. We tried using the ASCII option of libqrencode but we couldn't find any phone that would read the result. So we need a way to get to the framebuffer once we've sawn our head off (maybe this is easier with x86 and VGA modes?).

One of the original motivations for kernel modesetting (KMS) was to get readable oops information to the screen. Using KMS to display a fairly simple QR code graphic instead should be workable, rather than creating an ASCII version as Deacon describes. Matthew Garrett noted that it should be fairly straightforward at least for hardware that has KMS support:

KMS already has atomic modeswitch support for showing panics. We'd just need to ensure that there's an unaccelerated path for dumping contents directly to the framebuffer. If you don't have KMS then you don't get to play with modern useful functionality.

There is some disagreement about where the decoding of any QR code should take place. Garrett believes that existing QR apps in phones should be used, while others are not convinced they can be coerced into being flexible enough to deal with the large QR codes that might result from a kernel backtrace. Garrett has also done some work on the problem and described his approach:

Basic design was as follows: Take the backtrace, compress it, encode in an alphanumeric QR code including an http:// prefix, submit to http://kbu.gs/blah automatically when user takes a picture

Anvin would rather see some kind of web application that accepts a photo of the QR code and decodes it on the server. For one thing, having one (working) decoding code base is desirable: "I can tell you just how bad a lot of the QR decoder software running on smartphones are -- because I have tried them." In addition, though, a web application would also have the photo itself, so even if it didn't decode because of picture quality or other reasons, those photos could be used to improve the quality of the decoder.

But that implies that a user would need to download an app to their phone or use some web application as suggested by John Hawley. Garrett was not in favor of either solution, noting that requiring an app makes its harder for users, while a web application doesn't really make it any better:

And now your workflow is "Take picture, move to browser, upload, wait to see if it decodes, back to camera, back to browser", etc. I know we're expected to be bad at UX here, but come on.

Given that many users already use photos to report crashes—taking a picture of the screen with the last part of the backtrace—the QR code mechanism, even if a bit cumbersome, might be able to provide the full backtrace. But, as Dave Jones suggested, just having scrollback available on the console after a crash would make much of the problem disappear: "What would be a thousand times more useful would be having working scrollback when we panic, like we had circa 2.2".

Users could then take a photo, scroll back a ways, take another, and so on. In the thread, there was widespread agreement that console scrollback would be desirable. But it turns out that the advent of USB keyboards caused the loss of that feature. Doing USB handling inside the panic code would be messy, so bringing that feature back is difficult. Other ideas were mentioned, like providing enough of the USB stack to write the crash information to a USB stick as Anvin suggests, or to "auto-scroll" the console output after a crash without requiring keyboard input as proposed by Paul Gortmaker.

Making it easier for users to report crashes with useful information was one branch of the discussion, but the folks who work on the embedded side are looking for more developer-oriented solutions as well. Tony Luck outlined the pstore back-ends that are currently available to store crash and other information in various places (ERST, EFI variables, RAM) that are accessible after a reboot. Wang, Tim Bird, Jason Wessel, and others are interested in discussing that piece of the puzzle.

While QR codes may seem like something of gimmick, they can compress a fair amount of data into a form that can be digested elsewhere. Getting useful information out of an unresponsive, crashed Linux system is fairly difficult at this point, so finding better ways to do so would be good. Should the program committee decide to add this topic, a lively discussion seems likely. If not, though, enough people are looking into the idea that something will emerge sooner or later.

Comments (20 posted)

printk() problems

By Jonathan Corbet
June 26, 2012

The record-oriented logging patch set was pulled into the mainline during the 3.5 merge window. These changes are meant to make the processing of kernel messages generated by printk() and friends more reliable, more informative, and more easily consumed by automatic systems. But recently it has turned out that these changes make printk() less useful for kernel developers. Now there is some uncertainty as to whether this feature can be repaired in time, or whether it will be reverted back out of the 3.5 release.

One of the core design features of the new printk() is a change from byte-streamed output to record-oriented output. Current kernels can easily corrupt messages on their way to the log; for example, when the log buffer overflows, the kernel simply wraps around and partially overwrites older messages. Messages from multiple CPUs can also get confused, especially if one or more CPUs are using multiple printk() calls to output a single line of text. The switch to the record-oriented mechanism eliminates these problems; it also makes it possible to attach useful structured information to messages. As a whole, it looks like a solid improvement to the kernel logging subsystem.

There is just one little problem, though: when the kernel outputs a partial message (by passing a string to printk() that does not end with a newline), the logging system will buffer the text until the rest of the message arrives. The good news is that this buffering causes the full line to be output together once it's complete—if things go well. The situation when things do not go well was best summarized by Andrew Morton:

If a driver does

	printk("testing the frobnozzle ...");
	do_test();
	printk(" OK\n");

and do_test() hangs up, we really really want the user to know that there was a frobnozzle testing problem. Please tell me this isn't broken.

Not only is this behavior now broken, but it has also burned at least one developer who ended up spending a lot of time trying to figure out why the kernel was hanging. Kernel developers depend heavily on printk(), so this change has caused a fair amount of concern.

Bugs happen, of course; the important thing is to fix them. A number of possible fixes have been discussed on the list, including:

Leave printk() as it is, and change specific callers to output only full lines. Kay Sievers, the author of the printk() changes, suggested that approach, saying "We really should not optimize for cosmetics (full lines work reliably, they are not buffered) of self-tests, for the price of the reliability and integrity of all other users."
Adding a printk_flush() function to be called in places where it is important to see partial lines even if things go wrong before the newline character is printed. The problem with this approach is that, like printing full lines only, it requires changing every place in the code where the problem might hit. Experience says that many of those places can only be found the hard way.
Add a global knob by which buffering can be turned on or off; this knob might be set by either user space or the kernel. This idea was not particularly popular; it seems unlikely that the knob will be set for unbuffered output when it really matters.
Simply revert the printk() changes for 3.5 and try again for 3.6 or later. Ingo Molnar posted a patch to this effect, seemingly as a way of pressuring Kay to take the problem more seriously.

As of this writing, most of the discussion centers around this patch from Steven Rostedt which simply removes the buffering from printk(). For the most part, the advantages of the new code remain. But it is now possible that a single line of output created with multiple printk() calls may be split into multiple lines, with messages from other CPUs mixed in between. It seems to many to be a reasonable compromise fix.

Except that Kay still doesn't like the splitting of continuation lines. Andrew Morton is also concerned about where the printk() code is going, saying "The core printk code is starting to make one think of things like 'cleansing with fire'." Steven, meanwhile, is reconsidering the whole thing, saying that, perhaps, printk() is not the right tool for structured logging and other approaches should be considered. And Greg Kroah-Hartman has suggested that it might be better just to fix the call sites rather than further complicating the printk() code.

Linus, however, has argued strongly for the merging of Steven's patch. His view is that buffering at the logging level is fine, but text emitted with printk() has to get to the console immediately. So chances are that some version of Steven's fix will be applied for the 3.5 release. But it has become clear, again, that adding structured logging to the kernel while not making life harder for kernel developers is a difficult problem.

Comments (6 posted)

Tightening security: not for the impatient

By Jonathan Corbet
June 27, 2012

It has often been said that memory management patches can take a long time to be accepted into the mainline kernel. Because memory management performance regressions can take years to be discovered, developers in this area have become highly conservative; making memory management changes is not a recommended endeavor for those lacking patience. But there may be an area where progress can be even more glacial, for different reasons. Security-oriented changes are subject to arbitrary delays because tighter security can break programs and irritate users.

Consider the classic symbolic link vulnerability, wherein an attacker fools a privileged program into writing to a file behind an attacker-controlled symbolic link. Such vulnerabilities can be exploited to overwrite files that the attacker would not otherwise have access to. One does not have to dig far into the LWN vulnerability list to see that the identification and patching of symbolic link vulnerabilities is an ongoing process. One might think that, if somebody could come up with a way to eliminate such vulnerabilities altogether, it would be adopted in a hurry.

As it happens, Kees Cook has a way to deal with this class of vulnerabilities. It is based on the observations that symbolic link vulnerabilities almost always involve links placed in /tmp, and that /tmp has the "sticky" bit set in any contemporary distribution. Given that:

The solution is to permit symlinks to only be followed when outside a sticky world-writable directory, or when the uid of the symlink and follower match, or when the directory owner matches the symlink's owner.

In short, this change would make it so that nobody could create symbolic links in /tmp and expect a privileged program to follow them. Lest one think that Kees is taking credit for this concept, he posted a bit of history for this idea, starting with a 1996 Bugtraq message from Zygo Blaxell and a kernel patch by Andrew Tridgell from the same year. This idea, in other words, has been floating around for at least 16 years, but an implementation has never found its way into the mainline kernel. Memory management changes are amazingly fast in comparison.

The reason for the resistance, of course, is that this is a change in filesystem semantics. There are concerns that it would break POSIX compliance, though Kees claims that POSIX is silent on this particular behavior. Also of concern is the possibility of breaking existing applications. Kees responds that any broken applications would be easily noticed (while those suffering from symbolic link vulnerabilities are not), and that no applications relying on existing behavior have ever been found. There have also been disagreements over how the patch should be implemented, but those have seemingly mostly been resolved.

So Kees thinks that his current patch set (a variant of one we have seen before) should be considered for merging, finally. The patches implement the symbolic link restrictions, but also add a new rule for hard links: a hard link to a file can only be created if the user owns the file or has write access to it. Once again, this change eliminates a class of attacks, but at a small cost: older versions of the "at" daemon break unless a small patch is applied. No other problems have been found, Kees says, after 1.5 years of experience with this patch in the Ubuntu kernel.

Whether that is enough evidence to get the changes merged this time around remains to be seen. It has only been 16 years, after all, and one would not want to be too hasty about such a thing.

Meanwhile, Kees has put together a separate security-oriented patch that has run into some concerns of its own. On Linux systems, there is a sysctl knob (suid_dumpable) that controls whether a crashing setuid process generates a core dump or not. Setting it to a non-zero value allows core dumps to happen; setting it to two applies certain restrictions that are intended to make it safe. But, Kees says, that's not the case; it allows a user to create a file called core in almost any directory, containing arbitrary text (environment strings, for example). This capability is not necessarily as harmless as one might think; as the 2006 cron vulnerability shows, some programs will happily pick out the strings they understand in a file full of junk, happily ignoring the rest. Thus, he claims, allowing users to create files in arbitrary locations is asking for trouble.

His response has been through a number of iterations:

Version 1 disallowed storing core dumps from privileged executables into a file. If the core_pattern knob is set to a pipe, instead, core dumps happen as before. This was seen as an incompatible ABI change, though, and one that would cause surprising results.
Version 2 added a new setting (3) that would only allow setuid core dumps to a pipe. The previous "safe" setting (2) was deprecated; attempting to set it would fail with an EINVAL error. This version ran into trouble as a result of how it interacted with the sysctl mechanism.
Version 3 fixed the sysctl difficulties but was opposed by Andrew Morton, who feared that the deprecation of the previous mode would break current systems in surprising ways. He suggested keeping suid_dumpable=2 as a working mode with a warning.
Version 4 went back to something closer to version 1, but with some loud warnings emitted. But then Eric Biederman asked whether disallowing relative paths would be a sufficient fix.
Thus, version 5 (the current version, as of this writing), just disallows the writing of setuid core dumps to relative paths. Should core_pattern be set to a relative path ("core", for example), a warning will be logged instead.

Thus far, there has not been much in the way of complaints about the fifth iteration of the patch. So, possibly, it will not be necessary to wait for years until this particular bit of security tightening gets into the mainline kernel. Of course, unlike the system's link behavior, the core dump behavior can be changed now by concerned system administrators—no need to wait at all.

Comments (29 posted)

Linus Torvalds Linux 3.5-rc4 ?

Greg KH Linux 3.4.4 ?

Steven Rostedt 3.4.4-rt13 ?

Steven Rostedt 3.4.3-rt12 ?

Steven Rostedt 3.2.21-rt33 ?

Greg KH Linux 3.0.36 ?

Steven Rostedt 3.0.36-rt57 ?

Huacai Chen MIPS: Add Loongson-3 based machines support. ?

Matt Fleming x86, efi: Handover Protocol ?

Marek Szyprowski ARM: replace custom consistent dma region with vmalloc ?

Robert Lee ARM: imx: Add basic imx6q thermal driver ?

Gregory Clement [PATCH v4] arm: Add basic support for new Marvell Armada 370 and Armada XP SoC ?

Steven Rostedt localmodconfig: Improve the number of modules removed ?

Paul E. McKenney [PATCH tip/core/rcu 0/22] v2 Improvements to rcu_barrier() and RT response on big systems ?

Daniel Santos Generic Red-Black Trees ?

Steven Rostedt printk: Have printk() never buffer its data ?

Akinobu Mita notifier error injection ?

Luming Yu [patch update-v1] a simple hardware detector for latency as well as throughput ver. 0.1.0 ?

Anton Vorontsov Function tracing support for pstore ?

K. Y. Srinivasan drivers: hv: kvp ?

sjur.brandeland@stericsson.com STE-modem remoteproc driver ?

David Daney netdev/phy: 10G PHY support. ?

Mark Brown mfd: Wolfson Arizona and WM5102 support ?

Mark Brown Extcon: Arizona: Add driver for Wolfson Arizona class devices ?

Roland Stigge input: keyboard: Add keys driver for the LPC32xx SoC ?

Lin Ming SATA ZPODD support ?

Ezequiel Garcia media: Add stk1160 new driver ?

Namjae Jeon nfs: Support posix_fadvise(POSIX_FADV_RANDOM) on nfs server. ?

Dmitry Monakhov RFC: introduce extended inode owner identifier v9 ?

Rik van Riel mm: scalable and unified arch_get_unmapped_area ?

Mel Gorman Swap-over-NFS without deadlocking V7 ?

Mel Gorman Swap-over-NBD without deadlocking V13 ?

Glauber Costa kmem controller for memcg: stripped down version ?

Rafael Aquini make balloon pages movable by compaction ?

Frank Swiderski Add a page cache-backed balloon device driver. ?

John Stultz Fallocate Volatile Ranges v5 ?

Yasuaki Ishimatsu memory-hotplug : hot-remove physical memory ?

David Miller ipv4: Early TCP socket demux. ?

Eldad Zack LLDP implementation for Linux ?

Kees Cook fs: introduce pipe-only dump mode suid_dumpable=3 ?

Kees Cook fs: make dumpable=2 require fully qualified path ?

Kees Cook fs: add link restrictions ?

Alex Williamson kvm: level triggered irqfd support ?

Michael S. Tsirkin kvm: eoi optimization support ?

Jason Wang Multiqueue virtio-net ?

Peiyong Feng Implement uhook(call kernel func from userspace) driver ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Developer profiles: Sarah Sharp and Thomas Gleixner (Linux.com)

Gettys: The Internet is Broken, and How to Fix It

Fusion IO picks up another developer

Attack of the Kernel Komedians

Kernel development news

Displaying QR codes for kernel crashes

printk() problems

Tightening security: not for the impatient

Patches and updates

Kernel trees

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous