User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.36-rc5, released on September 20. "Nothing really stands out. Except perhaps that Al has obviously been looking at architecture sigreturn paths, and been finding problems, and Dan Rosenberg has been finding places where we copy structures to user space without having fully initialized them yet, leaking kernel stack contents. And we had that annoying x86-64 32-bit compat syscall bug that needed fixing for the _second_ time." See the announcement for the short changelog, or the full changelog for all the details.

Stable updates: the, and updates were released on September 20; they each contain a long list of important fixes.

Comments (none posted)

Quotes of the week

I did some Git digging - that ptrace check for /proc/$pid/mem read/write goes all the way back to the beginning of written human history, aka Linux v2.6.12-rc2.

I researched the fragmented history of the stone ages as well, i checked out numerous cave paintings, and while much was lost, i was able to recover this old fragment of a clue in the cave called 'patch-2.3.27', carbon-dated back as far as the previous millenium (!).

-- Ingo Molnar

This series is the core mount and lookup infrastructure from union mounts, split up into small, easily digestible, bikeshed-friendly pieces. All of the (non-documentation, non-whitespace) patches in this series are less than 140 lines long. It's like Twitter for kernel patches.

VFS developers should be able to review each of these patches in 3 minutes or less. If it takes you longer, email me and I'll post a video on YouTube making fun of you.

-- Valerie Aurora

Well, after things were basically working, we started fixing all the real bugs reported by an updated Any lines that obviously should have been longer than 80 chars were joined back to their proper length, and curly braces were moved to their proper locations. Tabs were all changed to 3 spaces each, we couldn't quite decide on 2 vs 4.

After that was done, we forward ported devfs. Proper disc management code in Linux was long overdue.

-- Chris Mason

Comments (3 posted)

The return of the Kernel Podcast

Jon Masters has announced that he is once again producing kernel podcasts. "My intention is to get back into doing this regularly again as it helps me keep up with LKML - recently I've been having enough time to do most of the prep but not enough time to write the show and record it! I don't do this for any reason other than to force myself to keep up with LKML."

Full Story (comments: 3)

Oracle's "Unbreakable Enterprise Kernel"

Oracle has sent out a press release proclaiming the virtues of its new kernel. "Based on the combined efforts of Oracle's Linux, Database, Middleware, and Hardware engineering teams, the Unbreakable Enterprise Kernel is: Fast: More than 75 percent performance gain demonstrated in OLTP performance tests over a Red Hat Compatible Kernel; 200 percent speedup of Infiniband messaging; 137 percent faster solid state disk access." Much of the gain may come from using 2.6.32 as the base instead of 2.6.18, but it's hard to tell for sure; source for this kernel does not seem to be available for download as of this writing.

This is an interesting move in that it reintroduces competition at the kernel level - something distributors have not emphasized in recent years - and it demonstrates an intent to move a bit further away from the RHEL base that Oracle uses to build its offering.

Comments (65 posted)

Updating the kernel's web links

By Jake Edge
September 22, 2010

Justin P. Mattock has set out a rather large clean-up task for himself: updating all of the web links in the kernel comments. As might be guessed, many of those links have link-rotted over the last ten or more years, so Mattock is trying to update the kernel to point to the proper place—if it can be found. That effort resulted in a monster patch that covered all of the references to "http" that he could find.

Many of the new links pointed off to as the only location that Mattock was able to find, but that caused a formatting problem. When adding those, he used links like:*/http://oldsite/oldlink
which shows all of the different versions of pages that the Wayback Machine has stored. Putting "*/" into a C-language comment is not a good plan, however, as Matt Turner pointed out. The proper solution is to use "%2A" as that is the HTML entity for "*". But there is a bigger issue with those archive links.

Finn Thain suggested that any of those links could just be left alone and that people should already know about, so adding it to the old links is just "bloat". Furthermore, there is a question of which version of the stored page is the one that the original comment referred to. Basically, Thain's point was that web pages which are maintained and updated are likely to be more useful, and that those who want to refer to pages that have dropped off the net should know (or learn) how to go about it.

Eventually Mattock split the patch into two parts, one that updated links to newer locations and the other which added the links for lost sites. He is soliciting more feedback on whether to include the archive links or not.

It is not clear, so far at least, whether these changes will be accepted. It is, in a sense, churn, and likely to lead to more churn down the road as link-rot is an endemic web problem. It is probably frustrating for developers and others to come across broken links in the kernel code, but is it worth the never-ending—hopefully fairly infrequent—stream of update patches? There are undoubtedly copyright, logistical, and other issues, but it would certainly be a lot nicer if these documents could be permanently stored in some location at

Comments (4 posted)

Kernel development news

BKL-free in 2.6.37 (maybe)

By Jonathan Corbet
September 20, 2010
The removal of the big kernel lock has been an ongoing, multi-year effort which has been reported on here a few times. The BKL has some strange and unique properties which make its removal from various kernel subsystems trickier than one might think it should be. But, thanks to a great deal of work by Arnd Bergmann, we might just be approaching a point where the 2.6.37 kernel can be built BKL-free for many or most users. There is, however, one significant obstacle which still must be overcome.

Arnd currently has a vast array of patches in the linux-next tree. Many of them are the result of the tedious (but tricky) work of looking at specific subsystems, determining what kind of locking they really need to have, then substituting lock_kernel() calls with something more local. In many cases, the BKL locking can simply be removed, as the code turns out not to need it. A big focus for 2.6.37 has been the removal of the BKL from a number of filesystems - a task which has required digging into some fairly old code. The Amiga FFS, for example, cannot have received much maintenance in recent times, and seems unlikely to have a lot of users.

The most wide-ranging patch for 2.6.37 has to do with the llseek() function, found in struct file_operations. This function allows a filesystem or driver to implement the lseek() system call, changing a file descriptor's position within the file. Unlike most file_operations functions, there is a default implementation for llseek() which simply changes the kernel's idea of the descriptor's position without notifying the underlying code at all. That change, naturally, was done with the BKL held. This implicit default llseek() implementation will have made life easier for a handful of developers, but it makes BKL removal hard: an implementation change could affect any code with a file_operations structure, not just modules which actually implement the llseek() operation.

To make things harder, a great many of these implicit llseek() implementations are not really needed or useful - most device drivers do not implement any concept of a "file position" and pay no attention to whatever the kernel thinks the position might be. In such situations, it is tempting to change the code to an explicit "no seeking allowed" implementation which reflects what is really going on. The problem here is that some user-space application somewhere might be calling lseek() on the device, and they might get upset if those calls started failing with ESPIPE errors. In other words, a successful-but-ignored lseek() call might just be part of the user-space ABI for a specific device. So something more careful has to be done.

The first step was to go through the kernel and add an explicit llseek() operation to every file_operations structure which did not already have one - a patch affecting 343 files. This work was done primarily with a frightening Coccinelle semantic patch (it was included in the patch changelog) which attempts to determine whether the code in question actually uses the file position or not. If the file position is used, default_llseek(), which implements the old default behavior, becomes the explicit default; otherwise noop_llseek(), which succeeds but does nothing, is used. After that work was done, Arnd was able to verify that none of the users of default_llseek() (there are 191 of them) needs the BKL. So the removal of the BKL from llseek() can be made complete.

The patch also changes how llseek() is handled in the core kernel. Starting with 2.6.37, assuming this work is merged (a good bet), any code which fails to provide an llseek() operation will default to no_llseek(), which returns ESPIPE. Any out-of-tree code which depends on the old default will thus not work properly with 2.6.37 until it is updated.

Even after all of this work, there are still a lot of lock_kernel() calls in the mainline. Almost all of them, though, are in old, obscure code which is not relevant to a lot of users. In some cases, the remaining BKL-using code might be shifted over to the staging tree and eventually removed entirely if it is not fixed up. In other cases, an effort will be made to eradicate the BKL; it can still be found in occasionally-useful code like the Appletalk and ncpfs implementations. There are also a lot of Video4Linux2 drivers which still use the BKL; how those drivers will be fixed is the subject of an ongoing discussion in the V4L2 community.

The biggest impediment to a BKL-free 2.6.37, though, may well be the POSIX locking code. File locks are represented internally with a file_lock structure; those structures are passed around to a few places and, of course, protected with the BKL. Patches exist to protect those structures with a spinlock within the core kernel. The main sticking point appears to be the NFS lockd daemon, which uses file_lock structures and which, thus, requires the BKL; somebody is said to be working on fixing this code, but no patches have been posted yet. Until lockd has been converted, file locking as a whole requires the BKL. And, since it's a rare kernel that does not have file locking enabled, that will drag the BKL into almost all real-world kernel builds.

Even after that fix is in place, distributor kernels are likely to need the BKL for a bit longer. As long as there is even one module they ship which requires the BKL, the support for it needs to be there, even if most users will not have that module loaded. People who build their own kernels, though, should often be able to put together a configuration which does not need the BKL. If all goes well, 2.6.37 will have a configuration option which makes BKL-free builds possible. That's a huge step forward, even if the BKL still exists in most stock kernels.

Comments (8 posted)

The hazards of 32/64-bit compatibility

By Jake Edge
September 22, 2010

A kernel bug that was found—and fixed—in 2007 has recently reared its head again. Unfortunately, the bug was reintroduced in 2008, leaving a rather large pile of kernel versions that are vulnerable to a local privilege escalation on x86_64 systems. Though perhaps difficult to do, it would seem that some kind of regression testing suite for the kernel might be able to detect these kinds of problems before they get released to the world.

There are two semi-related bugs that are both currently floating around, which is causing a bit of confusion. One was originally CVE-2007-4573, and was reintroduced in a cleanup patch in June 2008. The reintroduced vulnerability has been tagged as CVE-2010-3301 (though the CVE entry is simply reserved at the time of this writing). Ben Hawkes found a somewhat similar vulnerability—also exploiting system calls from 32-bit binaries on 64-bit x86 systems—which led him to the discovery of the reintroduction of CVE-2007-4573.

There are numerous pitfalls when trying to handle 32-bit binaries making system calls on 64-bit systems. Linux has a set of functions to handle the differences in arguments and calling conventions between 32 and 64-bit system calls, but it has always been tricky to get right. What we are seeing today are two instances where it wasn't done correctly, and the consequences of that can be dire.

The 2007 problem stemmed from a mismatch between the use of the %eax 32-bit register to store the system call number (which is used as an index into the syscall table) and the use of the %rax 64-bit register (which contains %eax as its low 32 bits) to do the indexing. In the "normal" system call path, %eax was zero-extended before the 32-bit system call number from user space was stored, but there was a second path into that code where the upper 32 bits in %rax were not cleared.

The ptrace() system call has the facility to make other system calls (using the PTRACE_SYSCALL request type) and also gives a user the ability to set register values. An attacker could set the upper 32 bits of %rax to a value of their choosing, make a system call with a seemingly valid index (in %eax) and end up indexing somewhere outside of the syscall table. By arranging to have exploit code at the designated location, the attacker can get the kernel to run his code.

The ptrace() path was fixed by Andi Kleen in September 2007 by ensuring that %eax (and other registers) were zero-extended. But zero-extending %eax was removed in Roland McGrath's clean up patch in June 2008. When Hawkes and Robert Swiecki recently noticed the problem, they had little difficulty in modifying an exploit from 2007 to get a root shell on recent kernels.

CVE-2010-3301 was resolved by a pair of patches. McGrath put the zero-extension of the %eax register back into the ptrace path, while H. Peter Anvin made the validity test of the system call number look at the entire %rax register. Either would be sufficient to close the current hole, but Anvin's patch will prevent any new paths into the system call entry code from running afoul of this problem in the future.

The fact that the old exploit was useful implies that someone could have written a test case in 2007 that might have detected the reintroduction of the problem. A suite of such regression tests, run regularly against the mainline, would be quite useful as a way to reduce regressions, both for normal bugs as well as for security holes. Not all kernel bugs will be amenable to that kind of testing, but, for those that are, it seems like an idea worth pursuing.

The other problem that Hawkes found (CVE-2010-3081, also just reserved) is that the compat_alloc_user_space() function did not check to see that the pointer which is being returned is actually a valid user-space pointer. That routine is used to allocate some stack space for massaging 32-bit data into its 64-bit equivalent before making a system call. Hawkes found two places (and believes there are others) where the lack of an access_ok() call in that path could be exploited to allow attackers to write to kernel memory.

One of those was in a video4linux ioctl(), but the more easily exploited spot was in the IP multicast getsockopt() call. It uses a 32-bit unsigned length parameter provided by user space that can be used to confuse compat_alloc_user_space() into returning a pointer into kernel memory. The compat_mc_getsockopt() call then writes user-supplied values using those pointers. That can be fairly easily turned into an exploit as Hawkes noted:

This path allows an attacker to write a chosen value to anywhere within the top 31 bits of the kernel address space. In practice, this seems to be more than enough for exploitation. My proof of concept overwrote the interrupt descriptor table, but it's likely there are other good options too.

Anvin patched compat_alloc_user_space() so that it always does the access_ok() check. That should take care of the two problem spots that Hawkes found as well as any others that are lurking out there. But there have been a whole lot of kernels released with one or both of these bugs, and there have been other bugs associated with 64-bit/32-bit compatibility. It is a part of the kernel that Hawkes calls "a little bit scary":

Not just because it's an increased attack surface versus having purely 32-bit or purely 64-bit modes, but because of the type of input processing that has to be performed by any such compatibility layer. It invariably involves a significant amount of subtle bit wrangling between 32/64-bit values, using primitives that I'd argue most programmers aren't normally exposed to. The possibility of misuse and abuse is very real.

Perhaps 32-bit compatibility for x86_64 kernels would be a good starting point for regression testing. Some enterprise distributions were not affected by CVE-2010-3301 because of the ancient kernels (like RHEL's 2.6.18) they are based on, but CVE-2010-3081 was backported into RHEL 5, which required that kernel to be updated. The interests of distribution vendors would be well-served by better—any—regression testing so a project of that sort would be quite welcome. The vendors may already be running some tests internally, but regression testing is just the kind of project that would benefit from some cross-distribution collaboration.

It should also be noted that a posting to the full-disclosure mailing list claims that the vulnerability in compat_mc_getsockopt() has been known for nearly two-and-a-half years by black (or at least gray) hats. According to the post, it was noticed when the vulnerability was introduced in April 2008. Certainly there are some that are following the commit-stream to try to find these kinds of vulnerabilities; it would be good if the kernel had a team of white hats doing the same.

Comments (14 posted)

Broadcom firmware and regulatory compliance

By Jonathan Corbet
September 22, 2010
Broadcom's recently-announced open source wireless networking driver was a major step forward for a company which has, until now, not been forthcoming when it comes to free support for its wireless products. That driver includes the obligatory firmware blob which has been licensed for free distribution by the company; it is now found in the kernel firmware repository. Broadcom has not freed the firmware for its older drivers, though, leading to discussions on the intersection between kernel development and regulatory compliance.

The lack of freely-distributable firmware for older Broadcom products makes life a bit more difficult for users, who must obtain the firmware separately. When the new firmware was made distributable, David Woodhouse asked the company about the older firmware as well, only to told that it would not be made distributable. As he explained it, Broadcom is afraid that allowing the distribution of that firmware could lead to trouble:

They seem to think that they could be prosecuted even for *enabling* people to use the open source b43 driver, because you have the possibility of hacking that driver not to conform to the regulatory requirements.

The reason why the old firmware is different is simple: the newer firmware, which can only run on newer hardware, has regulatory compliance built into it. The older firmware, instead, depends on the driver in the kernel to ensure that it is not configured to operate in a non-compliant manner.

David is not known for graceful suffering in the presence of (people he sees as) fools. His response was a patch which "credits" Broadcom for enabling the development of the reverse-engineered b43 driver; this "enablement" is said to have come through the provision of binary-only drivers which could be reverse engineered. His goal in writing this patch was described as:

Everything we do in the b43 and b43legacy drivers is enabled by Broadcom's original binary-only drivers. So let's make that 'enablement' by Broadcom's binary drivers clear in our source code -- in the hope that it'll narrow the 'risk gap' that they falsely perceive between open and closed source drivers.

Or failing that, in the hope that it'll give their crack-addled lawyers aneurysms, and they'll hire some saner ones to replace them.

He also expressed a wish that the b43 developers would release more information - obtained from the binary-only drivers - on how to patch those binary drivers to get around various regulatory restrictions. Once again, he feels that this kind of information would help to make it clear that free drivers do not make it any easier to operate the hardware in an illegal manner.

David's position plays well with developers who have no patience for obstacles created by lawyers. There is also a vocal contingent out there which says that Linux has no business telling users how they should use their hardware in any case; if the user wants to configure the hardware in a non-compliant manner, that's the user's problem. In some cases, that user may well have a license which makes it entirely legal to run the hardware outside of the parameters which normally apply to off-the-shelf wireless networking equipment. So regulatory compliance naturally irritates developers who think that the kernel has no business getting in anybody's way in this regard.

Luis Rodriguez, on the other hand, is a strong supporter of regulatory compliance in the Linux kernel; he stepped into the discussion to remind people of the kernel's regulatory statement and to say that there was no real interest in encouraging the violation of spectrum-use regulations with any driver. He added:

The reason why current legislation doesn't seem to make sense is because it does not, but just because a law doesn't make sense it does not enable vendors to ignore it. So the best you can do in the meantime is really be proactive by working on real technical solutions.

We are not dealing with legal issues on Linux, we are dealing with engineering solutions, and trust me, we're light years ahead of other OSes because of this now.

His point is that the kernel's "engineering solution" to the regulatory problem has made it possible for wireless vendors to dip their toes into the open-source water. That, in turn, has helped to move Linux from having poor wireless support to, arguably, having the best support over the course of a few years. It is hard to argue with the success which the wireless developers have had recently; any moves which might endanger that success should be considered carefully, to say the least.

Of course, it would be nicer to do without the proprietary firmware blob altogether. In early 2009, the openfwwf project announced the availability of an open source firmware implementation for Broadcom adapters. Since then, news from that project has been relatively scarce. On September 21, though, Michael Büsch announced the availability of a toolchain for working with the b43 firmware. Using the disassembler and assembler, it is possible to decode the device firmware, make changes, then build a new firmware load. Naturally, one can also build a new firmware implementation from the beginning. With these tools available, we might just get to a point where we can have device firmware without distribution restrictions, and which adds features and flexibility to the device as well.

Comments (11 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management


Virtualization and containers

Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds