Brief itemsreleased on September 20. "Nothing really stands out. Except perhaps that Al has obviously been looking at architecture sigreturn paths, and been finding problems, and Dan Rosenberg has been finding places where we copy structures to user space without having fully initialized them yet, leaking kernel stack contents. And we had that annoying x86-64 32-bit compat syscall bug that needed fixing for the _second_ time." See the announcement for the short changelog, or the full changelog for all the details.
I researched the fragmented history of the stone ages as well, i checked out numerous cave paintings, and while much was lost, i was able to recover this old fragment of a clue in the cave called 'patch-2.3.27', carbon-dated back as far as the previous millenium (!).
VFS developers should be able to review each of these patches in 3 minutes or less. If it takes you longer, email me and I'll post a video on YouTube making fun of you.
After that was done, we forward ported devfs. Proper disc management code in Linux was long overdue.
This is an interesting move in that it reintroduces competition at the kernel level - something distributors have not emphasized in recent years - and it demonstrates an intent to move a bit further away from the RHEL base that Oracle uses to build its offering.
Justin P. Mattock has set out a rather large clean-up task for himself: updating all of the web links in the kernel comments. As might be guessed, many of those links have link-rotted over the last ten or more years, so Mattock is trying to update the kernel to point to the proper place—if it can be found. That effort resulted in a monster patch that covered all of the references to "http" that he could find.
Many of the new links pointed off to archive.org as the only location that Mattock was able to find, but that caused a formatting problem. When adding those, he used links like:
http://web.archive.org/web/*/http://oldsite/oldlinkwhich shows all of the different versions of pages that the Wayback Machine has stored. Putting "*/" into a C-language comment is not a good plan, however, as Matt Turner pointed out. The proper solution is to use "%2A" as that is the HTML entity for "*". But there is a bigger issue with those archive links.
Finn Thain suggested that any of those links could just be left alone and that people should already know about archive.org, so adding it to the old links is just "bloat". Furthermore, there is a question of which version of the stored page is the one that the original comment referred to. Basically, Thain's point was that web pages which are maintained and updated are likely to be more useful, and that those who want to refer to pages that have dropped off the net should know (or learn) how to go about it.
Eventually Mattock split the patch into two parts, one that updated links to newer locations and the other which added the archive.org links for lost sites. He is soliciting more feedback on whether to include the archive links or not.
It is not clear, so far at least, whether these changes will be accepted. It is, in a sense, churn, and likely to lead to more churn down the road as link-rot is an endemic web problem. It is probably frustrating for developers and others to come across broken links in the kernel code, but is it worth the never-ending—hopefully fairly infrequent—stream of update patches? There are undoubtedly copyright, logistical, and other issues, but it would certainly be a lot nicer if these documents could be permanently stored in some location at kernel.org.
Kernel development news
Arnd currently has a vast array of patches in the linux-next tree. Many of them are the result of the tedious (but tricky) work of looking at specific subsystems, determining what kind of locking they really need to have, then substituting lock_kernel() calls with something more local. In many cases, the BKL locking can simply be removed, as the code turns out not to need it. A big focus for 2.6.37 has been the removal of the BKL from a number of filesystems - a task which has required digging into some fairly old code. The Amiga FFS, for example, cannot have received much maintenance in recent times, and seems unlikely to have a lot of users.
The most wide-ranging patch for 2.6.37 has to do with the llseek() function, found in struct file_operations. This function allows a filesystem or driver to implement the lseek() system call, changing a file descriptor's position within the file. Unlike most file_operations functions, there is a default implementation for llseek() which simply changes the kernel's idea of the descriptor's position without notifying the underlying code at all. That change, naturally, was done with the BKL held. This implicit default llseek() implementation will have made life easier for a handful of developers, but it makes BKL removal hard: an implementation change could affect any code with a file_operations structure, not just modules which actually implement the llseek() operation.
To make things harder, a great many of these implicit llseek() implementations are not really needed or useful - most device drivers do not implement any concept of a "file position" and pay no attention to whatever the kernel thinks the position might be. In such situations, it is tempting to change the code to an explicit "no seeking allowed" implementation which reflects what is really going on. The problem here is that some user-space application somewhere might be calling lseek() on the device, and they might get upset if those calls started failing with ESPIPE errors. In other words, a successful-but-ignored lseek() call might just be part of the user-space ABI for a specific device. So something more careful has to be done.
The first step was to go through the kernel and add an explicit llseek() operation to every file_operations structure which did not already have one - a patch affecting 343 files. This work was done primarily with a frightening Coccinelle semantic patch (it was included in the patch changelog) which attempts to determine whether the code in question actually uses the file position or not. If the file position is used, default_llseek(), which implements the old default behavior, becomes the explicit default; otherwise noop_llseek(), which succeeds but does nothing, is used. After that work was done, Arnd was able to verify that none of the users of default_llseek() (there are 191 of them) needs the BKL. So the removal of the BKL from llseek() can be made complete.
The patch also changes how llseek() is handled in the core kernel. Starting with 2.6.37, assuming this work is merged (a good bet), any code which fails to provide an llseek() operation will default to no_llseek(), which returns ESPIPE. Any out-of-tree code which depends on the old default will thus not work properly with 2.6.37 until it is updated.
Even after all of this work, there are still a lot of lock_kernel() calls in the mainline. Almost all of them, though, are in old, obscure code which is not relevant to a lot of users. In some cases, the remaining BKL-using code might be shifted over to the staging tree and eventually removed entirely if it is not fixed up. In other cases, an effort will be made to eradicate the BKL; it can still be found in occasionally-useful code like the Appletalk and ncpfs implementations. There are also a lot of Video4Linux2 drivers which still use the BKL; how those drivers will be fixed is the subject of an ongoing discussion in the V4L2 community.
The biggest impediment to a BKL-free 2.6.37, though, may well be the POSIX locking code. File locks are represented internally with a file_lock structure; those structures are passed around to a few places and, of course, protected with the BKL. Patches exist to protect those structures with a spinlock within the core kernel. The main sticking point appears to be the NFS lockd daemon, which uses file_lock structures and which, thus, requires the BKL; somebody is said to be working on fixing this code, but no patches have been posted yet. Until lockd has been converted, file locking as a whole requires the BKL. And, since it's a rare kernel that does not have file locking enabled, that will drag the BKL into almost all real-world kernel builds.
Even after that fix is in place, distributor kernels are likely to need the BKL for a bit longer. As long as there is even one module they ship which requires the BKL, the support for it needs to be there, even if most users will not have that module loaded. People who build their own kernels, though, should often be able to put together a configuration which does not need the BKL. If all goes well, 2.6.37 will have a configuration option which makes BKL-free builds possible. That's a huge step forward, even if the BKL still exists in most stock kernels.
A kernel bug that was found—and fixed—in 2007 has recently reared its head again. Unfortunately, the bug was reintroduced in 2008, leaving a rather large pile of kernel versions that are vulnerable to a local privilege escalation on x86_64 systems. Though perhaps difficult to do, it would seem that some kind of regression testing suite for the kernel might be able to detect these kinds of problems before they get released to the world.
There are two semi-related bugs that are both currently floating around, which is causing a bit of confusion. One was originally CVE-2007-4573, and was reintroduced in a cleanup patch in June 2008. The reintroduced vulnerability has been tagged as CVE-2010-3301 (though the CVE entry is simply reserved at the time of this writing). Ben Hawkes found a somewhat similar vulnerability—also exploiting system calls from 32-bit binaries on 64-bit x86 systems—which led him to the discovery of the reintroduction of CVE-2007-4573.
There are numerous pitfalls when trying to handle 32-bit binaries making system calls on 64-bit systems. Linux has a set of functions to handle the differences in arguments and calling conventions between 32 and 64-bit system calls, but it has always been tricky to get right. What we are seeing today are two instances where it wasn't done correctly, and the consequences of that can be dire.
The 2007 problem stemmed from a mismatch between the use of the %eax 32-bit register to store the system call number (which is used as an index into the syscall table) and the use of the %rax 64-bit register (which contains %eax as its low 32 bits) to do the indexing. In the "normal" system call path, %eax was zero-extended before the 32-bit system call number from user space was stored, but there was a second path into that code where the upper 32 bits in %rax were not cleared.
The ptrace() system call has the facility to make other system calls (using the PTRACE_SYSCALL request type) and also gives a user the ability to set register values. An attacker could set the upper 32 bits of %rax to a value of their choosing, make a system call with a seemingly valid index (in %eax) and end up indexing somewhere outside of the syscall table. By arranging to have exploit code at the designated location, the attacker can get the kernel to run his code.
The ptrace() path was fixed by Andi Kleen in September 2007 by ensuring that %eax (and other registers) were zero-extended. But zero-extending %eax was removed in Roland McGrath's clean up patch in June 2008. When Hawkes and Robert Swiecki recently noticed the problem, they had little difficulty in modifying an exploit from 2007 to get a root shell on recent kernels.
CVE-2010-3301 was resolved by a pair of patches. McGrath put the zero-extension of the %eax register back into the ptrace path, while H. Peter Anvin made the validity test of the system call number look at the entire %rax register. Either would be sufficient to close the current hole, but Anvin's patch will prevent any new paths into the system call entry code from running afoul of this problem in the future.
The fact that the old exploit was useful implies that someone could have written a test case in 2007 that might have detected the reintroduction of the problem. A suite of such regression tests, run regularly against the mainline, would be quite useful as a way to reduce regressions, both for normal bugs as well as for security holes. Not all kernel bugs will be amenable to that kind of testing, but, for those that are, it seems like an idea worth pursuing.
The other problem that Hawkes found (CVE-2010-3081, also just reserved) is that the compat_alloc_user_space() function did not check to see that the pointer which is being returned is actually a valid user-space pointer. That routine is used to allocate some stack space for massaging 32-bit data into its 64-bit equivalent before making a system call. Hawkes found two places (and believes there are others) where the lack of an access_ok() call in that path could be exploited to allow attackers to write to kernel memory.
One of those was in a video4linux ioctl(), but the more easily exploited spot was in the IP multicast getsockopt() call. It uses a 32-bit unsigned length parameter provided by user space that can be used to confuse compat_alloc_user_space() into returning a pointer into kernel memory. The compat_mc_getsockopt() call then writes user-supplied values using those pointers. That can be fairly easily turned into an exploit as Hawkes noted:
Anvin patched compat_alloc_user_space() so that it always does the access_ok() check. That should take care of the two problem spots that Hawkes found as well as any others that are lurking out there. But there have been a whole lot of kernels released with one or both of these bugs, and there have been other bugs associated with 64-bit/32-bit compatibility. It is a part of the kernel that Hawkes calls "a little bit scary":
Perhaps 32-bit compatibility for x86_64 kernels would be a good starting point for regression testing. Some enterprise distributions were not affected by CVE-2010-3301 because of the ancient kernels (like RHEL's 2.6.18) they are based on, but CVE-2010-3081 was backported into RHEL 5, which required that kernel to be updated. The interests of distribution vendors would be well-served by better—any—regression testing so a project of that sort would be quite welcome. The vendors may already be running some tests internally, but regression testing is just the kind of project that would benefit from some cross-distribution collaboration.
It should also be noted that a posting to the full-disclosure mailing list claims that the vulnerability in compat_mc_getsockopt() has been known for nearly two-and-a-half years by black (or at least gray) hats. According to the post, it was noticed when the vulnerability was introduced in April 2008. Certainly there are some that are following the commit-stream to try to find these kinds of vulnerabilities; it would be good if the kernel had a team of white hats doing the same.
The lack of freely-distributable firmware for older Broadcom products makes life a bit more difficult for users, who must obtain the firmware separately. When the new firmware was made distributable, David Woodhouse asked the company about the older firmware as well, only to told that it would not be made distributable. As he explained it, Broadcom is afraid that allowing the distribution of that firmware could lead to trouble:
The reason why the old firmware is different is simple: the newer firmware, which can only run on newer hardware, has regulatory compliance built into it. The older firmware, instead, depends on the driver in the kernel to ensure that it is not configured to operate in a non-compliant manner.
David is not known for graceful suffering in the presence of (people he sees as) fools. His response was a patch which "credits" Broadcom for enabling the development of the reverse-engineered b43 driver; this "enablement" is said to have come through the provision of binary-only drivers which could be reverse engineered. His goal in writing this patch was described as:
Or failing that, in the hope that it'll give their crack-addled lawyers aneurysms, and they'll hire some saner ones to replace them.
He also expressed a wish that the b43 developers would release more information - obtained from the binary-only drivers - on how to patch those binary drivers to get around various regulatory restrictions. Once again, he feels that this kind of information would help to make it clear that free drivers do not make it any easier to operate the hardware in an illegal manner.
David's position plays well with developers who have no patience for obstacles created by lawyers. There is also a vocal contingent out there which says that Linux has no business telling users how they should use their hardware in any case; if the user wants to configure the hardware in a non-compliant manner, that's the user's problem. In some cases, that user may well have a license which makes it entirely legal to run the hardware outside of the parameters which normally apply to off-the-shelf wireless networking equipment. So regulatory compliance naturally irritates developers who think that the kernel has no business getting in anybody's way in this regard.
Luis Rodriguez, on the other hand, is a strong supporter of regulatory compliance in the Linux kernel; he stepped into the discussion to remind people of the kernel's regulatory statement and to say that there was no real interest in encouraging the violation of spectrum-use regulations with any driver. He added:
We are not dealing with legal issues on Linux, we are dealing with engineering solutions, and trust me, we're light years ahead of other OSes because of this now.
His point is that the kernel's "engineering solution" to the regulatory problem has made it possible for wireless vendors to dip their toes into the open-source water. That, in turn, has helped to move Linux from having poor wireless support to, arguably, having the best support over the course of a few years. It is hard to argue with the success which the wireless developers have had recently; any moves which might endanger that success should be considered carefully, to say the least.
Of course, it would be nicer to do without the proprietary firmware blob altogether. In early 2009, the openfwwf project announced the availability of an open source firmware implementation for Broadcom adapters. Since then, news from that project has been relatively scarce. On September 21, though, Michael Büsch announced the availability of a toolchain for working with the b43 firmware. Using the disassembler and assembler, it is possible to decode the device firmware, make changes, then build a new firmware load. Naturally, one can also build a new firmware implementation from the beginning. With these tools available, we might just get to a point where we can have device firmware without distribution restrictions, and which adds features and flexibility to the device as well.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds