Brief items
The current development kernel is 2.6.36-rc5,
released on September 20.
"
Nothing really stands out. Except perhaps that Al has obviously been
looking at architecture sigreturn paths, and been finding problems, and Dan
Rosenberg has been finding places where we copy structures to user space
without having fully initialized them yet, leaking kernel stack contents.
And we had that annoying x86-64 32-bit compat syscall bug that needed
fixing for the _second_ time." See the announcement for the short
changelog, or
the
full changelog for all the details.
Stable updates: the 2.6.27.54, 2.6.32.22 and 2.6.35.5 updates were released on
September 20; they each contain a long list of important fixes.
Comments (none posted)
I did some Git digging - that ptrace check for /proc/$pid/mem
read/write goes all the way back to the beginning of written human
history, aka Linux v2.6.12-rc2.
I researched the fragmented history of the stone ages as well, i
checked out numerous cave paintings, and while much was lost, i was
able to recover this old fragment of a clue in the cave called
'patch-2.3.27', carbon-dated back as far as the previous millenium
(!).
--
Ingo Molnar
This series is the core mount and lookup infrastructure from union
mounts, split up into small, easily digestible, bikeshed-friendly
pieces. All of the (non-documentation, non-whitespace) patches in
this series are less than 140 lines long. It's like Twitter for
kernel patches.
VFS developers should be able to review each of these patches in 3
minutes or less. If it takes you longer, email me and I'll post a
video on YouTube making fun of you.
--
Valerie Aurora
Well, after things were basically working, we started fixing all
the real bugs reported by an updated checkpatch.pl
Any lines that obviously should have been longer than 80 chars were
joined back to their proper length, and curly braces were moved to
their proper locations. Tabs were all changed to 3 spaces each, we
couldn't quite decide on 2 vs 4.
After that was done, we forward ported devfs. Proper disc
management code in Linux was long overdue.
--
Chris Mason
Comments (3 posted)
Jon Masters has announced that he is once again producing kernel podcasts.
"
My intention is to get back into doing this regularly again as
it helps me keep up with LKML - recently I've been having enough time to
do most of the prep but not enough time to write the show and record it!
I don't do this for any reason other than to force myself to keep up
with LKML."
Full Story (comments: 3)
Oracle has sent out
a
press release proclaiming the virtues of its new kernel. "
Based
on the combined efforts of Oracle's Linux, Database, Middleware, and
Hardware engineering teams, the Unbreakable Enterprise Kernel is: Fast:
More than 75 percent performance gain demonstrated in OLTP performance
tests over a Red Hat Compatible Kernel; 200 percent speedup of Infiniband
messaging; 137 percent faster solid state disk access." Much of the
gain may come from using 2.6.32 as the base instead of 2.6.18, but it's
hard to tell for sure; source for this kernel does not seem to be available
for download as of this writing.
This is an interesting move in that it reintroduces competition at the
kernel level - something distributors have not emphasized in recent years -
and it demonstrates an intent to move a bit further away from the RHEL base
that Oracle uses to build its offering.
Comments (65 posted)
By Jake Edge
September 22, 2010
Justin P. Mattock has set out a rather large clean-up task for himself:
updating all
of the web links in the kernel comments. As might be guessed, many of those links
have link-rotted over the last ten or more years, so Mattock is trying to
update the kernel to point to the proper place—if it
can be found. That effort resulted in a monster patch
that covered all of the references to "http" that he could find.
Many of the new links pointed off to archive.org as the only location that
Mattock was able to find, but that caused a formatting problem. When
adding those, he used links like:
http://web.archive.org/web/*/http://oldsite/oldlink
which shows all of the different versions of pages that the Wayback Machine
has stored. Putting "*/" into a C-language comment is not a good plan,
however,
as Matt Turner
pointed out. The proper
solution is to use "%2A" as that is the HTML entity for "*". But there is a
bigger issue with those archive links.
Finn Thain suggested that any of those
links could just be left alone and that people should already know about
archive.org, so adding it to the old links is just
"bloat". Furthermore, there is a question of which
version of the stored page is the one that the original comment referred
to. Basically, Thain's point was that web pages which are maintained and
updated are likely to be more useful, and that those who want to refer to
pages that have dropped off the net should know (or learn) how to go about it.
Eventually Mattock split the patch into two
parts, one that updated links to newer locations and the other which added
the archive.org links for lost sites. He is soliciting more feedback on
whether to include
the archive links or not.
It is not clear, so far at least, whether these changes will be accepted.
It is, in a sense, churn, and likely to lead to more churn down the road as
link-rot is an endemic web problem. It is probably frustrating for
developers and others to come across broken links in the kernel code, but
is it worth the never-ending—hopefully fairly infrequent—stream
of update patches? There are undoubtedly copyright, logistical, and other
issues, but it
would certainly be a lot nicer if these documents could be permanently
stored in some location at kernel.org.
Comments (4 posted)
Kernel development news
By Jonathan Corbet
September 20, 2010
The removal of the big kernel lock has been an ongoing, multi-year effort
which has been reported on here a few times. The BKL has some strange and
unique properties which make its removal from various kernel subsystems
trickier than one might think it should be. But, thanks to a great deal of work
by Arnd Bergmann, we might just be approaching a point where the 2.6.37
kernel can be built BKL-free for many or most users. There is, however,
one significant obstacle which still must be overcome.
Arnd currently has a vast array of patches in the linux-next tree. Many of
them are the result of the tedious (but tricky) work of looking at specific
subsystems, determining what kind of locking they really need to have, then
substituting lock_kernel() calls with something more local. In
many cases, the BKL locking can simply be removed, as the code turns out
not to need it. A big focus for 2.6.37 has been the removal of the BKL
from a number of filesystems - a task which has required digging into some
fairly old code. The Amiga FFS, for example, cannot have received much
maintenance in recent times, and seems unlikely to have a lot of users.
The most wide-ranging patch for 2.6.37 has to do with the llseek()
function, found in struct file_operations. This function allows a
filesystem or driver to implement the lseek() system call,
changing a file descriptor's position within the file. Unlike most
file_operations functions, there is a default implementation for
llseek() which simply changes the kernel's idea of the
descriptor's position without notifying the underlying code at all. That
change, naturally, was done with the BKL held. This implicit default
llseek() implementation will have made life easier for a handful
of developers, but it makes BKL removal hard: an implementation change
could affect any code with a file_operations structure, not
just modules which actually implement the llseek() operation.
To make things harder, a great many of these implicit llseek()
implementations are not really needed or useful - most device drivers do
not implement any concept of a "file position" and pay no attention to
whatever the kernel thinks the position might be. In such situations, it
is tempting to change the code to an explicit "no seeking allowed"
implementation which reflects what is really going on. The problem here is
that some user-space application somewhere might be calling
lseek() on the device, and they might get upset if those calls
started failing with ESPIPE errors. In other words, a
successful-but-ignored lseek() call might just be part of the
user-space ABI
for a specific device. So something more careful has to be done.
The first step was to go through the kernel and add an explicit
llseek() operation to every file_operations structure
which did not already have one - a patch affecting 343 files. This work
was done primarily with a frightening Coccinelle semantic patch (it was
included in the patch
changelog) which attempts to determine whether the code in question
actually uses the file position or not. If the file position is used,
default_llseek(), which implements the old default behavior,
becomes the explicit default; otherwise
noop_llseek(), which succeeds but does nothing, is used. After
that work was done, Arnd was able to verify that none of the users of
default_llseek() (there are 191 of them) needs the BKL. So the
removal of the BKL from llseek() can be made complete.
The patch also changes how llseek() is handled in the core
kernel. Starting with 2.6.37, assuming this work is merged (a good bet),
any code which fails to provide an llseek() operation will default
to no_llseek(), which returns ESPIPE. Any out-of-tree
code which depends on the old default will thus not work properly with
2.6.37 until it is updated.
Even after all of this work, there are still a lot of
lock_kernel() calls in the mainline. Almost all of them, though,
are in old, obscure code which is not relevant to a lot of users. In some
cases, the remaining BKL-using code might be shifted over to the
staging tree and eventually removed entirely if it is not fixed up. In
other cases, an effort will be made to eradicate the BKL; it can still be
found in occasionally-useful code like the Appletalk and ncpfs
implementations. There are also a lot of Video4Linux2 drivers which still
use the BKL; how those drivers will be fixed is the subject of an ongoing discussion in the V4L2 community.
The biggest impediment to a BKL-free 2.6.37, though, may well be the POSIX
locking code. File locks are represented internally with a
file_lock structure; those structures are passed around to a few
places and, of course, protected with the BKL. Patches exist to protect
those structures with a spinlock within the core kernel. The main sticking
point appears to be the NFS lockd daemon, which uses file_lock
structures and which, thus, requires the BKL; somebody is said to be working on
fixing this code, but no patches have been posted yet. Until lockd has
been converted, file locking as a whole requires the BKL. And, since it's
a rare
kernel that does not have file locking enabled, that will drag the BKL into
almost all real-world kernel builds.
Even after that fix is in place, distributor kernels are likely to need the
BKL for a bit longer. As long as there is even one module they ship which
requires the BKL, the support for it needs to be there, even if most users
will not have that module loaded. People who build their own kernels,
though, should often be able to put together a configuration which does not
need the BKL. If all goes well, 2.6.37 will have a configuration option
which makes BKL-free builds possible. That's a huge step forward, even if
the BKL still exists in most stock kernels.
Comments (8 posted)
By Jake Edge
September 22, 2010
A kernel bug that was found—and fixed—in 2007 has recently
reared its head again. Unfortunately, the bug was reintroduced in 2008,
leaving a rather large pile of kernel versions that are vulnerable to a
local privilege escalation on x86_64 systems. Though perhaps difficult to do, it would seem
that some kind of regression testing suite for the kernel might be able to
detect these kinds of problems before they get released to the world.
There are two semi-related bugs that are both currently floating
around, which is
causing a
bit of confusion. One was originally CVE-2007-4573,
and was reintroduced in a cleanup
patch in June 2008. The reintroduced vulnerability has been tagged as
CVE-2010-3301
(though the CVE entry is simply reserved at the time of this writing). Ben
Hawkes found a somewhat similar
vulnerability—also
exploiting system calls from 32-bit binaries on 64-bit x86
systems—which led him to the discovery of the reintroduction of
CVE-2007-4573.
There are numerous pitfalls
when trying to handle 32-bit binaries making
system calls on 64-bit systems. Linux has a set of
functions to handle the differences in arguments and calling conventions
between 32 and 64-bit system calls, but it has always been tricky to get
right. What we are seeing today are two instances where it wasn't done
correctly, and the consequences of that can be dire.
The 2007 problem stemmed from a mismatch between the use of the
%eax 32-bit register to store the system call number (which is
used as an index into the syscall table) and the use of the %rax
64-bit register (which contains %eax as its low 32 bits) to do the
indexing. In the
"normal" system call path, %eax was zero-extended before the
32-bit system call number from user space was stored, but there was a
second path into that code where the upper 32 bits in %rax were
not cleared.
The ptrace() system call has the facility to make other system
calls (using the PTRACE_SYSCALL request type) and also gives a user the
ability to set register values. An attacker could set the upper 32 bits of
%rax to a value of their choosing, make a system call with a
seemingly valid index (in %eax) and end up indexing somewhere
outside of the syscall table. By arranging to have exploit code at the
designated location, the attacker can get the kernel to run his code.
The ptrace() path was
fixed
by Andi Kleen in September 2007 by ensuring that %eax (and
other registers) were zero-extended. But zero-extending %eax was
removed in Roland McGrath's clean
up patch in June 2008. When Hawkes and Robert Swiecki recently noticed
the problem, they had little difficulty in modifying an exploit from 2007 to
get a root shell on recent kernels.
CVE-2010-3301 was resolved by a pair of patches. McGrath put the
zero-extension of the %eax register back into the ptrace path,
while H. Peter Anvin made
the validity test of the system call number look at the entire
%rax register. Either would be sufficient to close the current
hole, but Anvin's patch will prevent any new paths into the system call
entry code from running afoul of this problem in the future.
The fact that the old exploit was useful implies that someone could
have written a test case in 2007 that might have detected the
reintroduction of the problem. A suite of such regression tests, run
regularly against the mainline, would be
quite useful as a way to reduce regressions, both for normal bugs as well
as for security holes.
Not all kernel bugs will be amenable to
that kind of testing, but, for those that are, it seems like an idea worth
pursuing.
The other problem that Hawkes found (CVE-2010-3081,
also just reserved) is that the compat_alloc_user_space() function did
not check to see that the pointer which is being returned is actually a
valid user-space pointer. That routine is used to allocate some stack
space for
massaging 32-bit data into its 64-bit equivalent before making a system
call. Hawkes found two places (and believes there are others) where the
lack of an access_ok()
call in that path could be exploited to allow attackers to write to kernel
memory.
One of those was in a video4linux ioctl(), but the more easily
exploited spot was in the IP multicast getsockopt() call. It uses
a 32-bit unsigned length parameter provided by user space that can be used
to confuse compat_alloc_user_space() into returning a pointer into
kernel memory. The compat_mc_getsockopt() call then writes
user-supplied values using those pointers. That can be fairly easily
turned into an exploit as Hawkes noted:
This path allows an attacker to write a chosen value to anywhere within the
top 31 bits of the kernel address space. In practice, this seems to be more
than enough for exploitation. My proof of concept overwrote the interrupt
descriptor table, but it's likely there are other good options too.
Anvin patched
compat_alloc_user_space() so that it always does the
access_ok() check. That should take care of the two problem spots
that Hawkes found as well as any others that are lurking out there. But
there have been a whole lot of kernels released with one or both of these
bugs, and there have been other bugs associated with 64-bit/32-bit
compatibility. It is a part of the kernel that Hawkes calls "a
little bit scary":
Not just because it's an increased attack surface versus having purely
32-bit or purely 64-bit modes, but because of the type of input processing
that has to be performed by any such compatibility layer. It invariably
involves a significant amount of subtle bit wrangling between 32/64-bit
values, using primitives that I'd argue most programmers aren't normally
exposed to. The possibility of misuse and abuse is very real.
Perhaps 32-bit compatibility for x86_64 kernels would be a good starting
point for regression testing. Some enterprise distributions were not
affected by CVE-2010-3301 because of the ancient kernels (like RHEL's
2.6.18) they are based on, but CVE-2010-3081 was backported into RHEL 5,
which required that kernel to be updated. The interests of
distribution vendors would be well-served by
better—any—regression testing so a project of that sort would
be quite welcome. The vendors may already be running some tests
internally, but
regression testing is just the kind of project that would benefit from some
cross-distribution collaboration.
It should also be noted that a posting to the
full-disclosure mailing list claims that the vulnerability in
compat_mc_getsockopt() has been known for nearly two-and-a-half
years by black (or at least gray) hats. According to the post, it was
noticed when the vulnerability was introduced in April 2008. Certainly
there are some that are following the commit-stream to try to find these
kinds of vulnerabilities; it would be good if the kernel had a team of
white hats doing the same.
Comments (14 posted)
By Jonathan Corbet
September 22, 2010
Broadcom's recently-announced open source wireless networking driver was a
major step forward for a company which has, until now, not been forthcoming
when it comes to free support for its wireless products. That driver
includes the obligatory firmware blob which has been licensed for free
distribution by the company; it is now found in the kernel firmware repository.
Broadcom has not freed the firmware for its older drivers, though, leading
to discussions on the intersection between kernel development and
regulatory compliance.
The lack of freely-distributable firmware for older Broadcom products makes
life a bit more difficult for users, who must obtain the firmware
separately. When the new firmware was made distributable, David Woodhouse
asked the company about the older firmware as well, only to told that it
would not be made distributable. As he explained it, Broadcom is afraid
that allowing the distribution of that firmware could lead to trouble:
They seem to think that they could be prosecuted even for
*enabling* people to use the open source b43 driver, because you
have the possibility of hacking that driver not to conform to the
regulatory requirements.
The reason why the old firmware is different is simple: the newer firmware,
which can only run on newer hardware, has regulatory compliance built into
it. The older firmware, instead, depends on the driver in the kernel to
ensure that it is not configured to operate in a non-compliant manner.
David is not known for graceful suffering in the presence of (people he
sees as) fools. His response was a patch
which "credits" Broadcom for enabling the development of the
reverse-engineered b43 driver; this "enablement" is said to have come
through the provision of binary-only drivers which could be reverse
engineered. His goal in writing this patch was described as:
Everything we do in the b43 and b43legacy drivers is enabled by
Broadcom's original binary-only drivers.
So let's make that 'enablement' by Broadcom's binary drivers clear
in our source code -- in the hope that it'll narrow the 'risk gap'
that they falsely perceive between open and closed source drivers.
Or failing that, in the hope that it'll give their crack-addled lawyers
aneurysms, and they'll hire some saner ones to replace them.
He also expressed a wish that the b43 developers would release more
information - obtained from the binary-only drivers - on how to patch those
binary drivers to get around various regulatory restrictions. Once again,
he feels that this kind of information would help to make it clear that
free drivers do not make it any easier to operate the hardware in an
illegal manner.
David's position plays well with developers who have no patience for
obstacles created by lawyers. There is also a vocal contingent out there
which says that Linux has no business telling users how they should use
their hardware in any case; if the user wants to configure the hardware in
a non-compliant manner, that's the user's problem. In some cases, that
user may well have a license which makes it entirely legal to run the
hardware outside of the parameters which normally apply to off-the-shelf
wireless networking equipment. So regulatory compliance naturally
irritates developers who think that the kernel has no business getting in
anybody's way in this regard.
Luis Rodriguez, on the other hand, is a strong supporter of regulatory
compliance in the Linux kernel; he stepped into
the discussion to remind people of the kernel's
regulatory statement and to say that there was no real interest in
encouraging the violation of spectrum-use regulations with any driver. He
added:
The reason why current legislation doesn't seem to make sense is
because it does not, but just because a law doesn't make sense it
does not enable vendors to ignore it. So the best you can do in the
meantime is really be proactive by working on real technical
solutions.
We are not dealing with legal issues on Linux, we are dealing with
engineering solutions, and trust me, we're light years ahead of
other OSes because of this now.
His point is that the kernel's "engineering solution" to the regulatory
problem has made it possible for wireless vendors to dip their toes into
the open-source water. That, in turn, has helped to move Linux from having
poor wireless support to, arguably, having the best support over the course
of a few years. It is hard to argue with the success which the wireless
developers have had recently; any moves which might endanger that success
should be considered carefully, to say the least.
Of course, it would be nicer to do without the proprietary firmware blob
altogether. In early 2009, the openfwwf project announced the availability of an
open source firmware implementation for Broadcom adapters. Since then,
news from that project has been relatively scarce. On September 21,
though, Michael Büsch announced the availability of a
toolchain for working with the b43 firmware. Using the disassembler and
assembler, it is possible to decode the device firmware, make changes, then
build a new firmware load. Naturally, one can also build a new firmware
implementation from the beginning. With these tools available, we might
just get to a point where we can have device firmware without distribution
restrictions, and which adds features and flexibility to the device as
well.
Comments (11 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>