Brief items
The current 2.6 development kernel is 2.6.25-rc4,
released on March 4.
Patches still continue to go into the mainline repository at a high rate;
most of them are fixes, but there's also kdump support in the ehea driver,
dynamic tick handling in the RCU code, the temporary re-exporting of
init_mm until outside modules can be fixed, HT1100 SATA support,
Freescale MPC85xx DMA controller support, Seiko Instruments S-35390A RTC
support, and the restoration of GPL-only symbol access for ndiswrapper.
See
the short-form changelog for details,
or
the
full changelog for lots of details.
As of this writing, no post-rc4 patches have been merged into the mainline
repository.
The current -mm tree is 2.6.25-rc3-mm1. Recent changes
to -mm include a big set of IDE changes and the removal of some old wireless
drivers. The ext4 filesystem is disabled in -mm until it catches up with
some API changes.
Comments (none posted)
Kernel development news
/* As Linux tends to come apart under the stress
of time travel, we must be careful */
--
Rusty Russell
This many years into the effort we ought to be slicing and dicing
volumes as second nature, changing configuration on the fly,
transparently expanding, shrinking and migrating filesystems, and
many other things that ZFS and GEOM are already doing and we are
not. It is not so much that device mapper is incapable of such
fancy tricks, but that we have taken a very powerful kernel
subsystem and hobbled it with a nearly unusable application
interface. Think about a jet turbine racecar with a two inch air
intake.
--
Daniel Phillips
Comments (9 posted)
By Jonathan Corbet
March 5, 2008
The realtime patchset has one overriding goal: provide deterministic
response times in all situations. To that end, much work has been done to
eliminate places in the kernel which can be the source of excessive
latencies; quite a bit of that work has been merged into the mainline over
the last two years or so. One of the biggest remaining out-of-tree
components is the sleeping spinlock code. Sleeping spinlocks have
advantages and disadvantages. A recently posted set of patches has the
potential to significantly reduce one of the biggest disadvantages of the
realtime spinlock code.
Mainline spinlocks work by repeatedly polling a lock variable until it
becomes available. This busy-waiting code thus "spins" while waiting for a
lock. Spinlocks are quite fast, but they can also be a source of
significant latencies: a processor which is holding a lock can delay others
for indefinite amounts of time. In the mainline kernel, it is also not
possible to preempt a thread which holds a spinlock - another source of
latencies. (See this article
for a more detailed description of the mainline spinlock implementation).
The realtime patch set addresses this problem in a couple of ways. One of
those is to cause threads waiting for a contended lock to sleep rather than
spin. As a result, lock contention cannot create latencies on processors
which are not holding the lock. When spinning is removed, it is also
possible to make code preemptible even when it holds a lock without causing
deadlock problems. That allows a high-priority process to run regardless
of any lower-priority processes which might currently hold locks on the
current CPU. Finally, the realtime patch set has added priority awareness
and priority inheritance to the locking code to ensure that the
highest-priority process is always able to run.
This is all good stuff, but there is one little disadvantage: the extra
overhead imposed by the more complicated locks can reduce system throughput
considerably. This is a cost that the realtime developers have been
willing to pay; it is often necessary to make trade-offs between throughput
and latency. Recently, though, some developers at Novell have come to the
conclusion that the throughput cost of the realtime patch set need not be
as severe as it currently is; the resulting adaptive realtime locks patch
brings the throughput of the realtime kernel to a level much closer to that
found in the mainline - at least, for some workloads.
The core observation encapsulated in this patch set is that hold times for
spinlocks tend to be quite short, especially in the realtime kernel. So
the cost of putting a waiting thread to sleep may well exceed the cost of
simply busy-waiting until the lock becomes free. So adaptive locks behave
more like their mainline counterpart and simply spin until the lock becomes
available. There are some twists, though, which are necessitated by the
realtime system:
- The spinning cannot go on forever, since it may cause unacceptable
latencies elsewhere in the system. So an adaptive lock will only spin
up to a configurable number of times (the default is 10,000) before
giving up and going to sleep.
- Since lock holders are preemptible in the realtime kernel, it is
possible that the thread which currently holds the lock was previously
running on the same CPU as the process trying to acquire the lock. In
that situation, spinning for the lock is
clearly a bad thing to do. In the absence of a loop counter, it would
be a hard deadlock situation; with the counter, it would just be an
unnecessary delay. Either way, the result is undesirable, so, if the
lock owner is running on the same
processor, the thread waiting for the lock simply goes to sleep.
- If the lock owner is, instead, itself sleeping while waiting for something,
there is little point in having another thread stay awake in the hope
that the owner will release the lock soon. So, in this case too, a thread
contending for a lock will simply go to sleep rather than spin.
One other throughput improvement is obtained by changing the lock-stealing
code. Locks in the realtime system are normally fair, in that threads
waiting for a lock will get it in first-come-first-served order. A
higher-priority process will jump the queue, however, and "steal" the lock
from lower-priority processes which have been waiting for longer. The
adaptive locks patch tweaks this algorithm by allowing a running process to
steal a lock from another, equal-priority process which is sleeping. This
change adds some unfairness to the locking code, but it allows the system
to avoid a context switch and keep a running, cache-warm process going.
Some
benchmark results [PDF] have been posted. On the test system, the
dbench benchmark runs at about 1500 MB/s on a stock 2.6.24 system, but
at just under 170 MB/s on a system with the realtime patches applied.
The adaptive lock patch raises that number back to over 700 MB/s -
still far from a mainline system, but much better than before. The
improvement in hackbench results is even better, while the change in the
all-important "build the kernel" benchmark is small (but still positive).
A fundamental patch like this will require quite a bit of review and
testing before it might be accepted. But the initial results suggest that
adaptive locks might be a big win for the realtime patch set.
Comments (2 posted)
By Jonathan Corbet
March 3, 2008
Thomas Gleixner has discovered that being the maintainer of a core kernel
infrastructure module can bring some special challenges. Whenever
somebody's kernel oopses in the timer code, for example, Thomas tends to
hear about it. The only problem is that the timer code is almost never
where the bug is. Instead, it's far more likely that some other kernel
subsystem has corrupted an active timer, leaving a bomb that will only
explode later, in the timer code, when that timer is set to expire. At
that point, it can be hard to figure out where the real problem is, as the
culprit will be long gone.
In response, Thomas developed some special-purpose code aimed at finding
the real source of timer-related problems, preferably before it brings down
the kernel. He has now generalized that code and posted it as the object debugging infrastructure
patch, which was subsequently significantly revised. As this
code develops, it has the potential to help find whole classes of
especially difficult bugs before they bring the system down.
There's a few steps involved in adding support for object debugging to a
new subsystem. The first is to create and populate a
debug_obj_descr structure (defined in
<linux/debugobjects.h>):
struct debug_obj_descr {
const char *name;
int (*fixup_init) (void *addr, enum debug_obj_state state);
int (*fixup_activate) (void *addr, enum debug_obj_state state);
int (*fixup_destroy) (void *addr, enum debug_obj_state state);
int (*fixup_free) (void *addr, enum debug_obj_state state);
};
The name field is the name of the subsystem; it is used in
debugging output. We will return to the other fields below.
The next step is to call into the object debugging code whenever an action
of interest involves one of the tracked objects. There is a set of
functions used for this purpose:
void debug_object_init (void *addr, struct debug_obj_descr *descr);
void debug_object_activate (void *addr, struct debug_obj_descr *descr);
void debug_object_deactivate(void *addr, struct debug_obj_descr *descr);
void debug_object_destroy (void *addr, struct debug_obj_descr *descr);
void debug_object_free (void *addr, struct debug_obj_descr *descr);
In each case, addr is a pointer to the object being operated on,
and descr is a pointer to the debug_obj_descr structure
mentioned above. The meaning of each call is:
- debug_object_init(): the object is being initialized.
- debug_object_activate(): it is being added to a subsystem list. For
timer debugging, this action happens when add_timer() is
called.
- debug_object_deactivate(): the object is being removed from a subsystem
list.
- debug_object_destroy(): the object is being destroyed and is
no longer referenced within the subsystem. This call is not
used in the version 2 patch set.
- debug_object_free(): the object is being freed.
The debugging code maintains a hashed set of lists for tracking objects;
each object is added to the appropriate list when one of the above calls is
made. As actions are performed on the objects, their state is tracked.
In this way, the debugging code
is able to test for a number of common mistakes, including deactivating an
object which is not active, reinitializing active objects, or adding
objects twice.
When something goes wrong, a backtrace is sent to the system logs. Since
this backtrace identifies where the original error is made, it is likely to
be far more useful than the trace associated with the system crash which
will probably come later. But this infrastructure can also help to make
that crash less likely, in that each subsystem can register a set of "fixup
functions." These, of course, are all the methods in the
debug_obj_descr structure which we glossed over above.
For example, if a call to debug_object_init() is made with an
object which has already been activated, the debugging infrastructure will
respond with a call to the fixup_init() callback, passing in the
object in question and its current state (ODEBUG_STATE_ACTIVE in
this case). The callback should return zero if it is able to,
somehow, repair the damage. Even if things cannot be truly fixed, though,
there is still use for this function; the timer code, for example, will
disable an active timer if the calling code mishandles it. The kernel will
almost certainly not operate as expected, but, at least, it has a smaller
chance of crashing at some random time in the future.
Most debugging checks are performed in response to calls from within the
subsystem itself. There is one useful check which cannot be done that way,
though: detecting the freeing of objects which are still under some sort of
subsystem management. To catch that mistake, Thomas's patch inserts a hook
into functions like kfree() and free_hot_cold_page().
Every time an object is freed, the code checks through the appropriate list
to see if it is still seen as being active in some subsystem.
Freeing an object which is still known to a subsystem is almost always a
bug - one which can be hard to track down later on.
The check on freed memory objects is clearly a useful debugging tool. It could also have a
nontrivial overhead, though, since it requires searching a list every time
some memory is freed. So it has its own configuration option and can be
configured out of the kernel, even if the rest of the debugging code is
built in.
At this point, only the timer subsystem is covered by this infrastructure,
but there are plenty of other obvious candidates. Perhaps at the top of the
list would be kobjects, which are famously susceptible to all kinds
of programming mistakes. So expect to see the coverage of this code grow
in the near future.
Comments (2 posted)
By Jonathan Corbet
March 4, 2008
Back in February, LWN published
a
discussion of the vmsplice() exploit which showed how the
failure to check permissions for a read operation led to a buffer overflow
within the kernel. Subsequently, a linux-kernel reader
pointed out that the article
stopped short of a complete explanation: this is not an ordinary buffer
overflow exploit. Travel schedules and such prevented the writing of an
immediate followup, but your editor would still like to tell the full
story. So this article picks up where the last one left off and describes
how the
vmsplice() exploit makes use of this buffer overflow to
take over the system.
When vmsplice() is being used to feed data from memory into a
pipe, the function charged with making it all happen is
vmsplice_to_pipe(), found in fs/splice.c. It declares a
couple of arrays of interest:
struct page *pages[PIPE_BUFFERS];
struct partial_page partial[PIPE_BUFFERS];
PIPE_BUFFERS, remember, is 16 on exploitable configurations. Both
of these arrays are passed into get_iovec_page_array(), which, as
described in the previous article, makes a call to
get_user_pages() to fill in the pages array. As a result
of the failure to check whether the calling application is allowed to read
the requested region of memory, get_user_pages() will overflow the
pages array, writing far more than PIPE_BUFFERS pointers
into it. These are, however, pointers to legitimate kernel data
structures; it remains to be seen how this overflow enables the attacker to
take control of the system.
The partial array is also passed into
get_iovec_page_array(); it describes the portion of each page which
should be written into the pipe. To that end, a loop like this is run
immediately after returning from get_user_pages():
for (i = 0; i < error; i++) {
const int plen = min_t(size_t, len, PAGE_SIZE - off);
partial[buffers].offset = off;
partial[buffers].len = plen;
/* ... */
}
Since full pages are being written in this case, the calculated offset will be zero, and the length
will be PAGE_SIZE (4096). The value of error is the
return value from get_user_pages(); that will be the number of
pages actually mapped: 46, in the case of the exploit. Remember that the
partial array is also dimensioned to hold 16 entries, so this loop
will overflow that array as well.
Both of these arrays are declared, one right after the other, in
vmsplice_to_page(). A quick test by your editor suggests that the
partial array will be placed below pages in memory, so,
once partial is overflowed, the loop will start overwriting
pages instead. So the pages array will end up containing
alternating values of zero and 4096 rather than the real struct
page pointers it had before. (It's worth noting that the exploit
still works if the arrays are placed in the opposite order, since the
overflow causes code down the line to think that pages is larger
than it really is).
Once all this has happened, control returns to vmsplice_to_pipe()
- the overflow is not big enough to have overwritten the return address. A
call to splice_to_pipe() is supposed to finish the job, but
something interesting happens there. Toward the beginning of this
function, this test is made:
if (!pipe->readers) {
send_sig(SIGPIPE, current, 0);
if (!ret)
ret = -EPIPE;
break;
}
Looking back at the exploit
code, we see that it closes the read side of the pipe before calling
vmsplice(). So splice_to_pipe() will quit almost
immediately. On its way out, however, it does this:
while (page_nr < spd_pages)
page_cache_release(spd->pages[page_nr++]);
The call to get_user_pages() will have locked each of the relevant
pages into memory to allow the kernel to work with them; this is the
cleanup code which goes back and unlocks the pages which will not be used.
But remember that the pointers in the pages array have been
overwritten, and are now either zero or 4096. What would normally happen
here is a kernel oops, since those are not legitimate addresses. The
exploit code has done something tricky, though: using some special
mmap() calls, it has created some anonymous memory at the bottom
of its address space.
Directly dereferencing user-space addresses while running in kernel mode is
frowned upon for a number of reasons; it can blow up in a number of ways.
But, if the address is valid and the relevant page is resident in memory,
direct access to user-space memory will work. So, when the kernel starts
to work with the addresses that it thinks are struct page
pointers, it does not get any sort of fault; instead, it gets the data
placed in that memory by the exploit. Needless to say, that data has been
arranged carefully.
The Linux kernel normally manages each page as an independent object.
There are times, however, when pages are grouped into larger units, called
"compound pages." This generally happens when physically contiguous
allocations larger than one page are needed by the kernel; when this
happens, a compound page is passed back to the caller. These pages are
special in that they must be split back apart when they are released back
into the system, and there may be other cleanup work to do. So
compound pages have an attribute not found on normal pages: a destructor
which is called when the page is freed.
So, if we look at how the exploit sets up its low-memory page
structures, we see:
pages[0]->flags = 1 << PG_compound;
pages[0]->private = (unsigned long) pages[0];
pages[0]->count = 1;
pages[1]->lru.next = (long) kernel_code;
When the kernel looks for a page structure at user-space address
zero, it will find something which looks like a compound page. The
destructor (stored in the lru.next field of the second
page structure) is set to kernel_code(), a function
defined within the exploit itself. Since the count is set to one,
the call to page_cache_release() (which decrements that count)
will conclude that there are no further references and, since the page looks like
a compound page, the destructor will be called. At this point, the exploit
has arbitrary code running in kernel mode, and the show is truly over.
This code just sets the process's uid to zero (giving it root
access), then engages in some assembly-language trickery to return
immediately to user space, shorting out the rest of the cleanup process.
There are a couple of interesting implications from all of this. One, clearly,
is that this exploit is not something which was bashed out by a script
kiddie somewhere. It was written by somebody who understands low-level
kernel code quite well and who is able to use that understanding to
escalate an apparent information-disclosure vulnerability into a full code
execution problem. It is, clearly, a mistake to underestimate those who
write exploits, not all of whom immediately make their works known to the
development community. One also should not assume that they have not
already written exploits for other, still unfixed bugs.
Also worth noting is the fact that ordinary buffer overflow protection may
well have not been effective against this vulnerability. The return address on
the stack was not overwritten, and no exploit code was put in data areas.
This episode has caused a renewed interested in technical security measures
in the kernel. These measures are good, but it would be a mistake to think
that they will fix the problem. What is really needed is stronger review
of patches with security in mind; it is not yet clear to your editor that
this review is happening.
Comments (10 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>