Brief items
The current 2.6 prepatch is 2.6.7-rc2,
released on May 29.
Most of the patches this time around are aimed at stabilization
after the big changes in -rc1, but -rc2 also contains an ALSA update, a
whole bunch of new
__user annotations (intended to help find
misuses of user-space pointers - see below), an XFS update, some IPSec
fixes, and some
architecture updates. See
the long-format
changelog for the details.
Linus's BitKeeper repository contains, as of this writing, some stack usage
reduction patches, more __user annotations, some architecture
updates, and a few other fixes.
The current prepatch from Andrew Morton is 2.6.7-rc2-mm1. Recent additions to -mm include
NFS, MD, and DMI updates, the x86 performance counters patch, some
read-copy-update scalability work, and the usual pile of fixes.
The current 2.4 prepatch is 2.4.27-pre4, which was released by Marcelo on May 30. There are
some XFS and JFS updates, a number of 2.6 networking backports (including
TCP Vegas support and receiver-side RTT estimation) some driver updates,
and the usual set of fixes.
Comments (4 posted)
Kernel development news
Marking regions of memory as not containing executable code is not a
particularly new technique; some processors have recognized this mode for
years. The processor that everybody actually
uses, however (the x86
family) does not have a "no-execute" bit.
At least, it didn't until very recently. AMD added a no-execute (NX)
permission bit to the page table entries
in its 64-bit processors; Intel has recently said it will be
supporting this mode as well. So the hardware will be able to avoid
executing code from certain regions of memory, making various types of
buffer overflow attacks harder. At least, that will be true if the
operating system supports and uses the NX mode.
To that end, Ingo Molnar has posted a patch bringing NX
support to the x86 architecture; his patch is based on previous work done
by Intel and the x86_64 NX support by Andi Kleen. This patch allows
applications to mark areas as being non-executable; such areas, typically,
will include the stack and heap zones. It also applies the NX bit to the
kernel itself; kernel text is marked executable, but kernel data is not.
As a result, the next time a buffer overflow turns up in the kernel, it,
too, will be harder to exploit.
The NX bit only works when the processor is running in the PAE mode. Most
x86 Linux systems currently do not run in that mode; it is normally only
turned on when large amounts of memory (more than 4GB) are installed. This
mode adds a third level of page tables, and makes the page table entries
themselves larger, so users and distributors normally turn it off if it is
not needed. Most modern x86 processors support the PAE mode, however;
security considerations may lead to it being used more heavily in the
future.
Linus's main concern about the patch would
appear to be how many old applications it might break. The reply from Arjan van de Ven is that pretty much
everything "just works." The no-execute permission is not applied unless
the code is specially marked in the image file, and gcc apparently does a
good job of not setting that flag when it would break things. If this
experience holds true, NX support could go in fairly quickly, and a
longstanding x86 security weakness will be no more.
For people interested in testing this patch, Arjan has merged it into the
latest Fedora Core test kernels. See the patch
announcement for a pointer. There is also a
"quickstart" document for those who would like to test out NX in their
own kernels.
Comments (5 posted)
As the 2.6.0 release approached, some developers worried that the CPU
scheduler would be the downfall of this particular stable series.
Complaints of poor interactive performance were common, NUMA systems were
not supported well, and so on. Over time, most of these problems have been
addressed; massive amounts of interactivity work and the domain scheduler
have smoothed over most of the problems. Complaints about the scheduler
have been relatively rare in recent times.
One thing that does still bother some people, however, is the complexity of
the current 2.6 scheduler. The interactivity work, in particular, added a
great deal of very obscure code. The scheduler goes to great lengths to
try to identify interactive tasks and to boost their priority accordingly.
This process involves numerous strange computations involving a number of
magic constants; it is difficult to understand, much less improve.
Con Kolivas, who had his hand in much of the interactivity work, has just
posted a new version of his "staircase
scheduler" patch. This patch aims to greatly simplify the scheduler while
simultaneously improving interactive response; it deletes 498 lines of
code, while adding less than 200. Much of what is deleted is the "black
magic" interactivity calculations; it is all replaced with a relatively
simple, rank-based scheme.
The staircase scheduler implements a single, ranked array of processes for
each CPU. Initially, each process goes into the array at the rank
determined by its base priority; the scheduler can then locate and run the
highest-priority process in the usual way. So far, not much has changed.
In the current scheduler, processes which use up their time slice get moved
over to a separate "expired" array; there they languish until the rest of
the processes in the mix have used up their time (or blocked) as well. The
staircase scheduler does away with the expired array; instead, an expired
process will be put back into the staircase, but at the next lower rank.
It can, thus, continue to run, but at a lower priority. When it exhausts
another time slice, it moves down again. And so on. The following little
table shows how long the process spends at each priority level:
|
Priority rank |
| Iteration |
Base |
-1 | -2 | -3 | -4 | -5 |
-6 | -7 | -8 | -9 | ... |
| 1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
When a process falls off the bottom of the staircase, an interesting thing
happens: it gets moved back up to one level below its previous maximum, and
it gets two time slices at that level. Thereafter, it once again works its
way down the steps to the bottom. The next time, it goes up to two steps
below the maximum, for three time slices. The above table, with three
iterations through the staircase, would look like this:
|
Priority rank |
| Iteration |
Base |
-1 | -2 | -3 | -4 | -5 |
-6 | -7 | -8 | -9 | ... |
| 1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
| 2 |
| 2 | 1 | 1 | 1 |
1 | 1 | 1 | 1 | 1 |
| 3 |
| | 3 | 1 | 1 |
1 | 1 | 1 | 1 | 1 |
Each descent down the staircase thus involves the same number of time
slices, but, each time, more slices are spent at the top priority level for
that iteration.
This algorithm helps maintain
the relative priorities. A process at priority n will, after
falling off the staircase, find itself competing with all the processes at
priority n-1, but it will get a longer slice of time relative to those
other processes, which have a lower base priority.
If a process sleeps for a reasonable interval, it gets pushed back up the
staircase. Thus interactive tasks, which normally sleep quite a bit,
should stay near the top of the staircase and be responsive, while CPU hogs
spend much of their time on the lower steps.
The kernel community may not be up for another big scheduler change at this
point in the stable series; many people would like to see 2.6 actually
stabilize and 2.7 begin. This patch appears worthy of consideration,
however, for its simplification of a complex part of the kernel if nothing
else.
Comments (8 posted)
In past years, this page has looked at the work done by the "Stanford
checker," which analyzes code in search of various types of programming
errors. The checker has found a lot of problems over the years, with the
result that a lot of problems have been fixed before they had a chance to
bite users of production kernels.
The only problem with the Stanford checker is that it is not free software;
it is, in fact, completely unavailable to the world as a whole. Rather
than release the code, the checker group went off and formed Coverity to commercialize the checker
software (now called "SWAT" and touted, ominously, as being "patent
pending"). Developers at Coverity still occasionally post reports of
potential bugs found by SWAT, but, for the most part, their attention seems
focused on potential revenue opportunities.
It is hard to complain about this outcome. Before heading on this course,
the Coverity folks uncovered vast numbers of bugs, and all Linux users
benefited from that work. They also demonstrated how valuable static code
testing tools can be. The community, however, was left in the position of
having to actually write its own checker if it wanted one. Fortunately,
this is the sort of thing the community can be good at.
A while back, none other than Linus Torvalds started work on his own tool,
which came to be called "sparse." There has recently been a flurry of new
activity around sparse, so it seems like a good time to take a look.
sparse is normally obtained by cloning the BitKeeper repository at
bk://kernel.bkbits.net/torvalds/sparse. For those who don't use
BK, a checked-out
version is available (as a bunch of SCCS files) on kernel.org. There
is a low-bandwidth sparse mailing
list as well.
Essentially, sparse is a parsing and analysis library for the C language.
One could put a number of different backends onto it; for example, a
code-generation backend would turn it into a simple compiler. For the
purposes of the kernel, however, the backend of interest is the analysis
code which looks for various types of errors. The analyzer checks for
quite a few different types of errors. Many of these (many sorts of type
mismatches, for example) are also found by the compiler, but other tests are
unique to sparse.
The core test done by sparse is still the check for improper use of
user-space pointers. A quick look through the kernel will turn up liberal
use of a type attribute called __user; for example, the
read() method invoked from system calls is prototyped as:
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
When the kernel is being compiled, __user is defined as the empty
string, so gcc doesn't see it at all. When sparse is being used,
instead, it marks the pointer as (1) being in a separate address
space, and (2) not being legal to dereference. sparse will use those
flags to catch any mixing of user- and kernel-space pointers, and any
attempt to directly dereference user-space pointers.
These checks have turned up a surprising number of errors. The kernel
normally sets up the virtual address space in such a way that direct
dereferencing of user-space pointers actually works - most of the time.
Using user-space addresses in this way will fail, however, if the user page
is not actually resident in memory at the time. More importantly, perhaps,
this sort of direct dereferencing bypasses the normal access controls;
every such error could, thus, become a security hole.
Catching such mistakes automatically seems like a good idea. It does
require, however, that every variable holding a user-space pointer be
marked with the __user attribute. Since much of the kernel
(including every device driver) deals
with user-space pointers, this is not a trivial job. This job is
proceeding, however; several dozen patches adding __user
annotations (and fixing problems found on the way) have been merged for
2.6.7.
Other checks performed include finding constants which are overly long for
their target type, mistakes in embedded assembly language code, empty
switch statements, assignments in conditionals, and so on. Its
output is rather noisy still, but one assumes that will improve over time.
If you have sparse installed, running it on the kernel is simply a matter
of adding "C=1" to the make command. External modules
can also be checked in this way.
sparse is still clearly far behind the Stanford checker in terms of the
variety of errors it can find. Unlike the checker, however, sparse is free
software. The core parsing infrastructure is in place, so the addition of
new checks should be relatively straightforward. All that's needed is the
application of a bunch of developer time.
Comments (8 posted)
A standard feature of most commercial operating systems is a "crash dump"
facility. If something goes wrong in the operating system kernel, the
system saves its entire state to a file and reboots; the contents of that
file can then be examined at leisure to try to figure out what went wrong.
The Linux kernel, however, lacks this capability. There are a few possible
reasons for this omission: the kernel never crashes (not quite true,
unfortunately), kernel developers rarely want crash dumps for their own
work, and there is a certain degree of unhappiness with all of the crash
dump patches currently in circulation. The fact of the matter, however, is
that a number of Linux vendors would like to have a good crash dump system
in place so they can better support their customers.
A recent patch posted by Takao Indoh may
provide that capability. The new "diskdump" system has taken a
simpler approach to crash dumps that, with some fixes, may just get enough
core hacker support to be considered for merging into the (presumably 2.7)
mainline.
Diskdump works by taking absolute control of the system when a panic
occurs. It shuts down all interrupts to keep the processor from getting
distracted; it also freezes all other processors on SMP systems. It then
checksums its own code, comparing against a value computed at
initialization time; if the checksums fail to match, diskdump assumes that
it has been corrupted as a result of whatever went wrong and refuses to
run.
The next step involves finding a place to store the crash dump. Diskdump
can be set up with multiple dump partitions. For each possibility, it
queries the state of the driver, then reads and verifies the entire crash
dump space. The diskdump authors are (rightly) fearful of overwriting
important data while the system is in an unstable state, so diskdump
requires that every block of the crash dump partition be initialized with a
special pattern. If any blocks fail the test, that destination will not be
used.
When a suitable location has been found, diskdump writes a header with the
system state and panic information, followed by a memory image. At that
point the system can be rebooted; once things are stable again, the
"savecore" utility turns the memory image into a proper core dump and
reinitializes the crash dump partition. All is then in readiness for
debugging and, if need be, the next crash.
Diskdump needs some significant block driver modifications to be able to do
its job. The driver must export a new set of operations:
struct disk_dump_device_ops {
int (*sanity_check)(struct disk_dump_device *);
int (*quiesce)(struct disk_dump_device *);
int (*shutdown)(struct disk_dump_device *);
int (*rw_block)(struct disk_dump_partition *, int rw, unsigned long
block_nr, void *buf);
};
The sanity_check() call checks to ensure that the device in
question is ready to accept a crash dump. If that function finds that, for
example, the device is offline or somebody, somewhere is holding a spinlock
for the device, the sanity check will fail and the dump will have to go
somewhere else. A call to quiesce() follows, in case any
preparation is needed. The current implementation (which only works with
some SCSI devices) performs a full SCSI bus reset at this point. The
actual I/O is done via rw_block, which is expected to transfer one
page per call. This I/O should be done without interrupts (which are,
remember, disabled when the panic happens), so the typical implementation
will work by polling the device. At the end, shutdown() is called
to ensure that all blocks have been flushed to the media.
Perhaps the ugliest part of the patch - and the part which some developers
have complained about - is the rerouting of timer and tasklet calls. Since
all interrupts are disabled, the normal timer and software interrupt
mechanisms will not function. Diskdump does not need those capabilities
itself, but a number of disk drivers do. As a result, diskdump must,
somehow, run tasklets and timers expected by the driver, but without
running arbitrary code unrelated to the dump process. To this end,
diskdump sets up its own private timer and tasklet lists which come into
action once the system is locked down and the dump process begins.
Currently, all this works by modifying the drivers to call diskdump's
functions rather than the core kernel variants. So, for example, instead
of setting up a timer with add_timer(), a driver implementing
dumps would call this little wrapper:
static inline void diskdump_add_timer(struct timer_list *timer)
{
if (crashdump_mode())
_diskdump_add_timer(timer);
else
add_timer(timer);
}
But that function is only available if crash dumps are configured into the
system, so some preprocessor macros are used to redefine
add_timer() if need be. This solution is not going to make it
into the mainline kernel, however. The preferred approach would appear to
be integrating this functionality directly into the core timer and tasklet
routines; that change will make the driver changes smaller, but at the cost
of intruding into some of the core kernel code.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>