Brief items
The current 2.6 kernel is 2.6.0. Linus released the second 2.6.1
release candidate on January 6 without an announcement; the
(relatively small) list of changes can be seen in
the long-format changelog. Previously,
2.6.1-rc1 (
announcement,
changelog) had been released on
December 31. It included quite a few fixes, along with a couple of
internal API changes (see below), the restoration of the old
/proc/pid/maps formatting, the ability to compile with
-Os on embedded systems, message signaled interrupt support
(covered here
last August), and extensible
firmware interface (EFI) support.
Linus's BitKeeper tree contains a very small number of fixes added since
2.6.1-rc2 came out.
The latest tree from Andrew Morton is 2.6.1-rc1-mm2. Recent additions of interest
include the laptop mode patch (see below), a mechanism for rate-limiting
printk() messages, a number of architecture updates, and a great
many fixes.
The current 2.4 kernel is 2.4.24, released by Marcelo on January 5.
Unusually, Marcelo deferred the patches in the 2.4.24 prepatches and
released a kernel containing only the mremap() and RTC security
fixes and a couple of other small repairs.
The previous 2.4.24 prepatches have been reissued (with the addition of
some ext2/ext3 filesystem updates, a number of architecture updates, and
various other fixes) as 2.4.25-pre4.
Comments (3 posted)
Kernel development news
The
mremap() system call allows a user process to make changes to
an existing memory mapping. This call, as exported by the C library, allows
changing the size of a mapped region. The underlying call provided by the kernel,
however, has an extra parameter which can be used to request that the
entire region be moved to a different virtual address. That capability is
rarely used, but it turns out to be the key to a new kernel exploit.
The code implementing mremap() makes several checks to ensure that
the calling process is not trying to do anything overly strange. The
kernel developers forgot to check, however, whether the user has asked to
remap a zero-length memory region. In that case, the code does the wrong
thing, and creates a new memory area with a length of zero at the requested
address. Since numerous places in the virtual memory subsystem code assume that
zero-length VM areas do not exist, the creation of such an area is, in
effect, a corruption of the kernel's virtual memory data structures.
The existence of a zero-length virtual memory area is not necessarily a
problem; since it does not actually cover any memory, it cannot be used
directly to access a memory range which should be off-limits to the
process. Where things go wrong is when the kernel makes a pass over a
process's entire virtual address space. For example, the fork()
system call must copy the process's memory space. The code used implements
(in a complicated way) a do loop that assumes each virtual memory
area contains at least one page. As a result, it copies page table
information which does not actually exist.
The situation is complicated by the fact that mremap() is happy to
create this zero-length area just above the end of the virtual address
range allocated to user space--at the beginning of kernel space, in other
words. When fork() tries to copy the page table information for
that area, it can get tangled up in the special large page table entries
used for the kernel. The result is a mess.
What will usually happen (as people who have tried an exploit posted on
Bugtraq have found out) is that the system panics and reboots. It is not
clear to many people who have looked at the problem (including Linus) that this bug can be exploited
for anything other than a denial of service attack. It is worth noting,
however, that the advisory
posted by Paul Starzetz claims:
Proper exploitation of this vulnerability may lead to local
privilege escalation including execution of arbitrary code with
kernel level access. Proof-of-concept exploit code has been created
and successfully tested giving UID 0 shell on vulnerable
systems.... We have identified at least two different attack
vectors for the 2.4 kernel series.
It would not be a good idea to wait and see whether these claims are borne
out or not. Prudent administrators will upgrade to the 2.4.24 kernel, or
apply the update provided by their distributor. (The 2.6.0 kernel is also
vulnerable; the fix can be found in the 2.6.1-rc2 release).
Comments (1 posted)
The kernel developers usually try to keep the internal kernel programming
interface unchanged over the course of a stable kernel series. There are
never any guarantees, however, and things can change at any time.
Experience has shown, in particular, that internal APIs can take a little
while to stabilize after a new stable series begins. The 2.6 kernel looks
like it will follow this pattern; a couple of small changes have already
found their way into the code base.
The first is a simple addition:
int can_request_irq(unsigned int irq, unsigned long flags);
This function will return a non-zero value if an attempt to request the
given interrupt number (possibly shared, as directed by flags)
would succeed. It is intended to be used in situations where multiple
interrupt numbers could be used and the code would like to find an idle
one. There are, of course, no guarantees; a kernel routine could get a
positive result from can_request_irq(), but find that somebody
else had slipped in and allocated the request number immediately
thereafter. As of this writing, can_request_irq() is not exported
to modules and is not supported by all architectures.
The other change has the potential to create minor trouble for some
external modules. Code which implements virtual memory areas (to allow
device memory to be mapped into user space, for example) usually provides a
nopage() function to handle page faults. The prototype for that
function in 2.4.x and 2.6.0 is:
struct page *(*nopage)(struct vm_area_struct *area,
unsigned long address,
int unused);
As of 2.6.1, the unused argument is no longer unused, and the
prototype has changed to:
struct page *(*nopage)(struct vm_area_struct *area,
unsigned long address,
int *type);
The type argument is now used to return the type of the page
fault; VM_FAULT_MINOR would indicate a minor fault - one where the
page was in memory, and all that was needed was a page table fixup. A
return of VM_FAULT_MAJOR would, instead, indicate that the page
had to be fetched from disk. Driver code using nopage() to
implement a device mapping would probably return VM_FAULT_MINOR.
In-tree code checks whether type is NULL before assigning
the fault type; other users would be well advised to do the same.
Making module code compile cleanly will require changing the prototype of
the nopage() function, of course.
As always, the Driver Porting
Series has been updated to reflect these changes.
Comments (none posted)
It is fairly common for kernel code to create lightweight processes -
kernel threads - which perform a certain task asynchronously. To see these
threads, run
ps ax on a 2.6 kernel and note all of the
processes in [square brackets] at the beginning of the listing. The code
which sets up these threads has tended to be reimplemented every time a new
thread is needed, however, and certain tasks (ensuring that the environment
is clean, for example) are not always handled well. The current kernel
also does not easily allow the creator of a kernel thread to control the
behavior of that thread.
Rusty Russell encountered even more trouble as he was doing his "hotplug
CPU" work: when processors can come and go, their associated kernel threads
must be started or stopped at arbitrary times. To make his life easier, he
implemented a new set of kernel thread
primitives which simplify the task greatly.
Using the new mechanism, the first step in creating a kernel thread is to
define a "thread function" which will contain the code to be executed; it
has a prototype like:
int thread_function(void *data);
The function will be called repeatedly (if need be) by the kthread code; it
can perform whatever task it is designated to do, sleeping when necessary.
This function should, however, check its signal status and return if any
signals are pending.
A kernel thread is created with:
struct task_struct *kthread_create(int (*threadfn)(void *data),
void *data,
const char *namefmt, ...);
The data argument will simply be passed to the thread function. A
standard printk()-style formatted string can be used to name the
thread.
The thread will not start running immediately; to get the thread to run,
pass the task_struct pointer returned by kthread_create()
to wake_up_process().
There is also a convenience function which creates and starts the thread:
struct task_struct *kthread_run(int (*threadfn)(void *data),
void *data,
const char *namefmt, ...);
Once started, the thread will run until it explicitly calls
do_exit(), or until
somebody calls kthread_stop():
int kthread_stop(struct task_struct *thread);
kthread_stop() works by sending a signal to the thread. As a
result, the thread function will not be interrupted in the middle of some
important task. But, if the thread function never returns and does not
check for signals, it will never actually stop.
Kernel threads are often created to run on a particular processor. To
achieve this effect, call kthread_bind() after the thread is
created:
void kthread_bind(struct task_struct *thread, int cpu);
Rusty's patch includes a set of changes converting a number of kernel
thread users over to the new infrastructure. There has been a fair amount
of discussion of the kthread patches, which has resulted in some
significant changes. Whether this code will get into the 2.6 kernel
remains to be seen, however.
Comments (1 posted)
Greg Kroah-Hartman has, it seems, received a fair amount of email from
devfs users, many of whom are not pleased with the fact that devfs has been
marked "deprecated" in 2.6. Never mind that Greg didn't do that... But
Greg
is the primary author of udev, which is intended to replace
devfs in the future. With the intent of cutting down on hate mail, Greg
has posted
a lengthy diatribe on why, he
thinks, the udev approach is better. It's not at all clear that his
posting will have succeeded in that goal, but it does make the current
thinking (accepted by most kernel developers, it seems) clearer.
The posting also inspired a lengthy thread on the meaning of Linux device
numbers and how they will be handled in the future. For starters, we now
have Linus's explanation of why he chose to
expand the device number type to 32 bits, rather than the expected 64:
Note that one reason I didn't much like the 64-bit versions is that
not only are they bigger, they also encourage insanity. Ie you'd
find SCSI people who want to try to encode
device/controller/bus/target/lun info into the device number.
We should resist any effort that makes the numbers "mean"
something. They are random cookies. Not "unique identifiers", and
not "addresses".
Linus's talk of "random cookies" set off some alarms from developers who
foresee a world where devices could have different numbers every time the
system boots. Linus's response was unrepentant; he claims that
(1) that world already exists, and (2) attempts to create
relatively stable device numbers just encourage applications to depend on
those numbers not changing, and thus create bugs.
Anybody who has plugged two similar USB devices into the same system has
already experienced one kind of device number instability. The kernel will
assign numbers based on the order in which it discovers the devices; that
order depends on a number of things, including, simply, which device was
plugged in first. There is no way in the general case to provide stable
numbers for this sort of hot-pluggable device. Other devices, such as
iSCSI disks, are even worse. Discovering all of the available devices can
be a challenge by itself; there is no way that this discovery will happen
in a predictable order.
So, for many kinds of devices, variable device numbers is simply a fact of
life. So, says Linus, it is better not to
even try to keep numbers stable.
Basically, if you cannot 100% guarantee reproducibility (and nobody
can, not your hashes, not anything else), then the _appearance_ of
reproducibility is literally a mistake. Because it ends up being a
bug waiting to happen - and one that is very very hard to reproduce
on a developer machine.
To bring that point home, Linus has raised an idea that Greg has presented
a few times in the past: making all device numbers random. This change
would quickly flush out any code which made assumptions about device
numbers, whether it be in the kernel or in user space. Of course, random
device number assignment is a feature for a development kernel; Linus acknowledges that, "for simple politeness
reasons," device numbers should be kept as stable as possible in stable
kernel releases.
In any case, the point of all this is not to confuse users about the
organization of their system. But, in a world where device numbers can
offer no real clues about the hardware on a computer, something else needs
to create stable names by which devices can be identified. That, of
course, is the purpose of tools like udev. As a way of showing how
flexible udev can be, Greg posted a brief
script which makes CD drives available by the name of the disk (as
obtained from CDDB)
currently inside. This scheme is unlikely to become part of any major
distribution in the near future, but it does show how elaborate device
naming can be. For some sorts of devices, a conversation with a remote
server may well be part of the naming process. As naming gets more
complex, it becomes increasingly clear that it simply cannot be done in the
kernel.
That, of course, is one of the main objections to devfs - the naming policy
is implemented entirely in kernel space. The udev approach moves that
policy back out to user space, where it can be easily changed and
extended. The remaining devfs users will want to look at switching over,
but there is no particular hurry; Andrew Morton has made it clear that devfs will continue to be
supported through the lifetime of 2.6 and, possibly, beyond.
Comments (11 posted)
Some months ago, Jens Axboe posted a "laptop mode" patch for the 2.4
kernel. That patch had never been ported forward to 2.6, until now. Bart
Samwel has picked up the laptop mode baton and posted several versions of a
2.6 patch; the latest, as of this writing, is
version 6.
The purpose of the patch is to allow laptop users to get the greatest
amount of time out of their batteries by minimizing the time the disk
spends spinning. Any Linux conference attendee who has ever lost the race
for the available power outlets can't help but appreciate this idea.
To
keep the disk idle, the patch (along with an associated script) changes
system behavior in the following ways:
- The amount of time the system is willing to wait before writing dirty
pages to disk is expanded to ten minutes. As a result, laptop mode
users risk losing up to ten minutes worth of work, but that is a risk
many will be willing to take.
- Any ext3 or ReiserFS filesystems will be remounted with a commit
period of ten minutes.
- Background writeback of dirty pages, normally done when the disk is
not busy doing anything else, is disabled.
- When something does force the disk to spin up, the system writes out
all dirty pages regardless of how long they have been in memory. In
this way, the kernel tries to accomplish all the work it can during
the brief time that the disk is spinning.
There is also a separate mode which can be enabled which creates a log
message every time a process forces some disk activity. This feature is
useful for solving those "why is the disk spinning up" mysteries.
An older version of the laptop mode patch is currently in the 2.6.1-rc1-mm2
tree, which suggests that it may yet find its way into a 2.6 kernel.
Thousands of power-starved laptop users will be grateful.
Comments (2 posted)
Patches and updates
Kernel trees
- Linus Torvalds: 2.6.1-rc1.
(December 31, 2003)
- Andrew Morton: 2.6.0-mm2.
(December 29, 2003)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>