LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 kernel is 2.6.0. Linus released the second 2.6.1 release candidate on January 6 without an announcement; the (relatively small) list of changes can be seen in the long-format changelog. Previously, 2.6.1-rc1 (announcement, changelog) had been released on December 31. It included quite a few fixes, along with a couple of internal API changes (see below), the restoration of the old /proc/pid/maps formatting, the ability to compile with -Os on embedded systems, message signaled interrupt support (covered here last August), and extensible firmware interface (EFI) support.

Linus's BitKeeper tree contains a very small number of fixes added since 2.6.1-rc2 came out.

The latest tree from Andrew Morton is 2.6.1-rc1-mm2. Recent additions of interest include the laptop mode patch (see below), a mechanism for rate-limiting printk() messages, a number of architecture updates, and a great many fixes.

The current 2.4 kernel is 2.4.24, released by Marcelo on January 5. Unusually, Marcelo deferred the patches in the 2.4.24 prepatches and released a kernel containing only the mremap() and RTC security fixes and a couple of other small repairs. The previous 2.4.24 prepatches have been reissued (with the addition of some ext2/ext3 filesystem updates, a number of architecture updates, and various other fixes) as 2.4.25-pre4.

Comments (3 posted)

Kernel development news

Subverting mremap()

The mremap() system call allows a user process to make changes to an existing memory mapping. This call, as exported by the C library, allows changing the size of a mapped region. The underlying call provided by the kernel, however, has an extra parameter which can be used to request that the entire region be moved to a different virtual address. That capability is rarely used, but it turns out to be the key to a new kernel exploit.

The code implementing mremap() makes several checks to ensure that the calling process is not trying to do anything overly strange. The kernel developers forgot to check, however, whether the user has asked to remap a zero-length memory region. In that case, the code does the wrong thing, and creates a new memory area with a length of zero at the requested address. Since numerous places in the virtual memory subsystem code assume that zero-length VM areas do not exist, the creation of such an area is, in effect, a corruption of the kernel's virtual memory data structures.

The existence of a zero-length virtual memory area is not necessarily a problem; since it does not actually cover any memory, it cannot be used directly to access a memory range which should be off-limits to the process. Where things go wrong is when the kernel makes a pass over a process's entire virtual address space. For example, the fork() system call must copy the process's memory space. The code used implements (in a complicated way) a do loop that assumes each virtual memory area contains at least one page. As a result, it copies page table information which does not actually exist.

The situation is complicated by the fact that mremap() is happy to create this zero-length area just above the end of the virtual address range allocated to user space--at the beginning of kernel space, in other words. When fork() tries to copy the page table information for that area, it can get tangled up in the special large page table entries used for the kernel. The result is a mess.

What will usually happen (as people who have tried an exploit posted on Bugtraq have found out) is that the system panics and reboots. It is not clear to many people who have looked at the problem (including Linus) that this bug can be exploited for anything other than a denial of service attack. It is worth noting, however, that the advisory posted by Paul Starzetz claims:

Proper exploitation of this vulnerability may lead to local privilege escalation including execution of arbitrary code with kernel level access. Proof-of-concept exploit code has been created and successfully tested giving UID 0 shell on vulnerable systems.... We have identified at least two different attack vectors for the 2.4 kernel series.

It would not be a good idea to wait and see whether these claims are borne out or not. Prudent administrators will upgrade to the 2.4.24 kernel, or apply the update provided by their distributor. (The 2.6.0 kernel is also vulnerable; the fix can be found in the 2.6.1-rc2 release).

Comments (1 posted)

Two API changes in 2.6

The kernel developers usually try to keep the internal kernel programming interface unchanged over the course of a stable kernel series. There are never any guarantees, however, and things can change at any time. Experience has shown, in particular, that internal APIs can take a little while to stabilize after a new stable series begins. The 2.6 kernel looks like it will follow this pattern; a couple of small changes have already found their way into the code base.

The first is a simple addition:

    int can_request_irq(unsigned int irq, unsigned long flags);

This function will return a non-zero value if an attempt to request the given interrupt number (possibly shared, as directed by flags) would succeed. It is intended to be used in situations where multiple interrupt numbers could be used and the code would like to find an idle one. There are, of course, no guarantees; a kernel routine could get a positive result from can_request_irq(), but find that somebody else had slipped in and allocated the request number immediately thereafter. As of this writing, can_request_irq() is not exported to modules and is not supported by all architectures.

The other change has the potential to create minor trouble for some external modules. Code which implements virtual memory areas (to allow device memory to be mapped into user space, for example) usually provides a nopage() function to handle page faults. The prototype for that function in 2.4.x and 2.6.0 is:

    struct page *(*nopage)(struct vm_area_struct *area, 
                           unsigned long address, 
			   int unused);

As of 2.6.1, the unused argument is no longer unused, and the prototype has changed to:

    struct page *(*nopage)(struct vm_area_struct *area, 
	                   unsigned long address, 
			   int *type);

The type argument is now used to return the type of the page fault; VM_FAULT_MINOR would indicate a minor fault - one where the page was in memory, and all that was needed was a page table fixup. A return of VM_FAULT_MAJOR would, instead, indicate that the page had to be fetched from disk. Driver code using nopage() to implement a device mapping would probably return VM_FAULT_MINOR. In-tree code checks whether type is NULL before assigning the fault type; other users would be well advised to do the same.

Making module code compile cleanly will require changing the prototype of the nopage() function, of course.

As always, the Driver Porting Series has been updated to reflect these changes.

Comments (none posted)

Kernel threads made easy

It is fairly common for kernel code to create lightweight processes - kernel threads - which perform a certain task asynchronously. To see these threads, run ps ax on a 2.6 kernel and note all of the processes in [square brackets] at the beginning of the listing. The code which sets up these threads has tended to be reimplemented every time a new thread is needed, however, and certain tasks (ensuring that the environment is clean, for example) are not always handled well. The current kernel also does not easily allow the creator of a kernel thread to control the behavior of that thread.

Rusty Russell encountered even more trouble as he was doing his "hotplug CPU" work: when processors can come and go, their associated kernel threads must be started or stopped at arbitrary times. To make his life easier, he implemented a new set of kernel thread primitives which simplify the task greatly.

Using the new mechanism, the first step in creating a kernel thread is to define a "thread function" which will contain the code to be executed; it has a prototype like:

    int thread_function(void *data);

The function will be called repeatedly (if need be) by the kthread code; it can perform whatever task it is designated to do, sleeping when necessary. This function should, however, check its signal status and return if any signals are pending.

A kernel thread is created with:

    struct task_struct *kthread_create(int (*threadfn)(void *data),
                                       void *data,
				       const char *namefmt, ...);

The data argument will simply be passed to the thread function. A standard printk()-style formatted string can be used to name the thread. The thread will not start running immediately; to get the thread to run, pass the task_struct pointer returned by kthread_create() to wake_up_process().

There is also a convenience function which creates and starts the thread:

    struct task_struct *kthread_run(int (*threadfn)(void *data),
                                    void *data,
				    const char *namefmt, ...);

Once started, the thread will run until it explicitly calls do_exit(), or until somebody calls kthread_stop():

    int kthread_stop(struct task_struct *thread);

kthread_stop() works by sending a signal to the thread. As a result, the thread function will not be interrupted in the middle of some important task. But, if the thread function never returns and does not check for signals, it will never actually stop.

Kernel threads are often created to run on a particular processor. To achieve this effect, call kthread_bind() after the thread is created:

    void kthread_bind(struct task_struct *thread, int cpu);

Rusty's patch includes a set of changes converting a number of kernel thread users over to the new infrastructure. There has been a fair amount of discussion of the kthread patches, which has resulted in some significant changes. Whether this code will get into the 2.6 kernel remains to be seen, however.

Comments (1 posted)

The future of device numbers

Greg Kroah-Hartman has, it seems, received a fair amount of email from devfs users, many of whom are not pleased with the fact that devfs has been marked "deprecated" in 2.6. Never mind that Greg didn't do that... But Greg is the primary author of udev, which is intended to replace devfs in the future. With the intent of cutting down on hate mail, Greg has posted a lengthy diatribe on why, he thinks, the udev approach is better. It's not at all clear that his posting will have succeeded in that goal, but it does make the current thinking (accepted by most kernel developers, it seems) clearer.

The posting also inspired a lengthy thread on the meaning of Linux device numbers and how they will be handled in the future. For starters, we now have Linus's explanation of why he chose to expand the device number type to 32 bits, rather than the expected 64:

Note that one reason I didn't much like the 64-bit versions is that not only are they bigger, they also encourage insanity. Ie you'd find SCSI people who want to try to encode device/controller/bus/target/lun info into the device number.

We should resist any effort that makes the numbers "mean" something. They are random cookies. Not "unique identifiers", and not "addresses".

Linus's talk of "random cookies" set off some alarms from developers who foresee a world where devices could have different numbers every time the system boots. Linus's response was unrepentant; he claims that (1) that world already exists, and (2) attempts to create relatively stable device numbers just encourage applications to depend on those numbers not changing, and thus create bugs.

Anybody who has plugged two similar USB devices into the same system has already experienced one kind of device number instability. The kernel will assign numbers based on the order in which it discovers the devices; that order depends on a number of things, including, simply, which device was plugged in first. There is no way in the general case to provide stable numbers for this sort of hot-pluggable device. Other devices, such as iSCSI disks, are even worse. Discovering all of the available devices can be a challenge by itself; there is no way that this discovery will happen in a predictable order.

So, for many kinds of devices, variable device numbers is simply a fact of life. So, says Linus, it is better not to even try to keep numbers stable.

Basically, if you cannot 100% guarantee reproducibility (and nobody can, not your hashes, not anything else), then the _appearance_ of reproducibility is literally a mistake. Because it ends up being a bug waiting to happen - and one that is very very hard to reproduce on a developer machine.

To bring that point home, Linus has raised an idea that Greg has presented a few times in the past: making all device numbers random. This change would quickly flush out any code which made assumptions about device numbers, whether it be in the kernel or in user space. Of course, random device number assignment is a feature for a development kernel; Linus acknowledges that, "for simple politeness reasons," device numbers should be kept as stable as possible in stable kernel releases.

In any case, the point of all this is not to confuse users about the organization of their system. But, in a world where device numbers can offer no real clues about the hardware on a computer, something else needs to create stable names by which devices can be identified. That, of course, is the purpose of tools like udev. As a way of showing how flexible udev can be, Greg posted a brief script which makes CD drives available by the name of the disk (as obtained from CDDB) currently inside. This scheme is unlikely to become part of any major distribution in the near future, but it does show how elaborate device naming can be. For some sorts of devices, a conversation with a remote server may well be part of the naming process. As naming gets more complex, it becomes increasingly clear that it simply cannot be done in the kernel.

That, of course, is one of the main objections to devfs - the naming policy is implemented entirely in kernel space. The udev approach moves that policy back out to user space, where it can be easily changed and extended. The remaining devfs users will want to look at switching over, but there is no particular hurry; Andrew Morton has made it clear that devfs will continue to be supported through the lifetime of 2.6 and, possibly, beyond.

Comments (11 posted)

Laptop mode for 2.6

Some months ago, Jens Axboe posted a "laptop mode" patch for the 2.4 kernel. That patch had never been ported forward to 2.6, until now. Bart Samwel has picked up the laptop mode baton and posted several versions of a 2.6 patch; the latest, as of this writing, is version 6.

The purpose of the patch is to allow laptop users to get the greatest amount of time out of their batteries by minimizing the time the disk spends spinning. Any Linux conference attendee who has ever lost the race for the available power outlets can't help but appreciate this idea. To keep the disk idle, the patch (along with an associated script) changes system behavior in the following ways:

  • The amount of time the system is willing to wait before writing dirty pages to disk is expanded to ten minutes. As a result, laptop mode users risk losing up to ten minutes worth of work, but that is a risk many will be willing to take.

  • Any ext3 or ReiserFS filesystems will be remounted with a commit period of ten minutes.

  • Background writeback of dirty pages, normally done when the disk is not busy doing anything else, is disabled.

  • When something does force the disk to spin up, the system writes out all dirty pages regardless of how long they have been in memory. In this way, the kernel tries to accomplish all the work it can during the brief time that the disk is spinning.

There is also a separate mode which can be enabled which creates a log message every time a process forces some disk activity. This feature is useful for solving those "why is the disk spinning up" mysteries. An older version of the laptop mode patch is currently in the 2.6.1-rc1-mm2 tree, which suggests that it may yet find its way into a 2.6 kernel. Thousands of power-starved laptop users will be grateful.

Comments (2 posted)

Patches and updates

Kernel trees

  • Linus Torvalds: 2.6.1-rc1. (December 31, 2003)
  • Andrew Morton: 2.6.0-mm2. (December 29, 2003)

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds