User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.16-rc3, released on February 12. As would be expected for this phase of the development cycle, the additions are mostly fixes, but 2.6.16-rc3 also contains a patch to export the system's CPU topology in sysfs, parallel port support for SGI O2 systems, administrator-changeable permissions in configfs, an OCFS2 update, the unshare() system call, and various architecture updates. See the long-format changelog for the details.

The mainline git repository, as of this writing, holds about 100 fixes merges after 2.6.16-rc3.

The current -mm tree is 2.6.16-rc3-mm1. Recent changes to -mm include some memory management system call tweaks (see below), the EXPORT_SYMBOL_GPL_FUTURE() macro (see below), and various fixes.

The current stable 2.6 release is, announced on February 10. This one contains a fair number of fixes for crashes and other undesirable behavior.

Comments (8 posted)

Kernel development news


The kernel has a couple of macros for making internal symbols available to loadable modules:


The second form is used for kernel symbols which are only available to modules with a GPL-compatible license. The idea behind GPL-only symbols is that they are so deeply internal to the kernel that any module using them can only be a derived product of the kernel. Either that, or it's a relatively new symbol whose creator simply wanted it to be GPL-only.

Greg Kroah-Hartman has recently proposed a new variant:


Its purpose would be to mark symbols which will be changed to a GPL-only export at some point in the future. If such a symbol is used by a non-GPL module, the kernel will emit warnings to the effect that the module will break at a future time. With luck, the warnings will help authors of proprietary modules prepare for changes ahead of time.

This patch raised a few eyebrows. When GPL-only exports were first added to the kernel, they went in with the understanding that only new symbols would be tagged GPL-only. The current module interface - while always subject to change - was not to have symbols withdrawn arbitrarily. So, if the export status of symbols should not change, what is the use of this patch? Greg has a couple of uses in mind:

  • The read-copy-update symbols are due to turn GPL-only in April of this year. The use of RCU by non-GPL modules has always been legally problematic: RCU is a patented technique which has been licensed by IBM for use in GPL code. Non-GPL modules will (in the absence of other arrangements) lack a license for RCU, and thus should not be using those symbols anyway.

  • Current plans call for making the core USB subsystem GPL-only in early 2008. The argument here is that this subsystem has changed greatly over time, and that it is possible to create full-speed USB drivers in user space.

It is not clear that there will be uses beyond these; resistance to a larger-scale restricting of exported symbols remains strong. So the weapon of choice for those who wish to make life difficult for proprietary modules is likely to remain the combination of API changes and restrictions on new symbols.

Comments (9 posted)

Tweaks to madvise() and posix_fadvise()

A couple of Linux-specific additions to the memory-related system call API have recently found their way into the -mm tree. There is a bit of pressure to get them into 2.6.16, though that may be unlikely at this late date. This may be a good time to look at the proposed changes, however, along with the pressures which motivated them.

Prepare yourself, as your editor is about to inflict his primitive drawing skills upon the world again. Consider a situation which, with some [Diagram] imagination, could be described by the diagram to the right. A process has a particular memory page of interest, pointed to by a page table entry. That process has arranged with a device driver to exchange data through this page; as a result, the driver has a pointer to the associated page structure, possibly obtained with get_user_pages(). At this stage, all is working well.

But then the process decides to reproduce. The resulting call to fork() has a number of consequences beyond the creation of a child process. That call will attempt to avoid copying the parent process's [Diagram] memory since, for much of the memory range, there is unlikely to ever be a reason to do so. Instead, both parent and child will be set up with page table entries pointing to the same physical page in memory, but that page will now be write protected. As long as neither process attempts to write to the page, the situation can remain as shown in the diagram to the left. Both processes - and the driver - can share the same physical page. If either process calls fork() again, the result will be a third process also sharing that page, and so on. Often, no process will attempt to write to the page for as long as it is in this shared state, and no copy will ever have to be performed.

Life is not always so easy, however. If the parent process makes a change to the page - writing some new data to be passed through to the driver, for example - the hardware will trap the write attempt. The kernel will respond by allocating a new page, copying the old page's contents there, [Diagram] and pointing the parent process's page table entry to the new, write-enabled page. At that point, the write attempt can go forward, and everybody will be happy.

Or maybe not. The copy-on-write operation described above will break the parent process's connection with the old page. But there is no way to inform the driver of that change. The result will be the situation shown on the right: the driver retains a reference to the page which now belongs exclusively to the child process(es). The parent process and the driver will no longer be able to communicate with each other. Additionally, if the parent had used mlock() to lock the original page into memory, that lock, too, will remain with the original page. The page which the parent had thought was pinned into RAM will become pageable, with potentially bad effects on performance and security.

One could try to address this problem by changing the copy-on-write logic to always maintain the connection between the parent process and its original pages. That would require the COW code to find any other processes with references to the page, however, and assign the copied page to them. That change would slow the code and invite interesting race conditions, however; remember that there could be a large number of other processes with references to the page. So the solution proposed by Michael Tsirkin takes a different approach.

If a process has pages which it has locked into memory or set up to be shared with a device driver, chances are that it never wants its children to have access to that memory in the first place. So Michael's patch adds a couple of new flags to the madvise() system call. A process with special memory can call madvise() with the new MADV_DONTFORK flag; the kernel will respond by setting the VM_DONTCOPY flag in the associated virtual memory area structure; thereafter, any newly-created child process simply will not see that part of the address space. There is also a MADV_DOFORK flag which cancels the effect of MADV_DONTFORK.

Meanwhile, another change found in current -mm came as a result of this complaint about the behavior of the msync() system call, which is used to flush modified parts of a memory-mapped file back to disk. In particular, the complainer, whose real name is unclear, just noticed that msync() changed its semantics between 2.4 and 2.6. In the 2.4 kernel, a call to msync(..., MS_ASYNC) would mark the indicated memory range as being dirty and begin the process of writing those pages to disk. In 2.6, instead, no I/O is started directly from msync(); instead, the pages will remain dirty in the page cache until the virtual memory subsystem gets around to flushing them out.

The original complainer asked that the old behavior be restored, but that seems unlikely to happen. For most workloads, the best performance is achieved by letting the kernel decide just when to write each part of the file back to disk. But there was also some recognition that an option to start I/O immediately (without necessarily waiting for it) would be a useful thing in some situations. The answer, as implemented by Andrew Morton, leaves the msync() call alone, however; instead, Andrew has added a couple of new options to the posix_fadvise() system call:

  • LINUX_FADV_ASYNC_WRITE will start write I/O on the given range of pages. If some of those pages are already under I/O, the operation will not be restarted, leaving open the possibility that late changes might not make it to disk.

  • LINUX_FADV_WRITE_WAIT will wait for any I/O currently in progress on the given range of pages, but does not actually start any I/O.

In practice, these calls will often need to be made in combinations. An application which needs to assure itself that all modified pages are on disk must first perform a wait call (thus ensuring that all pages under I/O are written), a write call (to start I/O on remaining dirty pages), and a second wait call (to allow that I/O to complete). But any application wanting the 2.4 msync() behavior can get it with a single LINUX_FADV_ASYNC_WRITE call.

Chances are good that both of these changes could land in the mainline in the 2.6.17 time frame.

Comments (5 posted)

Robust futexes - a new approach

One of the many features added during the 2.5 development series was the "futex" - a sort of fast, user-space mutual exclusion primitive. In the non-contended case, futexes can be obtained and released with no kernel involvement at all, making them quite fast. When contention does happen (one process tries to obtain a futex currently owned by another), the kernel is called in to queue any waiting processes and wake them up when the futex becomes available. When queueing is not needed, however, the kernel maintains no knowledge of the futex, keeping its overhead low.

There is one problem with keeping the kernel out of the picture, however. If a process comes to an untimely end while holding a futex, there is no way to release that futex and let other processes know about the problem. The SYSV semaphore mechanism - a much more heavyweight facility - has an "undo" mechanism which can be called into play in this sort of situation, but there is no such provision for futexes. As a result, a few different "robust futex" patches have been put together over the past years; LWN looked at one of them in January, 2004. These patches have tended to greatly increase the cost of futexes, however, and none have been accepted into the mainline.

Ingo Molnar, working with Thomas Gleixner and Ulrich Drepper, has tossed aside those years' worth of work and, in a couple of days, produced a new robust futex patch which, he hopes, will find its way into the mainline. The new patch has the advantage of being fast, but, as Ingo notes:

Be warned though - the patchset does things we normally dont do in Linux, so some might find the approach disturbing. Parental advice recommended ;-)

The fundamental problem to solve is that the kernel must, somehow, know about all futexes held by an exiting process in order to release them. A past solution has been the addition of a system call to notify the kernel of lock acquisitions and releases. That approach defeats one of the main features of futexes - their speed. It also adds a record-keeping and resource limiting problem to the kernel, and suffers from some problematic race conditions.

So Ingo's patch takes a different approach. A list of held futexes is maintained for each thread, but that list lives in user space. All the thread has to do is to make a single call to a new system call:

    long set_robust_list(struct robust_list_head *head, size_t size);

That call informs the kernel of the location of a linked list of held futexes in the calling process's address space; there is also a get_robust_list() call for retrieving that information. Typically, this call would be made by glibc, and never seen by the application. Glibc would also take on the task of maintaining the list of futexes.

When a process dies, the kernel looks for a pointer to a user-space futex list. Should that pointer be found, the kernel will carefully walk through it, bearing in mind that, as a user-space data structure, it could be accidentally or maliciously corrupt. For each held futex, the kernel will release the lock and set it to a special value indicating that the previous holder made a less-than-graceful exit. It will then wake a waiting process, if one exists. That process will be able to see that it has obtained the lock under dubious circumstances (user-space functions like pthread_mutex_lock() are able to return that information) and take whatever action it deems to be necessary. The kernel will release a maximum of one million locks; that keeps the kernel from looping forever on a circular list. Given the practical difficulties of making a million-lock application work at all, that limit should not constrain anybody for quite some time.

There is still a race condition here: if a process dies between the time it acquires a lock and when it updates the list, that lock might not be released by the kernel. Getting around that problem involves a bit of poor kernel hacker's journaling. The head of the held futex list contains a single-entry field which can be used to point to a lock which the application is about to acquire. The kernel will check that field on exit, and, if it points to a lock actually held by the application, that lock will be released with the others. So, if glibc sets that field before acquiring a lock (and clears it after the list is updated), all locks held by the application will be covered.

The discussion on this patch was just beginning when this article was written. There is some concern about having the kernel walking through user-space data structures; the chances of trouble and security problems are certainly higher when that is going on. Other issues may yet come up as well. But, since this is clearly not a 2.6.16 feature in any case, there will be time to talk about them.

Comments (6 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

  • Junio C Hamano: GIT 1.2.0. (February 13, 2006)

Device drivers


Filesystems and block I/O


Memory management



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds