The current 2.6 prepatch is 2.6.16-rc3
on February 12.
As would be expected for this phase of the development cycle, the additions
are mostly fixes, but 2.6.16-rc3 also contains a patch to export the
system's CPU topology in sysfs, parallel port support for SGI O2 systems,
administrator-changeable permissions in configfs, an OCFS2 update, the unshare() system call
and various architecture updates. See the
for the details.
The mainline git repository, as of this writing, holds about 100 fixes
merges after 2.6.16-rc3.
The current -mm tree is 2.6.16-rc3-mm1. Recent changes
to -mm include some memory management system call tweaks (see below), the
EXPORT_SYMBOL_GPL_FUTURE() macro (see below), and various fixes.
The current stable 2.6 release is 22.214.171.124, announced on February 10.
This one contains a fair number of fixes for crashes and other undesirable
Comments (8 posted)
Kernel development news
The kernel has a couple of macros for making internal symbols available to
The second form is used for kernel symbols which are only available to
modules with a GPL-compatible license. The idea behind GPL-only symbols is
that they are so deeply internal to the kernel that any module using them
can only be a derived product of the kernel. Either that, or it's a
relatively new symbol whose creator simply wanted it to be GPL-only.
Greg Kroah-Hartman has recently proposed a new variant:
Its purpose would be to mark symbols
which will be changed to a GPL-only export at some point in the future. If
such a symbol is used by a non-GPL module, the kernel will emit warnings to
the effect that the module will break at a future time. With luck, the
warnings will help authors of proprietary modules prepare for changes ahead
This patch raised a few eyebrows. When GPL-only exports were first added
to the kernel, they went in with the understanding that only new symbols
would be tagged GPL-only. The current module interface - while always
subject to change - was not to have symbols withdrawn arbitrarily. So, if
the export status of symbols should not change, what is the use of this
patch? Greg has a couple of uses in mind:
- The read-copy-update symbols are due to turn GPL-only in April of this
year. The use of RCU by non-GPL modules has always been legally
problematic: RCU is a patented technique which has been licensed by
IBM for use in GPL code. Non-GPL modules will (in the absence of
other arrangements) lack a license for RCU, and thus should not be
using those symbols anyway.
- Current plans call for making the core USB subsystem GPL-only in early
2008. The argument here is that this subsystem has changed greatly
over time, and that it is possible to create full-speed USB drivers in
It is not clear that there will be uses beyond these; resistance to a
larger-scale restricting of exported symbols remains strong. So the weapon
of choice for those who wish to make life difficult for proprietary modules
is likely to remain the combination of API changes and restrictions on new
Comments (9 posted)
A couple of Linux-specific additions to the memory-related system call API
have recently found their way into the -mm tree. There is a bit of
pressure to get them into 2.6.16, though that may be unlikely at this late
date. This may be a good time to look at the proposed changes, however,
along with the pressures which motivated them.
Prepare yourself, as your editor is about to inflict his primitive drawing
skills upon the world again. Consider a situation which, with some
imagination, could be described by the diagram to the right. A process has
a particular memory page of interest, pointed to by a page table entry.
That process has arranged with a device driver to exchange data through
this page; as a result, the driver has a pointer to the associated
page structure, possibly obtained with get_user_pages().
At this stage, all is working well.
But then the process decides to reproduce. The resulting call to fork() has
a number of consequences beyond the creation of a child process. That
call will attempt to avoid copying the parent process's
memory since, for much of the memory range, there is unlikely to ever be a
reason to do so. Instead, both parent and child will be set up with page table
entries pointing to the same physical page in memory, but that page will
now be write protected. As long as neither process attempts to write to
the page, the situation can remain as shown in the diagram to the left.
Both processes - and the driver - can share the same physical page. If
either process calls fork() again, the result will be a third
process also sharing that page, and so on. Often, no process will attempt
to write to the page for as long as it is in this shared state, and no copy
will ever have to be performed.
Life is not always so easy, however. If the parent process makes a change
to the page - writing some new data to be passed through to the driver, for
example - the hardware will trap the write attempt. The kernel will
respond by allocating a new page, copying the old page's contents there,
and pointing the parent process's page table entry to the new,
write-enabled page. At that point, the write attempt can go forward, and
everybody will be happy.
Or maybe not. The copy-on-write operation described above will break the
parent process's connection with the old page. But there is no way to
inform the driver of that change. The result will be the situation shown
on the right: the driver retains a reference to the page which now belongs
exclusively to the child process(es). The parent process and the driver
will no longer be able to communicate with each other. Additionally, if
the parent had used mlock() to lock the original page into memory,
that lock, too, will remain with the original page. The page which the
parent had thought was pinned into RAM will become pageable, with
potentially bad effects on performance and security.
One could try to address this problem by changing the copy-on-write logic
to always maintain the connection between the parent process and its
original pages. That would require the COW code to find any other
processes with references to the page, however, and assign the copied
page to them. That change would slow the code and invite interesting race
conditions, however; remember that there could be a large number of other
processes with references to the page. So the solution proposed by Michael Tsirkin takes a
If a process has pages which it has locked into memory or set up to be
shared with a device driver, chances are that it never wants its children
to have access to that memory in the first place. So Michael's patch adds
a couple of new flags to the madvise() system call. A process
with special memory can call madvise() with the new
MADV_DONTFORK flag; the kernel will respond by setting the
VM_DONTCOPY flag in the associated virtual memory area structure;
thereafter, any newly-created child process simply will not see that part
of the address space. There is also a MADV_DOFORK flag which
cancels the effect of MADV_DONTFORK.
Meanwhile, another change found in current -mm came as a result of this complaint about the behavior of the
msync() system call, which is used to flush modified parts of a
memory-mapped file back to disk. In particular, the
complainer, whose real name is unclear, just noticed that msync()
changed its semantics between 2.4 and 2.6. In the 2.4 kernel, a call to
msync(..., MS_ASYNC) would mark the indicated memory range as
being dirty and begin the process of writing those pages to disk. In 2.6,
instead, no I/O is started directly from msync(); instead, the
pages will remain dirty in the page cache until the virtual memory
subsystem gets around to flushing them out.
The original complainer asked that the old behavior be restored, but that
seems unlikely to happen. For most workloads, the best performance is
achieved by letting the kernel decide just when to write each part of the
file back to disk. But there was also some recognition that an option to
start I/O immediately (without necessarily waiting for it) would be a
useful thing in some situations. The answer, as implemented by Andrew Morton, leaves the
msync() call alone, however; instead, Andrew has added a couple of
new options to the posix_fadvise() system call:
- LINUX_FADV_ASYNC_WRITE will start write I/O on the given
range of pages. If some of those pages are already under I/O, the
operation will not be restarted, leaving open the possibility that
late changes might not make it to disk.
- LINUX_FADV_WRITE_WAIT will wait for any I/O currently in
progress on the given range of pages, but does not actually start any
In practice, these calls will often need to be made in combinations. An
application which needs to assure itself that all modified pages are on
disk must first perform a wait call (thus ensuring that all pages under I/O
are written), a write call (to start I/O on remaining dirty pages), and a
second wait call (to allow that I/O to complete). But any application
wanting the 2.4 msync() behavior can get it with a single
Chances are good that both of these changes could land in the mainline in
the 2.6.17 time frame.
Comments (5 posted)
One of the many features added during the 2.5 development series was the
"futex" - a sort of fast, user-space mutual exclusion primitive. In the
non-contended case, futexes can be obtained and released with no kernel
involvement at all, making them quite fast. When contention does happen
(one process tries to obtain a futex currently owned by another), the
kernel is called in to queue any waiting processes and wake them up when
the futex becomes available. When queueing is not needed, however, the
kernel maintains no knowledge of the futex, keeping its overhead low.
There is one problem with keeping the kernel out of the picture, however.
If a process comes to an untimely end while holding a futex, there is no
way to release that futex and let other processes know about the problem.
The SYSV semaphore mechanism - a much more heavyweight facility - has an
"undo" mechanism which can be called into play in this sort of situation,
but there is no such provision for futexes. As a result, a few different
"robust futex" patches have been put together over the past years; LWN looked at one of them in January,
2004. These patches have tended to greatly increase the cost of futexes,
however, and none have been accepted into the mainline.
Ingo Molnar, working with Thomas Gleixner and Ulrich Drepper, has tossed
aside those years' worth of work and, in a couple of days, produced a new robust futex patch which,
he hopes, will find its way into the mainline. The new patch has the
advantage of being fast, but, as Ingo notes:
Be warned though - the patchset does things we normally dont do in
Linux, so some might find the approach disturbing. Parental advice
The fundamental problem to solve is that the kernel must, somehow, know
about all futexes held by an exiting process in order to release them. A
past solution has been the addition of a system call to notify the kernel
of lock acquisitions and releases. That approach defeats one of the main
features of futexes - their speed. It also adds a record-keeping and
resource limiting problem to the kernel, and suffers from some problematic
So Ingo's patch takes a different approach. A list of held futexes is
maintained for each thread, but that list lives in user space. All the
thread has to do is to make a single call to a new system call:
long set_robust_list(struct robust_list_head *head, size_t size);
That call informs the kernel of the location of a linked list of held
futexes in the calling process's address space; there is also a
get_robust_list() call for retrieving that information.
Typically, this call would be made by glibc, and never seen by the
application. Glibc would also take on the task of maintaining the list of
When a process dies, the kernel looks for a pointer to a user-space futex
list. Should that pointer be found, the kernel will carefully walk through
it, bearing in mind that, as a user-space data structure, it could be
accidentally or maliciously corrupt. For each held futex, the kernel will
release the lock and set it to a special value indicating that the previous
holder made a less-than-graceful exit. It will then wake a waiting
process, if one exists. That process will be able to see that it has
obtained the lock under dubious circumstances (user-space functions like
pthread_mutex_lock() are able to return that information) and take
whatever action it deems to be necessary. The kernel will release a
maximum of one million locks; that keeps the kernel from looping forever on
a circular list. Given the practical difficulties of making a million-lock
application work at all, that limit should not constrain anybody for quite
There is still a race condition here: if a process dies between the time it
acquires a lock and when it updates the list, that lock might not be
released by the kernel. Getting around that problem involves a bit of poor
kernel hacker's journaling. The head of the held futex list contains a
single-entry field which can be used to point to a lock which the
application is about to acquire. The kernel will check that field on exit,
and, if it points to a lock actually held by the application, that lock
will be released with the others. So, if glibc sets that field before
acquiring a lock (and clears it after the list is updated), all locks held
by the application will be covered.
The discussion on this patch was just beginning when this article was
written. There is some concern about having the kernel walking through
user-space data structures; the chances of trouble and security problems
are certainly higher when that is going on. Other issues may yet come up
as well. But, since this is clearly not a 2.6.16 feature in any case,
there will be time to talk about them.
Comments (6 posted)
Patches and updates
Core kernel code
- Junio C Hamano: GIT 1.2.0.
(February 13, 2006)
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>