Brief items
The current development kernel is 2.5.26, which was
announced by Linus on July 16. Changes
include some ACPI updates, the "direct to BIO for O_DIRECT" patch (see
last week's LWN Kernel Page), a number of NTFS
updates, some USB changes, some IDE fixes, an ARM update, and a great many
other changes. The
long-format changelog is
also available.
The latest prepatch from Dave Jones (as of this writing) is 2.5.25-dj2, released on July 12. The most
significant feature of this release, perhaps, is that Dave has included the
2.4 IDE "foreport" (also discussed last
week).
The latest 2.5 status summary from Guillaume
Boissiere came out on July 17.
The current stable kernel is still 2.4.18. The second 2.4.19
release candidate showed up on the kernel.org sites on July 17, but
Marcelo has not posted any sort of changelog or announcement.
The current prepatch from Alan Cox is 2.4.19-rc1-ac7. Alan has been merging a lot of
code for the IBM "Summit" architecture, PA-RISC, and more.
Comments (1 posted)
Kernel development news
One of the biggest challenges in kernel programming is managing
concurrency. If multiple threads try to access the same resources at the
same time, the result can be chaos. Users tend to have a dislike for
chaos, so kernel programmers work hard to avoid uncontrolled access to
shared data.
In the Linux kernel, this sort of mutual exclusion is usually done with
spinlocks. By obtaining a spinlock, a process running in kernel mode can
ensure that it is the only one working with the data structures protected
by that lock. A variant on spinlocks, called the "reader writer lock,"
allows numerous threads to access a data structure as long as they do not
modify it, but provides exclusive access to processes which make changes.
Spinlocks work well in most situations, but they are not free. Taking out
a lock takes time, especially on SMP machines, where the cache line
containing the lock must be moved between processors. The overhead of a
heavily used lock can be significant. So kernel hackers interested in
scalability have long kept an eye out for alternatives; one such
alternative is a technique called "read-copy-update," or RCU. A "new and shiny" RCU patch was posted by Dipankar
Sarma recently (the code credits Paul McKenney, Andrea Arcangeli, Rusty
Russell, Andi Kleen, and "etc."). So it seems like a good time to give RCU
a look.
RCU works by requiring shared kernel data structures to be accessed via
pointers. Code needing read-only access to a given data structure (a
network routing table entry, say) follows a pointer and is able to work
with the data with no locking at all (OK, almost none, see below). The
reader case is, thus, handled in a fast and efficient manner. When the
code needs to make a change to the data, however, life gets rather more
complicated; the sequence of steps required is, roughly:
- The writer allocates a new data structure and makes a copy of the
structure to be changed.
- The new structure is then modified to reflect the new state of the
world.
- The writer saves the pointer to the old version of the data, and sets
the global pointer to the new structure. Kernel threads that access
the data after this change will see the new version; any thread that
came along before will still be working with the old copy.
- The writer asks for a callback when the kernel knows that no code
has any reference to the old version of the data.
- When the callback happens, the old data can be freed, and the RCU
cycle is complete.
This technique is clearly optimized for situations where the data is read
frequently and modified rarely. For frequently-changed data, the overhead
of the RCU cycle would likely exceed that of simply using spinlocks. The
"frequent reads/infrequent writes" mode of operation is quite common in the
kernel, though, so there are many places where this technique could be
employed. For example, Rusty Russell's hotplug CPU patch uses RCU, on the
assumption that processors do not actually come and go very often.
All of the above, however, has glossed over one interesting detail: how,
exactly, does the kernel know when it is safe to release an old data
structure? The RCU patch handles this with a basic assumption: kernel code
will not retain pointers to RCU-protected data after it sleeps or returns
to user space. Thus, it is sufficient to wait until every processor in the
system has been seen to be running in user space or to be idle. The RCU
patch describes such a processor as being "quiescent." Each CPU in the
system has a quiescent counter, which is incremented by the scheduler
whenever a quiescent state is observed.
To call the RCU writer callbacks at the right time, the RCU code maintains
a list of pending RCU completions on each processor. A tasklet runs
periodically on any processor with pending RCU callbacks; it polls the
quiescent counter for all CPUs and waits until every one of them has
changed. Once that has happened, it is safe to free any old RCU data, so
the list of callbacks is processed. If, by that time, a new list of
pending callbacks has accumulated, the whole thing starts over again.
All of this works until you throw in one other little detail: the
preemptive kernel. If a process is preempted while running in kernel
space, it could retain a pointer to old RCU data even though the CPU
appears to be quiescent. The RCU patch provides two different ways of
dealing with this problem. One is that code accessing RCU data for reading
can bracket that access with calls to rcu_read_lock() and
rcu_read_unlock(), which simply disable preemption in the critical
section. Spinlocks, of course, do the same thing.
Alternatively, code can read the RCU data in an unprotected mode as
always. In this case, the RCU callback code gets even a little more
complicated; it must now wait until every process which had been preempted
in kernel mode either exits or is rescheduled normally. This waiting is
not quite as bad as it might seem; it is handled with a couple of atomic
counters. It does, however, introduce an indeterminate delay
between when the new data appears and the old can be freed. If the memory
areas involved are large, quite a bit of kernel memory could be tied up
waiting for RCU callbacks; disabling preemption is a safer way to go in
most cases.
RCU thus involves some complexity, but it holds out a promise of better
performance for certain kinds of data access patterns. Will it get into
the 2.5 kernel? There is one little problem, being that Linus doesn't like
the RCU approach. From a message posted last
October:
RCU obviously has major subtle issues, ranging from memory ordering
to "what is quiescent", ie the scheduling points. And "subtlety"
directly translates into bugs and hard-to-debug seldom-seen
problems that end up being "unexplainable".
In short, RCU seems to be a case of "hey, that's cool", but it's a
solution in search of a problem so severe that it is worth it.
There are no indications that Linus believes such a problem has yet come
up. Work continues on RCU patches (and other patches that use it),
however, so the story is not yet finished. (For information in numbing
detail about RCU - but without the preemptive kernel changes - see this
page on the LSE site).
Comments (3 posted)
As was discussed
last week, one problem with
an increasingly fine-grained kernel is that it becomes difficult to know
which locks, out of thousands, must be held at any given point. Some
functions include documentation on their locking requirements (and
sometimes it's even current), but many others don't. And there is no way
for the code to actually enforce those requirements.
That may be about to change, however. Jesse Barnes, in discussion with
Daniel Phillips and others, has posted a
patch which addresses both problems. A function which expects to be
called with a given lock held simply includes a line like:
MUST_HOLD_SPIN(&some_lock);
In kernels compiled for production use, this macro expands to nothing and
serves as documentation only - anybody looking at the code sees immediately
that some_lock must be held before calling the function. The
CONFIG_DEBUG_SPINLOCK compilation option gives the macro some
teeth, however: if the given lock is not actually held at that point the
kernel panics immediately. The end result is that erroneous calls are
likely to get fixed in a hurry.
Dave Jones jumped in with a suggestion for
tracking down a related (and common) problem: code which sleeps while
holding a spinlock. Sleeping while holding a lock is against the rules,
since it can cause other processors to spin for a very long time. But it
is easy, while programming the kernel, to call a function which, three
functions later, goes to sleep. Once again, one could try to document the
"can sleep" status of every function and expect programmers to follow that
documentation. But, says Dave, why not just add a line like:
FUNCTION_SLEEPS();
to any function that can sleep? If the macro is called while a lock is
held, a bug exists. A quick kernel panic will allow the kernel hackers to
track down the offending call in a hurry.
Neither of these changes has found its way into a mainline kernel yet. If
they do, though, they could well help in the early detection of a number of
programming errors.
Comments (5 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
- Andre Hedrick: IDE/ATAPI in 2.5. A new 'foreport' of 2.4 IDE to 2.5.25.
(July 12, 2002)
Filesystems and block I/O
- Andrew Morton: readahead optimisations. "<span>This patch teaches readahead to detect the situation where no IO is
actually being performed as a result of its actions.</span>"
(July 17, 2002)
Janitorial
Memory management
- Robert Love: strict VM overcommit. "<span>We introduce new overcommit policies
that attempt to never succeed an allocation that can not be fulfilled by
the backing store and consequently never OOM. This is achieved through
strict accounting of the committed address space and a policy to
allow/refuse allocations based on that accounting.</span>" A reworking of Alan Cox's patch.
(July 12, 2002)
- Andrew Morton: minimal rmap. "<span>The code adds a pointer to struct page, consumes additional storage for
the pte chains and adds computational expense to the page reclaim code
(I measured it at 3% additional load during streaming I/O). The
benefits which we get back for all this are, I must say, theoretical
and unproven.</span>"
(July 17, 2002)
Networking
Architecture-specific
- Jeff Dike: UML 2.5.26. "<span>After a long, relaxing period of ignoring 2.5, I decided that it was about [time]
to catch up.</span>"
(July 17, 2002)
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>