Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is 2.6.8-rc1, which was released by Linus on July 11. The list of patches is huge; it includes the TEA and XTEA crypto algorithms, a bunch of USB work, snapshot and mirror support in the device mapper, vast amounts of "sparse" annotations and associated fixes, some virtual memory tweaks, an AGP update, an NTFS update, some read-copy-update improvements, x86 no-execute support, netlink support for SELinux, a serial ATA update, 64-bit SuperH support, fixes for locking problems found by the Stanford checker, reworked symbolic link lookups, and much more. See Linus's announcement for the brief listing of patches, or the long-format changelog for the details.Linus's BitKeeper repository contains a small number of patches, including some network driver updates, more sparse annotations, and various fixes.
The current prepatch from Andrew Morton is 2.6.8-rc1-mm1; Andrew notes, however, that "This kernel runs
like a dessicated slug if you have more than 2G of memory due to a 32-bit
overflow.
" Recent additions to -mm include some latency fixes (see
below), a set of gcc 3.5 fixes, a big user-mode Linux update, and
various fixes.
The current 2.4 prepatch is 2.4.27-rc3; Marcelo has released no patches since July 3.
Kernel development news
The 2004 Kernel Summit
Content on this page will be somewhat thin next week, as your editor will be in Ottawa for the 2004 Kernel Summit and the Ottawa Linux Symposium. The Kernel Summit will be happening Monday and Tuesday, July 19 and 20. The agenda has now been posted for those who are curious. The topics to be discussed will not be surprising to most readers: virtual memory management, NUMA, power management, clustered storage, networking, block I/O, security, and more. There will also be a pair of sessions on kernel support for desktop users, featuring a cameo appearance by Keith Packard.As usual, LWN will be carrying reports from the event; stay tuned.
Your editor is also giving a talk in the very first OLS slot, 10:00, Wednesday, where he will engage in some wild speculation on where the 2.7 development series might go, assuming it actually starts sometime soon.
NULL v. zero
Back in June, this page looked at the sparse utility, which is being used to search out various kinds of errors in the kernel code base. Recently, large numbers of patches have gone in to address one particular sparse complaint: using an integer 0 to represent a null pointer value. These patches (example) have struck some developers as useless code churn, leading to complaints like:
Linus responds that programmers who interchange NULL and zero are confused about the types they are using and are putting that confusion into the kernel. In his desire to enable the compiler (and other compile-time checkers) to find errors, he wants to separate the integer and pointer types as completely as possible. NULL is a pointer, while 0 can never be.
char * p = 0; /* IS WRONG! DAMMIT! */ int i = NULL; /* THIS IS WRONG TOO! */
and anybody who writes code like the above either needs to get out of the kernel, or needs to get transported to the 21st century.
One might conclude from this statement that Linus is pretty well convinced that the current course of action is correct. He also states that, without exception, changing zero to NULL has resulted in better, more readable code. So use of NULL seems to have become part of the official kernel coding style, even if the CodingStyle document is still silent on the matter.
Addressing latency problems in 2.6
The 2.6 kernel is becoming increasingly stable, and the user base is, correspondingly, becoming happier. There is, however, one remaining group of disgruntled users out there: multimedia users and developers who depend on very quick response times from the kernel. Whether you are capturing a video stream, playing a movie, or burning a disc, you need the system to respond very quickly when the hardware involved needs attention. Failure to respond in time leads to buffer overruns or underruns; those, in turn, lead to video degradation, audio skips, writable media which is suitable only for use as drink coasters or grade-school art projects, and flames on various mailing lists.The traffic has been growing in recent times, as it has become clear that some in the multimedia community feel discriminated against:
The result of this discussion has been a renewed interest among the kernel developers in fixing this particular problem. It is pretty universally believed that the latency issue should be close to resolved, and that it is just a matter of fixing a few remaining trouble spots.
One approach that has been taken is the voluntary preemption patch put together by Ingo Molnar and Arjan van de Ven. This patch tries to reduce latency by adding more scheduling points - essentially the approach that was taken back in the 2.4 days. Some things were done a little differently, however.
The 2.6 kernel contains a hundred or so calls to might_sleep(). This function is a debugging aid; it is a way of marking functions which can sleep. If might_sleep() finds itself being called in a situation where sleeping is not allowed (while a spinlock is held, for example) it complains loudly and, hopefully, the problem gets fixed. Ingo and Arjan noted that any place which calls might_sleep() is, by definition, a good place to perform scheduling. So the voluntary preemption patch adds a cond_reschedule() call to might_sleep(), allowing a higher-priority process to be scheduled, should such a process exist. This tweak yields over 100 scheduling points without having to actually go into the code in that many places.
While they were at it, Ingo and Arjan also added a few scheduling points in places that needed them, and also split up code in a couple of places which were holding locks for too long.
This patch was not welcomed by everybody. In the mainline kernel, the might_sleep() call can be configured out entirely for production kernels; it is a pure debugging aid. The voluntary preemption patch turns it into a scheduler function and makes its presence required in production kernels. Some developers would rather see explicit rescheduling calls added in the places where they make sense.
The strongest objection, however, would appear that the 2.6 kernel already implements involuntary preemption via the preemptable kernel option. Any place which calls might_sleep() is already, by definition, preemptable, so the voluntary preemption patch adds nothing which the kernel can't already do. Says Andrew Morton:
So why are some developers pursuing the voluntary preemption patch? At this time, very few distributors are shipping 2.6 kernels with kernel preemption turned on, mostly out of fear of creating stability problems. Kernel preemption is, itself, reasonably well debugged at this point, but it has, over the last year or so, shaken out a fair number of bugs in other parts of the kernel. Few such bugs have been found recently, but the distributors continue to take a conservative approach. Users often find bugs in surprising places, and bugs related to preemption can be incredibly difficult to reproduce and track down. The voluntary preemption patch is a way of getting some of the benefits of kernel preemption without turning on a configuration option that the distributors find scary.
Andrew has often stated his wish to have the mainline kernel meet the needs of the distributors, so he may eventually merge the patch:
Meanwhile, the effort to find the real latency issues is going forward. William Lee Irwin and Con Kolivas have put together a patch which tries to track down high-latency parts of the kernel. It works by making a note of when kernel code disables preemption (usually by taking a spinlock) and when preemption is turned back on again. If preemption is disabled for too long, a message is printed stating where the problem is to be found.
ALSA users who are experiencing latency problems, and who would like to help track them down, should also be aware of the xrun_debug knob. It is described in sound/alsa/ProcFile.txt in the Documentation directory. Turning this option on causes a message and a kernel stack trace whenever an audio device suffers from a buffer overrun or underrun. This information can often be used to find the source of latency issues in short order.
Thanks to the preempt-timing patch and xrun_debug, a few suspects have been turned up already. Console scrolling turns out to be one of them. ReiserFS has also come up a few times as being a source of high latency, to the point that its use in latency-critical situations is being discouraged. Ext3 has been shown to be the source of a few problems as well; the -mm tree currently contains a set of patches aimed at fixing the worst of those. Another problem can be driver ioctl() methods, which run with the big kernel lock held. This process is just beginning, however.
Yet another approach can be found in this patch by Joe Korty. Software interrupts have been fingered as a potential source of latency problems; they take priority over regular kernel code, and have no real, hard limit on how long they can run. Joe's patch pushes all software interrupt handling into the ksoftirqd daemon, giving the scheduler a say on when they run. In this way, high-priority user processes will see lower latencies - at the expense of higher latency for the handling of software interrupts.
Tracking down and fixing the remaining latency problems may take a little while. But enough attention is now being focused on the problem that its resolution seems pretty well assured. The complete solution, however, requires enabling kernel preemption, meaning that, for the time being, 2.6 users in search of low latency will have to build and install their own kernels.
RCU-safe reference counting
The "kref" mechanism is a simple structure for implementing reference-counted objects in the kernel; it was covered here last March. At the core of a kref is an atomic_t counter which contains the number of outstanding references. When that counter goes to zero, the object is no longer used and can be freed.The kref functions are simple. Obtaining a reference is done with a call to kref_get():
struct kref *kref_get(struct kref *kref)
{
WARN_ON(!atomic_read(&kref->refcount));
atomic_inc(&kref->refcount);
return kref;
}
Releasing that reference is accomplished with kref_put():
void kref_put(struct kref *kref)
{
if (atomic_dec_and_test(&kref->refcount)) {
kref->release(kref);
}
}
The use of atomic types makes these functions safe in multiprocessor or preemptive environments; the reference count will always be correct. Except, of course, when things go wrong. Consider the following order of operations performed by two kernel threads; they could be running on separate processors, or on a preemptive, uniprocessor system:
| Thread 1 | Thread 2 |
|---|---|
/* In kref_get() */ WARN_ON(!atomic_read(&kref->refcount)); | |
kref_put(&kref); | |
atomic_inc(&kref->refcount); return kref; |
The first thread will be left thinking it holds a reference to an object which, in fact, has been deleted. As a general rule, good things cannot be expected to result from this situation. The kref code deals with this possibility by fiat: simultaneous calls to kref_get() and kref_put() on the same object are not allowed. In practice, this restriction usually requires that these operations be called under the protection of a lock somewhere.
Developers interested in high-end scalability, however, often try to use lock-free algorithms. Locks can easily become a performance bottleneck as the number of threads increases, so, if they can be eliminated, the kernel will scale better. That is the motivation behind the use of techniques like seqlocks and read-copy-update (RCU). The locking requirement associated with the kref type makes that type difficult to use with these techniques.
Ravikiran G Thirumalai recently posted a patch entitled "Refcounting of objects part of a lockfree collection" which implements a new locking type (called refcount_t) for dealing with objects managed using no-lock techniques. The explanation goes to great lengths to describe reference counting issued when working with RCU, but, in the end, all the patch is really doing, via a long path, is making a type which is like the kref, but which is not subject to the race described above.
kref_get(), as currently written, checks the reference count first; if that count is zero, the object has already been freed. The current implementation merely complains when this happens; one could argue that stronger action is called for. The real problem, though, is that this test and the subsequent incrementing of the reference count are not, together, atomic - other actions can come between the two. Ravikiran's patch addresses this issue by coding his _get() function differently:
static inline int refcount_get_rcu(refcount_t *rc)
{
int c, old;
c = atomic_read(&rc->count);
while ( c && (old = cmpxchg(&rc->count.counter, c, c+1)) != c)
c = old;
return c;
}
The core of this function is the call to cmpxchg(), which is an inline assembly function giving access to the processor's cmpxchg instruction. The function prototype looks like:
int cmpxchg(int *location, int old, int new);
(The actual definition is a little more complex, depending on the real type of location). The purpose of this function is to (1) compare the contents of *location with old, (2) if and only if the two are the same, assign new to *location, and (3) return the old value. If cmpxchg() returns old, the operation succeeded; otherwise the value pointed to by location is unchanged. The key point is that all of these operations are performed in an atomic manner
cmpxchg() is, in other words, a form of test-and-set instruction. It is used here to increment the reference count in an atomic manner while being absolutely sure that nobody else can possibly have seen that count reach zero. When references are obtained in this way, the race described above cannot happen.
There is still a pitfall, however. If the reference-counted object were to be freed and reused before another thread tried to obtain a reference, that thread might see a random "reference count" and think it succeeded. Preventing that turn of events is where RCU comes in. The actual object is freed by way of an RCU callback, which cannot happen until every processor has scheduled. If any thread can see a pointer to the object, said object will continue to exist, though its reference count may be zero. After a complete quiescence cycle, no threads can see such a pointer, and the object can be safely deleted.
One other potential problem is that not all architectures offer a cmpxchg instruction. On such systems Ravikiran uses a rather more elaborate and unsightly scheme involving a hashed array of spinlocks; see the patch if morbid curiosity gets the better of you.
This effort seems worthwhile; when this technique is used for looking up file descriptors, tiobench performance improvements of 13% to 21% are claimed. There were objections, however, to the creation of a new reference counting API which is very similar to the kref API. As a result, the patch is likely to be rewritten to use krefs, extending that API as need be to supply the required semantics.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
