Brief items
The current 2.6 prepatch is 2.6.8-rc1, which was
released by Linus on July 11. The list of
patches is huge; it includes the TEA and XTEA crypto algorithms, a bunch of
USB work, snapshot and mirror support in the device mapper, vast amounts of
"sparse" annotations and associated fixes, some virtual memory tweaks, an
AGP update, an NTFS update, some read-copy-update improvements, x86
no-execute support, netlink support for SELinux, a serial ATA update,
64-bit SuperH support, fixes for locking problems found by the Stanford
checker, reworked symbolic link lookups, and much more. See Linus's
announcement for the brief listing of patches, or
the long-format changelog for the details.
Linus's BitKeeper repository contains a small number of patches, including
some network driver updates, more sparse annotations, and various
fixes.
The current prepatch from Andrew Morton is 2.6.8-rc1-mm1; Andrew notes, however, that "This kernel runs
like a dessicated slug if you have more than 2G of memory due to a 32-bit
overflow." Recent additions to -mm include some latency fixes (see
below), a set of gcc 3.5 fixes, a big user-mode Linux update, and
various fixes.
The current 2.4 prepatch is 2.4.27-rc3; Marcelo has released no
patches since July 3.
Comments (none posted)
Kernel development news
Content on this page will be somewhat thin next week, as your editor will
be in Ottawa for the 2004 Kernel Summit and the Ottawa Linux Symposium.
The Kernel Summit will be happening Monday and Tuesday, July 19
and 20. The
agenda has now been
posted for those who are curious. The topics to be discussed will not be
surprising to most readers: virtual memory management, NUMA, power
management, clustered storage, networking, block I/O, security, and more.
There will also be a pair of sessions on kernel support for desktop users,
featuring a cameo appearance by Keith Packard.
As usual, LWN will be carrying reports from the event; stay tuned.
Your editor is also giving a talk in the very first OLS slot, 10:00,
Wednesday, where he will engage in some wild speculation on where the 2.7
development series might go, assuming it actually starts sometime soon.
Comments (none posted)
Back in June, this page
looked at
the sparse utility, which is being used to search out various
kinds of errors in the kernel code base. Recently, large numbers of
patches have gone in to address one particular
sparse complaint:
using an integer
0 to represent a null pointer value. These
patches (
example) have struck some
developers as useless code churn, leading to
complaints like:
If you want people to conform people to a certain CodingStyle
please document officially in the kernel, sparse isn't distributed
with the kernel and the sparse police is silently changing the
kernel all over the place with sometimes questionable benefit. Only
the __user warnings had really found the bugs, but the rest I've
seen changes perfectly legal code.
Linus responds that programmers who
interchange NULL and zero are confused about the types they are
using and are putting that confusion into the kernel. In his desire to
enable the compiler (and other compile-time checkers) to find errors, he
wants to separate the integer and pointer types as completely as possible.
NULL is a pointer, while 0 can never be.
In other words:
char * p = 0; /* IS WRONG! DAMMIT! */
int i = NULL; /* THIS IS WRONG TOO! */
and anybody who writes code like the above either needs to get out of the
kernel, or needs to get transported to the 21st century.
One might conclude from this statement that Linus is pretty well convinced
that the current course of action is correct. He also states that, without exception, changing zero
to NULL has resulted in better, more readable code. So use of
NULL seems to have become part of the official kernel coding
style, even if the CodingStyle document is
still silent on the matter.
Comments (33 posted)
The 2.6 kernel is becoming increasingly stable, and the user base is,
correspondingly, becoming happier. There is, however, one remaining group of
disgruntled users out there: multimedia users and developers who depend on
very quick response times from the kernel. Whether you are capturing a
video stream, playing a movie, or burning a disc, you need the system to
respond very quickly when the hardware involved needs attention. Failure
to respond in time leads to buffer overruns or underruns; those, in turn,
lead to video degradation, audio skips, writable media which is suitable
only for use as drink coasters or grade-school art projects, and flames on
various mailing lists.
The traffic has been growing in recent times, as it has become clear that
some in the multimedia community feel
discriminated against:
"We" (the audio developer community) did not participate because it
was made clear that our needs were not going to be considered. We
were told that the preemption patch was sufficient to provide "low
latency", and that rescheduling points dotted all over the place
was bad engineering (probably true). With this as the pre-rendered
verdict, there's not a lot of point in dedicating time to tracking
a situation that clearly is not going to work.
The result of this discussion has been a renewed interest among the kernel
developers in fixing this particular problem. It is pretty universally
believed that the latency issue should be close to resolved, and that it is
just a matter of fixing a few remaining trouble spots.
One approach that has been taken is the voluntary preemption patch put
together by Ingo Molnar and Arjan van de Ven. This patch tries to reduce
latency by adding more scheduling points - essentially the approach that
was taken back in the 2.4 days. Some things were done a little
differently, however.
The 2.6 kernel contains a hundred or so calls to might_sleep().
This function is a debugging aid; it is a way of marking functions which
can sleep. If might_sleep() finds itself being called in a
situation where sleeping is not allowed (while a spinlock is held, for
example) it complains loudly and, hopefully, the problem gets fixed. Ingo
and Arjan noted that any place which calls might_sleep() is, by
definition, a good place to perform scheduling. So the voluntary
preemption patch adds a cond_reschedule() call to might_sleep(),
allowing a higher-priority process to be scheduled, should such a process
exist. This tweak yields over 100 scheduling points without having to
actually go into the code in that many places.
While they were at it, Ingo and Arjan also added a few scheduling points in
places that needed them, and also split up code in a couple of places which
were holding locks for too long.
This patch was not welcomed by everybody. In the mainline kernel, the
might_sleep() call can be configured out entirely for production
kernels; it is a pure debugging aid. The voluntary preemption patch turns
it into a scheduler function and makes its presence required in production
kernels. Some developers would rather see explicit rescheduling calls
added in the places where they make sense.
The strongest objection, however, would appear that the 2.6 kernel already
implements involuntary preemption via the preemptable kernel
option. Any place which calls might_sleep() is already, by
definition, preemptable, so the voluntary preemption patch adds nothing
which the kernel can't already do. Says Andrew
Morton:
And please let me repeat: preemption is the way in which we wish to
provide low-latency. At this time, patches which sprinkle
cond_resched() all over the place are unwelcome. After 2.7 forks
we can look at it again.
So why are some developers pursuing the voluntary preemption patch? At
this time, very few distributors are shipping 2.6 kernels with kernel
preemption turned on, mostly out of fear of creating stability problems.
Kernel preemption is, itself, reasonably well debugged at this point, but
it has, over the last year or so, shaken out a fair number of bugs in other
parts of the kernel. Few such bugs have been found recently, but the
distributors continue to take a conservative approach. Users often find
bugs in surprising places, and bugs related to preemption can be incredibly
difficult to reproduce and track down. The voluntary preemption patch is a
way of getting some of the benefits of kernel preemption without turning on
a configuration option that the distributors find scary.
Andrew has often stated his wish to have the mainline kernel meet the needs
of the distributors, so he may eventually merge
the patch:
Oh I can buy the make-the-bugs-less-probable practical argument,
but sheesh. If you insist on going this way we can stick the patch
in after 2.7 has forked. I spose. The patch will actually slow
the rate of improvement of the kernel :(
Meanwhile, the effort to find the real latency issues is going forward.
William Lee Irwin and Con Kolivas have put together a patch which tries to track down high-latency
parts of the kernel. It works by making a note of when kernel code
disables preemption (usually by taking a spinlock) and when preemption is
turned back on again. If preemption is disabled for too long, a message is
printed stating where the problem is to be found.
ALSA users who are experiencing latency problems, and who would like to
help track them down, should also be aware of the xrun_debug
knob. It is described in sound/alsa/ProcFile.txt in the
Documentation directory. Turning this option on causes a message
and a kernel stack trace whenever an audio device suffers from a buffer
overrun or underrun. This information can often be used to find the source
of latency issues in short order.
Thanks to the preempt-timing patch and xrun_debug, a few suspects
have been turned up already. Console scrolling turns
out to be one of them. ReiserFS has also come up a few times as being a
source of high latency, to the point that its use in latency-critical
situations is being discouraged. Ext3 has been shown to be the source
of a few problems as well; the -mm tree currently contains a set of
patches aimed at fixing the worst of those. Another problem can be driver
ioctl() methods, which run with the big kernel lock held. This
process is just beginning, however.
Yet another approach can be found in this
patch by Joe Korty. Software interrupts have been fingered as a
potential source of latency problems; they take priority over regular
kernel code, and have no real, hard limit on how long they can run. Joe's
patch pushes all software interrupt handling into the ksoftirqd
daemon, giving the scheduler a say on when they run. In this way,
high-priority user processes will see lower latencies - at the expense of
higher latency for the handling of software interrupts.
Tracking down and fixing the remaining latency problems may take a little
while. But enough attention is now being focused on the problem that its
resolution seems pretty well assured. The complete solution, however,
requires enabling kernel preemption, meaning that, for the time being,
2.6 users in search of low latency will have to build and install their own
kernels.
Comments (5 posted)
The "
kref" mechanism is a simple structure for implementing
reference-counted objects in the kernel; it was covered here
last March. At the core of a
kref is an
atomic_t counter which contains the number of
outstanding references. When that counter goes to zero, the object is no longer used
and can be freed.
The kref functions are simple. Obtaining a reference is done with
a call to kref_get():
struct kref *kref_get(struct kref *kref)
{
WARN_ON(!atomic_read(&kref->refcount));
atomic_inc(&kref->refcount);
return kref;
}
Releasing that reference is accomplished with kref_put():
void kref_put(struct kref *kref)
{
if (atomic_dec_and_test(&kref->refcount)) {
kref->release(kref);
}
}
The use of atomic types makes these functions safe in multiprocessor or
preemptive environments; the reference count will always be correct.
Except, of course, when things go wrong. Consider the following order of
operations performed by two kernel threads; they could be running on
separate processors, or on a preemptive, uniprocessor system:
| Thread 1 | Thread 2 |
/* In kref_get() */
WARN_ON(!atomic_read(&kref->refcount));
| |
|
kref_put(&kref);
|
atomic_inc(&kref->refcount);
return kref;
| |
The first thread will be left thinking it holds a reference to an object
which, in fact, has been deleted. As a general rule, good things cannot be
expected to result from this situation. The kref code deals with
this possibility by fiat: simultaneous calls to
kref_get() and kref_put() on the same object are not
allowed. In practice, this restriction usually requires that these
operations be called under the protection of a lock somewhere.
Developers interested in high-end scalability, however, often try to use
lock-free algorithms. Locks can easily become a performance bottleneck as
the number of threads increases, so, if they can be eliminated, the kernel
will scale better. That is the motivation behind the use of techniques
like seqlocks and read-copy-update (RCU). The locking
requirement associated with
the kref type makes that type difficult to use with these techniques.
Ravikiran G Thirumalai recently posted a patch entitled "Refcounting of objects part of a lockfree
collection" which implements a new locking type (called
refcount_t) for dealing with objects managed using no-lock
techniques. The explanation goes to great lengths to describe reference
counting issued when working with RCU, but, in the end, all the patch is
really doing, via a long path, is making a type which is like the
kref, but which is not subject to the race described above.
kref_get(), as currently written, checks the reference count
first; if that count is zero, the object has already been freed. The
current implementation merely complains when this happens; one could argue
that stronger action is called for. The real problem, though, is that this
test and the subsequent incrementing of the reference count are not,
together, atomic - other actions can come between the two. Ravikiran's
patch addresses this issue by coding his _get() function
differently:
static inline int refcount_get_rcu(refcount_t *rc)
{
int c, old;
c = atomic_read(&rc->count);
while ( c && (old = cmpxchg(&rc->count.counter, c, c+1)) != c)
c = old;
return c;
}
The core of this function is the call to cmpxchg(), which is an
inline assembly function giving access to the processor's cmpxchg
instruction. The function prototype looks like:
int cmpxchg(int *location, int old, int new);
(The actual definition is a little more complex, depending on the real type
of location). The purpose of this function is to (1) compare
the contents of *location with old, (2) if and only
if the two are the same, assign new to *location, and
(3) return the old value. If cmpxchg() returns old,
the operation succeeded; otherwise the value pointed to by
location is unchanged. The key point is that all of these
operations are performed in an atomic manner
cmpxchg() is, in other words, a form of test-and-set instruction.
It is used here to increment the reference count in an atomic manner while
being absolutely sure that nobody else can possibly have seen that count
reach zero. When references are obtained in this way, the race described
above cannot happen.
There is still a pitfall, however. If the reference-counted object were to
be freed and reused before another thread tried to obtain a reference, that
thread might see a random "reference count" and think it succeeded.
Preventing that turn of events is where RCU comes in. The actual object is
freed by way of an RCU callback, which cannot happen until every processor
has scheduled. If any thread can see a pointer to the object, said object
will continue to exist, though its reference count may be zero. After a
complete quiescence cycle, no threads can see such a pointer, and the
object can be safely deleted.
One other potential problem is that not all architectures offer a
cmpxchg instruction. On such systems Ravikiran uses a rather more
elaborate and unsightly scheme involving a hashed array of spinlocks; see
the patch if morbid curiosity gets the better of you.
This effort seems worthwhile; when this technique is used for looking up file descriptors,
tiobench performance improvements of 13% to 21% are claimed.
There were objections, however, to the creation of a new
reference counting API which is very similar to the kref API. As
a result, the patch is likely to be rewritten to use krefs,
extending that API as need be to supply the required semantics.
Comments (5 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Architecture-specific
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>