The current 2.6 prepatch remains 2.6.21-rc3
. About 250 patches have
found their way into the mainline repository since -rc3 was released; -rc4
will likely come out shortly after LWN is published this week.
There are two current -mm trees, differing only in their inclusion of one
patch set: 2.6.21-rc3-mm1
(which includes the RSDL scheduler) and 2.6.21-rc3-mm2 (which does
The current stable 2.6 kernel is 126.96.36.199, released on March 13 with a
couple dozen fixes. 188.8.131.52
was released on March 9 with a full 100 patches.
For older kernels: 184.108.40.206 was released on
March 8. 220.127.116.11-rc1
is out with a number of fixes, including a couple of security patches.
Comments (2 posted)
Kernel development news
-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)))
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)) \
+ + sizeof(typeof(int[1 - 2*!!__builtin_types_compatible_p(typeof(arr), \
-- Rusty Russell
Rusty, that's a work of art.
However, I would suggest that you never show it to anybody ever
again. I'm sure that in fifty years, it will be worth much more. So
please keep it tightly under wraps, to keep people from gouging
their eyes out^W^W^W^W^W^W^W make a killing in the art market.
-- Linus Torvalds
Comments (30 posted)
The long story of the kevent subsystem has appeared on this page a number
of times. Kevents are designed to give applications a single system call
which they can use to wait for any events of interest: I/O, timers,
signals, and more. While quite a bit of work has been done on this code,
its path into the kernel has been long. A number of developers are still
unconvinced that the interface is needed, and, if it is, that the proposed
kevent API (which would have to be maintained forever) is the right one.
Now there is a competing approach which may prove easier for the community
Davide Libenzi is the creator of the epoll_wait() system call; it
is a version of poll() which is intended to be scalable to large
numbers of file descriptors. This API seems to be well regarded for what
it does, but it is limited to waiting on file descriptors. Many of the
things that kevents address are not associated with files, and so cannot be
handled through the epoll interface.
Kevents fix that shortcoming with the creation of a new subsystem and
user-space API. Davide has now shown up with a different strategy: make a
way for applications to request delivery of events via a file descriptor.
Consider, for example, the case of signals. Signals tend to be tricky for
applications to handle; they are asynchronous events which are delivered to
a special signal handler function, but that function is seriously limited
in what it can do. In response, application developers have resorted to
tricks like writing a byte to an internal pipe so that the signal can be
handled in the main event loop.
Davide has proposed a new system
call named signalfd() which can help developers avoid much of
the hassle of working with signals:
int signalfd(int ufd, const sigset_t *mask, size_t masksize);
If ufd is -1, this call will create (and return) a new
file descriptor. The signals described in mask will be caught and
returned to the process by way of that file descriptor. It is pollable,
allowing signals to be handled in an event loop based on select(),
or epoll_wait(). When signals are available, they can be read
from the descriptor as data; the signalfd_siginfo structure
returned by read() has the signal number and all of the related
information that comes with it.
If ufd is set to an existing signal file descriptor, the
signalfd() call will change to the new mask. It is worth
noting that reading from this file descriptor competes with normal signal
delivery for queued signals; there is no way to predict whether the signal
will be delivered in the usual way or will be read from the file
descriptor. This situation can be avoided by using sigprocmask()
to block normal delivery of the signal(s) of interest.
There is a similar interface for
int timerfd(int ufd, int clockid, int timertype,
const struct timespec *when);
Once again, ufd is -1 to create a new file descriptor, or
an existing timer file descriptor which is to be modified. The
clockid parameter describes which clock is wanted:
CLOCK_MONOTONIC or CLOCK_REALTIME. The type of timer is
described by timertype: TFD_TIMER_REL for a time relative
to the current time, TFD_TIMER_ABS for an absolute time, or
TFD_TIMER_SEQ for a repeating timer at a given interval. The
when structure contains the requested expiration time.
Once again, this file descriptor can be polled. Reading from it yields an
integer value saying how many times the timer has fired since the last time
it was read.
Evgeniy Polyakov, the author of the kevent patches, has not been sitting
still while these patches have gone around. His proposal is called
eventfs; it is a special
filesystem which offers the ability to bind events to file descriptors.
The first version of the patch only handles signals, via a system call
named (yes) signalfd():
int signalfd(int signal, int flags);
This call creates a new file descriptor for the given signal (a
separate file descriptor is required for each signal in this scheme). In
the current code, if flags is nonzero, the signal will only be
delivered through eventfs and will never go into the signal queue. The
file descriptor is pollable, but there is no way to read any information
from it. So any associated signal information is lost; multiple deliveries
of the same signal between polls will also be lost.
One assumes that Evgeniy's patches could be improved over time, but
Davide's version seems to be ahead in terms of features, coverage, and
community review. Davide has also avoided the need to create a new
filesystem to back the whole thing up. So if bets were being taken on
which approach might make it into the kernel, Davide would seem to be in
the lead at the moment.
There are certainly things to be said for this approach. It brings Linux
toward a single interface for event delivery without the need for a new,
complex API. It also reinforces the role of the file descriptor as a
fundamental object for interaction with the kernel. On the other hand,
the poll interfaces do not provide a way for
applications to receive events without the need to call into the kernel - a
feature which has been requested by some interested parties. There are
also event types (asynchronous I/O completion, for example) which are not
yet covered. So, if things do go this way, it would not be
surprising to see patches trying to fill in those gaps in the near future.
Comments (39 posted)
As flame wars go, this one was somewhat more technical and inscrutable than
most. It was, however, still a flame war. The core issue was this: is the
addition of the paravirt_ops
layer, now beginning to be used to support running Linux under hypervisors,
a good thing or a long-term maintenance disaster for the Linux kernel?
It all started with a patch added to the -mm tree; it seems that some work
on the new clockevents code
broke the VMI virtualization layer. So the developers at VMware put
together a fix, but that fix did not sit well with the core clockevents
developers. In their view, it took much of the older time-related code,
which they had worked so hard to get rid of, and shoved it back under the
VMI layer. Thomas Gleixner did not like
This is ugly as hell. NO_HZ enables the dyntick functions in
idle(), irq_enter() and irq_exit() so the clockevents code is
actually invoked. I have not looked close enough why this does
work at all.
I have the feeling that "working fine" means something like "does not
The right solution, according to Thomas, is for all of the people who are
working on hypervisors and Linux to get together and come up with a single
timer interface based on clockevents. This should not be all that hard of
a job, in his opinion. The VMI hackers may well be willing to do that over
time, but they don't see that as something which can be done in the near
future. Their current code works, and, besides, they are on the verge of a
product release and would rather not thrash things up at this time.
"On the verge of a product release" is not an excuse which flies far on
linux-kernel. This is doubly true in this case, where some of the people
involved feel that the VMI developers should have seen clockevents coming
and developed for that interface over the last year. They see the current
VMI timer code as being the beginning of a long-term maintenance
Ingo Molnar widened the discussion to the
problems he sees with paravirt_ops in general. The posting is long, but
the core point seems to be this: every hypervisor connection implemented
with paravirt_ops becomes an ABI that the kernel must then maintain
forever. The paravirt_ops interface itself is supposed to insulate the
kernel from changes, and that API can change. But each hypervisor
interface done through paravirt_ops must continue to work into the future,
meaning that certain sorts of fundamental design changes cannot be made.
Maintaining compatibility with several hypervisors will be hard, and Ingo sees bad things when one inevitably
And it doesn't matter whether we think that it was VMWare who messed
up. Users/customers _will_ blame us: "v2.6.25 regresses, it wont
run under ESX v1.12 anymore". Distro will yield and will undo
whatever change breaks backwards compatibility with older
hypervisors. (most likely it will be undone upstream already)
Backwards compatibility acts as a very heavy barrier against
certain types of paravirt_ops design changes.
There have not been a whole lot of others supporting this point of view,
though. The current abuses are seen as things which can be fixed, people
seem to be sanguine about the ability to maintain compatibility in the
paravirt_ops interface code, and, most likely, many people simply tune out
of virtualization discussions. Linus suggests that Ingo point out specific problems
(and fix them if he desires) rather than complaining about general
problems. Ingo's response is that hypervisor interfaces should be treated
like system calls, and added with the same degree of care and deliberation.
In the end, it is not clear that anything will change. There is a high
level of interest in getting hypervisor support into the kernel, and that
process is unlikely to stop. So expect to see some more serious squabbles
about what is done in hypervisor interfaces in the future. If we are
lucky, that process, while noisy, will result in the evolution of the
paravirt_ops code toward something which proves to be maintainable over the
Comments (none posted)
In last week's episode
Rotating Staircase Deadline Scheduler (RSDL) had appeared out of the blue
and was busily impressing testers left and right. One person even called
for it to go straight into 2.6.21. In reality, the replacement of
something as fundamental as the CPU scheduler was never going to be an
entirely smooth operation. So it's not all that surprising that the RSDL
has run into an obstacle or two.
The biggest snag would appear to be this
workload reported by Mike Galbraith. Mike is trying to run some CPU
hogs (MP3 encoding, in particular) in the background while watching some
interactive eye candy. It's a load that works with the current scheduler,
but it becomes sluggish when running under RSDL. There have been a couple of
other reports of a visible interactive slowdown when serious computation is
going on - though others have reported better
There is little surprise in the appearance of behavioral regressions
for certain workloads. Few people would have expected RSDL to be perfect
within a week of its first posting. The real difficulty, instead, is that
RSDL creator Con Kolivas has reacted in a somewhat defensive manner, refusing to see the behavior as a regression:
Your expectations of what you should be able to do are simply
skewed. Find what cpu balance you loved in the old one (and I
believe it wasn't that much more cpu in favour of X if I recall
correctly) and simply change the nice setting on your lame encoder
- since you're already setting one anyway.
We simply cannot continue arguing that we should dish out
unfairness in any manner any more. It will always come back and
bite us where we don't want it. We are getting good interactive
response with a fair scheduler yet you seem intent on overloading
it to find fault with it.
Con's position is that the scheduler should strive to provide fairness and
low latency; any further expectations about interactive response should
then be addressed by playing with nice levels. The interactivity estimator
built into the current scheduler is just too difficult to work with; the
kernel should not be in that particular business. The problem is that
this approach conflicts with how Linux users have come to expect things to
As soon as one looks at improving RSDL for these situations, one gets into
the same old discussions on improving interactive response in general.
Linus pointed out that RSDL's way of
scheduling is not quite as fair as it could be, since it does not always
account for work in the right place:
And the problem is that a lot of clients actually end up doing
*more* in the X server than they do themselves directly. Doing
things like showing a line of text on the screen is a lot more
expensive than just keeping track of that line of text, so you end
up with the X server easily being marked as getting "too much" CPU
time, and the clients as being starved for CPU time. And then you
get bad interactive behaviour.
There are a couple of ways of handling problems like this. One is to just
favor the X server, either by somehow marking it as the core of interactive
behavior or by simply raising its priority. Con has been in favor of the
latter approach; to that end, he has posted a
separate patch which is aimed at improving latencies for all processes,
even when they are not all running at the same priority levels. There have
not been any follow-up results reported as of this writing.
This difficulty may well not keep RSDL out of the mainline kernel. The
advantages inherent in dumping the interactivity heuristics are large, and
RSDL does seem to improve life for a number of users. Noticeable
performance regressions for some workloads are a problem, though; nobody
wants to field a bunch of "2.6.x turned my response to crap" messages from
unhappy users. So expect some iterations on this project yet - and,
perhaps, an additional kernel cycle or two before it can be merged.
Comments (13 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>