LWN.net Logo

Kernel development

Release status

Kernel release status

The current 2.6 prepatch remains 2.6.21-rc3. About 250 patches have found their way into the mainline repository since -rc3 was released; -rc4 will likely come out shortly after LWN is published this week.

There are two current -mm trees, differing only in their inclusion of one patch set: 2.6.21-rc3-mm1 (which includes the RSDL scheduler) and 2.6.21-rc3-mm2 (which does not).

The current stable 2.6 kernel is 2.6.20.3, released on March 13 with a couple dozen fixes. 2.6.20.2 was released on March 9 with a full 100 patches.

For older kernels: 2.6.16.43 was released on March 8. 2.6.16.44-rc1 is out with a number of fixes, including a couple of security patches.

Comments (2 posted)

Kernel development news

Quotes of the week

-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])		      \
+	+ sizeof(typeof(int[1 - 2*!!__builtin_types_compatible_p(typeof(arr), \
+		 typeof(&arr[0]))]))*0)

-- Rusty Russell

Rusty, that's a work of art.

However, I would suggest that you never show it to anybody ever again. I'm sure that in fifty years, it will be worth much more. So please keep it tightly under wraps, to keep people from gouging their eyes out^W^W^W^W^W^W^W make a killing in the art market.

-- Linus Torvalds

Comments (30 posted)

Kernel events without kevents

The long story of the kevent subsystem has appeared on this page a number of times. Kevents are designed to give applications a single system call which they can use to wait for any events of interest: I/O, timers, signals, and more. While quite a bit of work has been done on this code, its path into the kernel has been long. A number of developers are still unconvinced that the interface is needed, and, if it is, that the proposed kevent API (which would have to be maintained forever) is the right one. Now there is a competing approach which may prove easier for the community to accept.

Davide Libenzi is the creator of the epoll_wait() system call; it is a version of poll() which is intended to be scalable to large numbers of file descriptors. This API seems to be well regarded for what it does, but it is limited to waiting on file descriptors. Many of the things that kevents address are not associated with files, and so cannot be handled through the epoll interface.

Kevents fix that shortcoming with the creation of a new subsystem and user-space API. Davide has now shown up with a different strategy: make a way for applications to request delivery of events via a file descriptor. Consider, for example, the case of signals. Signals tend to be tricky for applications to handle; they are asynchronous events which are delivered to a special signal handler function, but that function is seriously limited in what it can do. In response, application developers have resorted to tricks like writing a byte to an internal pipe so that the signal can be handled in the main event loop.

Davide has proposed a new system call named signalfd() which can help developers avoid much of the hassle of working with signals:

    int signalfd(int ufd, const sigset_t *mask, size_t masksize);

If ufd is -1, this call will create (and return) a new file descriptor. The signals described in mask will be caught and returned to the process by way of that file descriptor. It is pollable, allowing signals to be handled in an event loop based on select(), poll() or epoll_wait(). When signals are available, they can be read from the descriptor as data; the signalfd_siginfo structure returned by read() has the signal number and all of the related information that comes with it.

If ufd is set to an existing signal file descriptor, the signalfd() call will change to the new mask. It is worth noting that reading from this file descriptor competes with normal signal delivery for queued signals; there is no way to predict whether the signal will be delivered in the usual way or will be read from the file descriptor. This situation can be avoided by using sigprocmask() to block normal delivery of the signal(s) of interest.

There is a similar interface for timer events:

    int timerfd(int ufd, int clockid, int timertype, 
                const struct timespec *when);

Once again, ufd is -1 to create a new file descriptor, or an existing timer file descriptor which is to be modified. The clockid parameter describes which clock is wanted: CLOCK_MONOTONIC or CLOCK_REALTIME. The type of timer is described by timertype: TFD_TIMER_REL for a time relative to the current time, TFD_TIMER_ABS for an absolute time, or TFD_TIMER_SEQ for a repeating timer at a given interval. The when structure contains the requested expiration time.

Once again, this file descriptor can be polled. Reading from it yields an integer value saying how many times the timer has fired since the last time it was read.

Evgeniy Polyakov, the author of the kevent patches, has not been sitting still while these patches have gone around. His proposal is called eventfs; it is a special filesystem which offers the ability to bind events to file descriptors. The first version of the patch only handles signals, via a system call named (yes) signalfd():

    int signalfd(int signal, int flags);

This call creates a new file descriptor for the given signal (a separate file descriptor is required for each signal in this scheme). In the current code, if flags is nonzero, the signal will only be delivered through eventfs and will never go into the signal queue. The file descriptor is pollable, but there is no way to read any information from it. So any associated signal information is lost; multiple deliveries of the same signal between polls will also be lost.

One assumes that Evgeniy's patches could be improved over time, but Davide's version seems to be ahead in terms of features, coverage, and community review. Davide has also avoided the need to create a new filesystem to back the whole thing up. So if bets were being taken on which approach might make it into the kernel, Davide would seem to be in the lead at the moment.

There are certainly things to be said for this approach. It brings Linux toward a single interface for event delivery without the need for a new, complex API. It also reinforces the role of the file descriptor as a fundamental object for interaction with the kernel. On the other hand, the poll interfaces do not provide a way for applications to receive events without the need to call into the kernel - a feature which has been requested by some interested parties. There are also event types (asynchronous I/O completion, for example) which are not yet covered. So, if things do go this way, it would not be surprising to see patches trying to fill in those gaps in the near future.

Comments (39 posted)

paravirt_ops considered harmful?

As flame wars go, this one was somewhat more technical and inscrutable than most. It was, however, still a flame war. The core issue was this: is the addition of the paravirt_ops layer, now beginning to be used to support running Linux under hypervisors, a good thing or a long-term maintenance disaster for the Linux kernel?

It all started with a patch added to the -mm tree; it seems that some work on the new clockevents code broke the VMI virtualization layer. So the developers at VMware put together a fix, but that fix did not sit well with the core clockevents developers. In their view, it took much of the older time-related code, which they had worked so hard to get rid of, and shoved it back under the VMI layer. Thomas Gleixner did not like this solution:

This is ugly as hell. NO_HZ enables the dyntick functions in idle(), irq_enter() and irq_exit() so the clockevents code is actually invoked. I have not looked close enough why this does work at all. I have the feeling that "working fine" means something like "does not explode".

The right solution, according to Thomas, is for all of the people who are working on hypervisors and Linux to get together and come up with a single timer interface based on clockevents. This should not be all that hard of a job, in his opinion. The VMI hackers may well be willing to do that over time, but they don't see that as something which can be done in the near future. Their current code works, and, besides, they are on the verge of a product release and would rather not thrash things up at this time.

"On the verge of a product release" is not an excuse which flies far on linux-kernel. This is doubly true in this case, where some of the people involved feel that the VMI developers should have seen clockevents coming and developed for that interface over the last year. They see the current VMI timer code as being the beginning of a long-term maintenance nightmare.

Ingo Molnar widened the discussion to the problems he sees with paravirt_ops in general. The posting is long, but the core point seems to be this: every hypervisor connection implemented with paravirt_ops becomes an ABI that the kernel must then maintain forever. The paravirt_ops interface itself is supposed to insulate the kernel from changes, and that API can change. But each hypervisor interface done through paravirt_ops must continue to work into the future, meaning that certain sorts of fundamental design changes cannot be made. Maintaining compatibility with several hypervisors will be hard, and Ingo sees bad things when one inevitably breaks:

And it doesn't matter whether we think that it was VMWare who messed up. Users/customers _will_ blame us: "v2.6.25 regresses, it wont run under ESX v1.12 anymore". Distro will yield and will undo whatever change breaks backwards compatibility with older hypervisors. (most likely it will be undone upstream already) Backwards compatibility acts as a very heavy barrier against certain types of paravirt_ops design changes.

There have not been a whole lot of others supporting this point of view, though. The current abuses are seen as things which can be fixed, people seem to be sanguine about the ability to maintain compatibility in the paravirt_ops interface code, and, most likely, many people simply tune out of virtualization discussions. Linus suggests that Ingo point out specific problems (and fix them if he desires) rather than complaining about general problems. Ingo's response is that hypervisor interfaces should be treated like system calls, and added with the same degree of care and deliberation.

In the end, it is not clear that anything will change. There is a high level of interest in getting hypervisor support into the kernel, and that process is unlikely to stop. So expect to see some more serious squabbles about what is done in hypervisor interfaces in the future. If we are lucky, that process, while noisy, will result in the evolution of the paravirt_ops code toward something which proves to be maintainable over the long term.

Comments (none posted)

RSDL hits a snag

In last week's episode, the Rotating Staircase Deadline Scheduler (RSDL) had appeared out of the blue and was busily impressing testers left and right. One person even called for it to go straight into 2.6.21. In reality, the replacement of something as fundamental as the CPU scheduler was never going to be an entirely smooth operation. So it's not all that surprising that the RSDL has run into an obstacle or two.

The biggest snag would appear to be this workload reported by Mike Galbraith. Mike is trying to run some CPU hogs (MP3 encoding, in particular) in the background while watching some interactive eye candy. It's a load that works with the current scheduler, but it becomes sluggish when running under RSDL. There have been a couple of other reports of a visible interactive slowdown when serious computation is going on - though others have reported better results.

There is little surprise in the appearance of behavioral regressions for certain workloads. Few people would have expected RSDL to be perfect within a week of its first posting. The real difficulty, instead, is that RSDL creator Con Kolivas has reacted in a somewhat defensive manner, refusing to see the behavior as a regression:

Your expectations of what you should be able to do are simply skewed. Find what cpu balance you loved in the old one (and I believe it wasn't that much more cpu in favour of X if I recall correctly) and simply change the nice setting on your lame encoder - since you're already setting one anyway.

We simply cannot continue arguing that we should dish out unfairness in any manner any more. It will always come back and bite us where we don't want it. We are getting good interactive response with a fair scheduler yet you seem intent on overloading it to find fault with it.

Con's position is that the scheduler should strive to provide fairness and low latency; any further expectations about interactive response should then be addressed by playing with nice levels. The interactivity estimator built into the current scheduler is just too difficult to work with; the kernel should not be in that particular business. The problem is that this approach conflicts with how Linux users have come to expect things to work.

As soon as one looks at improving RSDL for these situations, one gets into the same old discussions on improving interactive response in general. Linus pointed out that RSDL's way of scheduling is not quite as fair as it could be, since it does not always account for work in the right place:

And the problem is that a lot of clients actually end up doing *more* in the X server than they do themselves directly. Doing things like showing a line of text on the screen is a lot more expensive than just keeping track of that line of text, so you end up with the X server easily being marked as getting "too much" CPU time, and the clients as being starved for CPU time. And then you get bad interactive behaviour.

There are a couple of ways of handling problems like this. One is to just favor the X server, either by somehow marking it as the core of interactive behavior or by simply raising its priority. Con has been in favor of the latter approach; to that end, he has posted a separate patch which is aimed at improving latencies for all processes, even when they are not all running at the same priority levels. There have not been any follow-up results reported as of this writing.

This difficulty may well not keep RSDL out of the mainline kernel. The advantages inherent in dumping the interactivity heuristics are large, and RSDL does seem to improve life for a number of users. Noticeable performance regressions for some workloads are a problem, though; nobody wants to field a bunch of "2.6.x turned my response to crap" messages from unhappy users. So expect some iterations on this project yet - and, perhaps, an additional kernel cycle or two before it can be merged.

Comments (13 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds