LWN.net Logo

Advertisement

E-Commerce & credit card processing - the Open Source way!

Advertise here

Kernel development

Release status

Kernel release status

The current 2.6 prepatch remains 2.6.14-rc4. The final 2.6.14 kernel was supposed to be out by now, but, as of this writing, it has not been released. Once the swiotlb problem (see below) has been worked out, 2.6.14 should follow shortly.

The current -mm tree is 2.6.14-rc4-mm1. Recent changes to -mm include a fair number of VM scalability patches, the nested class devices patch set (see below), a big x86-64 update, the removal of the PageReserved() flag, the swap prefetching patches, some kernel keyring enhancements, the error detection and correction patch set, a RAID update, and lots of fixes.

Comments (none posted)

Kernel development news

Quote of the week

I'm with Roman on this one - the old "show me the code" trick which people use to quash other people's objections is rather poor form - we should simply address the objections as raised.
-- Andrew Morton

Comments (none posted)

Some new VM documentation

For those wanting to know more about how the 2.6 virtual memory subsystem works: Rik van Riel has put together a detailed article on how page fault handling is handled on the i386 architecture. This document is apparently the first of many, all of which should show up on the Linux MM Internals page.

Comments (none posted)

What's holding up 2.6.14: two difficult bugs

Linus was set on releasing the 2.6.14 kernel on October 17, when a little issue came up. Serge Belyshev discovered that it is easy to cause the system to stop opening files for user-space applications. He posted a program which, in essence, does the following:

    while (1) {
        int fd = open("/dev/null", O_RDONLY);
	close(fd);
    }

After some 50,000 iterations, the open fails with a "too many open files in system" message. This behavior can be problematic in more realistic situations; it evidently can cause highly-parallel kernel builds to fail, and it also exposes the system to local denial of service attacks. So it is worth tracking down.

The kernel places a limit on the number of files which are allowed to be open simultaneously. That limit is not normally expected to include files which have been closed, however. The problem, as it turns out, is a virtual filesystem scalability patch which was merged in September. That patch eliminates some locking around file structures in the kernel, and, to that end, defers certain tasks (such as file cleanup) to the read-copy-update mechanism. For this particular case, file structures corresponding to closed files are building up in the RCU callback list, and RCU is not getting around to freeing them in time.

Initially, it was thought that the culprit was another patch which put a limit on the processing of the RCU callback lists. Those lists can get quite long, and lengthy callback processing was causing latency problems elsewhere in the kernel. So a "batch size" of ten was imposed; after ten callbacks have been processed, the RCU subsystem defers the rest until later. It seemed that this limit was causing the freeing of file structures to languish. Raising the batch limit to 10,000 seemed to improve the situation, so Linus merged a patch to that effect.

But, in fact, the higher batch limit did not solve the problem for real. RCU callbacks cannot be called immediately after being queued. They must, instead, wait until every processor on the system has scheduled at least once. This "quiescence" requirement is the kernel's way of ensuring that no references to the freed structure remain; it's a key part of how RCU works. If a process chews through file structures quickly enough, they will accumulate while the kernel waits for the grace period to run out, and no changes to the batch limits will help. The only way to be able to process those callbacks - and free the associated structures - is to force every processor to schedule.

A couple of patches have been posted in an attempt to deal with this problem. One of them simply changes the way file structures are accounted for - they are removed from the count of open files when the RCU callback is queued, rather than when it is executed. This patch stops programs from running into the maximum open file limit, but does nothing to stop the growth of the RCU callback queues. So the patch which got merged, instead, is this one from Eric Dumazet, which keeps track of the length of the callback list. Should the list get to be too long (where "too long" is wired at 10,000 entries), a reschedule is forced so that the callbacks can be processed. This patch appears to have dealt with the problem well enough to allow 2.6.14 to come out, though more refinement may be required afterward.

Unfortunately for those who are waiting for 2.6.14, another problem turned up. Some 64-bit architectures which lack I/O memory management units must be very careful in setting up DMA areas. A number of devices can only reliably deal with 32-bit DMA addresses, so DMA areas must be allocated in the lower part of memory. To that end, the x86-64 and ia64 architectures use a mechanism called the "software I/O translation buffer", or swiotlb. It is simply a large chunk of low memory, allocated at boot time, which is used as a bounce buffer for DMA operations involving 64-bit-challenged devices.

It was noted that the 2.6.14-rc4 kernel can allocate the swiotlb area in high memory, which defeats the entire purpose. This revelation led to a long discussion of how swiotlb memory should be allocated. It turns out that there is no easy way of finding the low memory on the system. Once upon a time, that memory would belong to CPU 0, but on some contemporary NUMA systems, the low memory might be elsewhere. So the real solution appears to iterate through all CPUs on the system, try to allocate from each of them, and test to see if the resulting memory is within the DMAable range. If not, the memory is freed and the next processor is tried. A couple of patches implementing this approach are circulating; none has been merged as of this writing.

Comments (3 posted)

Nested class devices and the future of the device model

Two weeks ago, this page looked at nested classes in sysfs as a way of representing the input subsystem device hierarchy to user space. This week, Greg Kroah-Hartman posted a set of patches with the latest version of class_device nesting; the selling feature this time around was that the patches "actually work." With this patch set, it is possible to create a hierarchy under /sys/class which represents the known input devices on the system and their relationship to the actual system hardware. Greg also notes that this patch set makes possible the long-anticipated move of /sys/block into the class hierarchy.

So all would seem to be well in sysfs land. But Greg finished his announcement with the following:

Oh, one final thing. I really don't think that input should be a class. It looks like a "bus" and acts like a "bus" (you have different devices that have different drivers bind to them, and you want to load those drivers with the hotplug mechanism.)

This note opened the floodgates to a wider discussion; it seems that a number of people are not entirely happy with the /sys/class hierarchy. Udev hacker Kay Sievers complained:

The nesting classes implement a fraction of a device hierarchy in /sys/class. It moves arbitrary relation information into the class directory, where nothing else than device classification belongs. What is the rationale behind sticking device trees into class?

What seems to have happened here is that a number of devices, mostly of the virtual variety, have found their home in the class hierarchy rather than with the other devices. As a result, the class tree has grown more complicated, and it has moved away from its original purpose, which was to be a way of grouping devices which share the same interface and function. So Kay (among others) has proposed that much of what is currently in the class tree be moved over to /sys/devices with the rest of the device information. The idea is that user space does not really care about the distinction between "real" and "virtual" devices, and the kernel interface should not either.

Greg, who holds a big vote on device model issues, has responded thusly:

Ok, I've spent a while thinking about this proposal and originally I thought it was the same thing we had heard years ago. But I was wrong, moving the class stuff into the device tree is the right thing to do, as long as we keep them as new "things" in the tree...

So it would seem that big changes are in store for the Linux device model. This code has grown and evolved considerably since its introduction in 2.5; it may be time for a big rework. Actually changing things without causing major pain for users could be a bit of a challenge, however. It will have to be approached carefully.

The plan under consideration for now is to simply try to solve the input subsystem problem for 2.6.15. That most likely involves the nested class_device patches, perhaps with some changes to avoid breaking things in user space (and udev in particular). Things look more ambitious in the longer term:

Then, we move the class stuff into real devices. There was always a lot of duplication with the class and device code, and this shows that there is a commonality there. At the same time, I'll work on making the attribute stuff easier and possibly merge the kobject and device structures together a bit (possibly I said, I don't know quite how much yet...)

The end result is that there is likely to be some significant churn in the device model code in the coming months. There will almost certainly be consequences for the driver API, and for user space as well. If it all works out, however, we should end up with a device model which is easier to understand and work with in both kernel and user space.

Comments (8 posted)

On the merging of ktimers

LWN looked at the ktimers patch about one month ago. Work continues on the new kernel timer mechanism; the latest version of the patch includes a new "clockevents" abstraction intended to make high-resolution timer support easier to implement in an architecture-independent way. The patch appears to be coming together well, and there has been little in the way of criticism.

...with the exception of one observer, who has kept up a steady stream of complaints about the new mechanism. His objections include the name (he would rather see "process timers" than "ktimers"), the use of high-resolution time within the kernel, and various "unnecessary complexities." The discussion has been mostly unfruitful, to the point that the normally even-keeled Ingo Molnar tried to end it with a shut up and show me the code challenge. That led Andrew Morton to state that "show me the code" is no longer an acceptable arguing point for kernel discussions, and that the objections should be addressed regardless.

Getting a handle on the objections has proved hard; it is not clear that the person in question (Roman Zippel) truly understands the patches. One bit of the discussion is worth a look, however. It has been repeatedly pointed out that the existing kernel timer mechanism is optimized for timeouts which rarely actually expire, while ktimers are expected to expire. Roman claimed:

Whether the timer event is delivered or not is completely unimportant, as at some point the event has to be removed anyway, so that optimizing a timer for (non)delivery is complete nonsense.

This claim led to a required-reading response from Ingo on the history of the kernel timer mechanism and why optimizing for delivery (or the lack thereof) is not nonsense. That particular branch of the discussion, at least, should not need to go much further.

Andrew Morton has, in the past, stated that he would be highly reluctant to merge new code over the objections of a developer. The need to address all objections can be highly frustrating to kernel hackers, especially when new complaints seem to keep turning up as the old ones are resolved. The result of this process, when it works well, can be a stronger kernel. But it can also be the delaying of useful code which few people have problems with. It is starting to look like that may be the outcome in the ktimers case; the code will almost certainly be merged in the end, perhaps with almost no changes resulting from the current discussion.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds