User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.16-rc5; no new -rc releases have been made over the last week. A slow trickle of patches continues to find its way into the mainline git repository as bugs are tracked down and fixed.

The current -mm release is 2.6.16-rc5-mm3. Recent changes to -mm include a patch to allow NFS mounts from a common server to share superblocks, CPU hotplug support for the x86-64 architecture, a continuation of the /proc rework, and some device mapper work.

The current stable 2.6 kernel is, released on March 5, following shortly after The two updates carry a few dozen patches, a number of which address security-related issues.

Comments (none posted)

Kernel development news

Quote of the week

Users of Suspend2 can rest assured that I will not allow the patches to suffer bitrot. I will be continuing to use them myself, and will therefore have the best of incentives to keep them up-to-date.

Now for the downside: I won't, however, be making any sort of concerted effort at getting them merged into the vanilla kernel after my move, and am not inclined to make a big effort beforehand.

-- Nigel Cunningham

Comments (2 posted)

Double kfree() errors

Less than 24 hours after Coverity announced the availability of a new set of machine-detected potential kernel bugs, Dave Jones started posting fixes. Judging from these fixes, a number of the problems detected this time around are double-free errors - passing the same pointer to kfree() twice. Freeing memory twice is a sure way to corrupt core kernel data structures, leading to trouble in unpredictable places far from where the real bug is to be found. Avoiding this kind of error would make life easier for everybody involved.

To that end, Dave tossed out a simple idea: have kfree() poison pointers so that a second call can be detected immediately. His first proposal looked like this:

    #define kfree(foo) \
	        __kfree(foo); \
	        foo = KFREE_POISON;

This code was not meant to be incorporated as-is; for starters, it probably needs a pair of braces. But there were a couple of other problems which popped up. One of them is that, since passing a NULL pointer to kfree() is legal, passing it twice is also legal. But this code would break that case. Whether that would be a problem for real code is unclear. Al Viro pointed out a more serious issue: the pointer passed to kfree() is not always an lvalue which can be assigned to. So simply redefining kfree() in this way would lead to compilation errors.

The end result is that a transparent, in-place replacement for kfree() may be hard to implement. An alternative might be the creation of a safe_kfree() variant, combined with some serious pressure to use that variant. Then, perhaps, double-free errors could be caught when they happen.

Or, instead, one could use the double-free checking already built into the kernel. The slab allocator, which is (among other things) the engine behind kmalloc() and kfree(), has options for poisoning (writing special values to) all memory which it handles. One value (0x5a in every byte) marks uninitialized memory, while another (0x6b) is written into memory when it is freed. The resulting patterns jump out nicely in oops listings, often making the cause of the problem immediately obvious. But the use-after-free value can also enable the detection of double-free errors - assuming that the memory is not reallocated between kfree() calls.

The problem, it seems, is that not a whole lot of developers are running with slab poisoning enabled. As a result, they are working without a valuable debugging tool and allowing certain kinds of bugs to persist in the code base. So a part of the solution to the problem may well be a stronger effort to get developers to turn the slab poisoning option on. Beyond that, any sort of checking added to kfree() (or a variant) should be harder to disable than the existing debugging options.

Comments (4 posted)

RCU and open file accounting

David Miller has been making great progress in his port of the Linux kernel to Sun's new "Niagara" (SPARC) CPU architecture. He has run into one little problem, however:

I just wanted to report that I am hitting the "VFS: file-max limit xxx reached" problem quite easily on my 32-cpu Niagara machine with 16GB of ram with current 2.6.x GIT. It seems far too easy to get a box into this state due to SLAB fragmentation and RCU. And once you get a machine into this state it is totally unusable.

Our test case is usually a "make -j8192" kernel build along with a parallel bootstrap of gcc. That puts about 256 processes on each cpu's runqueue, I doubt ksoftirqd can run much at all.

The file limit problem was last discussed here in October, when it delayed the release of the 2.6.14 kernel. A fix merged at that time made the problem harder to trigger, but, as David's experience shows, the problem has not been solved altogether. One might argue that a relatively small number of users run the sort of workload that David is playing with. But the point remains: with current kernels, including the upcoming 2.6.16 release, it is possible for a suitably-written program to run the open file count to its maximum, thus denying any sort of service to other users. This seems like a problem which one might want to fix.

One piece of the puzzle here is the way that the open file count is managed. Currently, that count is decremented in the slab destructor set up for file structures. This method works, but it can cause the decrement to be delayed by an arbitrary amount of time, with the result that the open file count overstates the number of files which are actually held open by processes in the system. Moving that operation out of the slab destructor can help to keep the count more in sync with reality.

The core of the problem, however is the use of the read-copy-update (RCU) mechanism for management of file structures. When a file is closed, the task of freeing the structure is queued in RCU. Using RCU lets the kernel ensure that the structure is not freed while references to it remain, but without the sort of locking overhead that comes with other techniques. As a result, performance is measurably improved on SMP systems.

When there is a lot of opening and closing of files going on (such as, say, when a wild-eyed developer starts an 8192-process kernel build), the length of the RCU callback queue can get quite long. By the time that the RCU code decides that the system has quiesced and it is safe to invoke the RCU callbacks, the queue might have thousands of entries. Working through the entire callback queue led to latency problems elsewhere in the system, so 2.6.14 included a patch which put an upper limit on the number of callbacks which would be processed in any single iteration.

The limit helped with the latency problem. But, if the generation of RCU callbacks continues at a high rate, the length of the callback queue can only grow. Every entry in the queue represents memory which could be returned to the system, but which has not yet been made available. So, as the queue grows, memory gets fragmented and the system heads towards the dreaded out-of-memory state.

An attempt at a solution can be found in this patch by Dipankar Sarma, which has been sitting in the -mm tree for a while. Dipankar's patch puts a configurable upper limit on the number of RCU callbacks which will be processed in any single batch; that allows system administrators to tune the batch size to their particular needs. On a server which is dealing with large number of file requests, and on which latency is not a crucial issue, the batch size can be set to a large number.

The patch also adds a high-water limit. If the length of the RCU callback queue ever exceeds that limit, the RCU code will (1) set the batch limit to infinity (or the integer representation thereof) and (2) send out an inter-processor interrupt forcing every CPU on the system to schedule. The combination of these actions will cause the system to work through the entire RCU queue at the soonest possible time. Once the queue length goes below a low-water limit, the old batch limit will be restored.

It is, in other words, a somewhat unsubtle approach; the system is given a kick in the rear and told to go clean up its mess. But, it seems, that is exactly what the system needs at such a time. The cleanup task can only be deferred for so long; the work eventually needs to be done regardless. David has reported that the patches fix the problem on his Niagara system, and suggests that they should be merged into 2.6.16. It is a fairly significant patch to merge at this late point in the cycle, but there seems to be a reasonably high level of confidence in its stability. So, chances are that it will be included as a preferable alternative to shipping 2.6.16 with a known problem.

Comments (6 posted)

Some upcoming sysfs enhancements

A glance at Greg Kroah-Hartman's state of the driver core and sysfs message shows that a number of changes are queued up for future kernel cycles. A couple of those add new features to sysfs, and seem worth a mention.

Attribute files in sysfs serve as a channel for sharing information between the kernel and user space. As more of the information interface moves to sysfs, an increasing number of user-space programs will be making use of sysfs attributes. Often, these programs will want to respond when the value of a sysfs attribute changes. In current kernels, however, there is no easy way for an application to know when an attribute has changed; the only option is to repeatedly re-read the file and check for new values.

The current -mm kernels include a patch by Neil Brown which makes it possible to create pollable attributes. With such attributes, user space need only open the attribute of interest pass it to poll() with the POLLERR and POLLPRI events selected. When poll() returns, the file can be reopened and reread to obtain the new value.

Internally, the patch adds a wait queue head to every kobject on the system; that queue is inserted into a poll table in response to a poll() call. The sysfs code has no way of knowing, however, when the value of any given sysfs attribute has changed, so the subsystem implementing a pollable attribute must make explicit calls to:

    void sysfs_notify(struct kobject *kobj, char *dir, char *attr);

Here, kobj and attr describe the attribute whose value has been changed. The dir argument need only be supplied when the given kobject has a special subdirectory (and the attribute is in that directory). This call will cause any polling process to wake up and see that a new value is available.

With the current code, there is no way to mark attributes which can be polled. Any process which calls poll() on an attribute which does not support polling will end up waiting rather longer than the developer intended.

While sysfs attributes are normally low-bandwidth items - holding generally a single value - the relayfs subsystem (added in 2.6.14) is meant to be a high-bandwidth pipe from the kernel to user space. Relayfs is often used for debugging tasks, such as relaying large amounts of kernel trace data for later analysis. User space gets at that data stream by opening a channel file created in the special-purpose relayfs filesystem.

As it turns out, relayfs contains a fairly nice internal abstraction for its file operations, making it possible to create entries for relay channels in other filesystems. Paul Mundt recently put together a patch taking advantage of this feature to allow kernel code to create relayfs channels in sysfs. The reaction to this capability was positive; indeed, it was seen as a better interface to the relay code than relayfs itself. So Paul's patches have grown into a full reworking of the relay interface, with the separate relayfs filesystem going away entirely.

Most of the interfaces remain unchanged; in particular, almost the entire kernel API (as described in the documentation file) remains as it was before. But now there is a pair of new functions:

    int sysfs_create_relay_file(struct kobject *kobj, 
                                struct relay_attribute *attr);
    void sysfs_remove_relay_file(struct kobject *kobj, 
                                 struct relay_attribute *attr);

A simple call to sysfs_create_relay_file() will add a relay channel attribute to the given kobject. The relay_attribute structure must be filled in with information about the actual channel. On the user-space side, the only change is that the application must look in a different place to find the relay channel. All of the supported operations (mmap() in particular) work as before.

Barring last-minute objections, both of these patches seem likely to be merged for 2.6.17.

Comments (7 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds