User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.23-rc3, released by Linus on August 12. "Either people really are calming down, and figuring out that we're in the stabilization phase, or it's just that it's the middle of August, and most everybody at least in Europe are off on vacation." The changes are mostly limited to fixes; see the long-format changelog for the details.

As of this writing, a few dozen post-rc3 fixes have been merged into the mainline repository.

The current -mm tree is 2.6.23-rc2-mm2. Recent changes to -mm include a new e1000 network driver, a bunch of IDE updates, and support for NUMA nodes with no memory.

The current stable 2.6 kernel is, released on August 15. It contains several fixes, one of which is security-related., containing a rather larger set of fixes, was released on August 9.

For older kernels: Willy Tarreau has announced his intention to put together "a few more" 2.6.20 stable updates. The first of those is due almost any time. was released on August 15. It contains some build fixes and one security patch.

Comments (3 posted)

Kernel development news

Smarter write throttling

By Jonathan Corbet
August 14, 2007
Whenever a process performs a normal, buffered write() to a file, it ends up creating one or more dirty pages in memory. Those pages must eventually be written to disk. Until the data moves to persistent storage, the pages of memory it occupies cannot be used for any other purpose, even if the original writing process, as is often the case, no longer needs them. It is important to prevent dirty pages from filling too much of the system's memory; should the dirty pages take over, the system will find itself under severe memory pressure, and may not even have enough memory to perform the necessary writes and free more pages. Avoiding this situation is not entirely easy, though.

As a general rule, software can create dirty pages more quickly than storage devices can absorb them. So various mechanisms must be put in place to keep the number of dirty pages at a manageable level. One of those mechanisms is a simple form of write throttling. Whenever a process dirties some pages, the kernel checks to see if the total number of dirty pages in the system has gotten too high. If so, the offending process is forced to do some community service by writing pages to disk for a while. Throttling things in this way has two useful effects: dirty pages get written to disk (and thus cleaned), and the process stops making more dirty pages for a little while.

This mechanism is not perfect, however. The process which gets snared by the global dirty pages threshold may not be the one which actually dirtied most of those pages; in this case, the innocent process gets put to work while the real culprit continues making messes. If the bulk of the dirty pages must all be written to a single device, it might not be beneficial to throttle processes working with files on other disks - the result could be that traffic for one disk essentially starves the others which could, otherwise, be performing useful work. Overall, the use of a single global threshold can lead to significant starvation of both processes and devices.

It can get worse than that, even. Consider what happens when block devices are stacked - a simple LVM or MD device built on top of one or more physical drives, for example. A lot of I/O through the LVM level could create large numbers of dirty pages destined for the physical device. Should things hit the dirty thresholds at the LVM level, however, the process could block before the physical drive starts writeback. In the worst case, the end result here is a hard deadlock of the system - and that is not generally the sort of reliability that users expect of their systems.

Peter Zijlstra has been working on a solution in the form of the per-device write throttling patch set. The core idea is quite simple: rather than use a single, global dirty threshold, each backing device gets its own threshold. Whenever pages are dirtied, the number of dirty pages which are destined for the same device is examined, and the process is throttled if its specific device has too many dirty pages outstanding. No single device, then, is allowed to be the destination for too large a proportion of the dirty pages.

Determining what "too large" is can be a bit of a challenge, though. One could just divide the global limit equally among all block devices on the system, but the end result would be far from optimal. Some devices may have a great deal of activity on them at any given time, while others are idle. One device might be a local, high-speed disk, while another is NFS-mounted over a GPRS link. In either case, one can easily argue that the system will perform better if the faster, more heavily-used devices get a larger share of memory than slow, idle devices.

To make things work that way, Peter has created a "floating proportions" library. In an efficient, mostly per-CPU manner, this library can track events by source and answer questions about what percentage of the total is coming from each source. In the writeback throttling patch, this library is used to count the number of page writeback completions coming from each device. So devices which are able to complete writeback more quickly will get a larger portion of the dirty-page quota. Devices which are generally more active will also have a higher threshold.

The patch as described so far still does not solve the problem of one user filling memory with dirty pages to the exclusion of others - especially if users are contending for the bandwidth of a single device. There is another part of the patch, however, which tries to address this issue. A different set of proportion counters is used to track how many pages are being dirtied by each task. When a page is dirtied and the system goes to calculate the dirty threshold for the associated device, that threshold is reduced proportionately to the task's contribution to the pile of dirty pages. So a process which is producing large numbers of dirty pages will be throttled sooner than other processes which are more restrained.

This patch is in its eighth revision, and there has not been a lot of criticism this time around. Linus's response was:

Ok, the patches certainly look pretty enough, and you fixed the only thing I complained about last time (naming), so as far as I'm concerned it's now just a matter of whether it *works* or not. I guess being in -mm will help somewhat, but it would be good to have people with several disks etc actively test this out.

The number of reports so far has been small, but some testers have said that this patch makes their systems work better. It was recently removed from -mm "due to crashiness," though, so there are some nagging issues to be taken care of yet. In the longer term, the chances of it getting in could be said to be fairly good - but, with memory management patches like this, one never knows for sure.

Comments (11 posted)

timerfd() and system call review

By Jonathan Corbet
August 14, 2007
One of the fundamental principles of Linux kernel development is that user-space interfaces are set in stone. Once an API has been made available to user space, it must, for all practical purposes, be supported (without breaking applications) indefinitely. There have been times when this rule has been broken, but, even in the areas known for trouble (sysfs, for example), the number of times that the user-space API has been broken has remained relatively small.

Now consider the timerfd() system call, which was added to the 2.6.22 kernel. The purpose of this call is to allow an application to obtain a file descriptor to use with timer events, eliminating the need to use signals. The system call prototype, as found in 2.6.22, is:

    long timerfd(int fd, int clockid, int flags, struct itimerspec *utimer);

If fd is -1, a new timer file descriptor will be created and returned to the application. Otherwise, a timer will be set using the given clockid for the time specified in utimer. The TFD_TIMER_ABSTIME flag can be set to indicate that an absolute timer expiration is needed; otherwise the specified time is relative to the current time. The flags argument can also be used to request a repeating timer.

There is another aspect to the timerfd() API, though: a read on the timer file descriptor will return an integer value saying how many times the timer has fired since the previous read. If no timer expirations have happened, the read() call will block. In the 2.6.22 kernel, the returned value was 32 bits (on all architectures). It has since been decided that a 64-bit value would have been more appropriate, and a patch making that change has been merged for 2.6.23. The stable update also contained the API change.

That is not the full story, though. Michael Kerrisk, while writing manual pages for the new system call, encountered a couple of other shortcomings with the interface. In particular, it is not possible to ask the system for the amount of time remaining on a timer. Other timer-related system calls allow for this sort of query, either as a separate operation or when changing a timer. Michael thought that the timerfd() system call should work similarly to those which came before.

Michael has now posted a patch fixing up the timerfd() interface. With this patch, the system call would now look like this:

	long timerfd(int fd, int clockid, int flags, struct itimerspec *utimer,
                     struct itimerspec *outmr);

The new outmr pointer must be NULL when the file descriptor is first being created. In any other context, it will be used to return the amount of time remaining at any timerfd() call. So user space can query a timer non-destructively by calling timerfd() with a NULL value for utimer. If both timer pointers are non-NULL, the timer will be set to utimer, with its previous value being returned in outmr.

This is, of course, an entirely incompatible change to an API which has already been exported to user space; any code which is using timerfd() now will break if it is merged. By the rules, such a change should not be merged, but it appears that there is a good chance that the rules will be bent this time around. One can argue that, in a real sense, the API has not yet been made available to user space: there has been no glibc release which supports timerfd(). The number of applications using this system call must be quite low - if, in fact, there are any at all. So a change at this point, especially if it can get into 2.6.23, will improve the interface without actually causing any user-space pain.

Fixing timerfd() might still be possible. But there is no denying that we would be better off if we could eliminate this kind of API problem before it gets into a stable kernel release and possibly has to be supported for many years. Therein lies the real problem: system calls (and other user-space API features) are being added to the kernel at a high rate, but review of these changes tends to lag behind. Given the difficulty of fixing user-space API mistakes, it would seem that the review standards for API additions should be especially high. Causing that to happen will not be easy, though; reviewer attention is a scarce resource throughout the free software community.

An idea which has been raised in the past is to explicitly mark new user-space interfaces as being in a volatile "beta" state. For as long as the API remains in that state, the kernel developers are free to change it. Applications would, during this period, rely in the API at their peril. This idea has been rejected in the past, though; it is seen as a way of avoid proper thought ahead of merging a new API into the kernel. Assuming that view still holds, another way will have to be found.

One part of the solution might well be seen in how the timerfd() problems came to light. Michael has demonstrated something your editor has also encountered a number of times: one of the best ways to find shortcomings in an API is to attempt to document it comprehensively. If the kernel community were to resolve that it would not merge user-space API features in the absence of complete documentation, it might just provide the necessary incentive to get that last review pass done.

This idea seems likely to come up at next month's kernel summit (for which a preliminary agenda has just been posted). How it will be received is anybody's guess; writing documentation appears to be a task so challenging that even kernel hackers fear to try it. This challenge may be worth taking up, though, if the reward is few long-lasting user-space API problems in the future.

Comments (38 posted)

Kernel markers

By Jonathan Corbet
August 15, 2007
LWN's recent look at SystemTap noted that the patch set currently lacks a set of static probe points like that provided with DTrace. There are a few reasons for this difference. For example, the rate of change of the kernel code base would make the maintenance of a large set of probe points difficult, especially given that developers working on many parts of the code might not be particularly aware of - or concerned about - those points. But there is also the simple fact that the Linux kernel has no built-in mechanism for the creation of static probe points in the first place.

There is, naturally, a patch which makes the creation of probe points possible; it is called Linux kernel markers. This patch has been under development for some years. Its path into the mainline has been relatively rough, but there are signs that the worst of the roadblocks have been overcome. So perhaps a quick look at this patch is called for.

With kernel markers, the placement of a probe point is easy:

    #include <linux/marker.h>

    trace_mark(name, format_string, ...);

The name is a unique identifier which is used to access the probe; the documentation recommends a subsystem_event format, describing the subsystem in which the probe is found and the event which is being traced. For example: in a part of the patch which instruments the block subsystem, a probe placed in elv_insert(), which inserts a request into its proper location in the queue, is named blk_request_insert. The format string describes the remaining arguments, each of which will be some variable of interest at the time the trace point is hit.

Code which wants to hook into a trace point must call:

    int marker_probe_register(const char *name, const char *format,
			      marker_probe_func *probe, void *pdata);

Here, name is the name of the trace point, format is the format string describing the expected parameters from the trace point (it must match the format string provided when the trace point was established), probe() is the function to call when the trace point is hit, and pdata is a private data value to pass to probe(). The probe() function will have this prototype:

    void (*probe)(const struct __mark_marker *mdata, void *pdata,
                  const char *format, ...);

The mdata structure includes the name of the trace point, if need be, along with a formatted version of the arguments. The arguments themselves are passed after the format string.

Registration of a marker does not, yet, set up the probe() function to be called. First, the marker must be armed with:

    int marker_arm(const char *name);

Once the marker has been armed, probe() will be called every time execution arrives at the given trace point.

When probe points are no longer of interest, they can be shut down with:

    int marker_disarm(const char *name);
    void marker_probe_unregister(const char *name);

Calls to marker_arm() will nest - if a given marker has been armed three times, then three marker_disarm() calls will be required to turn it off again.

Internally, there are a lot of details to the management of markers. The code at the actual trace point, in the end, looks much like one would expect:

    if (marker_is_armed) {

In reality, it is not quite so simple. Getting marker support into the kernel requires that the runtime impact of kernel markers be as close to zero as possible, especially when the marker is not armed. A common use case for markers is to investigate performance problems on systems running in production, so they have to be present in production kernels without causing performance problems themselves. Adding a test-and-jump operation to a kernel hot path will always be a hard sell; the cache effects of referencing a set of global marker state variables could also be significant.

To get around this problem, the marker code comes with a separate patch called immediate values. In the architecture-independent implementation, an immediate value just looks like any other shared variable. The purpose of immediate values, though, is to provide variables with the assumption that they will be frequently read but infrequently changed, and that the read operations must have the lowest impact possible. So, in an architecture-specific implementation (which only exists for i386 at the moment), changing an immediate value actually patches any code which reads the value. To say that the details of doing this sort of patching safely are ugly would be to understate the point. But Mathieu Desnoyers has dealt with those details, and nobody else need look at the resulting code.

Through the use of immediate values, the code inserted by trace_mark() can query the setting of a trace point without generating a memory reference at all; instead, that setting is stored directly in the inserted code. So there will be no potential for an expensive cache miss at the probe point. The patch also provides an immediate_if() construct which is intended to allow jumps to be patched directly into the code, eliminating the test altogether, but that functionality has not yet been implemented. Even without this feature, immediate values allow the creation of trace points whose runtime impact is very nearly zero, eliminating the most common objection to their existence.

If and when this code is merged, the way will be clear for the creation of a set of well-defined trace points for utilities like SystemTap and LTTng. That, in turn, could make the internal operations of the kernel more visible to system administrators and others who are not necessarily well versed in how the kernel works. This sort of tracing ability has been on many users' wish lists for some time; they might just be, finally, getting close to having that wish fulfilled.

Comments (3 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management


Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds