Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.24-rc5, released by Linus on December 10. He says:

Things _have_ slowed down, although I'd obviously be lying if I said we've got all the regressions handled and under control. They are being worked on, and the list is shrinking, but at a guess, we're definitely not going to have a final 2.6.24 out before xmas unless santa puts some more elves to work on those regressions.

The list of fixes is still fairly long; there is also a significant FireWire stack update. The short-form changelog is included in Linus's announcement; see the long-format changelog for all the details.

A handful of patches have found their way into the mainline git repository since the -rc5 release.

Comments (none posted)

Kernel development news

Quotes of the week

while i dont want to jump to conclusions without looking at some profiles, i think the SLUB performance regression is indicative of the following fallacy: "SLAB can be done significantly simpler while keeping the same performance".

I couldnt point to any particular aspect of SLAB that i could characterise as "needless bloat".

-- Ingo Molnar

I suppose if the NSA had 20,000 2Ghz processors in the basement cranking for 10 years, then 50% of the time *after* they did a black bag job to crack the random pool state, they could get the last 80 bits generated from /dev/random, but it just seems that if you are assuming the power to grab the pool plus add_ptr, there would be much more useful things you could --- like for example having the black bag job trojaning the software to grab the private key directly.

-- Ted Ts'o

Nothing is beyond my skills. My mad k0der skillz are unbeatable.

-- Linus Torvalds

Comments (12 posted)

Simpler syslets

By Jonathan Corbet
December 10, 2007

Syslets are a proposed mechanism which would allow any system call to be invoked in an asynchronous manner; this technique promises a more comprehensive and simpler asynchronous I/O mechanism and much more - once all of the pesky little details can be worked out. A while back, Zach Brown let it be known that he had taken over the ongoing development of the syslets patch set; things have been relatively quiet since then. But Zach has just returned with a new syslets patch which shows where this idea is going.

This version of the patch removes much of the functionality seen in previous postings. The ability to load simple programs into the kernel for asynchronous execution is now gone, as is the "threadlet" mechanism for asynchronous execution of user-space functions. Instead, syslets have gone back to their roots: a mechanism for running a single system call without blocking.

As had been foreshadowed in other discussions, syslets now use the indirect() system call mechanism. An application wanting to perform an asynchronous system call fills in a syslet_args structure describing how the asynchronous execution is to be handled; the application then calls indirect() to make it happen. If the system call can run without blocking, indirect() simply returns with the final status. If blocking is required, the kernel will (as with previous versions of this patch) return to user space in a separate process while the original process waits for things to complete. Upon completion, the final status is stored in user-space memory and the application is notified in an interesting way.

The syslet_args structure looks like this:

    struct syslet_args {
	u64 completion_ring_ptr;
	u64 caller_data;
	struct syslet_frame frame;
    };

The completion_ring_pointer field contains a pointer to a circular buffer stored in user space. The head of the buffer is defined this way:

    struct syslet_ring {
	u32 kernel_head;
	u32 user_tail;
	u32 elements;
	u32 wait_group;
	struct syslet_completion comp[0];
    };

Here, kernel_head is the index of the next completion ring entry to be filled in by the kernel, and user_tail is the next entry to be consumed by the application. If the two are equal, the ring is empty. The elements field says how many entries can be stored in the ring; it must be a power of two. The kernel uses wait_group as a way of locating a wait queue internally when the application waits on syslet completion; your editor suspects that this part of the API may not survive into the final version.

Finally, the completion status values themselves live in the array of syslet_completion structures, which look like this:

    struct syslet_completion {
	u64 status;
	u64 caller_data;
    };

When a syslet completes, the final return code is stored in status, while the caller_data field is set with the value provided in the field by the same name in the syslet_args structure when the syslet was first started.

There is one field of syslet_args which has not been discussed yet: frame. The definition of this structure is architecture-dependent; for the x86 architecture it is:

    struct syslet_frame {
	u64 ip;
	u64 sp;
    };

These values are used when the syslet completes. After the kernel stores the completion status in the ring buffer, it will call the function whose address is stored in ip, using the stack pointer found in sp. This call serves as a sort of instant, asynchronous notification to the application that the syslet is done. It's worth noting that this call is performed in the original process - the one in which the syslet was executed - rather than in the new process used to return to user space when the syslet blocked. This function also has nothing to return to, so, after doing its job, it should simply exit.

So, to review, here is how a user-space application will use syslets to call a system call asynchronously:

The completion ring is established and initialized in user space.
A stack is allocated for the notification function, and the syslet_args structure is filled in with the relevant information.
A call is made to indirect() to get the syslet going.
If the system call of interest is able to complete without blocking, the return value is passed directly back to user space from indirect() and the call is complete.
Otherwise, once the system call blocks, execution switches to a new process which returns to user space. An ESYSLETPENDING error is returned in this case.
Once the system call completes, the kernel stores the return value in the completion ring and calls the notification function in the original process.

Should the application wish to stop and wait for any outstanding syslets to complete, it can make use of a new system call:

    int syslet_ring_wait(struct syslet_ring *ring, unsigned long user_idx);

Here, ring is the pointer to the completion ring, and user_idx is the value of the user_tail index as seen by the process. Providing the tail as an argument to syslet_ring_wait() prevents problems with race conditions which might come about if a syslet completes after the application has decided to wait. This call will return once there is at least one completion in the ring.

The real purpose of this set of patches is to try to nail down the user-space API for syslets; it is clear that there is still some work to be done. For example, there is no way, currently, for an application to use indirect() to simultaneously launch a syslet and (as was the original purpose for indirect()) provide additional arguments to the target system call. In fact, the means for determining which of the two is being done looks dangerously brittle. As Zach has already noted, the calling convention needs to be changed to make the syslet functionality and the addition of arguments orthogonal.

There are a number of other questions which need to be answered - Zach has supplied a few of them with the patch. Interaction with ptrace() is unclear, resource management issues abound, and so on. Zach is clearly looking for feedback on these issues:

I'm particularly interested in hearing from people who are trying to use syslets in their applications. This will involve awkward wrappers instead of glibc calls for now, and your machine may explode, but hopefully the chance to influence the design of syslets would make it worth the effort.

So, the message is clear: anybody who is interested in how this interface will look would be well advised to pay attention to it now.

Comments (10 posted)

Writeout throttling

By Jonathan Corbet
December 11, 2007

The avoidance of writeout deadlocks is a topic which occasionally pops up on the mailing lists. Most Linux systems are able to handle the writeout of dirty pages to disk without a great deal of trouble. Every now and then, however, the system can get itself into a state where it is is out of memory and it must write some pages to disk before any more memory can be allocated. If the act of writing those pages, itself, requires memory allocations, the system can deadlock. Systems with complicated block I/O setups - those using the device mapper, network-based storage, user-space filesystems, etc. - are the most susceptible to this problem.

There has been a steady stream of patches aimed at solving this problem; the write throttling patch discussed here last August is one of them. The problem is inherently hard to solve, though; it looks like it may be with us for a long time. Or maybe not, if Daniel Phillips's new and rather aggressively promoted writeout throttling patch lives up to its hype.

Daniel's patch is quite simple at its core. His approach for eliminating writeout-related deadlocks comes down to this:

Establish a memory reserve from which (only) code performing writeout can allocate pages. In fact, this reserve already exists, in that some memory is reserved for the use of processes marked with the PF_MEMALLOC flag.
Place an upper limit on the amount of memory which can be used for writeout to each device at any given time.

The patch does not try to directly track the amount of memory which will be used by each writeout request; instead, it tasks block-level drivers with accounting for the number of "units" which will be used. To that end, it adds an atomic_t variable (called available) and a function pointer (metric()) to each request queue. When an outgoing request finds its way to __generic_make_request(), it is passed to metric() to get an estimate of the amount of resource which will be required to handle that request. If the estimated resource requirement exceeds the value of available, the process will simply block until a request completes and available is incremented to a sufficiently high level.

The metric() function is to be supplied by the highest-level block driver responsible for the request queue. If that block driver is, itself, responsible for getting the data to the physical media, estimating the resource requirements will be relatively easy. The deadlock problems, however, tend to come up when I/O requests have to go through multiple layers of drivers; imagine a RAID array built on top of network-based storage devices, for example. In that case the top level will have to get resource requirement estimates from the lower levels, a problem which has not been addressed in this patch set.

Andrew Morton suggested an alternative approach wherein the actual memory use by each block device would be tracked. A few hooks into the page allocation code would give a reasonable estimate of how much memory is dedicated to outstanding I/O requests at any given time; these hooks could also be used to make a guess at how much memory each new request can be expected to need. Then, the block layer could use that guess and the current usage to ensure that the device does not exceed its maximum allowable memory usage. Daniel eventually rejected this approach, saying that looking at current memory use is risky. It may well be that a given device is committed to serving I/O requests which will, before they are done, require quite a bit more memory than has been allocated so far. In that case, memory usage could eventually exceed the cap in a big way. It's better, says Daniel, to do a conservative accounting at the beginning.

The patch does not address the memory reserve issue at all; instead, it relies on the current PF_MEMALLOC mechanism. It was necessary, says Daniel, to give the PF_MEMALLOC "privilege" to some system processes which assist in the writeout process, but nothing more than that was needed. He also claims that, for best results, much of the current code aimed at preventing writeout deadlocks needs to be removed from the kernel. He concludes:

Let me close with perhaps the most relevant remarks: the attached code has been in heavy testing and in production for months now. Thus there is nothing theoretical when I say it works, and the patch speaks for itself in terms of obvious correctness. What I hope to add to this in the not too distant future is the news that we have removed hundreds of lines of existing kernel code, maintaining stability and improving performance.

Since then, a couple of reviewers have pointed out problems in the code, dimming its aura of obvious correctness slightly. But nobody has found serious fault with the core idea. Determining its true effectiveness and making it work for a larger selection of storage configurations will take some time and effort. But, if the idea pans out, it could herald the end of a perennial and unpleasant problem for the Linux kernel.

Comments (none posted)

New bugs and old bugs

By Jonathan Corbet
December 12, 2007

As the 2.6.24 release slowly gets closer, the desire to shrink the list of known regressions grows. As can be seen from the current list (as of just before 2.6.24-rc5), there is still some work yet to be done. That list is long enough that, as Linus pointed out in the -rc5 announcement, the traditional holiday release may not happen this year.

One of those regressions is a failure of a certain model of DVD drive to work with the 2.6.24-rc kernels; this drive works fine with 2.6.23. A look at the corresponding bugzilla entry shows that quite a bit of effort has been expended (by both developers and testers) toward tracking this one down, but, as of this writing, its exact cause remains unknown. So there is not (again, as of this writing) a well-defined fix for the problem.

What is known is which patch broke the device. Tejun Heo describes it this way: "It's introduced by setting ATAPI transfer chunk size to actual transfer size which is the right thing to do generally." The current development code (destined for 2.6.25) works just fine with this device, but that would be far too big a patch to put into the 2.6.24 kernel at this stage in the cycle. So Tejun (along with others) continues to look for a simpler fix. He also has a backup plan:

If we fail to find out the solution in time, we always have the alternative of backing out the ATAPI transfer chunk size update. This will break some other cases which were fixed by the change but those won't be regressions at least and we can add transfer chunk size update with other changes to 2.6.25.

This plan drew an immediate complaint from Alan Cox, who notes that backing out this fix will break quite a few devices which had finally been made to work while fixing only one which is known to have problems with the new code. This change, he says, "...is nonsensical and not in the general good". Alan would rather take the hit of breaking one device for the benefit of making a larger number of others work properly for the first time. If need be, the failing drive could be handled via a special blacklist in 2.6.24.

That idea, however, was firmly shot down by Linus:

"The one off regression" is likely the tip of an iceberg. If something regresses for one person, for that one person who tested and noticed and made a bug-report, there's probably a thousand people who haven't even tested the development kernel, or who had problems and just went back to the previous version.

In contrast, reverting something will be guaranteed to not have those kinds of issues, since the only people who could notice are people for who it never worked in the first place. There's no "silent mass of people" that can be affected.

In recent years, as the complexity of the kernel (and concerns about its quality) have grown, the development community has taken an increasingly hard line against regressions. As Linus points out above, regressions cause visible problems for people whose systems were once working; that is a clear way to lose testers and (eventually) users. On the other hand, something which has never worked, and which still does not work, does not make life worse for Linux users. For this reason, the avoidance of regressions has become one of the highest development priorities.

There is another, related reason: the aforementioned kernel quality concerns. One can easily ask whether the quality of the kernel is improving or not, but truly answering that question is not an easy thing to do. A better kernel may, by attracting additional users, actually result in more bug reports; similarly, a buggier kernel may drive testers away, with the result that the number of reported bugs goes down. One cannot simply look at the lists of known problems and come to a reasonably defensible conclusion as to whether a given kernel is better than another or not.

What one can do, however, is ensure that everything which works now continues to work in future versions. If working things do not break, then, on the assumption that other problems are occasionally being fixed, it is reasonable to conclude that the kernel is getting better. If regressions are allowed, instead, then one never really knows. Regressions thus are the closest thing we have to an objective measurement of the quality of a given kernel release, and fixing regressions is an unambiguous way of improving that quality. So it's no wonder that the higher priority placed on improving kernel quality has led to a stronger focus on regressions.

Anybody who has watched Alan Cox's work knows that he cares deeply about the quality of the kernel. But he thinks that the anti-regression policy is being taken a little too far this time around:

To blindly argue regressions are critical is sometimes (as in this case) to argue that "this freeway is no longer compatible with a horse and cart" means the freeway should be turned back into a dirt road.

It may yet be that a proper fix for this problem will be found for 2.6.24, at which point the larger change can go through. Failing that, though, it appears that the horses and carts will win the day for now. Those needing the full freeway will have to wait until the horse-compatible version becomes available in 2.6.25.

(Update: it appears that the problem has now been fixed.)

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Lnux 2.6.24-rc5 ?

Architecture-specific

Balbir Singh Fake NUMA emulation for PowerPC ?

Paul Mackerras [RFC][POWERPC] Provide a way to protect 4k subpages when using 64k pages ?

Core kernel code

Huang, Ying kexec based hibernation -v7 ?

Zach Brown syslets v7: back to basics ?

Development tools

Mathieu Desnoyers LTTng instrumentation (arch independent) ?

Mathieu Desnoyers LTTng architecture dependent instrumentation ?

Masami Hiramatsu kprobes x86 code unification and boosters ?

Device drivers

Poonam_Aggrwal-b10812 UCC based TDM driver for MPC83xx platforms. ?

Dmitry Baryshkov Tosa keyboard support ?

Anas Nashif Intel Management Engine Interface ?

Stephen Rothwell Introduce driver_create/remove_dir ?

Documentation

Michael Kerrisk man-pages-2.70 is released ?

Filesystems and block I/O

David Howells Permit filesystem local caching [try #2] ?

Jose R. Santos [PATCH] Flex_BG ialloc awareness V2. ?

Erez Zadok 00/42 Unionfs and related patches review ?

Evgeniy Polyakov DST: Distributed storage. ?

Kiyoshi Ueda [PATCH 00/30] blk_end_request: full I/O completion handler (take 4) ?

Memory management

Daniel Phillips [PATCH] A clean approach to writeout throttling ?

Mel Gorman Use two zonelists per node instead of multiple zonelists v11r2 ?

Networking

Renzo Davoli New Address Family: Inter Process Networking (IPN) ?

Michael Blizek Connection Oriented Routing Protocol ?

Virtualization and containers

Eric W. Biederman Core pid namespace enhancements ?

Miscellaneous

Avi Kivity Proposed new directory layout for kvm and virtualization ?

Page editor: Jonathan Corbet
Next page: Distributions>>