Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is 2.6.24-rc5, released by Linus on December 10. He says:
The list of fixes is still fairly long; there is also a significant FireWire stack update. The short-form changelog is included in Linus's announcement; see the long-format changelog for all the details.
A handful of patches have found their way into the mainline git repository since the -rc5 release.
Kernel development news
Quotes of the week
I couldnt point to any particular aspect of SLAB that i could characterise as "needless bloat".
Simpler syslets
Syslets are a proposed mechanism which would allow any system call to be invoked in an asynchronous manner; this technique promises a more comprehensive and simpler asynchronous I/O mechanism and much more - once all of the pesky little details can be worked out. A while back, Zach Brown let it be known that he had taken over the ongoing development of the syslets patch set; things have been relatively quiet since then. But Zach has just returned with a new syslets patch which shows where this idea is going.This version of the patch removes much of the functionality seen in previous postings. The ability to load simple programs into the kernel for asynchronous execution is now gone, as is the "threadlet" mechanism for asynchronous execution of user-space functions. Instead, syslets have gone back to their roots: a mechanism for running a single system call without blocking.
As had been foreshadowed in other discussions, syslets now use the indirect() system call mechanism. An application wanting to perform an asynchronous system call fills in a syslet_args structure describing how the asynchronous execution is to be handled; the application then calls indirect() to make it happen. If the system call can run without blocking, indirect() simply returns with the final status. If blocking is required, the kernel will (as with previous versions of this patch) return to user space in a separate process while the original process waits for things to complete. Upon completion, the final status is stored in user-space memory and the application is notified in an interesting way.
The syslet_args structure looks like this:
struct syslet_args {
u64 completion_ring_ptr;
u64 caller_data;
struct syslet_frame frame;
};
The completion_ring_pointer field contains a pointer to a circular buffer stored in user space. The head of the buffer is defined this way:
struct syslet_ring {
u32 kernel_head;
u32 user_tail;
u32 elements;
u32 wait_group;
struct syslet_completion comp[0];
};
Here, kernel_head is the index of the next completion ring entry to be filled in by the kernel, and user_tail is the next entry to be consumed by the application. If the two are equal, the ring is empty. The elements field says how many entries can be stored in the ring; it must be a power of two. The kernel uses wait_group as a way of locating a wait queue internally when the application waits on syslet completion; your editor suspects that this part of the API may not survive into the final version.
Finally, the completion status values themselves live in the array of syslet_completion structures, which look like this:
struct syslet_completion {
u64 status;
u64 caller_data;
};
When a syslet completes, the final return code is stored in status, while the caller_data field is set with the value provided in the field by the same name in the syslet_args structure when the syslet was first started.
There is one field of syslet_args which has not been discussed yet: frame. The definition of this structure is architecture-dependent; for the x86 architecture it is:
struct syslet_frame {
u64 ip;
u64 sp;
};
These values are used when the syslet completes. After the kernel stores the completion status in the ring buffer, it will call the function whose address is stored in ip, using the stack pointer found in sp. This call serves as a sort of instant, asynchronous notification to the application that the syslet is done. It's worth noting that this call is performed in the original process - the one in which the syslet was executed - rather than in the new process used to return to user space when the syslet blocked. This function also has nothing to return to, so, after doing its job, it should simply exit.
So, to review, here is how a user-space application will use syslets to call a system call asynchronously:
- The completion ring is established and initialized in user space.
- A stack is allocated for the notification function, and the
syslet_args structure is filled in with the relevant
information.
- A call is made to indirect() to get the syslet going.
- If the system call of interest is able to complete without blocking,
the return value is passed directly back to user space from
indirect() and the call is complete.
- Otherwise, once the system call blocks, execution switches to a new
process which returns to user space. An ESYSLETPENDING
error is returned in this case.
- Once the system call completes, the kernel stores the return value in the completion ring and calls the notification function in the original process.
Should the application wish to stop and wait for any outstanding syslets to complete, it can make use of a new system call:
int syslet_ring_wait(struct syslet_ring *ring, unsigned long user_idx);
Here, ring is the pointer to the completion ring, and user_idx is the value of the user_tail index as seen by the process. Providing the tail as an argument to syslet_ring_wait() prevents problems with race conditions which might come about if a syslet completes after the application has decided to wait. This call will return once there is at least one completion in the ring.
The real purpose of this set of patches is to try to nail down the user-space API for syslets; it is clear that there is still some work to be done. For example, there is no way, currently, for an application to use indirect() to simultaneously launch a syslet and (as was the original purpose for indirect()) provide additional arguments to the target system call. In fact, the means for determining which of the two is being done looks dangerously brittle. As Zach has already noted, the calling convention needs to be changed to make the syslet functionality and the addition of arguments orthogonal.
There are a number of other questions which need to be answered - Zach has supplied a few of them with the patch. Interaction with ptrace() is unclear, resource management issues abound, and so on. Zach is clearly looking for feedback on these issues:
So, the message is clear: anybody who is interested in how this interface will look would be well advised to pay attention to it now.
Writeout throttling
The avoidance of writeout deadlocks is a topic which occasionally pops up on the mailing lists. Most Linux systems are able to handle the writeout of dirty pages to disk without a great deal of trouble. Every now and then, however, the system can get itself into a state where it is is out of memory and it must write some pages to disk before any more memory can be allocated. If the act of writing those pages, itself, requires memory allocations, the system can deadlock. Systems with complicated block I/O setups - those using the device mapper, network-based storage, user-space filesystems, etc. - are the most susceptible to this problem.There has been a steady stream of patches aimed at solving this problem; the write throttling patch discussed here last August is one of them. The problem is inherently hard to solve, though; it looks like it may be with us for a long time. Or maybe not, if Daniel Phillips's new and rather aggressively promoted writeout throttling patch lives up to its hype.
Daniel's patch is quite simple at its core. His approach for eliminating writeout-related deadlocks comes down to this:
- Establish a memory reserve from which (only) code performing writeout
can allocate pages. In fact, this reserve already exists, in that
some memory is reserved for the use of processes marked with the
PF_MEMALLOC flag.
- Place an upper limit on the amount of memory which can be used for writeout to each device at any given time.
The patch does not try to directly track the amount of memory which will be used by each writeout request; instead, it tasks block-level drivers with accounting for the number of "units" which will be used. To that end, it adds an atomic_t variable (called available) and a function pointer (metric()) to each request queue. When an outgoing request finds its way to __generic_make_request(), it is passed to metric() to get an estimate of the amount of resource which will be required to handle that request. If the estimated resource requirement exceeds the value of available, the process will simply block until a request completes and available is incremented to a sufficiently high level.
The metric() function is to be supplied by the highest-level block driver responsible for the request queue. If that block driver is, itself, responsible for getting the data to the physical media, estimating the resource requirements will be relatively easy. The deadlock problems, however, tend to come up when I/O requests have to go through multiple layers of drivers; imagine a RAID array built on top of network-based storage devices, for example. In that case the top level will have to get resource requirement estimates from the lower levels, a problem which has not been addressed in this patch set.
Andrew Morton suggested an alternative approach wherein the actual memory use by each block device would be tracked. A few hooks into the page allocation code would give a reasonable estimate of how much memory is dedicated to outstanding I/O requests at any given time; these hooks could also be used to make a guess at how much memory each new request can be expected to need. Then, the block layer could use that guess and the current usage to ensure that the device does not exceed its maximum allowable memory usage. Daniel eventually rejected this approach, saying that looking at current memory use is risky. It may well be that a given device is committed to serving I/O requests which will, before they are done, require quite a bit more memory than has been allocated so far. In that case, memory usage could eventually exceed the cap in a big way. It's better, says Daniel, to do a conservative accounting at the beginning.
The patch does not address the memory reserve issue at all; instead, it relies on the current PF_MEMALLOC mechanism. It was necessary, says Daniel, to give the PF_MEMALLOC "privilege" to some system processes which assist in the writeout process, but nothing more than that was needed. He also claims that, for best results, much of the current code aimed at preventing writeout deadlocks needs to be removed from the kernel. He concludes:
Since then, a couple of reviewers have pointed out problems in the code, dimming its aura of obvious correctness slightly. But nobody has found serious fault with the core idea. Determining its true effectiveness and making it work for a larger selection of storage configurations will take some time and effort. But, if the idea pans out, it could herald the end of a perennial and unpleasant problem for the Linux kernel.
New bugs and old bugs
As the 2.6.24 release slowly gets closer, the desire to shrink the list of known regressions grows. As can be seen from the current list (as of just before 2.6.24-rc5), there is still some work yet to be done. That list is long enough that, as Linus pointed out in the -rc5 announcement, the traditional holiday release may not happen this year.One of those regressions is a failure of a certain model of DVD drive to work with the 2.6.24-rc kernels; this drive works fine with 2.6.23. A look at the corresponding bugzilla entry shows that quite a bit of effort has been expended (by both developers and testers) toward tracking this one down, but, as of this writing, its exact cause remains unknown. So there is not (again, as of this writing) a well-defined fix for the problem.
What is known is which patch broke the device. Tejun Heo describes it this way: "It's introduced
by setting ATAPI transfer chunk size to actual transfer size which is the
right thing to do generally.
" The current development code
(destined for 2.6.25) works just fine with this device, but that would be
far too big a patch to put into the 2.6.24 kernel at this stage in the
cycle. So Tejun (along with others) continues to look for a simpler fix.
He also has a backup plan:
This plan drew an immediate complaint from
Alan Cox, who notes that backing out this fix will break quite a few
devices which had finally been made to work while fixing only one which is
known to have problems with the new
code. This change, he says, "...is nonsensical and not in the
general good
". Alan would rather take the hit of breaking one
device for the benefit of making a larger number of others work properly
for the first time. If need be, the failing drive could be handled via a
special blacklist in 2.6.24.
That idea, however, was firmly shot down by Linus:
In contrast, reverting something will be guaranteed to not have those kinds of issues, since the only people who could notice are people for who it never worked in the first place. There's no "silent mass of people" that can be affected.
In recent years, as the complexity of the kernel (and concerns about its quality) have grown, the development community has taken an increasingly hard line against regressions. As Linus points out above, regressions cause visible problems for people whose systems were once working; that is a clear way to lose testers and (eventually) users. On the other hand, something which has never worked, and which still does not work, does not make life worse for Linux users. For this reason, the avoidance of regressions has become one of the highest development priorities.
There is another, related reason: the aforementioned kernel quality concerns. One can easily ask whether the quality of the kernel is improving or not, but truly answering that question is not an easy thing to do. A better kernel may, by attracting additional users, actually result in more bug reports; similarly, a buggier kernel may drive testers away, with the result that the number of reported bugs goes down. One cannot simply look at the lists of known problems and come to a reasonably defensible conclusion as to whether a given kernel is better than another or not.
What one can do, however, is ensure that everything which works now continues to work in future versions. If working things do not break, then, on the assumption that other problems are occasionally being fixed, it is reasonable to conclude that the kernel is getting better. If regressions are allowed, instead, then one never really knows. Regressions thus are the closest thing we have to an objective measurement of the quality of a given kernel release, and fixing regressions is an unambiguous way of improving that quality. So it's no wonder that the higher priority placed on improving kernel quality has led to a stronger focus on regressions.
Anybody who has watched Alan Cox's work knows that he cares deeply about the quality of the kernel. But he thinks that the anti-regression policy is being taken a little too far this time around:
It may yet be that a proper fix for this problem will be found for 2.6.24, at which point the larger change can go through. Failing that, though, it appears that the horses and carts will win the day for now. Those needing the full freeway will have to wait until the horse-compatible version becomes available in 2.6.25.
(Update: it appears that the problem has now been fixed.)
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
