User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current stable 2.6 release is, released on February 6. It contains a single, one-line fix for a remotely-exploitable denial of service vulnerability in the ICMP code.

The release is under review as of this writing. It is a rather larger patch with almost two dozen important fixes.

The current 2.6 prepatch is 2.6.16-rc2, released by Linus on February 2. In addition to the expected big pile of fixes, this prepatch adds another set of semaphore-to-mutex conversions, a USB driver for ET61X151 and ET61X251 camera controllers, a big Video4Linux update, the direct migration patches, some slab allocator tweaks for NUMA machines, several new system calls (openat() and friends, pselect(), ppoll()), a big ACPI update, and the EDAC error detection/correction code. The long-format changelog has lots of details.

The mainline git repository contains almost 500 post-rc2 patches as of this writing. They are dominated by fixes, but there is also a patch to export the system's CPU topology in sysfs, parallel port support for SGI O2 systems, administrator-changeable permissions in configfs, an OCFS2 update, the unshare() system call, and various architecture updates.

The current -mm tree is 2.6.16-rc2-mm1. Recent changes to -mm include a rework of the mempool code, a new version of the core timekeeping and NTP rework patches, better scheduler support for multicore systems, a feature for forcing kernel allocations to be spread across NUMA nodes, and an LED driver subsystem.

Comments (none posted)

Kernel development news

Quotes of the week

We've got bin-only kernel modules, much of which are clearly immoral, they are clearly hurting us and still we do things to keep them going - e.g. the refusal to remove 8K stacks from the .config. We are increasingly getting into a situation where loopholes are found and utilized to give back as little as possible, upsetting the balance.

so i believe _something_ should be done to tip the balance, because the negative effects are already hurting us. I'd support the move to the GPLv3 only as a tool to move the balance back into a fairer situation, not as some new moral mechanism. The GPLv3 might be overboard for that, but still the situation does exist undeniably.

-- Ingo Molnar

After seven years and hundreds of issues, I've decided to take a break from writing Kernel Traffic for awhile. I'd like to thank all the people who helped out, providing me with hosting space, hardware to work on, suggestions and bug reports, and money. And I'd especially like to thank Linus and the rest of the kernel developers for so powerfully changing the world for the better.

-- Zack Brown

Comments (9 posted)

Asynchronous I/O and vectored operations

The file_operations structure contains pointers to the basic I/O operations exported by filesystems and char device drivers. This structure currently contains three different methods for performing a read operation:

    ssize_t (*read) (struct file *filp, char __user *buffer, size_t size, 
                     loff_t *pos);
    ssize_t (*readv) (struct file *filp, const struct iovec *iov, 
                      unsigned long niov, loff_t *pos);
    ssize_t (*aio_read) (struct kiocb *iocb, char __user *buffer, 
                         size_t size, loff_t pos);

Normal read operations end up with a call to the read() method, which reads a single segment from the source into the supplied buffer. The readv() method implements the system call by the same name; it will read one segment and scatter it into several user buffers, each of which is described by an iovec structure. Finally, aio_read() is invoked in response to asynchronous I/O requests; it reads a single segment into the supplied buffer, possibly returning before the operation is complete. There is a similar set of three methods for write operations.

Back in November, Zach Brown posted a vectored AIO patch intended to provide a combination of the vectored (readv()/writev()) operations and asynchronous I/O. To that end, it defined a couple of new AIO operations for user space, and added two more file_operations methods: aio_readv() and aio_writev(). There was some resistance to the idea of creating yet another pair of operations, and a feeling that there was a better way. The result, after work by Christoph Hellwig and Badari Pulavarty, is a new vectored AIO patch with a much simpler interface - at the cost of a significant API change.

The observation was made that a number of subsystems use vectored I/O operations internally in all cases, even in the case of a "scalar" read() or write() call. For example, the read() function in the current mainline pipe driver is:

    static ssize_t
    pipe_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos)
	struct iovec iov = { .iov_base = buf, .iov_len = count };
	return pipe_readv(filp, &iov, 1, ppos);

Here, the read() method is essentially superfluous; it is provided simply because the API requires it. So, it was asked, rather than adding more vectored I/O operations, why not just "vectorize" the standard API? The resulting patch set brings about that change in a couple of steps.

The first of those is to change the prototypes for the asynchronous I/O methods to:

    ssize_t (*aio_read) (struct kiocb *iocb, const struct iovec *iov, 
             unsigned long niov, loff_t pos);
    ssize_t (*aio_write) (struct kiocb *iocb, const struct iovec *iov,  
             unsigned long niov, loff_t pos);

Thus, the single buffer has been replaced with an array of iovec structures, each describing one segment of the I/O operation. For the current single-buffer AIO read and write commands, the new code creates a single-entry iovec array and passes it to the new methods. (It's worth noting that, as the code is currently written, that iovec array is no longer valid after aio_read() or aio_write() returns; that array will need to be copied for any operation which remains outstanding when those functions finish).

The prototypes of a couple of VFS helper functions (generic_file_aio_read() and generic_file_aio_write()) have been changed in a similar manner. These changes ripple through every driver and filesystem providing AIO methods, making the patch reasonably large. A second patch then adds two new AIO operations (IOCB_CMD_PREADV and IOCB_CMD_PWRITEV) to the user-space interface, making vectored asynchronous I/O available to applications.

The patch set then goes one step further by eliminating the readv() and writev() methods altogether. With this patch in place, any filesystem or driver which wishes to provide vectored I/O operations must do so via aio_read() and aio_write() instead. Note that this change does not imply that asynchronous operations themselves must be supported - it is entirely permissible (if suboptimal) for aio_read() and aio_write() to operate synchronously at all times. But this patch does make it necessary for modules wishing to provide vectored operations to, at a minimum, provide the file_operations methods for asynchronous I/O. If the AIO methods are not available for a given device or filesystem, a call to readv() or writev() will be emulated through multiple calls to read() or write(), as usual.

Finally, with this patch in place, it is possible for a driver or filesystem to omit the read() and write() methods altogether if the asynchronous versions are provided. If, for example, only aio_read() is provided, all read() and readv() system calls will be handled by the aio_read() method. If, someday, all code implements the AIO methods, the regular read() and write() methods could be removed altogether. That would result in an interface which contained only one method for all read operations (and one more for writes). This change would also realize the vision expressed at the 2003 Kernel Summit that all I/O paths inside the kernel would, in the end, be made asynchronous.

There has been little discussion of the current patch set, so it is hard to predict what may ultimately become of it. Given that it simplifies a core kernel API while simultaneously making it more powerful, however, chances are that some version of this patch will find its way into the kernel eventually.

(For more information on the AIO interface, see this Driver Porting Series article or chapter 15 of LDD3).

Comments (1 posted)

Software suspend - again

Last week's Kernel Page looked at one small piece of the software suspend debate. Meanwhile, the wider discussion has flared up yet again, and looks unlikely to slow down. Developers of the in-kernel suspend-to-disk code are working on moving parts of it to user space and generally tweaking the existing structure. Nigel Cunningham and other supporters of the Suspend2 patches, instead, still hope to see that work merged, eventually replacing much of the existing implementation. The discussion does not appear to be nearing any sort of resolution.

One has become clear, though: Pavel Machek has a firm grip on the current in-tree swsusp code, and that puts Suspend2 at a significant disadvantage. Pavel has taken a strong position against many aspects of the Suspend2 code, and seems determined that it will never be merged. One gets the sense, sometimes, that he just wishes Nigel and his code would go away. Nigel is somewhat more persistent than that, however.

At one point, the two suggested that Linus and Andrew should make a decision between the two implementations and settle the debate. Andrew, however, does not want to do that:

You're unlikely to hear anything dispositive from either of us on this... What we hope and expect is that you'll come up with an agreed path in accordance with general kernel coding and development principles. Linus and I don't want to have to make tiebreak decisions - if we have to do that, the system has failed.

So much for the easy solution. Since then, the relevant parties have been talking, but without a whole lot of apparent progress.

Perhaps the more interesting part of Andrew's note, however, was this:

If you want my cheerfully uninformed opinion, we should toss both of them out and implement suspend3, which is based on the kexec/kdump infrastructure. There's so much duplication of intent here that it's not funny.

kexec(), remember, is a relatively new system call used to boot from one kernel directly into another without going through the whole BIOS startup ritual. The kdump code uses kexec() to perform safe crash dumps. When the kernel panics, it uses kexec() to boot into a small, special-purpose kernel which has been lurking in a reserved part of memory for just this occasion. The new kernel restricts itself to the reserved memory, so the entire memory image of the old, crashed kernel remains intact. That image can then be written to disk in a relatively safe manner.

It is true that suspend-to-disk can be thought of as a sort of kernel dump; the only difference is this little desire to be able to restart the kernel from the dump image at a future time. Using kdump for suspend-to-disk has some obvious appeal. A great deal of effort now goes into freezing most processes on the system - but not the ones needed to complete the suspend process. The suspend code also must be very careful about what kernel state it changes as it goes about its work. Simply jumping into a separate dump kernel has the potential to make many of those problems go away. It might almost be like the Good Old Days, when BIOS-based suspend code simply worked most of the time.

A kdump-based suspend would not be without its costs. In particular, some people might balk at reserving a substantial chunk of memory for the suspend kernel. And, of course, the entire idea remains vaporware for now.

Andrew's suggestion generated little discussion on the mailing list. But, just maybe, it will have ignited a gleam in some hacker's eye. A simpler, more robust suspend mechanism based on kdump which appeared out of left field might just solve this problem - and put the whole tiresome debate in the past - for good.

Comments (22 posted)

PID virtualization: a wealth of choices

A set of patches for the management of virtual process IDs within containers was discussed here a few weeks ago. That patch set drew some interest, but a fair amount of concern as well. It is a large set of changes reaching all over the kernel; it seemed to many that there should be a better way. Since then, two candidates for the "better way" have been posted, and the situation seems less clear than ever. This sort of virtualization is clearly of interest to a number of projects, but there is little consensus on how it should be done.

One of the new entrants is the OpenVZ PID virtualization code, posted by Kirill Korotaev but originally developed by Alexey Kuznetsov. These patches introduce a container called a VPS (virtual private server), each of which can virtualize a number of aspects of the host system, including process IDs. Each process has a real and virtual PID; all PIDs of the virtual variety are identified by having a specific bit set. In the simple case, the virtual-PID bit is the only difference between the real and virtual IDs, but more complex mappings are possible as well.

There is the usual set of functions to convert between real and virtual PIDs (and group, process group, and thread IDs as well). All code which deals with user space must work with virtual PIDs, but internal code uses real PIDs, so a certain amount of awareness is called for. Since there is a specific bit used to mark virtual PIDs, the code is at least able to catch situations where the wrong type of PID is used. There is also a change to the internal fork() implementation allowing a process to be created with a specific virtual PID; this feature can be used to launch a new container with its top-level process having PID 1.

The other implementation is this "process ID namespace" patch set from Eric Biederman. It does away with the concept of virtual PIDs in favor of a different view of the problem. For starters, every process gets a "wait ID" - the process ID by which its parents know it. In most cases, the "wait ID" will be the same as the PID, but, in cases where a process is the leader of a virtualized group, the two will be different.

Then Eric adds process ID spaces. A process ID space (pspace) is simply a range of independent PIDs, associated with tree of processes. By default, the entire system shares one process space, but, by way of a clone() flag, a new process can be created in its own space. Process IDs are unique within any one pspace, but may be duplicated in other spaces. So the kernel, when it must identify a process unambiguously using a PID, must now use a (pspace, PID) tuple. Functions which deal in PIDs - kill_pg() or find_task_by_pid(), for example - get a new pspace parameter.

This approach has the advantage that there is no distinction between real and virtual PIDs - all PIDs are interpreted relative to a PID space. There is no real possibility of confusing real and virtual PIDs, or interpreting PIDs relative to the wrong pspace. So it should be a relatively safe addition to the kernel. On the other hand, Eric's patches don't even try to address the larger virtualization problem; anybody wanting to implement complete containers will still have to do that work separately. Of course, as has been seen, a few projects have already done that work; it's just a matter of seeing which implementation, if any, gets into the mainline.

On that question, it is far too early to say what might happen. Linus has indicated that he likes the container concept from the OpenVZ patches, but that does not necessarily extend to the PID virtualization part of it. Eric has tried to focus the discussion with a summary of the relevant issues and questions which must be resolved going forward. But there is a certain amount of disagreement, and a few projects which have each invested significant time into their particular approaches. It may be a while before the dust settles on this one.

Comments (3 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds