Kernel development
Brief items
Kernel release status
The current 2.6 release is 2.6.4-rc1, which was announced by Linus on February 27. This large patch contains support for Intel's "ia32e" architecture, a new syscalls.h include file with prototypes for the various sys_* functions, various network driver fixes, a UTF-8 tty mode, dynamic PTY allocation (allowing up to a million PTY devices), sysfs support for SCSI tapes and bluetooth devices, the "large number of groups" patch (covered in the October 2 Kernel Page), the generic kernel thread code (January 7 Kernel Page), an HFS filesystem rewrite, and a massive number of other fixes. See the long-format changelog for the details.Linus's BitKeeper tree contains a number of parallel port fixes, various architecture updates, the reversion of a patch which had removed threads from /proc (and broke gdb), an XFS update, a FireWire update (including one which notes that IEEE1394 support is no longer experimental), and numerous fixes.
The current kernel tree from Andrew Morton is 2.6.4-rc1-mm2. Recent additions to the -mm tree include more scheduler tweaks, some big NFS updates, the POSIX message queues patch, a 4K stack option for the x86 architecture, some VM optimizations, the removal of some old network device API functions (see below), and numerous other fixes and updates.
The current 2.4 kernel is 2.4.25. Marcelo has released no 2.4.26 prepatches since 2.4.26-pre1 on February 25.
Kernel development news
A retry-based AIO infrastructure
The asynchronous I/O infrastructure was added in 2.5 as a way to allow processes to initiate I/O operations without having to wait for their completion. The underlying mechanism is documented in this Driver Porting Series article. The actual implementation of asynchronous I/O in the kernel has been somewhat spotty, however. It works for some devices (which have specifically implemented that support) and for direct file I/O. Other sorts of potentially interesting uses, such as with regular buffered file I/O, have remained unimplemented.Part of the problem is that buffered file I/O integrates deeply with the page cache and virtual memory subsystem. It is not all that easy to graft asynchronous I/O operations into those complex bodies of code. So the kernel developers have, for the most part, simply punted on cases like that.
Suparna Bhattacharya, however, has not given up so easily. For over a year, now, she has been working on a set of patches which bring the asynchronous mode to the buffered I/O realm. A new set of patches has recently been posted which trims down the buffered AIO changes to the bare minimum. So this seems like a good time to take a look at what is involved in making asynchronous buffered I/O work.
The architecture implemented by these patches is based on retries. When an asynchronous file operation is requested, the code gets things started and goes as far as it can until something would block; at that point it makes a note and returns to the caller. Later, when the roadblock has been taken care of, the operation is retried until the next blocking point is hit. Eventually, all the work gets done and user space can be notified that the requested operation is complete. The initial work is done in the context of the process which first requested the operation; the retries are handled out of a workqueue.
For things to work in this mode, kernel code in the buffered I/O path must be taught not to block when it is working on an asynchronous request. The first step in this direction is the concept of an asynchronous wait queue entry. Wait queue entries are generally used, surprisingly, for waiting; they include a pointer to the process which is to be awakened when the wait is complete. With the AIO retry patch, a wait queue entry which has a NULL process pointer is taken to mean that actually waiting is not desired. When this type of wait queue entry is encountered, functions like prepare_to_wait() will not put the process into a sleeping state (though it does add the wait queue entry to the associated wait queue), and some functions will return the new error code -EIOCBRETRY rather than actually sleeping.
The next step is to add a new io_wait entry to the task structure. When AIO retries are being performed, that entry is pointed to an asynchronous wait queue entry associated with the specific AIO request. This task structure field is, for all practical purposes, being used in a hackish manner to pass the wait queue entry into functions deep inside the virtual memory subsystem. It might have been clearer to pass it explicitly as a parameter, but that would require changing large numbers of internal interfaces to support a rarely-used functionality. The io_wait solution is arguably less clean, but it also makes for a far less invasive patch. It does mean, however, that work can only proceed on a single AIO request at a time.
Finally, a few low-level functions have been patched to note the existence of a special wait queue entry in the io_wait field and to use it instead of the local entry that would normally have been used. In particular, page cache functions like wait_on_page_locked() and wait_on_page_writeback() have been modified in this way. These functions are normally used to wait until file I/O has been completed on a page; they are the point where buffered I/O often blocks. When AIO is being performed, instead, they will return the -EIOCBRETRY error code immediately.
The AIO code also takes advantage of the fact that wait queue entries, in 2.6, contain a pointer to the function to be called to wake up the waiting process. With an asynchronous request, there may be no such process; instead, the kernel needs to attempt the next retry. So the AIO code sets up its own wakeup function which does not actually wake any processes, but which does restart the relevant I/O request.
Once that structure is in place, all that's left is a bit of housekeeping code to keep track of the status of the request between retries. This work is done entirely within the AIO layer; as each piece of the request is satisfied, the request itself as seen by the filesystem layer is modified to take that into account. When the operation is retried to transfer the next chunk of data, it looks like a new request with the already-done portion removed.
Add in a few other hacks (telling the readahead code about the entire AIO request, for example, and an AIO implementation for pipes) and the patch set is complete. It does not attempt to fix every spot which might block (that would be a large task), but it should take care of the most important ones.
The end of init_etherdev() and friends
The last few 2.6 kernel releases have seen a lot of patches removing calls to a set of network driver support functions, including init_etherdev(), init_netdev(), and dev_alloc(). With the integration of networking and sysfs, static net_device structures have become impossible to use in a safe way; these structures must now be allocated dynamicly and properly reference counted. See this Driver Porting Series article for details on the currently supported interface.As of 2.6.3, there are no users of those functions in the mainline kernel tree. There are, however, certain to be out-of-tree drivers which still use them. Those drivers will need to be fixed soon; the 2.6.3-mm4 kernel tree added a patch which removes those functions forevermore. Once that patch works its way into the mainline kernel, any driver relying upon init_etherdev() and friends will cease to work until it is fixed. Don't say you haven't been warned.
pramfs - a new filesystem
Steve Longerbeam (of MontaVista) has sent out an announcement for a new filesystem called "pramfs." He would like to see pramfs merged into the mainline kernel in the near future; let it not be said that embedded Linux companies do not contribute to the kernel.Pramfs (the "protected and persistent RAM special filesystem") is a specialized filesystem; it is intended for use in embedded systems which provide a bank of non-volatile memory for user data storage. Think, for example, of a phone book housed within a mobile telephone. Such memory tends to be fast, but it is not normally part of the system's regular core memory. It also tends to be important; cell phone users will not tolerate a phone which scrambles their phone numbers.
To meet the special needs presented by non-volatile RAM filesystems, pramfs does a number of things differently than normal filesystems. Since there is no need to worry about the (nonexistent) performance impacts of block positioning, pramfs doesn't. Since pramfs filesystems are expected to live in fast memory, there is generally no performance benefit to caching pages in main memory. So pramfs, interestingly, forces all file I/O to be direct; essentially, it forces the O_DIRECT flag on all file opens. In that way, pramfs gets the benefits of shorting out the page cache without having to change applications to use O_DIRECT explicitly.
Pramfs also goes out of its way to avoid corruption of the filesystem. If the underlying non-volatile RAM is represented in the system's page tables, it is marked read-only to keep a stray write from trashing things. When an explicit write to the filesystem is performed, the page permissions are changed only for the time required to perform the I/O. Pramfs disallows writes from the page cache; one practical result of that prohibition is that shared mappings of pramfs-hosted files are not possible.
See the pramfs web site for more information.
Time to thrash the 2.6 VM?
Those who have been watching kernel development for a little while will remember the fun that came with the 2.4.10 release, when Linus replaced the virtual memory subsystem with a new implementation by Andrea Arcangeli. The 2.4 kernel did end up with a stable VM some releases thereafter, but many developers were upset that such a major change would be merged that far into a stable series. Especially since many of those developers were not convinced that the previous VM was not fixable.The 2.4 changes are long past, but the memories are fresh enough that when Andrea put forward a set of VM changes which, while they are for 2.4, are said to be applicable to 2.6 as well, people took notice. Andrea's goals this time are little more focused; he is concerned with the performance of systems with at least 32GB of installed memory and hundreds of processes with shared mappings of large files. This, of course, is the sort of description that might fit a high-end database server.
Andrea has found three problems which make those massive servers fail to function well. The first has to do with how 2.4 performs swapout; it works by scanning each process's virtual address space, and unmapping pages that it would like to make free. When a page's mapping count reaches zero, it gets kicked out of main memory. The problem is that this algorithm performs poorly in situations where many processes have the same, large file mapped. The VM will start by unmapping the entire file for the first process, then another, and so on. Only when it has passed through all of the processes mapping the file can it actually move pages out of main memory. Meanwhile, all of those processes are incurring minor page faults and remapping the pages. With enough memory and processes, the VM subsystem is almost never able to actually free anything.
This is the problem that the reverse-mapping VM (rmap) was added to 2.5 to solve. By working directly with physical pages and following pointers to the page tables which map them, the VM subsystem can quickly free pages for other use. Andrea is critical of rmap, however; with his scenario of 32GB of memory and hundreds of processes, the rmap infrastructure grows to a point where the system collapses. Instead, for his patches, he has implemented a variant of the object-based reverse mapping scheme. Object-based reverse mapping works by following the links from the object (a shared file, say) which backs up the shared memory; in this way it is able to dispense with the rmap structures in many situations. There are some concerns about pathological performance issues with the object-based approach, but those problems do not seem to arise in real-world use.
The second problem is a simple bug in the swapout code. When shared memory is unmapped and set up for swap, the actual I/O to write it out to the swap file is not started right away. By the time the system gets around to actually performing I/O, there is a huge pile of pages waiting to be shoved out, and an I/O storm results. Even then, the way the kernel tracks this memory means that it takes a long time to notice that it is free even after it has been written to swap. This problem is fixed by taking frequent breaks to actually shove dirty memory out to disk.
Andrea's final problem came about when he tried to copy a large file while all those database processes were running. It turns out that the system was swapping out the shared database memory (which was dirty and in use) rather than the data from the file just copied (which is clean). Tweaking the memory freeing code to make it prefer clean cache pages over dirty pages straightened this problem out, at the cost of a certain amount of unfairness.
With these patches, Andrea claims, the 2.4 kernel can run heavy loads on large systems which will immediately lock up a 2.6 system. So he is going to start looking toward 2.6, with an eye toward beefing it up for this sort of load. Andrew Morton has indicated that he might accept some of this work - but not yet:
I plan to merge the 4g split immediately after 2.7 forks. I wouldn't be averse to objrmap for file-backed mappings either - I agree that the search problems which were demonstrated are unlikely to bite in real life.
The "4g split" is Ingo Molnar's 4GB user-space patch which makes more low memory available to the kernel, but at a performance cost. Before Andrew merges any other patches, however, he wants to see a convincing demonstration of why the current VM patches are not enough for large loads. The 2.6 "stable" kernel may well see some significant virtual memory work, but, with luck, it will not be subjected to a 2.4.10-like abrupt switch.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
