Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is 2.6.22-rc1, released by Linus on May 12. Features added to 2.6.22 since last week's Kernel Page include the eventfd system calls, a new IEEE 1394 ("Firewire") stack "designed for robustness and simplicity," drivers for KingSun DS-620 USB-IrDA dongles, Native Instruments USB audio devices, and WM8753 audio codecs (as found in the OpenMoko phone), a large set of fixes to the "libertas" wireless driver, and support for a number of new ARM processors. See the short-form changelog for the details, or the log-format log for vast amounts of detail.As of this writing, about 100 changesets (almost all fixes) have been added to the mainline repository since 2.6.22-rc1.
The current -mm tree is 2.6.22-rc1-mm1. Recent changes to -mm include a number of filesystem writeback fixes, a CRC7 implementation, an improved version of the swap prefetch code, and an early startup development tree for the i386 architecture.
Kernel development news
Quote of the week
2.6 and the user-space ABI
The 2.6.22-rc1 kernel is out, and the reports of regressions are beginning to trickle in. A couple of those involve user-space binary interface changes: one in the video4linux2 interface and one in the i2c code (which affects hardware monitoring utilities). The V4L2 regression involves a change made to a structure passed to and from user space; chances are good that it will be reverted before the final 2.6.22 kernel comes out. For the time being, the i2c problem is "fixed" by upgrading to version 2.10.3 of the lm_sensors package.Linus isn't happy about the forced lm_sensors upgrade; he has asked for a way to avoid that requirement. In response, i2c maintainer Jean Delvare raised some misgivings about the stable ABI policy:
Those comments notwithstanding, Linux has managed to maintain user-space ABI compatibility quite nicely for many years. There are certainly exceptions, but they are few enough and far enough between that each one stands out. But, as Christoph Hellwig points out, the situation is not perfect:
Now compare that to sysfs..
The user-space ABI now goes well beyond system calls. The huge sysfs interface (4800 files on your editor's desktop) is a big piece of user space's view of the system, and it is a piece which is difficult to avoid breaking. Directories in sysfs correspond directly to data structures within the kernel; changes to those structures will often have consequences in sysfs. So kernel developers may think that they are operating far away from the user-space interface, but end up breaking it anyway. Netlink, /proc, and ioctl() also make up part of the ABI, and they, too, can be easy to break. The V4L2 regression is the result of an attempt to extend one ioctl() call breaking others.
The new development model can also make it harder to maintain compatibility. Four or five major releases per year, each with a full load of new major features, adds up to a lot of code changes. There is also no clear point where whatever changes do prove to be unavoidable can be made without surprising users. If the kernel developers were to disappear for a year or two and return with a 3.0 release, nobody would be surprised if it required a small amount of adaptation on the user-space side. But a 2.6.22 release - which contains needed fixes and new drivers along with new features - is not expected to break working systems.
Arguments for returning to the older development model are hard to find, though. Despite occasional glitches, things are generally working far better than they did before 2.6 came out. The pace of development is unlikely to slow. So the problem of occasional ABI regressions is likely to remain with us. As is often the case, the best way to avoid such problems - after a high degree of attention by the developers - is extensive testing. User-space ABI changes caught during the development cycle will almost certainly not survive into the final release, but it is hard to fix problems that nobody knows about. As is also often the case, automating this testing is hard; nobody can put together all of the hardware and software combinations that the kernel will face. So the worthy cause of maintaining a stable user-space interface is likely to require a fair amount of human attention for the foreseeable future.
LogFS
Rotating magnetic storage technology has served us well for a long time. It offers high capacities (for an ever-increasing value of "high"), relatively fast and relatively uniform access times, and relatively good reliability. It is generally accepted that rotating disks will be part of our systems for some time yet. For smaller sizes, however, disks are increasingly being pushed aside by solid state flash memory - and "smaller" is an ever-increasing value as well. Flash is more compact, requires less power, and offers truly random access, so it's not surprising to see it being deployed in more situations.Flash is not without its drawbacks. Its relatively high cost limits its applications and it brings its own set of quirks which must be understood and addressed by filesystem developers. Even so, some special-purpose laptops rely on flash for their persistent storage needs now, and there are rumors of more flash-based systems in the near future.
The most significant of the "quirks" mentioned above are:
- Flash storage cannot be simply overwritten like magnetic storage;
instead, a flash block must be explicitly erased and rewritten in two
separate steps. The size of the "erase blocks" may not match the
block size as understood by the operating system; often, the erase
blocks are relatively large.
- There are limits to the number of times a block of flash memory can be erased and rewritten before it loses the ability to reliably store data. That limit is generally around 100,000 cycles.
These hardware features have some interesting implications. What, for example, happens when the operating system decides to rewrite a single block within a larger flash erase block? A naive implementation would read the entire erase block, perform the erase operation, then write the data back with the new portion included. Should the system go down in the middle of this operation, however, all of the data within the erase block may be lost forever. If the operating system ignores the block lifetime issues, it is likely to cycle some erase blocks much more frequently than others, significantly shortening the overall life of the device. When one is dealing with a low-duty-cycle device, such as a USB thumb drive, it's possible to get away with ignoring the limitations that flash has. When a flash drive is the primary storage device, though, a smarter approach is called for.
Being smarter is usually a matter of using a filesystem which was explicitly designed to work well with flash hardware. These filesystems can dispense with the great care that other filesystems must take in how blocks are laid out - there are no seek time or rotational latency issues with flash drives. On the other hand, flash-aware filesystems must be written with erase cycles in mind; they must not risk losing data during these cycles and they should endeavor to spread these cycles across the drive to maximize its lifetime.
The end result is that filesystems designed for flash devices take the log-structured approach. The device is treated like a sort of circular buffer, with new data always being written to the end. This approach makes for fast write operations, but the read side can be a more complex story. One approach taken is to attach some metadata to each erase block describing which file that block belongs to and its version number. When an erase block is to be rewritten, a new copy is made at the end with a higher version number; reading the file is simply a matter of finding the erase block with the highest version number.
Finding that block requires scanning the disk - something which, most likely, one does not want to do for every read operation. The in-kernel JFFS2 filesystem solves this problem by performing a scan when the filesystem is mounted. It builds an in-memory data structure which speeds subsequent accesses considerably. There is a cost, though: the initial scan can make mounting slow, and the in-memory tree can take a considerable amount of space. Given that flash filesystems are often used on small, embedded systems - where both boot time and memory are at a premium - these costs are significant.
Jörn Engel thinks he has a better way in the form of the LogFS filesystem, currently proposed for inclusion into the mainline. The core idea behind LogFS is that, rather than building the filesystem tree at mount time, the filesystem code should store the tree on the device itself, much like traditional filesystems do. Putting the tree on the flash device reduces mount times (Jörn says that an OLPC system goes from 3.3 seconds under JFFS2 to 60ms under LogFS) and should reduce the runtime memory requirements considerably.
The on-flash tree looks much like the structure used by ext2. There are some differences in how it is managed, however. The log structure of the filesystem implies that blocks cannot be rewritten in place; any time a block is changed it must be moved and written to a new location. If there are pointers to the moved block (think about the usual indirect blocks used to store the layout of larger files), the blocks containing the pointers must also be changed, and thus moved. That, in turn, will require changes at the next level up in the tree. Thus changes at the bottom of the tree will propagate upward all the way to the root. This is the "wandering tree" algorithm. One of the advantages is that the old filesystem structure remains valid until the root is rewritten - a crash could cause the loss of the last operation, but it will leave previous data and the structure of the filesystem intact.
Actually managing the entire directory tree as a wandering tree would be expensive; beyond that, files with multiple hard links break the tree structure and make wandering trees much harder to implement. So the actual tree implemented by LogFS just has two levels. There is an "inode file" containing the inode structures for every file and directory existing within the filesystem; each inode then points to the associated blocks holding the file's data. Directory entries contain a simple integer index giving the inode offset within the inode file. So changes to an inode only require writing the inode itself and the inode file; the rest of the directory structure need not be touched.
To tie it all together, LogFS sets aside a group of blocks as the "anchor area," where versioned pointers to the root inode can be found. Mounting the filesystem requires scanning this anchor area to find the current version of the root inode, at which point the rest of the filesystem becomes accessible. This mechanism allows the root to be found in constant time without the need to scan the entire device.
LogFS has been through a couple rounds of review, with significant changes each time. Barring significant problems, it should be getting close to ready, perhaps it will be merged in time for 2.6.23.
(See also: Jörn's LogFS paper from which much of the above was cribbed).
Scatterlist chaining
High-performance I/O generally involves the use of direct memory access (DMA) operations. With DMA, the I/O device transfers data directly to and from main memory without the intervention of the CPU. In the simplest form of DMA, the controller is handed a pointer to a region of memory, given the length, and told to do its thing. The processor can then forget about the operation until the device signals that the work is done.This simple view has a drawback, however, in that it assumes that the data to be transferred is stored contiguously in memory. When the I/O buffer is in kernel space, the kernel can often arrange for it to be physically contiguous - though that gets harder as the size of the buffers gets larger. If the buffer is in user space, it is guaranteed to be scattered around physical memory. So it would be nice if DMA operations could work with buffers which are split into a number of distinct pieces.
In fact, with any reasonably capable peripheral device, buffers can be split this way. The term for operations on such buffers is "scatter/gather I/O"; scatter/gather has been well supported under Linux for some time. The DMA chapter of Linux Device Drivers covers scatter/gather in a fair amount of detail. In short, a driver starts by filling in an array of scatterlist structures, which (on the i386 architecture) look like:
struct scatterlist {
struct page *page;
unsigned int offset;
dma_addr_t dma_address;
unsigned int length;
};
For each segment, the page pointer tells where the segment is to be found in memory, offset tells where the data begins within the page, and length is the length of the segment. Once the list has been filled in, the driver calls:
int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction direction);
This operation, at a minimum, fills in the dma_address field of each scatterlist entry with an address which can be given to the peripheral. It might do more, though: physically contiguous pages may be coalesced into a single scatterlist entry, or the system's I/O memory management unit might be programmed to make parts (or all) of the list virtually contiguous from the device's point of view. All of this - including the exact form of struct scatterlist - is architecture dependent, but the scatter/gather interface is set up so that drivers need not worry about architecture details.
Recently, a particular constraint in the scatter/gather interface has turned up. For various reasons, scatterlists must fit within a single page; that restriction puts an upper limit on the number of segments which may be represented. On the i386 architecture, with high memory enabled, struct scatterlist requires 20 bytes, which limits a scatterlist to 204 entries. If each scatterlist entry points to a full page, the maximum size for a DMA transfer will be about 800KB. On an x86-64 system, the situation is worse: the structure uses 40 bytes, cutting the maximum length in half.
There are situations where larger I/O operations are desirable. The block I/O subsystem is one of them, but there are certainly others: high-resolution video capture devices are an example. The limitation on scatterlist length is one of the factors motivating developers who are working on large block size support. By increasing the effective page size, they are able to increase the maximum I/O operation size.
Increasing the page size is not the only feasible approach, though; another is simply to make scatterlists longer. Multi-page contiguous scatterlists are not really in the cards, but chaining single-page scatterlists can be done. Jens Axboe has been working on that approach; his scatterlist chaining patch is on its sixth revision as of this writing.
Chaining is done by overloading the page pointer in the last scatterlist entry in a page. The least significant bit is set to indicate that the entry is, in fact, a chain link rather than another segment to transfer. The change is almost transparent to drivers. In current kernels, the code which iterates through a scatterlist usually looks something like this:
struct scatterlist *sg = &the_scatterlist[0];
for (i = 0; i < nentries; i++) {
program_io_operation(sg);
sg++;
}
When chaining is being used, simply incrementing through the array no longer works. So Jens has added a simple sg_next() macro to follow the the chain links when necessary. So the sg++ line above turns into something like:
sg = sg_next(sg);
Since a driver change is required, chained scatterlists should not be used unless one knows for sure that the driver is prepared for them. The patch from Jens fixes up a number of drivers, especially in the block subsystem. Even so, the maximum I/O size must be raised explicitly by the administrator (via a sysfs file) before chaining will be turned on. Once it's enabled, however, multi-megabyte I/O operations become possible. No intrusive memory management changes required.
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
