LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.10-rc1, which came out on October 22.

Patches currently sitting in Linus's BitKeeper repository include fixes for the ELF loader security problems, kprobes support for the x86-64 architecture, a frame buffer device update, a set of user-mode Linux patches, an NTFS update, version 2.0 of the USB gadget serial driver, some kernel build tweaks (the preferred name for kernel makefiles is now Kbuild), the ext3 block reservation and online resizing patches, sysfs backing store, locking behavior annotations for the "sparse" utility, a reworking of spin lock initialization, the un-exporting of add_timer_on(), sys_lseek(), and a number of other kernel functions, an x86 signal delivery optimization, an IDE update, I/O space write barrier support, a frame buffer driver update, more scheduler tweaks, some big kernel lock preemption patches, a large number of architecture updates, and lots of fixes.

The current tree from Andrew Morton is 2.6.10-rc1-mm4. The biggest recent change in -mm, perhaps, is the inclusion of the four-level page table patch in 2.6.10-rc1-mm3 and subsequent fixes in -mm4; Andrew has stated that he expects to merge four-level page tables in the near future. Other changes include support for the FRV architecture, some scheduler tweaks, the un-exporting of cdev_get() and cdev_put(), a number of architecture updates, and the usual pile of fixes.

The current 2.4 prepatch is 2.4.28-rc2, released by Marcelo on November 7. It contains some networking updates and a patch for a (difficult to exploit) security problem; if nothing new turns up, it will become the official 2.4.28 release.

Comments (3 posted)

Kernel development news

Into the ABISS

Version 2 of the Active Block I/O Scheduling System (ABISS) was released on November 9. At a first glance, ABISS looks like yet another I/O scheduler for a kernel which already has a few of them - and that it is. But there is more to ABISS which makes it worth a look.

The goal behind ABISS is to enable applications to request (and receive) a guaranteed I/O rate to a specific file. It is implementing a sort of isochronous stream capability for the Linux block layer. The target applications might be multimedia recording and playback programs, or, perhaps, some sort of data acquisition system. Any application which needs assurance that it can transfer data to or from the filesystem at a given rate could benefit from ABISS.

For now, guaranteed data rates are only supported for read access, and only for a few filesystems. The core of the read side of ABISS is the "playout buffer." It is, for all practical purposes, a circular buffer in kernel space which is filled at the requested I/O rate. As long as the application does not exceed its requested rate for long periods of time, the data it requests should always be located in the buffer, and thus immediately available. The playout buffer is integrated with the page cache, so accessing the file via mmap() will also work - though, in that case, the application must inform ABISS of its progress through the file so that playout buffer pages can be released when no longer needed.

Setting up this buffer requires a few steps. The application uses an ioctl() call to request a guaranteed read rate; that request is then passed back to a user-space daemon for approval. The daemon is supposed to keep track of all such requests and ensure that the system actually has enough resources to implement another fixed-rate stream. Any policy decisions on which processes are allowed to request guaranteed-rate behavior - and the rates they can ask for - are also made in the user-space daemon.

If the daemon approves the request, the kernel builds an in-memory map of the location of the file's data blocks. This map is used when filling the playout buffer; its real purpose is to do the file location lookup work ahead of time and minimize unexpected I/O while the file is being processed. The operational phase consists of filling the playout buffer at the given rate while not allowing it to get too large. The idea is conceptually simple, though the actual implementation involves a number of somewhat tricky details.

ABISS differs from other I/O schedulers in that it does not just fit neatly into the block layer. Each filesystem must have ABISS support explicitly added to it. In particular, ABISS must be able to intercept ioctl() calls and, build the location map. When the filesystem-level code decides to look for a specific block within the file, the ABISS code, which may already have that location in its map, needs a chance to short out the usual lookup code. Finally, ABISS must be notified when a file is truncated, since it needs to adjust the location map to match the new size. Since filesystem-level changes are needed, ABISS does not support all filesystems in the Linux kernel; version 2 only works with FAT, VFAT, and ext3.

Underneath it all is a real I/O scheduler. The primary feature there is the implementation of I/O request priorities. Requests to fill the playout buffer go in at a high priority, and will be executed before most others. The ABISS I/O scheduler also implements a set of "best effort" priorities which can be used when guaranteed I/O rates are not required.

More information can be found on the ABISS project page.

Comments (4 posted)

Stackable security modules

The Linux security module framework allows the flexible loading of security modules into the kernel. These modules are allowed to hook into a large number of kernel functions and, if they deem it appropriate, block an attempted user-space operation. As a way of helping security modules, many core kernel structures include a void * "security" pointer which may be used to attach security-related information. These structures include those representing inodes, files, open sockets, processes, and more.

One shortcoming of the security module mechanism - according to some developers, at least - is that it makes life hard for people who are trying to load more than one module. There is some rudimentary support for stacking modules; essentially, any modules which request stacked loading are simply passed to the "primary" module. The primary module can refuse to accept the stacked module at all (in which case the load fails), or it can, in its own way, arrange to call the stacked module's hooks when it sees fit. So stacking a module requires that the author of the first-loaded module explicitly thought about and coded support for that mode of operation. Since that support must be added to a large number of security hooks, most security module authors conclude that they have better things to do with their time.

There is also the little matter of that void * security pointer in all those structures. If modules are to be stacked, they must come up with some way of sharing that single pointer without creating chaos.

Serge Hallyn has been trying to address the stacking problem for some time; his latest attempt was recently posted to linux-kernel with a request for comments. He certainly got a few of those.

The patch supports stacking security modules by separating them from each other to the greatest extent possible. The existing security hooks are all set to a set of "stacker" hooks; each one calls the associated hook provided by each stacked module, and returns a failure code if any of the modules decides to block the operation. The various void * pointers are each replaced by a static array, dimensioned to the maximum configured number of security modules (four by default). Each loaded module is given an index into the array, and is expected to work with its entry only. Thus, all security modules must be changed to work properly in the stacking mode.

The code itself has drawn a few complaints; not everybody is convinced by how the locking works, for example. Adding static arrays to heavily-used kernel data structures (such as files and inodes) will significantly increase kernel memory usage. Your editor, in his reading of the patch, can find no code which prevents loading more than the configured maximum number of modules and corrupting all of those structures. And so on.

The real issue of contention, however, is whether security module stacking makes any sense in the first place. Stacked modules operate without any awareness of each other, but could interact to produce surprising results. In the security world, surprising results tend not to be welcome. The right approach, as expressed by James Morris (and others), is to load SELinux and let it handle the loading of other security policies. SELinux was designed to do this, and it should be able to handle module interactions in a more predictable way. Whether other developers are willing to accept SELinux as the One True Base Security Module remains to be seen; it seems more likely than getting blind security module stacking into the kernel, however.

Comments (1 posted)

Partitioned loopback devices

The expanded device number type in the 2.6 kernel makes it possible, at the lowest level, to support vast numbers of partitions on every block device in the system. Unfortunately, the Linux block drivers have not caught up with this change. SCSI, in particular, is still limited to 15 partitions per device. There are a few reasons for this lag, but the largest is simple compatibility: there is no easy way to incorporate support for more partitions without breaking the existing device numbering scheme. The block layer assumes that partitions have consecutive minor numbers, so supporting more partitions means increasing the portion of the minor number which is dedicated to the partition number. But changing the interpretation of minor numbers in this way would break existing systems, and that is something the kernel developers are reluctant to do.

Carl-Daniel Hailfinger has recently posted an interesting solution to the partition limit: partitioned loopback devices. A loopback device is a kernel-implemented virtual block device which is backed up by something real - usually a disk partition or a file on a disk somewhere. Common uses for loopback devices include mounting regular files as filesystems or the creation of encrypted filesystems (though the device mapper is the preferred means for the latter application in 2.6). Loopback devices do not support partitions in their own right; they simply provide block-level access to the backing store as a single partition.

Carl-Daniel noticed, however, that adding partition support to loopback devices would be a relatively straightforward thing to do. In 2.6, partition handing is (finally) part of the block layer; all that is really required to support partitions in the loopback driver is to tell the block layer that those partitions exist. So, with a small patch, each loopback device can have up to 127 partitions. The bulk of the patch, in fact, is there to ensure continued compatibility for users of non-partitioned loopback devices.

This capability is interesting because it is a simple matter of one losetup command to create a loopback interface to a real disk drive. Thus, by using loopback devices in this mode, system administrators can get around the partition limits enforced by the real hardware drivers and divide their disks into lots of tiny little pieces. There is some small overhead associated with using the loopback device, but, for users in need of more partitions, it may well be a price worth paying.

Comments (14 posted)

Patches and updates

Kernel trees

  • Andrew Morton: 2.6.10-rc1-mm3. Includes 4-level page tables. (November 5, 2004)

Build system

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds