Brief items
The current 2.6 prepatch remains 2.6.10-rc1, which came out on
October 22.
Patches currently sitting in Linus's BitKeeper repository include fixes for
the ELF loader security problems, kprobes
support for the x86-64 architecture, a frame buffer device update, a set of
user-mode Linux patches, an NTFS update, version 2.0 of the USB gadget
serial driver, some kernel build tweaks (the preferred name for kernel
makefiles is now Kbuild),
the ext3 block reservation and online resizing patches, sysfs backing store, locking behavior
annotations for the "sparse" utility, a reworking of spin lock
initialization, the un-exporting of add_timer_on(),
sys_lseek(), and a number of other kernel functions, an x86 signal
delivery optimization, an IDE update, I/O space
write barrier support, a frame buffer driver update, more scheduler
tweaks, some big kernel lock preemption patches, a large number of
architecture updates, and lots of fixes.
The current tree from Andrew Morton is 2.6.10-rc1-mm4. The biggest recent change in
-mm, perhaps, is the inclusion of the four-level page table patch in 2.6.10-rc1-mm3 and subsequent fixes in -mm4;
Andrew has stated that he expects to merge four-level page tables in the
near future.
Other changes include support for the FRV architecture, some scheduler
tweaks, the un-exporting of cdev_get() and cdev_put(), a
number of architecture updates, and the usual pile of fixes.
The current 2.4 prepatch is 2.4.28-rc2, released by Marcelo on November 7. It
contains some networking updates and a patch for a (difficult to exploit)
security problem; if nothing new turns up, it will become the official
2.4.28 release.
Comments (3 posted)
Kernel development news
Version 2 of the Active Block I/O
Scheduling System (ABISS) was released on November 9. At a first
glance, ABISS looks like yet another I/O scheduler for a kernel which
already has a few of them - and that it is. But there is more to ABISS
which makes it worth a look.
The goal behind ABISS is to enable applications to request (and receive) a
guaranteed I/O rate to a specific file. It is implementing a sort of
isochronous stream capability for the Linux block layer. The target
applications might be multimedia recording and playback programs, or,
perhaps, some sort of data acquisition system. Any application which needs
assurance that it can transfer data to or from the filesystem at a given
rate could benefit from ABISS.
For now, guaranteed data rates are only supported for read access, and only
for a few filesystems. The core of the read side of ABISS is the "playout
buffer." It is, for all practical purposes, a circular buffer in kernel
space which is filled at the requested I/O rate. As long as the
application does not exceed its requested rate for long periods of time,
the data it requests should always be located in the buffer, and thus
immediately available. The playout buffer is integrated with the page
cache, so accessing the file via mmap() will also work - though,
in that case, the application must inform ABISS of its progress through the
file so that playout buffer pages can be released when no longer needed.
Setting up this buffer requires a few steps. The application uses an
ioctl() call to request a guaranteed read rate; that request is then
passed back to a user-space daemon for approval. The daemon is supposed to
keep track of all such requests and ensure that the system actually has
enough resources to implement another fixed-rate stream. Any policy
decisions on which processes are allowed to request guaranteed-rate
behavior - and the rates they can ask for - are also made in the user-space
daemon.
If the daemon approves the request, the kernel builds an in-memory map of
the location of the file's data blocks. This map is used when filling the
playout buffer; its real purpose is to do the file location lookup work
ahead of time and minimize unexpected I/O while the file is being
processed.
The operational phase consists of filling the playout buffer at the given
rate while not allowing it to get too large. The idea is conceptually
simple, though the actual implementation involves a number of somewhat
tricky details.
ABISS differs from other I/O schedulers in that it does not just fit neatly
into the block layer. Each filesystem must have ABISS support explicitly
added to it. In particular, ABISS must be able to intercept
ioctl() calls and, build the location map. When the
filesystem-level code decides to look for a specific block within the file,
the ABISS code, which may already have that location in its map, needs a
chance to short out the usual lookup code. Finally, ABISS must be notified
when a file is truncated, since it needs to adjust the location map to
match the new size. Since filesystem-level changes are needed, ABISS does
not support all filesystems in the Linux kernel; version 2 only works
with FAT, VFAT, and ext3.
Underneath it all is a real I/O scheduler. The primary feature
there is the implementation of I/O request priorities. Requests to fill
the playout buffer go in at a high priority, and will be executed before
most others. The ABISS I/O scheduler also implements a set of "best
effort" priorities which can be used when guaranteed I/O rates are not
required.
More information can be found on the ABISS project page.
Comments (4 posted)
The Linux security module framework allows the flexible loading of security
modules into the kernel. These modules are allowed to hook into a large
number of kernel functions and, if they deem it appropriate, block an
attempted user-space operation. As a way of helping security modules, many
core kernel structures include a
void * "security" pointer
which may be used to attach security-related information. These structures
include those representing inodes, files, open sockets, processes, and
more.
One shortcoming of the security module mechanism - according to some
developers, at least - is that it makes life hard for people who are trying
to load more than one module. There is some rudimentary support for
stacking modules; essentially, any modules which request stacked loading
are simply passed to the "primary" module. The primary module can
refuse to accept the stacked module at all (in which case the load fails),
or it can, in its own way, arrange to call the stacked module's hooks when
it sees fit. So stacking a module requires that the author of the
first-loaded module explicitly thought about and coded support for that
mode of operation. Since that support must be added to a large number of
security hooks, most security module authors conclude that they have better
things to do with their time.
There is also the little matter of that void * security
pointer in all those structures. If modules are to be stacked, they must
come up with some way of sharing that single pointer without creating
chaos.
Serge Hallyn has been trying to address the stacking problem for some time;
his latest attempt was recently posted to
linux-kernel with a request for comments. He certainly got a few of those.
The patch supports stacking security modules by separating them from each
other to the greatest extent possible. The existing security hooks are all
set to a set of "stacker" hooks; each one calls the associated hook
provided by each stacked module, and returns a failure code if any of the
modules decides to block the operation. The various void *
pointers are each replaced by a static array, dimensioned to the maximum
configured number of security modules (four by default). Each loaded
module is given an
index into the array, and is expected to work with its entry only. Thus,
all security modules must be changed to work properly in the stacking
mode.
The code itself has drawn a few complaints; not everybody is convinced by
how the locking works, for example. Adding static arrays to
heavily-used kernel data structures (such as files and inodes) will
significantly increase kernel memory usage. Your editor, in his reading of
the patch, can find no code which prevents loading more than the configured
maximum number of modules and corrupting all of those structures. And so
on.
The real issue of contention, however, is whether security module stacking
makes any sense in the first place. Stacked modules operate without any
awareness of each other, but could interact to produce surprising results.
In the security world, surprising results tend not to be welcome. The
right approach, as expressed by James
Morris (and others), is to load SELinux and let it handle the loading
of other security policies. SELinux was designed to do this, and it should
be able to handle module interactions in a more predictable way. Whether
other developers are willing to accept SELinux as the One True Base
Security Module remains to be seen; it seems more likely than getting blind
security module stacking into the kernel, however.
Comments (1 posted)
The expanded device number type in the 2.6 kernel makes it possible, at the
lowest level, to support vast numbers of partitions on every block device
in the system. Unfortunately, the Linux block drivers have not caught up
with this change. SCSI, in particular, is still limited to 15 partitions
per device. There are a few reasons for this lag, but the largest is
simple compatibility: there is no easy way to incorporate support for more
partitions without breaking the existing device numbering scheme. The
block layer assumes that partitions have consecutive minor numbers, so
supporting more partitions means increasing the portion of the minor number
which is dedicated to the partition number. But changing the
interpretation of minor numbers in this way would break existing systems,
and that is something the kernel developers are reluctant to do.
Carl-Daniel Hailfinger has recently posted an
interesting solution to the partition limit: partitioned loopback
devices. A loopback device is a kernel-implemented virtual block device
which is backed up by something real - usually a disk partition or a file
on a disk somewhere. Common uses for loopback devices include mounting
regular files as filesystems or the creation of encrypted filesystems
(though the device mapper is the preferred means for the latter application
in 2.6). Loopback devices do not support partitions in their own right;
they simply provide block-level access to the backing store as a single
partition.
Carl-Daniel noticed, however, that adding partition support to loopback
devices would be a relatively straightforward thing to do. In 2.6,
partition handing is (finally) part of the block layer; all that is really
required to support partitions in the loopback driver is to tell the block
layer that those partitions exist. So, with a small patch, each loopback
device can have up to 127 partitions. The bulk of the patch, in fact, is
there to ensure continued compatibility for users of non-partitioned
loopback devices.
This capability is interesting because it is a simple matter of one
losetup command to create a loopback interface to a real disk
drive. Thus, by using loopback devices in this mode, system administrators can
get around the partition limits enforced by the real hardware drivers and
divide their disks into lots of tiny little pieces. There is some small
overhead associated with using the loopback device, but, for users
in need of more partitions, it may well be a price worth
paying.
Comments (14 posted)
Patches and updates
Kernel trees
- Andrew Morton: 2.6.10-rc1-mm3. Includes 4-level page tables.
(November 5, 2004)
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>