Brief items
The current 2.6 kernel is 2.6.8.1.
Linus
announced the availability of the
2.6.8 allegedly stable kernel on August 13.
Unfortunately, it turned out to be a true "Friday the 13th" release with a
fatal bug in the NFS code, so 2.6.8.1 was rushed out to fix it. This
is the first time that the kernel has used a four-entry version number.
Changes since -rc4 include the "Khazad" crypto algorithm, some added
permissions checking on raw SCSI commands from user space (see below), and
the removal
of the
fcntl() file operations method. For those just tuning in,
changes from 2.6.7 include snapshot and mirror support in the
device mapper, unbelievable numbers of "sparse" annotations, a bunch of
read-copy-update performance improvements, 64-bit SuperH support, some
security fixes, a reworked symbolic link lookup mechanism (which will
eventually enable raising the maximum link depth), and lots of fixes. The
long-format changelog has the details; the
2.6.8.1 changelog is also out there for the
curious.
No patches have been added to Linus's BitKeeper repository since the
2.6.8.1 release.
The current prepatch from Andrew Morton is 2.6.8.1-mm1. Recent changes to -mm include kprobes
("Generally we prefer to not merge infrastructure into the kernel
unless it has in-kernel users. kprobes is exceptional, in that its
applications are all custom-written to solve a particular
problem."), the removal of the single-array scheduler patch, a waitid()
system call implementation, and lots of fixes.
The current 2.4 prepatch is 2.4.28-pre1, which was released by Marcelo on August 15.
Additions include a big serial ATA update, the Khazad crypto algorithm,
some networking updates, and a handful of fixes.
Comments (5 posted)
Kernel development news
Some kernel interfaces last longer than others. The
fcntl()
method is one of the others. It was added to the
file_operations
structure in 2.6.6 with the purpose of
giving low-level filesystems and device drivers an opportunity to look at
the command being executed from an
fcntl() system call and,
possibly, do something different. The
immediate motivation was allowing the NFS code to disallow the combination
of the
O_APPEND and
O_DIRECT flags, since those two modes
cannot work together in that filesystem. Since then, the CIFS filesystem
also has made use of it to better handle the
F_NOTIFY command by
getting directory notifications from the remote server.
In 2.6.8, that operation is gone again. The thinking is that the
file_operations structure did not really need another
general-purpose, multiplexed operation like fcntl(). So the
method was replaced with two new, carefully-focused methods. The first is:
int (*check_flags)(int flags);
This operation, if present, will be called in response to an
fcntl(F_SETFL,...) system call. It can look at the flags passed
in from user space and ensure that they make sense for the device or
filesystem in question.
The other new operation is:
int (*dir_notify)(struct file *filp, unsigned long arg);
This is the new method used by CIFS to handle F_NOTIFY
requests. All other fcntl() operations are handled in the core
VFS code, as usual.
The patch as merged by Linus fixed the NFS
and CIFS code to use the new
methods. Unfortunately, nobody tested the NFS changes before the patch was
merged, and this change went in just before the final 2.6.8 release came
out. The result was an NFS implementation which crashed the kernel, and
the need for a quick 2.6.8.1 release.
Comments (10 posted)
By far the loudest chorus of complaints about the 2.6.8.1 kernel comes from
users who have found that they can no longer burn CDs. In most cases, the
problem can be worked around by running the recording program from a root
shell (setuid is not sufficient), but
that is an unsatisfying alternative for many. Why, ask inquiring minds,
did CD recording have to break with the new kernel?
It's all a matter of trying to get the permissions right. Burning a CD
requires sending a number of special-purpose SCSI commands to the drive, so
the operation is performed outside of the regular I/O paths. But once you
can send arbitrary commands, you can do more than write CDs. In pushing
for changes, Alan Cox put it this way:
With the current code I can destroy all your hard disks given read
access to the drive. With checks on writable I can destroy all your
hard disks/cdroms as appropriate with write access. Destroy here
means "dead, defunct, pushing up the daisies, go order a new one
kind of dead".
Seeing this outcome as undesirable, Linus threw in a patch shortly before releasing 2.6.8. This
patch creates an array of known SCSI commands, associating each with "safe
for read" and "safe for write" flags. Those flags are tested when a
process attempts to execute the given command. If the device has been
opened for read access, the set of allowed commands is relatively small:
read, request sense, play CD, etc. A process with write access can execute
more commands, but not the whole set. Any command not explicitly flagged
as safe for the given open mode is restricted to processes with the
CAP_SYS_RAWIO capability - root, for all practical purposes.
This patch broke CD burning in multiple ways. Users of growisofs were
burned (so to speak) because that utility opens the device for read access.
That should never have worked, but did until now; fixing that problem will
require a patch to the application. Beyond that, however, is the simple
fact that numerous SCSI commands needed for CD burning were omitted from
the "safe for write" list. These vary from locking the door to "send OPC,"
"blank", and many others. Enabling CD writing from an unprivileged process
with write access to the drive will require adding several commands to the
list.
Unfortunately, expanding the list in that manner can bring back the
original problem. Many commands which are safe to execute in one context
can destroy data, firmware, or hardware in other contexts. And it can be
very hard for the kernel to tell the difference between the two. There has
been talk of expanding the checking framework to better understand the
target device's operating modes and, perhaps, giving high- or low-level
drivers a say in the decision. Down that road lies complexity, however,
and it would be hard to reach a point where the developers could declare
victory and call the problem solved. It may well be that, despite other
faults in his reasoning on CD recording, Jörg Schilling got
it right when he suggested that the most secure mode of operation is to
simply restrict device access and run the CD recording application in a
setuid mode.
Comments (20 posted)
Power management remains one of the unfinished jobs from the 2.5
development series. Many of the pieces are in place, including the whole
device model infrastructure, but the kernel still lacks a comprehensive,
working power management subsystem. There are signs that things are
starting to happen, but it seems that the developers still lack a clear
idea of how they want to go forward.
Back on August 9, Patrick Mochel posted a
patch aimed at improving the power management subsystem. It brought
significant changes to the device model, including:
- Two power management methods were added to the class subsystem.
Until this point, classes had not been part of the power management
code at all; they are, instead, a way of exporting device information
in a functional organization. The rationale behind putting power
management functions in classes was that the higher-level code would
better understand how to "quiesce" a device in preparation for a
power state change.
- Three new power management methods were be added to the device model
representation of a bus (struct bus_type). These were
pm_save() (save state prior to a state transition),
pm_restore() (restore state afterward), and
pm_power() (perform an actual state change). These methods
would replace the current suspend() and resume() bus
methods, and the equivalent methods associated with struct
device_driver. The idea is to move all power management tasks
firmly into the bus-level code, and to let that code pass things on to
low-level drivers as appropriate.
- Each device would get two new arrays. One of these
(pm_supports) lists all of the power management states
supported by the device, in that particular device's (usually
bus-specific) terms. The second array (pm_system) is a
simple mapping from the power states understood system-wide into the
equivalent device states. These states are described by the new
pm_state structure, and sysfs interfaces exist to query the
supported states and to transition between them.
The resulting discussion implied a lot of changes to this patch; among
other things, the idea of using the class layer to quiesce devices was
controversial. An updated version of the patch has not been posted,
however.
Pavel Machek, meanwhile, has been trying to address a much smaller piece of
the problem: confusion over what the power management states really mean.
The power management code itself uses a set of states roughly related to
those defined in the ACPI specification, but other parts of the system (PCI
drivers, for example) have a different set of states. The current power
management methods take a u32 state value, and it is far from
clear what kind of state is intended.
Pavel's patch tries to address this problem
by creating a new enum type called system_state. The
bus- and driver-level power management methods are modified to accept a
parameter of this type, so that it is clear that (1) the power
management core's state values are being used, and (2) the parameter
describes the state to which the entire system is changing. It clears up a
core ambiguity without otherwise changing how things work.
Even this change is controversial, however. The largest concern is that,
eventually, it is expected that the drivers will need more information than
just the target system state. So, it is suggested, the type of the
parameter should be a structure pointer rather than a simple scalar value.
But nobody has really figured out what should go into the structure yet.
Getting it right the first time matters in this case. It is generally
accepted that fixing power management will require a driver API change, and
that, potentially, all drivers in the kernel (and out of tree as well) will
have to be changed at once. Developers are resigned to this change - but
they would really rather only do it once. So, says Patrick, it's better to wait:
Why be hasty? We need to do it right and do it once. If that means
a couple of more weeks and several more emails, than so be it.
Otherwise, we'll be stuck with a sub-par solution for who knows how
long.
What this means is that the discussion is likely to continue for a while -
and that an upgraded power management system will not be ready until
2.6.10, at best. Linux users, who have waited a long time for better power
management, can probably manage to be patient for a little while yet.
Comments (none posted)
Efforts to track down and eliminate sources of latency in the 2.6 kernel
continue. It seems, however, that most of the low-hanging fruit has been
found; with the current iteration of the voluntary preemption patch, the
remaining problems are rare and relatively hard to track down. So Ingo
Molnar
built himself a new tool to help with
those harder cases.
Ingo's problem with the previous preempt timing patch was that, while it
showed where a lengthy latency took place, it yielded little information
about what was happening during the high-latency event. So he adapted the
profiling mechanism to bring a little light to the situation. With the
latency tracing option turned on, a little tracing function gets called as part of
every kernel function call. This trace code stores the time of the call
into a large (4000 entries), per-CPU array. If the kernel avoids
scheduling for too long, that array of function call information gets copied into a
static array which is made available via /proc/latency.
Ingo included some example output with his patch:
preemption latency trace v1.0
-----------------------------
latency: 121 us, entries: 1032 (1032)
process: default.hotplug/1470, uid: 0
nice: -10, policy: 0, rt_priority: 0
=======>
0.000ms (+0.000ms): page_address (kmap_high)
0.000ms (+0.000ms): page_slot (page_address)
0.000ms (+0.000ms): flush_all_zero_pkmaps (kmap_high)
0.000ms (+0.000ms): set_page_address (flush_all_zero_pkmaps)
[...]
0.118ms (+0.000ms): page_slot (set_page_address)
0.118ms (+0.000ms): check_preempt_timing (sub_preempt_count)
The output shows the function call, and, in parentheses, the caller of each
function. In this case, the output identifies
flush_all_zero_pkmaps() as the real villain.
Other changes to this patch include making hardware and software interrupts
(which have been redirected into kernel threads) preemptible by default ("I
reviewed a number of softirq users and it appears to be safe"), and,
of course, the breaking up of more code which holds locks for too long.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>