LWN.net Logo

Kernel development

Release status

Kernel release status

The current 2.6 kernel is 2.6.8.1. Linus announced the availability of the 2.6.8 allegedly stable kernel on August 13. Unfortunately, it turned out to be a true "Friday the 13th" release with a fatal bug in the NFS code, so 2.6.8.1 was rushed out to fix it. This is the first time that the kernel has used a four-entry version number. Changes since -rc4 include the "Khazad" crypto algorithm, some added permissions checking on raw SCSI commands from user space (see below), and the removal of the fcntl() file operations method. For those just tuning in, changes from 2.6.7 include snapshot and mirror support in the device mapper, unbelievable numbers of "sparse" annotations, a bunch of read-copy-update performance improvements, 64-bit SuperH support, some security fixes, a reworked symbolic link lookup mechanism (which will eventually enable raising the maximum link depth), and lots of fixes. The long-format changelog has the details; the 2.6.8.1 changelog is also out there for the curious.

No patches have been added to Linus's BitKeeper repository since the 2.6.8.1 release.

The current prepatch from Andrew Morton is 2.6.8.1-mm1. Recent changes to -mm include kprobes ("Generally we prefer to not merge infrastructure into the kernel unless it has in-kernel users. kprobes is exceptional, in that its applications are all custom-written to solve a particular problem."), the removal of the single-array scheduler patch, a waitid() system call implementation, and lots of fixes.

The current 2.4 prepatch is 2.4.28-pre1, which was released by Marcelo on August 15. Additions include a big serial ATA update, the Khazad crypto algorithm, some networking updates, and a handful of fixes.

Comments (5 posted)

Kernel development news

The end of the fcntl() method

Some kernel interfaces last longer than others. The fcntl() method is one of the others. It was added to the file_operations structure in 2.6.6 with the purpose of giving low-level filesystems and device drivers an opportunity to look at the command being executed from an fcntl() system call and, possibly, do something different. The immediate motivation was allowing the NFS code to disallow the combination of the O_APPEND and O_DIRECT flags, since those two modes cannot work together in that filesystem. Since then, the CIFS filesystem also has made use of it to better handle the F_NOTIFY command by getting directory notifications from the remote server.

In 2.6.8, that operation is gone again. The thinking is that the file_operations structure did not really need another general-purpose, multiplexed operation like fcntl(). So the method was replaced with two new, carefully-focused methods. The first is:

    int (*check_flags)(int flags);

This operation, if present, will be called in response to an fcntl(F_SETFL,...) system call. It can look at the flags passed in from user space and ensure that they make sense for the device or filesystem in question.

The other new operation is:

    int (*dir_notify)(struct file *filp, unsigned long arg);

This is the new method used by CIFS to handle F_NOTIFY requests. All other fcntl() operations are handled in the core VFS code, as usual.

The patch as merged by Linus fixed the NFS and CIFS code to use the new methods. Unfortunately, nobody tested the NFS changes before the patch was merged, and this change went in just before the final 2.6.8 release came out. The result was an NFS implementation which crashed the kernel, and the need for a quick 2.6.8.1 release.

Comments (10 posted)

2.6.8 and CD recording

By far the loudest chorus of complaints about the 2.6.8.1 kernel comes from users who have found that they can no longer burn CDs. In most cases, the problem can be worked around by running the recording program from a root shell (setuid is not sufficient), but that is an unsatisfying alternative for many. Why, ask inquiring minds, did CD recording have to break with the new kernel?

It's all a matter of trying to get the permissions right. Burning a CD requires sending a number of special-purpose SCSI commands to the drive, so the operation is performed outside of the regular I/O paths. But once you can send arbitrary commands, you can do more than write CDs. In pushing for changes, Alan Cox put it this way:

With the current code I can destroy all your hard disks given read access to the drive. With checks on writable I can destroy all your hard disks/cdroms as appropriate with write access. Destroy here means "dead, defunct, pushing up the daisies, go order a new one kind of dead".

Seeing this outcome as undesirable, Linus threw in a patch shortly before releasing 2.6.8. This patch creates an array of known SCSI commands, associating each with "safe for read" and "safe for write" flags. Those flags are tested when a process attempts to execute the given command. If the device has been opened for read access, the set of allowed commands is relatively small: read, request sense, play CD, etc. A process with write access can execute more commands, but not the whole set. Any command not explicitly flagged as safe for the given open mode is restricted to processes with the CAP_SYS_RAWIO capability - root, for all practical purposes.

This patch broke CD burning in multiple ways. Users of growisofs were burned (so to speak) because that utility opens the device for read access. That should never have worked, but did until now; fixing that problem will require a patch to the application. Beyond that, however, is the simple fact that numerous SCSI commands needed for CD burning were omitted from the "safe for write" list. These vary from locking the door to "send OPC," "blank", and many others. Enabling CD writing from an unprivileged process with write access to the drive will require adding several commands to the list.

Unfortunately, expanding the list in that manner can bring back the original problem. Many commands which are safe to execute in one context can destroy data, firmware, or hardware in other contexts. And it can be very hard for the kernel to tell the difference between the two. There has been talk of expanding the checking framework to better understand the target device's operating modes and, perhaps, giving high- or low-level drivers a say in the decision. Down that road lies complexity, however, and it would be hard to reach a point where the developers could declare victory and call the problem solved. It may well be that, despite other faults in his reasoning on CD recording, Jörg Schilling got it right when he suggested that the most secure mode of operation is to simply restrict device access and run the CD recording application in a setuid mode.

Comments (20 posted)

Power management: looking for a direction

Power management remains one of the unfinished jobs from the 2.5 development series. Many of the pieces are in place, including the whole device model infrastructure, but the kernel still lacks a comprehensive, working power management subsystem. There are signs that things are starting to happen, but it seems that the developers still lack a clear idea of how they want to go forward.

Back on August 9, Patrick Mochel posted a patch aimed at improving the power management subsystem. It brought significant changes to the device model, including:

  • Two power management methods were added to the class subsystem. Until this point, classes had not been part of the power management code at all; they are, instead, a way of exporting device information in a functional organization. The rationale behind putting power management functions in classes was that the higher-level code would better understand how to "quiesce" a device in preparation for a power state change.

  • Three new power management methods were be added to the device model representation of a bus (struct bus_type). These were pm_save() (save state prior to a state transition), pm_restore() (restore state afterward), and pm_power() (perform an actual state change). These methods would replace the current suspend() and resume() bus methods, and the equivalent methods associated with struct device_driver. The idea is to move all power management tasks firmly into the bus-level code, and to let that code pass things on to low-level drivers as appropriate.

  • Each device would get two new arrays. One of these (pm_supports) lists all of the power management states supported by the device, in that particular device's (usually bus-specific) terms. The second array (pm_system) is a simple mapping from the power states understood system-wide into the equivalent device states. These states are described by the new pm_state structure, and sysfs interfaces exist to query the supported states and to transition between them.

The resulting discussion implied a lot of changes to this patch; among other things, the idea of using the class layer to quiesce devices was controversial. An updated version of the patch has not been posted, however.

Pavel Machek, meanwhile, has been trying to address a much smaller piece of the problem: confusion over what the power management states really mean. The power management code itself uses a set of states roughly related to those defined in the ACPI specification, but other parts of the system (PCI drivers, for example) have a different set of states. The current power management methods take a u32 state value, and it is far from clear what kind of state is intended.

Pavel's patch tries to address this problem by creating a new enum type called system_state. The bus- and driver-level power management methods are modified to accept a parameter of this type, so that it is clear that (1) the power management core's state values are being used, and (2) the parameter describes the state to which the entire system is changing. It clears up a core ambiguity without otherwise changing how things work.

Even this change is controversial, however. The largest concern is that, eventually, it is expected that the drivers will need more information than just the target system state. So, it is suggested, the type of the parameter should be a structure pointer rather than a simple scalar value. But nobody has really figured out what should go into the structure yet.

Getting it right the first time matters in this case. It is generally accepted that fixing power management will require a driver API change, and that, potentially, all drivers in the kernel (and out of tree as well) will have to be changed at once. Developers are resigned to this change - but they would really rather only do it once. So, says Patrick, it's better to wait:

Why be hasty? We need to do it right and do it once. If that means a couple of more weeks and several more emails, than so be it. Otherwise, we'll be stuck with a sub-par solution for who knows how long.

What this means is that the discussion is likely to continue for a while - and that an upgraded power management system will not be ready until 2.6.10, at best. Linux users, who have waited a long time for better power management, can probably manage to be patient for a little while yet.

Comments (none posted)

Update from the latency front

Efforts to track down and eliminate sources of latency in the 2.6 kernel continue. It seems, however, that most of the low-hanging fruit has been found; with the current iteration of the voluntary preemption patch, the remaining problems are rare and relatively hard to track down. So Ingo Molnar built himself a new tool to help with those harder cases.

Ingo's problem with the previous preempt timing patch was that, while it showed where a lengthy latency took place, it yielded little information about what was happening during the high-latency event. So he adapted the profiling mechanism to bring a little light to the situation. With the latency tracing option turned on, a little tracing function gets called as part of every kernel function call. This trace code stores the time of the call into a large (4000 entries), per-CPU array. If the kernel avoids scheduling for too long, that array of function call information gets copied into a static array which is made available via /proc/latency.

Ingo included some example output with his patch:

  preemption latency trace v1.0
  -----------------------------
   latency: 121 us, entries: 1032 (1032)
   process: default.hotplug/1470, uid: 0
   nice: -10, policy: 0, rt_priority: 0
  =======>
   0.000ms (+0.000ms): page_address (kmap_high)
   0.000ms (+0.000ms): page_slot (page_address)
   0.000ms (+0.000ms): flush_all_zero_pkmaps (kmap_high)
   0.000ms (+0.000ms): set_page_address (flush_all_zero_pkmaps)
  [...]
   0.118ms (+0.000ms): page_slot (set_page_address)
   0.118ms (+0.000ms): check_preempt_timing (sub_preempt_count)

The output shows the function call, and, in parentheses, the caller of each function. In this case, the output identifies flush_all_zero_pkmaps() as the real villain.

Other changes to this patch include making hardware and software interrupts (which have been redirected into kernel threads) preemptible by default ("I reviewed a number of softirq users and it appears to be safe"), and, of course, the breaking up of more code which holds locks for too long.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Architecture-specific

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds