LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.13-rc5, which was released by Linus on August 1. This prepatch contains a great many fixes and the reversion of a couple of troublesome patches (see below). The long-format changelog has the details.

2.6.13-rc4 was announced on July 28. This prepatch is large, containing a vast number of fixes. There's also a SCSI update, an ALSA update, an NTFS update, a reworking of the shutdown/reboot code, and more. See the long-format changelog for the details.

Linus's git repository contains a very small number of fixes added since -rc5.

The current -mm tree is 2.6.13-rc4-mm1. Recent changes to -mm include some cleanups to the i386 code (in particular moving inline assembly code into wrapper functions), some scheduler tweaks, the page fault scalability patches, and the dropping of the CKRM patches.

Comments (2 posted)

Kernel development news

A PCMCIA subsystem change

Russell King recently sent out a heads-up regarding a PCMCIA subsystem change which will affect some users. In 2.6.13, if a PCMCIA driver is linked directly into the kernel, its devices will be recognized and bound at boot time. That means that no hotplug events will be generated for those devices. Since many systems use the hotplug subsystem to do things like configuring network interfaces, this change could lead to broken systems.

There are also concerns about the naming of disk devices; the presence or absence of a PCMCIA device could cause the names of other disks on the system to change from one boot to the next. Dominik Brodowski has posted a patch which causes PCMCIA IDE devices to be initialized late in the boot process in an attempt to minimize this problem; he also notes that udev is the right way to deal with device naming issues.

Meanwhile, most users will not be affected because most distributors build their PCMCIA drivers as modules. Devices managed by those drivers will be configured after the system is bootstrapped, and will generate hotplug events as usual.

Comments (1 posted)

How fast should HZ be?

There has been a debate slowly simmering on linux-kernel over an issue which, to most Linux users, will be invisible. Still, it points at the sorts of tradeoffs which must be made when configuring a system, and thus merits a look.

One of the features which will be included in the 2.6.13 kernel is the ability to configure the frequency of the timer interrupt at kernel build time - at least, on the i386 architecture. This capability, by itself, is not controversial, but the new default value for HZ (250) is. Some developers think it is too low, while others (fewer) think it is too high. It does not appear that there is a single "right" value for this variable.

HZ is the frequency with which the system's timer hardware is programmed to interrupt the kernel. Much of the kernel's internal housekeeping, including process accounting, scheduler time slice accounting, and internal time management, is done in the timer interrupt handler. Thus, the frequency of the timer interrupt affects a number of things; in particular, it puts an upper bound on the resolution of timers used with the kernel. If HZ is 1000 (the i386 default for 2.6 kernels through 2.6.12), then timers will have a best-case resolution of 1ms. If, instead, HZ is 100 (the 2.4 and prior default), that resolution is 10ms.

The 250Hz default in 2.6.13 gives a maximum timer resolution of 4ms, which is said to be insufficient for many multimedia-oriented applications (and others which need higher-resolution timers). Such applications, in that environment, will be forced to use busy-waiting to achieve delays which are below the best resolution offered by the system, with the usual effect on CPU utilization. It is not the way the developers of these applications want to go.

The arguments in favor of reducing HZ center around efficiency. A slower timer interrupt is said to require less power, since the processor (if relatively idle) will wake up less often. Thus, a lower value of HZ is supposed to be better for laptop users. The timer interrupt handler also requires CPU time (and a context switch, and cache space) every time it runs; running that handler less often will clearly reduce its overhead.

Part of the problem, however, is that nobody has quantified the savings which can be expected from a slower timer interrupt. That changed, however, when Marc Ballarin posted some results from tests he had run. His initial test, involving an idle system, showed that power consumption varied from 7.59 watts with a 100Hz timer frequency to 8.15W at 1000Hz. A subsequent test with KDE running showed a smaller savings, especially when artsd was running.

These results have given ammunition to both sides. Advocates of a low HZ value see the potential for a half-watt savings as worthwhile. Those who want HZ to be high see, instead, a change which makes the system less effective for them while yielding minimal advantages in real-world use.

If there is a consensus on this issue, it would appear to be that the real solution is the dynamic tick patch. By causing timer interrupts to happen only when there is actually something to be done, the kernel can simultaneously support higher-resolution timers and reduce the actual incidence of timer interrupts. No commitments have been made, but there seems to be a widely-held opinion that the dynamic tick patch will be merged once it has been sufficiently cleaned up; some architectures (e.g. ARM) already have it. To that end, Con Kolivas has posted a reworked version of that patch for review.

If this patch is to be merged soon, it has been asked, why make a change to HZ in the mean time? No answers to that question have been posted. It is true that the lower value of HZ has been in the mainline for some time (and in -mm for even longer) and the number of real complaints has been small. In the absence of problems noted by a wider group of testers, the default value of 250 for HZ seems likely to persist into the final 2.6.13 release. It remains to be seen, however, what value the distributors will pick for the kernels they ship.

Comments (5 posted)

A new path to the refrigerator

One of the trickier parts of the software suspend subsystem is the "refrigerator," the code which puts all processes on hold so that the system can be suspended in a quiet state. Last week, this page looked at some issues which come up in choosing which processes to freeze and when to freeze them. Another area of work, however, is the mechanism by which the freezing actually happens.

The in-kernel software suspend code puts processes on hold with the following steps:

  • The process flags (stored in the flags field of the task_struct structure) gets the PF_FREEZE bit set.

  • A signal is delivered to the process, causing it to execute briefly.

  • Eventually the process notices the PF_FREEZE flag and calls refrigerator(). That call replaces PF_FREEZE with PF_FROZEN and puts the process into an unrunnable state (TASK_UNINTERRUPTIBLE).

This mechanism does work, but it has a couple of problems. The PF_* flags require some support in the scheduler, which would be nice to avoid. The real issue, though, is that accessing another process's flags requires locking to avoid race conditions. Adding that sort of locking to the software suspend code, however, is hard to do without risking deadlocks. So the suspend code simply sets the PF_FREEZE flag without locking and hopes for the best; this is one of the reasons why software suspend has never really been supported on SMP systems.

Christoph Lameter has posted a set of patches aimed at fixing these issues. With his patch, the PF_FREEZE and PF_FROZEN flags go away. Instead, struct task_struct gets a new field called todo. This field is a notifier_block pointer; whenever any part of the kernel wants a particular process to run a function in its own context, the kernel can put a notifier request onto todo. At various places in the kernel, the todo list is checked, and any notifier requests which have been put there are executed.

With this mechanism, there is no need for any special process flags. The suspend code simply adds a todo item for each process asking it to freeze itself. It is still necessary to deliver a signal to each process to force it to run in the kernel; otherwise, processes waiting on I/O (or which never call out of user space) would not execute the notifier. The actual "frozen" state is implemented with a completion in Christoph's patch, meaning that unfreezing everybody is a simple matter of a call to complete_all().

Christoph thinks that the todo mechanism may be useful beyond software suspend. A number of places in the kernel have to make changes which are best run in the context of a specific process; the code to make those changes happen can, at times, be a little ugly. The todo list is a straightforward way of running code directly in the context of interest, potentially simplifying the kernel in a few places. The patch has not made it into -mm as of this writing, but there does not appear to be any great obstacle to its inclusion there.

Comments (1 posted)

ACPI, device interrupts, and suspend states

The 2.6.13-rc5 prepatch brought with it the reversal of a couple of ACPI-related patches. A look at what happened is rewarding in that it shows how hard it can be to get some things right, and how the kernel development model tries to address these issues.

Earlier 2.6.13 prepatches included a change to the core ACPI system. Whenever the system (or a part of it) is being suspended, the modified ACPI code would break the link which routed device interrupts into the processor. This change is part of a new set of rules which expects every device to release its interrupt line on suspend, and to reacquire it on resume. There are a few reasons for wanting to do things this way:

  • In theory, at least, a device could be resumed to find that its interrupt number has changed. People who reconfigure their hardware while the system is suspended (as opposed to being truly shut down) might be seen as actively looking for trouble, but it still might be nice to make things work for them when possible.

  • The interrupt handler for a suspended device should not normally be called, but that can happen in the case of shared interrupts. Any interrupt handler which tries to access a suspended device is likely to run into problems; having every suspend() method release the device's interrupt line can help to avoid this situation.

  • On resume, interrupts for a device whose driver has not yet been resumed may be seen as spurious and shut down. If that interrupt line is shared, however, other devices could be affected. This problem can be avoided by having ACPI shut down the interrupt altogether until individual drivers restore it, but that depends on drivers explicitly reallocating their interrupt lines.

The problem with the ACPI change is that it breaks a large number of drivers, and, as a result, it breaks suspend on systems where it used to work. The power management hackers seem to see this situation as an unfortunate, but necessary step toward getting suspend working reliably on a much broader range of hardware. Having individual drivers release and reacquire their interrupts is also seen as necessary to support runtime power management - suspending of individual devices in a running system to save power. The ACPI change, it is said, fixes more systems than it breaks, and is thus worthwhile.

Linus disagreed and reverted the patch, saying:

The thing is, we're better off making very very slow progress that is _steady_, than having people who _used_ to have things work for them suddenly break.

So I believe that if we fix two machines and break one machine, we've actually regressed. It doesn't matter that we fixed more than we broke: we _still_ regressed. Because it means that people can't trust the progress we make!

The right solution, according to Linus, is to go ahead and add the free_irq() and request_irq() calls to individual drivers when it makes sense to do so, and when it does not break things for individual users. Meanwhile, however, the ACPI subsystem should still restore the interrupt state on resume so that unmodified drivers do not break. There are some remaining issues with how that is done: it may involve running the ACPI AML interpreter with interrupts disabled, which leads to a number of interesting situations. Benjamin Herrenschmidt also pointed out that it could lead to situations where drivers may not be able to receive interrupts during the resume process.

Eventually, one assumes, these details will be worked out. In the mean time, it will be interesting to see if the "revert any patch that breaks somebody's machine" policy holds. If it leads to a more stable experience for Linux users, it seems like it would be a good thing.

Comments (3 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

  • David Teigland: GFS. (August 2, 2005)

Janitorial

Memory management

Networking

Architecture-specific

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds