Brief items
The current 2.6 prepatch is 2.6.13-rc5, which was
released by Linus on
August 1. This prepatch contains a great many fixes and the reversion
of a couple of troublesome patches (see below).
The long-format changelog has
the details.
2.6.13-rc4 was announced on July 28.
This prepatch is large, containing a vast number of fixes. There's also a SCSI
update, an ALSA update, an NTFS update, a reworking of the shutdown/reboot
code, and more. See the long-format
changelog for the details.
Linus's git repository contains a very small number of fixes added since
-rc5.
The current -mm tree is 2.6.13-rc4-mm1. Recent changes
to -mm include some cleanups to the i386 code (in particular moving inline
assembly code into wrapper functions), some scheduler tweaks, the page fault scalability patches, and
the dropping of the CKRM patches.
Comments (2 posted)
Kernel development news
Russell King recently sent out
a heads-up
regarding a PCMCIA subsystem change which will affect some users. In
2.6.13, if a PCMCIA driver is linked directly into the kernel, its devices
will be recognized and bound at boot time. That means that no hotplug
events will be generated for those devices. Since many systems use the
hotplug subsystem to do things like configuring network interfaces, this
change could lead to broken systems.
There are also concerns about the naming of disk devices; the presence or
absence of a PCMCIA device could cause the names of other disks on the
system to change from one boot to the next. Dominik Brodowski has posted
a patch which causes PCMCIA IDE devices to
be initialized late in the boot process in an attempt to minimize this
problem; he also notes that udev is the right way to deal with
device naming issues.
Meanwhile, most users will not be affected because most distributors build
their PCMCIA drivers as modules. Devices managed by those drivers will be
configured after the system is bootstrapped, and will generate hotplug
events as usual.
Comments (1 posted)
There has been a debate slowly simmering on linux-kernel over an issue
which, to most Linux users, will be invisible. Still, it points at the
sorts of tradeoffs which must be made when configuring a system, and thus
merits a look.
One of the features which will be included in the 2.6.13 kernel is the
ability to configure the frequency of the timer interrupt at kernel build
time - at least, on the i386 architecture. This capability, by itself, is
not controversial, but the new default value for HZ (250) is. Some
developers think it is too low, while others (fewer) think it is too high.
It does not appear that there is a single "right" value for this variable.
HZ is the frequency with which the system's timer hardware is programmed to
interrupt the kernel. Much of the kernel's internal housekeeping,
including process accounting, scheduler time slice accounting, and internal
time management, is done in the timer interrupt handler. Thus, the
frequency of the timer interrupt affects a number of things; in particular,
it puts an upper bound on the resolution of timers used with the kernel.
If HZ is 1000 (the i386 default for 2.6 kernels through 2.6.12), then
timers will have a best-case resolution of 1ms. If, instead, HZ is 100
(the 2.4 and prior default), that resolution is 10ms.
The 250Hz default in 2.6.13 gives a maximum timer resolution of 4ms, which
is said to be insufficient for many multimedia-oriented applications (and
others which need higher-resolution timers). Such applications, in that
environment, will be forced to use busy-waiting to achieve delays which are
below the best resolution offered by the system, with the usual effect on
CPU utilization. It is not the way the developers of these applications
want to go.
The arguments in favor of reducing HZ center around efficiency. A slower
timer interrupt is said to require less power, since the processor (if
relatively idle) will wake up less often. Thus, a lower value of HZ is
supposed to be better for laptop users. The timer interrupt handler also
requires CPU time (and a context switch, and cache space) every time it
runs; running that handler less often will clearly reduce its overhead.
Part of the problem, however, is that nobody has quantified the savings
which can be expected from a slower timer interrupt. That changed,
however, when Marc Ballarin posted some
results from tests he had run. His initial test, involving an idle
system, showed that power consumption varied from 7.59 watts with a
100Hz timer frequency to 8.15W at 1000Hz. A
subsequent test with KDE running showed a smaller savings, especially
when artsd was running.
These results have given ammunition to both sides. Advocates of a low HZ
value see the potential for a half-watt savings as worthwhile. Those who
want HZ to be high see, instead, a change which makes the system less
effective for them while yielding minimal advantages in real-world use.
If there is a consensus on this issue, it would appear to be that the real
solution is the dynamic tick
patch. By causing timer interrupts to happen only when there is
actually something to be done, the kernel can simultaneously support
higher-resolution timers and reduce the actual incidence of timer
interrupts. No commitments have been made, but there seems to be a
widely-held opinion that the dynamic tick patch will be merged once it has
been sufficiently cleaned up; some architectures (e.g. ARM) already have
it. To that end, Con Kolivas has posted a reworked version of that patch
for review.
If this patch is to be merged soon, it has been asked, why make a change to
HZ in the mean time? No answers to that question have been posted. It is
true that the lower value of HZ has been in the mainline for some time (and
in -mm for even longer) and the number of real complaints has been small.
In the absence of problems noted by a wider group of testers, the default
value of 250 for HZ seems likely to persist into the final 2.6.13 release.
It remains to be seen, however, what value the distributors will pick for
the kernels they ship.
Comments (5 posted)
One of the trickier parts of the software suspend subsystem is the
"refrigerator," the code which puts all processes on hold so that the
system can be suspended in a quiet state.
Last week, this page looked at
some issues which come up in choosing which processes to freeze and when to
freeze them. Another area of work, however, is the mechanism by which the
freezing actually happens.
The in-kernel software suspend code puts processes on hold with the
following steps:
- The process flags (stored in the flags field of the
task_struct structure) gets the PF_FREEZE bit set.
- A signal is delivered to the process, causing it to execute briefly.
- Eventually the process notices the PF_FREEZE flag and calls
refrigerator(). That call replaces PF_FREEZE with
PF_FROZEN and puts the process into an unrunnable state
(TASK_UNINTERRUPTIBLE).
This mechanism does work, but it has a couple of problems. The
PF_* flags require some support in the scheduler, which would be
nice to avoid. The real issue, though, is that accessing another process's
flags requires locking to avoid race conditions. Adding that sort of
locking to the software suspend code, however, is hard to do without
risking deadlocks. So the suspend code simply sets the PF_FREEZE
flag without locking and hopes for the best; this is one of the reasons why
software suspend has never really been supported on SMP systems.
Christoph Lameter has posted a set of
patches aimed at fixing these issues. With his patch, the
PF_FREEZE and PF_FROZEN flags go away. Instead,
struct task_struct gets a new field called todo.
This field is a notifier_block pointer; whenever any part of the
kernel wants a particular process to run a function in its own context, the
kernel can put a notifier request onto todo. At various places in
the kernel, the todo list is checked, and any notifier requests
which have been put there are executed.
With this mechanism, there is no need for any special process flags. The
suspend code simply adds a todo item for each process asking it to
freeze itself. It is still necessary to deliver a signal to each process
to force it to run in the kernel; otherwise, processes waiting on I/O (or
which never call out of user space) would not execute the notifier. The
actual "frozen" state is implemented with a completion in
Christoph's patch, meaning that unfreezing everybody is a simple matter of
a call to complete_all().
Christoph thinks that the todo mechanism may be useful beyond
software suspend. A number of places in the kernel have to make changes
which are best run in the context of a specific process; the code to make
those changes happen can, at times, be a little ugly. The todo
list is a straightforward way of running code directly in the context of
interest, potentially simplifying the kernel in a few places. The patch
has not made it into -mm as of this writing, but there does not appear to
be any great obstacle to its inclusion there.
Comments (1 posted)
The 2.6.13-rc5 prepatch brought with it the reversal of a couple of
ACPI-related patches. A look at what happened is rewarding in that it
shows how hard it can be to get some things right, and how the kernel
development model tries to address these issues.
Earlier 2.6.13 prepatches included a change to the core ACPI system.
Whenever the system (or a part of it) is being suspended, the modified ACPI
code would break the link which routed device interrupts into the
processor. This change is part of a new set of rules which expects every
device to release its interrupt line on suspend, and to reacquire it on
resume. There are a few reasons for wanting to do things this way:
- In theory, at least, a device could be resumed to find that its
interrupt number has changed. People who reconfigure their hardware
while the system is suspended (as opposed to being truly shut down)
might be seen as actively looking for trouble, but it still might be
nice to make things work for them when possible.
- The interrupt handler for a suspended device should not normally be
called, but that can happen in the case of shared interrupts. Any
interrupt handler which tries to access a suspended device is likely
to run into problems; having every suspend() method release
the device's interrupt line can help to avoid this situation.
- On resume, interrupts for a device whose driver has not yet been
resumed may be seen as spurious and shut down. If that interrupt line
is shared, however, other devices could be affected. This problem can
be avoided by having ACPI shut down the interrupt altogether until
individual drivers restore it, but that depends on drivers explicitly
reallocating their interrupt lines.
The problem with the ACPI change is that it breaks a large number of
drivers, and, as a result, it breaks suspend on systems where it used to
work. The power management hackers seem to see this situation as
an unfortunate, but necessary step toward getting suspend working reliably
on a much broader range of hardware. Having individual drivers release and
reacquire their interrupts is also seen as necessary to support runtime
power management - suspending of individual devices in a running system to
save power. The ACPI change, it is said, fixes more systems than it
breaks, and is thus worthwhile.
Linus disagreed and reverted the patch,
saying:
The thing is, we're better off making very very slow progress that
is _steady_, than having people who _used_ to have things work for them
suddenly break.
So I believe that if we fix two machines and break one machine,
we've actually regressed. It doesn't matter that we fixed more than
we broke: we _still_ regressed. Because it means that people can't
trust the progress we make!
The right solution, according to Linus, is to go ahead and add the
free_irq() and request_irq() calls to individual drivers
when it makes sense to do so, and when it does not break things for
individual users. Meanwhile, however, the ACPI subsystem should still
restore the interrupt state on resume so that unmodified drivers do not
break. There are some remaining issues with how that is done: it may
involve running the ACPI AML interpreter with interrupts disabled, which
leads to a number of interesting situations. Benjamin Herrenschmidt also
pointed out that it could lead to
situations where drivers may not be able to receive interrupts during the
resume process.
Eventually, one assumes, these details will be worked out. In the mean
time, it will be interesting to see if the "revert any patch that breaks
somebody's machine" policy holds. If it leads to a more stable experience
for Linux users, it seems like it would be a good thing.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
- David Teigland: GFS.
(August 2, 2005)
Janitorial
Memory management
Networking
Architecture-specific
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>