The current 2.6 prepatch is 2.6.22-rc7
by Linus on
July 1. "It's hopefully (almost certainly) the last -rc before
the final 2.6.22 release, and we should be in pretty good shape. The flow
of patches has really slowed down and the regression list has shrunk a
" This is the last chance to test 2.6.22 and find bugs before
they slip into the final release.
As of this writing, about 60 patches have been merged into the mainline git
repository after -rc7. They are mostly fixes, but there is also the
removal of a large set of private ioctl() functions from the
libertas (OLPC) wireless driver.
The current -mm tree is 2.6.22-rc6-mm1. Anybody wanting
to build and test this tree should certainly read Andrew's notes at the top
of the announcement. Recent changes to -mm include kgdb support for
several architectures, tickless support for the x86_64 architecture, the
ability to force-enable the HPET timer even when the BIOS leaves it
disabled, an updated file POSIX capabilities patch, and Intel IOMMU
Comments (none posted)
Kernel development news
SELinux orders a beer object
AppArmor order a /beer
Hilary says "You are both under 21 you can't"
SELinux orders a shandy object
AppArmor orders a /shandy
SELinux is refused because the shandy mixer opened a beer object and
shandy inherited beer typing
AppArmor gets drunk because /shandy and /beer are clearly
-- Alan Cox
Comments (6 posted)
Len Brown can only be a glutton for punishment; he is, after all, the
maintainer of the Linux ACPI subsystem. That is a difficult position to be in:
ACPI involves getting into the BIOS layer, an area of system software which
is not always known for careful, high-quality work. Supporting ACPI is a
complex task which, among other things, requires the embedding of a
specialized interpreter within the kernel, a hard sell at best. Even with
that background in mind, one must wonder just how much masochism is
required to lead one to deliver three separate talks at the 2007 Ottawa
Linux Symposium. That is just what Len did, however; the end result was a
good view into several aspects of the power management problem.
Getting more from tickless
The first talk (on the tickless kernel) was supposed to be given by Suresh
Siddha, who was unable to attend the event. The dynamic tick patches have been
covered here before. Suresh/Len's talk was not really about how these
patches work, but, instead, about the work which remains to be done to take
full advantage of the tickless design. It seems that the work which has
been done so far is just the beginning.
The problem is that, on a system used by Suresh and company, the average
processor sleep time was still less than 1ms even after the dynamic tick
code was enabled. Given that one of the driving reasons for dynamic tick
was to let the processor sleep for long periods of time - thus saving power
- this is a disappointing result. It turns out that there is a lot which
can be done to improve the situation, though.
Step number one is to address a kernel-space problem: there are a lot of
active kernel timers which tend to spread out over time. As a result,
the kernel wakes up much more often than it would if the timers were
coordinated to expire at the same time whenever possible. As it
happens, many kernel timers do not need great precision; a timer which
fires some number of milliseconds later than scheduled is not a problem.
So, if the kernel could defer some timers to fire at the same time as
others, it can reduce the number of wakeups. The deferrable timers patch does
exactly that; the round_jiffies() function added in 2.6.19 can
also help the kernel line up events. Adding this code brought the average
sleep time up to 20ms, with the system handling 90 interrupts per second.
Next is the problem of hardware timers. On the i386 architecture, the
preferred timer is the local APIC (LAPIC) timer, which is built into the
processor and very fast to program. Unfortunately, putting the processor
into a deep sleep also puts the LAPIC timer to sleep, a situation Len
compared to unplugging one's alarm clock before going to bed. In either
case, oversleeping can be the unwanted result. The programmable interval
timer (PIT) remains awake and is easily used, but it has a maximum event
time of 27ms. If one wants the processor to sleep for longer than that,
another solution must be found. That solution is the high-precision event
timer (HPET), which has a maximum interval of at least three seconds.
Getting access to the HPET can be hard, though; good BIOS support is spotty
at best and the HPET is often disabled. If it can be forced on, however,
the system can go to an average sleep period of about 56ms, handling 32
interrupts per second.
Even better is to get the HPET out of the "legacy mode" currently used by
Linux. This mode is simple to use, but it requires the rebroadcasting of
timer interrupts on multiprocessor systems. But the HPET can work with
per-CPU channels, eliminating this problem. The result: average sleep time
grows to 74ms.
At this point, the problem moves to user space. Since the release of powertop, there has been a lot of
progress in this area; user-space applications which cause frequent wakeups
unnecessarily stand out immediately and can be fixed. But, as Len noted,
"user space still sucks."
One gets the sense that Len is a little tired of people complaining
about ACPI in Linux. His response was a talk on "ten ACPI myths" - though
the list had grown to twelve by then.
#1: There is no benefit to enabling ACPI. Len's answer to this had
two parts, the first of which being that, increasingly, there is no
alternative. The older APM interface is deprecated, and, in particular,
Microsoft's Vista has removed APM support altogether. So, soon, there will
be no hardware support for APM at all; it is a dead standard. The MPS
standard (used for discovering processors) is also old and dying. Like it
or not, ACPI is needed to be able to make use of one's hardware.
On the positive side, using ACPI gives better access to hardware features
like software-enabled power, sleep, and lid buttons. Smart battery status
information becomes available, as well as the potential for reduced power
consumption and better battery life. True hotplug and (especially) docking
support also become possible with ACPI.
#2: Suspend-to-disk problems are ACPI's fault. In fact, ACPI is a
very small part of the suspend-to-disk process - everything else is in
other parts of the kernel code. If you have suspend-to-disk problems,
suggests Len, "complain to Pavel [Machek], not me."
#3: If the extra buttons don't work, it's ACPI's fault. The issue
here is that support for "hotkeys" is not actually a part of the ACPI
specification. All of those extra buttons found on laptops are
vendor-specific added features. The reverse-engineered drivers currently
found in the kernel are a "heroic effort," but they should not be
necessary. Vendors should be supplying drivers for their hardware.
#4: Boot problems with ACPI enabled are ACPI's fault. Len allows
that this one might just be true some of the time. But disabling ACPI at
boot-time also disables other hardware features - the IO-APIC in
particular. So any problems associated with those other parts of the
system will be masked by turning off ACPI. It looks like ACPI was the
actual problem, but the truth is more complicated.
#5: ACPI issues are due to sub-standard platform BIOS. It turns out
that there are three general sources of ACPI incompatibilities. Just one
of them is the BIOS violating the ACPI specification; incompatibilities
which don't break Windows will often slip through the testing process. The
firmware developer kit
produced by Intel can help in this regard. Another source of problems is
differing interpretations of the specification, which is a long and complex
document. The Linux ACPI developers have been working to help clarify the
specification when this sort of problem arises. Finally, there can also
simply be bugs in the Linux-specific code.
#6: The Linux community cannot help to improve the ACPI
specification. In fact, the ACPI team has been submitting
improvements, mostly in the form of "specification clarifications." Many
of those have been incorporated and shipped with specification updates.
#7: The ACPI code changes a lot but is not getting better. Intel
has put together a test suite with over 2000 tests; ACPI changes must now
pass that suite before being merged. The number of new bug reports has
been dropping - though, perhaps, more slowly than one might like.
#8: ACPI is slow and bad for high-performance CPU governors. The
ACPI interpreter is not used in performance-critical paths, and, thus,
cannot be slowing things down. ACPI's role is in the setup and
#9: The speedstep-centrino governor is faster than acpi-cpufreq.
The acpi-cpufreq governor has seen considerable improvements, and is now able
to access MSRs in a fast and (more importantly) supportable way. So its
performance is where it should be, and the speedstep-centrino governor is
scheduled for removal.
#10: More CPU idle power states is better. This may be true for any
given processor, but you cannot compare processors on the basis of how many
idle states they provide. All that really matters is how much power you
save when you use those states.
#11: Processor throttling will save energy. The problem here is a
confusion of "power" and "energy." A throttled processor may draw less
power, but it has to run longer to accomplish the same work. So throttling
the processor (while maintaining the same voltage) may have the effect of
increasing energy use rather than reducing it. The better approach is
almost always to run at the fastest clock frequency afforded by the current
voltage level and get the work done quickly; Len characterized this as the
"race to idle."
There are second-order effects to consider; in particular, batteries will
last longer if they are discharged over longer periods of time. A
throttled processor may also run cooler, allowing fans to be turned off.
Throttling may be necessary for temperature regulation. But, from an
energy-savings perspective, these are truly second-order effects.
#12: I can't contribute to improving ACPI in Linux. Like any other
project, Linux ACPI would love to have more developers. And, failing that,
one can always test kernels and report bugs. There is, in reality, plenty
of opportunity for improving the ACPI code.
Len's final talk moved away from power consumption toward its effects - the
generation of heat, in particular. The creation of excess heat is not a
welcome behavior in any device, but it becomes especially undesirable in
handheld devices. Devices which make the user's hand sweat are less fun to
use; those which get too hot to hold comfortably can be entirely unusable.
So temperature management is important. But the nature of these devices
can make thermal regulation tricky: there's no room for fans in a
Linux-powered cellular phone, and the dissipation of heat can be hard
The ACPI 3.0 specification includes a complicated thermal model. The
device is divided up into zones, and each component has its thermal
contribution to each zone characterized. Implementing this specification
is a complex and difficult task - enough so that the Linux ACPI developers
have no intention of doing it. They will, instead, focus on something
That something is the ACPI 2.0 thermal model. It includes thermal zones,
each of which comes with temperature sensors and a set of trip points.
The "critical shutdown" trip point is set somewhere just short of where the
device begins to melt; should things get that warm, the device just needs
to turn itself off as quickly as possible. Various other trip points
will be encountered first; they should bring about increasingly strong
measures for controlling temperature. These can include turning on fans
(if they exist), throttling devices, or suspending the system to disk.
ACPI 2.0 includes an embedded controller which monitors the system's
temperature sensors and sends events to the CPU when something interesting
The in-progress thermal regulation code just uses the existing critical
shutdown mechanism built into ACPI. There is also support for some of the
passive trip points which bring about CPU throttling. For the
non-processor thermal zones, though, the best thing to do is to let user
space figure out how to respond, so that's what the ACPI code will do.
There will be a netlink interface through which temperature events can be
sent, and a set of sysfs directories for reading sensor values. The sysfs
tree will also include control files which can be used by a user-space
daemon to throttle specific devices in response to temperature events.
In the end, the kernel is really just a conduit, conducting events and
control settings between the components of the device and user space.
There were some questions on whether there will be a standardized set of
sysfs knobs for every device; the answer appears to be "no." Each device
is different, with its own control parameters; it is hard to create any
sort of standard which can describe them all. Beyond that, the target
environment is embedded devices, each of which is unique. It is expected
that each device will have its own special-purpose management daemon
designed especially for it, so there is no real benefit in trying to make
The impression one gets from all these talks is that quite a bit is
happening in the power management area - a part of Linux which, for some
time, has been seen as falling short of what it really needs to be. The
increasing use of Linux in embedded systems can only help in this regard;
there are a number of vendors who have a strong interest in improved
support for intelligent use of power. Given time and continued work, power
management may soon be one of those past problems which is no longer an
Comments (9 posted)
Ingo Molnar's completely fair
(CFS) patch continues to develop; the current version, as of
this writing, is v18
aspect of CFS behavior is seen as a serious shortcoming by many potential
users, however: it only implements fairness between individual processes.
If 50 processes are trying to run at any given time, CFS will carefully
ensure that each gets 2% of the CPU. It could be, however, that one of
those processes is the X server belonging to Alice, while the other 49 are
part of a massive kernel build launched by Karl the opportunistic kernel
hacker, who logged
in over the net to take advantage of some free CPU time. Assuming that
allowing Karl on the system is considered fair at all, it is reasonable to
say that his 49 compiler processes should, as a group, share the processor
with Alice's X server. In other words, X should get 50% of the CPU (if it
needs it) while all of Karl's processes share the other 50%.
This type of scheduling is called "group scheduling"; Linux has never
really supported it with any scheduler. It would be nice if a "completely
fair scheduler" to be merged in the future had the potential to be
completely fair in this regard too. Thanks to work by Srivatsa Vaddagiri
and others, things may well happen in just that way.
The first part of Srivatsa's work was merged into v17 of the CFS patch. It
creates the concept of a "scheduling entity" - something to be scheduled,
which might not be a process. This work takes the per-process scheduling
information and packages it up within a sched_entity structure.
In this form, it is essentially a cleanup - it encapsulates the relevant
information (a useful thing to do in its own right) without actually
changing how the CFS scheduler works.
Group scheduling is implemented in a separate set of patches which
are not yet part of the CFS code. These patches turn a scheduling entity into
a hierarchical structure. There can now be scheduling entities which are
not directly associated with processes; instead, they represent a specific
group of processes. Each scheduling entity of this type has its own run
queue within it. All scheduling entities also now have a parent
pointer and a pointer to the run queue into which they should be scheduled.
By default, processes are at the top of the hierarchy, and each is
scheduled independently. A process can be moved underneath another
scheduling entity, though, essentially removing it from the primary run
queue. When that process becomes runnable, it is put on the run queue
associated with its parent scheduling entity.
When the scheduler goes to pick the next task to run, it looks at all of
the top-level scheduling entities and takes the one which is considered
most deserving of the CPU. If that entity is not a process (it's a
higher-level scheduling entity), then the scheduler looks at the run queue
contained within that entity and starts over again. Things continue down
the hierarchy until an actual process is found, at which point it is run.
As the process runs, its runtime statistics are collected as usual, but
they are also propagated up the hierarchy so that its CPU usage is properly
reflected at each level.
So now the system administrator can create one scheduling entity for Alice,
and another for Karl. All of Alice's processes are placed under her
representative scheduling entity; a similar thing happens to all of the
processes in Karl's big kernel build. The CFS scheduler will enforce
fairness between Alice and Karl; once it decides who deserves the CPU, it
will drop down a level and perform fair scheduling of that user's
The creation of the process hierarchy need not be done on a per-user basis;
processes can be organized in any way that the administrator sees fit. The
grouping could be coarser; for example, on a university machine, all
students could be placed in one group and faculty in another. Or the
hierarchy could be based on the type of process: there could be scheduling
entities representing system daemons, interactive tools, monster cranker
CPU hogs, etc. There is nothing in the patch which limits the ways in
which processes can be grouped.
One remaining question might be: how does the system administrator actually
cause this grouping to happen? The answer is in the second part of the
group scheduling patch, which integrates scheduling entities with the process container mechanism.
The administrator mounts a container filesystem with the cpuctl
option; scheduling groups can then be created as directories within that
filesystem. Processes can be moved into (and out of) groups using the
usual container interface. So any particular policy can be implemented
through the creation of a simple, user-space daemon which responds to
process creation events by placing newly-created processes in the right
In its current form, the container code only supports a single level of group
hierarchy, so a two-level scheme (divide users into administrators,
employees, and guests, then enforce fairness between users in each group,
for example) cannot be implemented. This appears to be a "didn't get
around to it yet" sort of limitation, though, rather than something which
is inherent in the code.
With this feature in place, CFS will become more interesting to a number of
potential users. Those users may have to wait a little longer, though.
The 2.6.23 merge window will be opening soon, but it seems unlikely that
this work will be considered ready for inclusion at that time. Maybe
2.6.24 will be a good release for people wanting a shiny, new, group-aware
Comments (5 posted)
The proposed fallocate()
system call, which exists to allow an
application to preallocate blocks for a file, was covered here
back in March.
Since then there has been quite a bit of discussion, but there is still no
system call in the mainline - and it's not clear that
there will be in 2.6.23 either. There is a new version of the
in circulation, so it seems like a good time
to catch up with what is going on.
Back in March, the proposed interface was:
long fallocate(int fd, int mode, loff_t offset, loff_t len);
It turns out that this specific arrangement of parameters is hard to
support on some architectures - the S/390 architecture in particular.
Various alternatives were proposed, but getting something that everybody
liked proved difficult. In the end, the above prototype is still being
used. The S/390 architecture code will have to do some extra bit shuffling
to be able to implement this call, but that apparently is the best way to
That does not mean that the interface discussions are done, though. The
current version of the patch now has four possibilities for mode:
- FA_ALLOCATE will allocate the requested space at the
given offset. If this call makes the file longer, the
reported size of the file will be increased accordingly, making the
allocated blocks part of the file immediately.
- FA_RESV_SPACE preallocates blocks, but does not change the
size of the file. So the newly allocated blocks, if past the end of
the file, will not appear to be present until the application writes
to them (or increases the size of the file in some other way).
- FA_DEALLOCATE returns previously-allocated blocks to the
system. The size of the file will be changed if the deallocated
blocks are at the end.
- FA_UNRESV_SPACE returns the blocks to the system, but does
not change the size of the file.
As an example of how the last two operations differ, consider what happens
if an application uses fallocate() to remove the last block from a
file. If that block was removed with FA_DEALLOCATE, a subsequent
attempt to read that block will return no data - the offset where that
block was is now past the end of the file. If, instead, the block is
removed with FA_UNRESV_SPACE, an attempt to read it will return a
block full of zeros.
It turns out that there are some differing opinions on how this interface
should work. A trivial change which has been requested is that the
FA_ prefix be changed to FALLOC_ - this change is likely
to be made. But it seems there's a number of other flags that people would
like to see:
- FALLOC_ZERO_SPACE would write zeros to the requested
range - even if that range is already allocated to the file. This
feature would be useful because some filesystems can quickly
mark the affected range as being uninitialized rather than actually
writing zeros to all of those blocks.
- FALLOC_MKSWAP would allocate the space, mark it initialized,
but not actually zero out the blocks. The newly-allocated blocks
would thus still contain whatever data the previous user left there.
This operation, which would clearly have to be privileged, is intended
to make it possible to create a swap file in a very quick way. It
would require very little in the way of in-kernel memory allocations
to implement, making it a useful way to add an emergency swap file to
a system which has gone into an out-of-memory condition.
- FALLOC_FL_ERR_FREE would be an additional flag which would
affect error handling; in particular, it would control behavior when
the filesystem runs out of space part way through an allocation
request. If this flag is set, the blocks which were successfully
preallocated would be freed; otherwise they would be left in place.
There is some opposition to this flag; it may be left out in favor of
an official "all or nothing" policy for preallocations.
- FALLOC_FL_NO_MTIME and FALLOC_FL_NO_CTIME would
prevent the filesystem from updating the modification
times associated with the file.
All told, it's a significant number of new features - enough that some
people are starting to wonder if fallocate() is the right approach
after all. Christoph Hellwig, in particular, has started to complain; he
suggests adding something small which would be able to implement
posix_fallocate() and no more. Block deletion, he says, is a
different function and should be done with a different system call, and the
other features need more thought (and aggressive weeding). So it's unclear
where this patch set will go and whether it will be considered ready for
Comments (2 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>