Brief items
The current stable 2.6 kernel is 2.6.16.13,
released on May 2. This
release contains a single patch for a denial of service problem in the SCTP
code.
2.6.16.12 had been
released the day before with a couple dozen important fixes.
The current 2.6 prepatch is 2.6.17-rc3, released by Linus on
April 26, several milliseconds after the LWN Weekly Edition was
published. As expected, the changes were mostly fixes, but this prepatch
also adds support for
version 1.2 trusted platform modules, multiple page size support for the
PA-RISC architecture, and the new vmsplice() system call (see
below). See the long-format changelog
for the details.
The current -mm tree is 2.6.17-rc3-mm1. Recent changes
to -mm include some red-black tree optimizations, a new set of page
migration patches, some RAID (MD) improvements, the likely() macro
profiler (see below), the long-delayed removal of devfs, and some memory
hotplug work.
For 2.4 users, 2.4.33-pre3 is out; it was announced by Marcelo on
May 1. It contains a small number of fixes, a number of which are
security-related.
Comments (2 posted)
Kernel development news
Last January, Van Jacobson
presented his network channel
concept at the 2006 linux.conf.au gathering. Channels, by
concentrating network processing in ways which are most friendly to SMP
systems, look like a promising way to improve high-speed networking
performance. There was a fair amount of excitement about the idea.
Unfortunately, Mr. Jacobson appears to have since become busy with other
projects, so no
contributions of actual code have resulted from his work. So not much has
happened on this front in the last few months - or so it seemed.
David Miller recently let slip that he was
working on his own channel implementation. It was not something he
expected to see functioning anytime soon, however:
[D]on't expect major progress and don't expect anything beyond a
simple channel to softint packet processing on receive any time
soon.
Going all the way to the socket is a large endeavor and will
require a lot of restructuring to do it right, so expect this to
take on the order of months.
It turns out, however, that David was not the only person working on this
idea; Kelly Daly and Rusty Russell have also put together a rudimentary channel implementation; in
response to David's note, they posted their code for review. Since this
version is more advanced, it has been the center of most of the discussion.
The Daly/Russell patch creates a data structure called struct
channel_ring. It consists of 256 pages of memory, mapped contiguously
into the receiving process's address space - though the pages will not be
contiguous in kernel space. As Van Jacobson described, the variables used
by the producer side are located at the beginning of the ring, while
variables used by the consumer are at the end; this separation helps to
ensure that the cache lines representing those variables do not bounce
between processors. These variables include the circular buffer indexes indicating
which buffer each side will use next. There are also flags allowing
the consumer to request a wakeup when buffers are added to the ring.
User-space starts by creating a socket with
the new PF_VJCHAN protocol type, then using mmap() to map
the ring buffer. Thereafter, it can use buffers as they become available
(using poll() or select(), if need be, to wait for more
data). When a buffer is no longer needed, incrementing the appropriate
index will free it up for new data.
The driver-side interface is, so far, quite simple. A buffer can be
allocated from a given ring with a call to vj_get_buffer(); once
the data has been placed there by the network interface,
vj_netif_rx() sends that buffer up into the protocol code. The
tricky part is getting each packet into the correct buffer in the first
place. Copying packets inside the kernel would defeat the purpose of this
whole exercise; it is important that the network interface choose the
correct buffer before DMAing the packet data into memory. As it happens,
contemporary network cards can be smart enough to make that decision, if
programmed properly by the driver.
There are vast numbers of issues to be worked out still. David Miller takes exception to the preallocated buffers,
seeing them as inflexible and hard to change; he would rather see a
pointer-oriented data structure. But it is hard to see how that might work
while still avoiding the overhead of mapping buffers into user space with
every packet.
A more difficult issue, perhaps, is netfilter. The zero-copy approach can
be quite fast, but it also naturally shorts out the packet filtering done
by the netfilter code. It has been suggested that, for established
connections, that is an acceptable tradeoff. But Rusty has pointed out that people do use filtering on
established connections, for packet counting if nothing else. As he put
it: "Basically I don't think we can 'relax' our firewall
implementation and retain trust." So some other sort of solution
will have to be found here.
Another open issue has to do with whether the channel should go all the way
through to user space or not. Van Jacobson's linux.conf.au presentation
included discussion of a user-space TCP implementation, taking the
end-to-end principle to its logical conclusion. The reasoning behind this
move is that, since the data will be processed by the application, putting
the protocol code in the same place will be the fastest, most
cache-friendly way to do it. But moving protocol code to user space also
means duplicating much of the networking stack and adding to the complexity
of the system as a whole. Leaving the protocol code in the kernel
simplifies the situation, and, it is believed, can be made to yield almost
all of the same performance benefits. In particular, protocol processing
can happen on the same processor as the destination application (a fair
amount of it is done that way now), and zero-copy networking will still be
possible.
It has also been pointed out that, since most of the system calls involved
with network data reception (read() or recv(), for
example) already imply copying the data, that copy might as well be done in
kernel space. But implicit in that statement is another conclusion: if
channels are to be used to their fullest potential for high-performance
networking, a new set of user-space interfaces will have to be developed.
The venerable socket interface was never designed for a channel-oriented
environment. How such an interface might look is not entirely clear; it
could be based on the current asynchronous I/O API, on kevents, or on something
completely new.
In summary, the networking developers are working on some major changes to
how networking will be done in Linux, and there are a lot of issues which
are not yet understood. The developers are groping around for ideas. So
the channel implementations which are being posted now are unlikely to
resemble the code which will, someday, be merged into the mainline; they
are, instead, exercises intended mainly to obtain a better understanding of
the real nature of the problem. But they are still a promising start to
what looks to be an interesting development effort.
Comments (8 posted)
April 28, 2006
This article was contributed by Patrick Mochel.
On 11 April 2006, 42 attendees from 17 different companies (and 3
universities) arrived in Santa Clara, California for the 2006
Linux Power Management Summit. The Summit was
organized by your author, in conjunction with the Consumer
Electronics Linux Forum (CELF), which held its Embedded Linux Conference
the same week, and with the OSDL Desktop Linux Working Group. Along
with CELF, summit sponsors included Intel, Nokia, Google, AMD, FreeScale,
and Texas Instruments. The attendees represented over a dozen open
source projects, from the low-level embedded (DPM/PowerOp) to the
high-level (freedesktop.org) to the broadest (Fedora, SUSE, and Ubuntu
distributions). With such a diverse crowd of people, if nothing else,
it promised to be an interesting week of discussions.
The Summit spanned 3 days, starting with a welcome reception on
Tuesday evening, 11 April and going until mid-day on Friday, 14
April. Wednesday and Thursday were filled with hour-long sessions led
by an individual from a project or a company. The sessions were
designed to foster discussion, though the format was left entirely up
to the presenter. Most had a backing presentation of talking points,
and each one succeeded in keeping the discussions flowing.
Wednesday's presentations were centered around various Open Source Power
Management projects.
First Pavel Machek talked about
Linux Suspend [PDF] (Suspend-to-Disk and
Suspend-to-RAM), giving an overview of its history, its
implementation, and the issues that continue to inhibit the suspend
operations from
"just working" in the way that people want them to. He spoke about
uSwsusp, which moves the suspend functionality to userspace, allowing
for less in-kernel complexity and an easier implementation of the
user-friendly features found in Nigel Cunningham's Suspend2 patches;
and he spoke about the main problem with getting Suspend-to-RAM to
work: video drivers.
Len
Brown next talked about ACPI [PDF], and what that meant to power
management. Len gave an overview of the generic ACPI components (the
tables, the ASL compiler. the AML interpretor, and the ACPICA
(Component Architecture)), and the Linux implementation (code
organization, ACPI device drivers, acpid). He then dove into ACPI
power states, and specifically how it represented and implemented CPU
C States (idle states that vary in latency to return) and P States
(performance states that vary in CPU speed).
Len's session provided a good lead-in to
Dominik
Brodowski's session about cpufreq [PDF], which does dynamic CPU
frequency scaling based on policy and intelligence about measuring and
predicting the load. Dominik described the architecture of the
subsystem, how decisions were made, and how they were effected via the CPU
drivers. He then spoke about the desire to extend cpufreq beyond just
frequency scaling (and include voltages and clocks), beyond single
CPUs (to be smarter about managing multiple cores and threads), and
beyond CPUs in general (to include policy and drivers for other
devices with similar functionality).
Todd
Poynor and Matthew Locke's session about DPM and PowerOp [PDF]
followed, providing a perspective on the same topic from the other
end of the tunnel. DPM (Dynamic Power Management) is infrastructure to
manage the "Operating Points" of a system, which are states
consisting of pre-defined tuples of voltages and clocks (and therefore
frequencies). To coordinate and set the voltages and clocks (which
usually must be done for several devices in unison), DPM uses a
low-level interface called PowerOp. DPM is practically ubiquitous in
embedded Linux implementations, though it lives in an out-of-kernel
patch.
The next hour was split between Holger Macht -- who
talked about
SUSE power management -- Dave Jones, who spoke about
Fedora power management, and a guest speaker who spoke about Ubuntu
power management. SUSE provides an application called
powersave that provides a
command-line interface (which can then be wrapped by a GUI) for managing suspend states, CPU
PM, and some device states (recently added). The Fedora and Ubuntu power management
concerns have both centered around getting suspend/resume to work
reliably for their users. Both Fedora and Ubuntu seem to use
gnome-power-manager as the primary interface for managing power; this tool doesn't
expose as many knobs and levers (literally and figuratively) as the powersave
family of utilities do.
All of the distributions now provide quite a large list of support for power
management (especially suspend/resume) on various laptop models.
To finish off the day,
Jim Gettys and
Mark Foster from the One Laptop Per Child project spoke
about the design and challenges of the $100 Laptop [PDF], especially
around power management. Specifically, they are looking for very
efficient hardware and software solutions so that charging the battery
requires minimal energy and so that the battery lasts an exceptionally
long time (by today's standards).
Mark presented a
proposal [PDF] of a mechanism for achieving a
resume-from-RAM in < 300ms.
Sampsa
Fabritius from Nokia started the Thursday sessions [PDF] off
with a presentation of the power management framework used on the
Maemo platform (which is used in the Nokia 770). Maemo is based on
GNOME, but it uses a custom power configuration and management scheme,
rather than one based on Utopia/HAL/DBUS. At a lower level, they have
also written a "clocks" framework for articulating and controlling
clock domains (of which the OMAP platform has many). Based on the
previous day's discussions, Sampsa presented the question of whether
or not it was possible (and prudent) to define a common solution of
power configuration and management (or common set of solutions), since
many platforms and interfaces are trying to accomplish similar things,
sometimes with a set of similar components.
A set of people from the Texas Instruments OMAP division
-- Eric Thomas, Shiv Ramamurthi, and Richard Woodruff -- spoke about the
OMAP platform, its goals, and the challenges faced with leveraging its
power management potential. OMAP has a rich set of power management
techniques, and unlike most desktop platforms, it exposes all of the
low-level components (clocks, clock domains, power domains, and
voltage domains) to the kernel and requires it to coordinate the
scaling of each. This is currently done with a modified version of
DPM, along with a custom set of scripts and control framework to set
and manage the operating points of the system.
Quinn Jensen from FreeScale used the next hour to speak about
the MX31 platform [PDF], an ARM 11-based system-on-a-chip that is similar in nature to
the OMAP. It has many power management features centered around
dynamic voltage and frequency scaling (aka DVFS). Not surprisingly,
they are also using a custom version of DPM and associated control
infrastructure to control the hardware. Like the others, they are
running into limitations of the framework, since it only deals with
the lowest-level components and doesn't provide a rich(er) policy
framework (like cpufreq does).
Mark
Gross, representing CELF, presented a summary of the CELF
power management requirements [PDF], as expressed by the CELF member
companies. The most important items seemed to be the refinement and
inclusion of a dynamic tick/tick-less idle solution (which underscored
the use of such solutions by previous presenters), and a mainstream
solution for DVFS (a la DPM) that provided a robust policy management
(a la cpufreq). Much of the discussion that followed was about the
details of a common interface for these solutions.
Jacob
Shin from AMD presented next about the low-level details of
AMD CPU PM [PDF], specifically how PowerNow works on multi-core K8
processors, and the changes that were necessary to the CPU hotplug and
cpufreq bodies of code to support it.
Thursday ended with a birthday celebration for Adam Belay, then
an open discussion about the topics covered so far and the issues
that were on peoples' minds.
Friday began with another open discussion about what the overall
architecture and framework that is needed for power management on any
system. After several diagrams, doodlings and lists went up on the
wall on gigantic Post-It Notes, the group broke into three smaller
groups to talk about the three primary layers of power management and
how they might be able to share functionality or features between
different platforms and solutions.
- Low level hardware configuration and control. This
discussion was centered around how to describe different levels of
"on-ness" and "off-ness" to high levels in a manner that made the
most sense (to both the device drivers and the consumers of such an
interface).
- The kernel-user space interface. This discussion was
based on the assumption that the gap between DPM's low-level management
framework and cpufreq's policy framework can and should be bridged in
some manner. From there, this group discussed how to design a common
interface (via sysfs) which could be used by a user-space policy mechanism
to control CPU operating points.
- The user space framework that must exist in
user space to provide good power management. There are a number of
existing solutions for monitoring various types of hardware,
monitoring and predicting system load, handling PM-related events, and
managing policy. But, they are all disjoint, overlapping only
occasionally, and most do not do as good of a job as anyone would
like.
It was a long three days, filled with many discussions about system
control and management throughout the software stack, and the many
interdependencies and special cases that exist on many platforms that
Linux supports. Such is the nature of power management. The
introductions to new topics and people, as well as the brainstorming
about better and more common solutions were top-notch, and bode well
for the future of efficiency in Linux.
However, in the meantime, we still have a lot of work to do in the
fixing category. Besides the fact that the primary embedded solution
(DPM), and it's variants don't exist in a mainstream kernel, there
is also this quote to consider about what we're working with today. As
Andrew Morton expressed it (via email):
My main concern is stability of the existing stuff, rather than any
need for new features. Firstly machines which won't boot, especially
ones which _newly_ won't boot. Secondly machines which won't
suspend/resume properly, especially ones which used to do this. Huge
number of ACPI bug reports, and rather a lot of cpufreq ones too.
My second concern would be with overall stability and maturity and
simplicity of the existing kernel APIs - it seems that lots of driver
developers get it wrong in subtle ways. (Why am I still staring at
those "pm_register is deprecated" warnings??)
Fortunately, we now have a lot more people familiar with the types of
Power Management problems, and many more upcoming events to discuss
the progress as we move forward.
[
Author's Note: This article was written with the
help of the extensive notes taken by Jeffery Osier-Mixon, a technical
writer from PalmSource who we borrowed for the Summit. Thanks,
Jefro.]
Comments (5 posted)
A number of issues have been discussed in recent times that, while too
short for a full article, are nonetheless worthy of mention. Here's a few
of them.
Development process
The 2.6.17-rc2-mm1 release
included, along with the usual huge pile of patches, a complaint from
Andrew Morton:
It took six hours work to get this release building and linking in
just a basic fashion on eight-odd architectures. It's getting out
of control....
Could patch submitters _please_ be a lot more careful about getting
the Kconfig correct, testing various Kconfig combinations (yes
sometimes people will want to disable your lovely new feature) and
just generally think about these things a bit harder? It isn't
rocket science.
Andrew, it seems, is getting too many submissions which lack basic
testing. Occasionally things simply don't compile. More often, patches
create problems when their particular configuration options are disabled,
or for architectures not tested by the original developer. Andrew ends up
fixing those problems, and that takes a fair amount of his time. The bigger issue is elsewhere, however:
My main reason for the big whine is that this defect rate indicates
that people just aren't being sufficiently careful in their work.
If so many silly trivial things are slipping through, then what
does this tell us about the big things, ie: runtime bugs?
There has been some discussion of how the situation could be improved.
Ideas include better automated kernel build farms which would allow any
developer to get wider build testing and a
checklist to be gone over before patches are sent for review. But what
is really needed is for developers to simply take a little more care in the
preparation of their patches.
CKRM rebranded
The CKRM resource management patches have been received unenthusiatically
by the development community in the past. To many, CKRM looks like a large
body of complex code, with hooks distributed throughout the kernel,
providing functionality which is of interest to relatively few users. So
the CKRM proposals have not gotten very far, and the development team has
been quiet recently.
What the developers have been doing, however, is reworking the CKRM patches
in an attempt to make them more palatable. The result is now known as Resource Groups, and it is, once
again, being pushed for inclusion into the kernel. The Resource Group code
has been put on a diet, with many features removed and others shoved out to
user space. Duplicated code has been taken out, and a major effort has
been made to use kernel library primitives wherever possible.
Andrew Morton had a reasonable positive
reaction to the new code submission, saying "...the overall code
quality is probably the best I've seen for an initial submission of this
magnitude." He was more worried
about a proposed memory controller, however, which looks to duplicate much
of the memory management subsystem. There have not been a whole lot of
comments from elsewhere in the community, however.
Not so unlikely after all
The kernel provides a couple of macros, called likely() and
unlikely(), which are intended to provide hints to the compiler
regarding which way a test in an if statement might go. The
processor can then use that hint, at run time, to direct its branch
prediction and speculative execution optimizations. These macros are used
fairly heavily throughout the kernel to reflect what the programmer
thinks will happen.
A well-known fact of life is that programmers can have a very hard time
guessing which parts of their code will actually consume the most processor
time. It turns out that they aren't always very good at choosing the
likely branches in their code either. To drive this point home, Daniel
Walker has put together a
patch which does a run-time profile of likely() and
unlikely() declarations. With the resulting output, it is
possible to see which of those declarations are, in reality, incorrect and
slowing down the kernel.
Using this output, Hua Zhong and others have been writing patches
to fix the worst offenders; some of them have already found their way into
the mainline. In at least one case, the results have made it clear to the
developers that things are not working as they were expected to, and other
fixes are in the works.
One unlikely() which remains unfixed, however, is in
kfree(). Passing a NULL pointer to kfree() is
entirely legal, and there has been a long series of janitorial patches
removing tests which checked pointers for NULL before freeing
them. kfree() itself is coded with a hint that a NULL
pointer is unlikely, but it turns out that, in real life, over half of the
calls to kfree() pass NULL pointers. There is
resistance to changing the hint, however; the preference seems to be to fix
the (assumed) small number of high-bandwidth callers which are at the root
of the problem.
vmsplice()
Last week, your editor astutely caught the last-minute merging of the vmsplice() system call
into 2.6.17-rc3. Rather less astutely, however, your editor missed the
fact that the prototype for vmsplice() had changed since it was
posted on the linux-kernel mailing list. The current prototype for
vmsplice() is:
long vmsplice(int fd, const struct iovec *iov,
unsigned long nr_segs, unsigned int flags);
The use of the iovec structure allows vmsplice() to be
used for scatter/gather operations.
Since then, vmsplice() has picked up a new flag:
SPLICE_F_GIFT. If that flag is set, the calling process is
offering the pages to the kernel as a "gift." If conditions allow, the
kernel can simply remove the page from the process's address space and
dump it into, for example, the page cache. With this flag, an application
can generate data in memory, then send it on to its destination without
copying in the kernel.
Comments (6 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>