Brief items
There is no development kernel as the 2.6.35 merge window is open.
As of this writing, the flow of patches into the mainline has been
relatively slow; see below for details.
2.6.34 was released on
May 16. As usual, there's a lot of new stuff in this release, though it is
slightly less feature-packed than some. Highlights include asynchronous power management, a
lot of tracing enhancements, LogFS, and the Ceph distributed filesystem. As
always, lots of details can be found in the excellent KernelNewbies summary.
There have been no stable updates over the last week.
Comments (none posted)
We do not trust BIOS tables, because BIOS writers are invariably
totally incompetent crack-addicted monkeys. If they weren't, they
wouldn't be BIOS writers. QED.
--
Linus Torvalds
If we continue to hug
Every pig we can
Our love will grow
As large as this land.
So I promise to help
Every pig in defeat
For if it weren't for pigs
There would be no bacon to eat.
--
Daniela
Torvalds
Right, because Firmware writers are from the rugged unresponsive
uplands of planet
ignore-user-complaints-and-eat-them-for-breakfast-if-they-file-bugs
and Software writers are from the emollient responsive groves of
planet harmony. Obviously what would work for one wouldn't work
for the other.
As a software writer, I fully buy into that world view. The
trouble is that when I go to dinner with hardware people, they seem
to be awfully nice chaps ... almost exactly like me, in fact ...
--
James Bottomley
If my phone is able to avoid losing almost all of its standby time
without me having to care about whether my bouncing cow game was
written by a complete fool or not, that means that my phone is
*better* than one where I have to care. Would the world be better
if said fool could be sent to reeducation camps before being
allowed to write any more software? Probably, but sadly that
doesn't seem to be something we can implement through code.
--
Matthew Garrett
Comments (7 posted)
A number of kernel developers held a minisummit dedicated to the collection
and reporting of hardware errors at the Linux Foundation's Collaboration
Summit. Mauro Carvalho Chehab has put together and posted a set of minutes
from that meeting; click below for the full text. It's worth noting that
some developers who were not present are not in full agreement with
everything found here; follow
the discussion
thread for more information.
Full Story (comments: 1)
By Jonathan Corbet
May 18, 2010
Red-black trees (rbtrees) are
a highly-optimized data structure used in a number of places in the kernel.
With an rbtree, a kernel programmer can quickly locate data structures
corresponding to a specific value; all that is needed is to store data
structures with the value of interest as the key. Some fuzzier sorts of
matches can be hard to do with rbtrees; consider, for example, the case of
finding the lowest-valued node which overlaps with a given range of values.
Venkatesh Pallipadi recently encountered this problem while trying to
improve the functioning of the page attribute table (PAT) support for the
x86 architecture. Rather than give up on rbtrees, he chose to enhance that
data structure to meet a wider range of needs.
Venkatesh's patch (which was one of the first things merged for 2.6.35)
implements the concept of "augmented rbtrees." Such a tree works very much
like an ordinary rbtree, with the exception that it keeps additional
information in each node. That information, almost certainly, is a
function of any child nodes in the tree - the maximum key value among all
children, for example. Since users of rbtrees must write their own search
functions anyway, they can easily take advantage of this extra information
to optimize searches.
Users of augmented rbtrees must define an augment_cb() callback
with this prototype:
void (*augment_cb)(struct rb_node *node);
When the tree is initialized, the callback should be stored in its root
node:
struct rb_root my_root = RB_AUGMENT_ROOT(my_augment_cb);
Thereafter, the augment_cb() callback will be invoked whenever the
value of one (or both) of a node's children might have changed. The
callback can then update the node's additional information to match the new
tree topology. The callback will be invoked from insert and delete
operations - anything which might change the tree - so rbtree users should
ensure that nodes are in a consistent state before inserting them.
Callbacks are not called recursively up the tree. So if a change to
a node's augmented value might ripple upward, the augment_cb()
callback must work its way up the tree and make the requisite updates.
Note that a recursive call on the parent node is probably not a good idea
unless the tree is known to be extremely shallow.
As of this writing, the PAT code is the only in-tree user of this
functionality, but others seem likely to appear now that this feature is
globally available.
Comments (none posted)
By Jonathan Corbet
May 19, 2010
Longtime LWN readers will know that the kernel does not have just one
internal memory allocator. Instead, we have the longstanding "slab"
allocator (perennially due to be removed someday), the SLUB allocator
(intended for better scalability, but it hasn't been able to beat slab on
every test), and the SLOB allocator (a space-efficient allocator for
embedded use). There is also the
SLQB allocator waiting in the
wings, but it has been waiting there long enough that one may wonder if it
will ever emerge from there.
All told, one might assume that we have enough allocators. Then again,
there are quite a few letters still available in the SL*B namespace, so why
not make another one?
Thus Christoph Lameter, author of SLUB, has come forward with the SLEB allocator, which is meant
to be a mixture of the best of slab and SLUB. Unlike SLUB, SLEB retains
the object queues used by slab, but it also adds a bitmap for object
management as well. Also unlike SLUB, there is no storage of metadata in
the objects themselves. That is a performance enhancement: if a cache-cold
object is allocated or freed, SLEB will not bring it into the cache.
This code is very new; it apparently has not yet been trusted outside of a
KVM virtual machine. The long benchmarking process that might lead to
merging and, possibly, displacing one of the other allocators has not yet
begun. But the code is there, and that's a start.
Comments (1 posted)
Kernel development news
By Jonathan Corbet
May 19, 2010
It's that time again: a new kernel development cycle has started and the
merge window is currently open for new code. As of this writing, some
1100 non-merge changes have been incorporated into the mainline kernel.
The most significant user-visible changes include:
- The performance monitoring subsystem supports the Intel "precise event
based sampling" (PEBS) mode, in which the hardware directly records
event information into a dedicated memory region. The perf subsystem
also can now obtain performance
information from old Pentium4 CPUs.
- The "perf kvm" tool, which allows the monitoring of virtualized guests
from the host, has been merged.
- The dynamic probe code has better support for a number of basic
integer types.
- The "fair sleepers," "sync wakeups," and "affine wakeup" scheduler
feature flags have
been removed. It seems that, at this point, the scheduler developers
don't believe that things will work properly without those features,
so they are always enabled.
- The SuperH architecture now has hotplug CPU support.
- New drivers:
- Processors and boards: HP iPAQ rx1950 devices, Acer N35
systems, Samsung S3C2416-based systems, Marvell GuruPlug
reference boards, Voipac PXA270 single-board computers, Aeronix
Zipit Z2 systems, Cavium Networks CNS3xxx processors, Cavium
Networks CNS3420 MPCore boards, taskit PortuxG20 and Stamp9G20
boards, ARM SPEAr3XX- and
SPEAr6XX-based systems, Versatile Express CA9x4 processors, and ARM
Ltd Versatile Express boards.
- Miscellaneous: DaVinci DM365-based realtime clock devices.
Changes visible to kernel developers include:
- The "cpu_stop" (formerly cpuhog) mechanism has been
merged. A cpu_stop allows kernel code to monopolize one or more
processors for brief periods.
- Augmented rbtrees are now in the
mainline kernel.
- The INIT_RCU_HEAD() macro is going away; it was never really
needed for RCU functionality, and RCU debugging is moving to the object debugging
infrastructure.
As can be seen above, the 2.6.35 merge window has gotten off to a bit of a
slow start. By the old schedule, the window would remain open through the
end of the month; there has been speculation that Linus will close it
rather sooner than that this time around, though, to inconvenience
maintainers who wait too long to get their pull requests in. One way or
another, there should certainly be more changes to report on next week.
Comments (none posted)
By Jake Edge
May 19, 2010
The time stamp counter (TSC) provided by x86 processors is a
high-resolution counter that can be read with a single instruction
(RDTSC), which
makes
it a tempting target for applications that need fine-grained timestamps.
Unfortunately, it is also rather unreliable, so the kernel jumps
through hoops to decide whether to use it and to try to detect when it goes
awry. An effort to export the kernel's knowledge about the reliability of
the TSC has met strong resistance for a number of reasons, but
the biggest is that the kernel developers don't think that applications
should be accessing the counter directly.
Dan Magenheimer and Venkatesh Pallipadi proposed adding a /sys/devices/tsc
directory with several entries corresponding to the kernel's internal TSC
information, including the tsc_unstable flag, which governs
whether the kernel uses the counter as a stable time source. Andi Kleen questioned the idea:
Is this really a good idea? It will encourage the applications
to use RDTSC directly, but there are all kinds of constraints on
that. Even the kernel has a hard time with them, how likely
is it that applications will get all that right?
That is exactly what the patch is meant to do, Magenheimer said, because applications have no reliable
way to determine whether the standard system calls will be "fast" or
"slow":
The problem is from an app point-of-view there is no vsyscall.
There are two syscalls: gettimeofday and clock_gettime. Sometimes,
if it gets lucky, they turn out to be very fast and sometimes
it doesn't get lucky and they are VERY slow (resulting in a performance
hit of 10% or more), depending on a number of factors completely
out of the control of the app and even undetectable to the app.
Note also that even vsyscall with TSC as the clocksource will
still be significantly slower than rdtsc, especially in the
common case where a timestamp is directly stored and the
delta between two timestamps is later evaluated; in the
vsyscall case, each timestamp is a function call and a convert
to nsec but in the TSC case, each timestamp is a single
instruction.
Depending on the hardware, gettimeofday() and
clock_gettime() may be implemented as vsyscalls—virtual
system calls—rather than standard
system calls, which eliminates the user space to kernel transition.
Vsyscalls are code that is stored in a special memory region in user space
(the vdso region)
that may access kernel-maintained data, like clock ticks.
Using vsyscalls, the calls are (relatively) fast, but on some hardware (or
virtual machines) that
requires kernel-space operations to get to a reliable counter, a vsyscall
cannot be
used, so the calls are slower. For applications that "need to obtain timestamp data
tens or hundreds of thousands of times per second", the difference
is significant.
But Magenheimer believes that
if the kernel finds the TSC stable enough for its own timekeeping purposes, then that guarantees that it is usable by applications. Arjan
van de Ven and Thomas Gleixner are quick to correct that misunderstanding.
Van de Ven notes that the stability of the
TSC can change under certain circumstances and there would be no way to
notify the applications. His advice: "friends don't let friends use
rdtsc in application code".
Gleixner goes into some detail about how
the TSC can get out of whack, including system management mode interrupts (SMIs)
fiddling with the TSC to hide their presence, that multiple cores can
have different values because of boot offsets and/or hotplugging, and that
multiple sockets can introduce differences due to separate clocks or drift
in the clock signals due to temperature. There is, in short, nothing
reliable about the TSC: "the stupid hardware is
not reliable whether it has some 'I claim to be reliable tag' on it or
not". Gleixner did offer a possible alternative, though:
[...] but as long as we do not have some really
reliable hardware I'm going to NACK any exposure of the gory details
to user space simply because I have to deal with the fallout of this.
What we can talk about is a vget_tsc_raw() interface along with a
vconvert_tsc_delta() interface, where vget_tsc_raw() returns you an
nasty error code for everything which is not usable.
Currently, there are unnamed "enterprise applications" that attempt to
figure out whether they can use the TSC, and do so if they think it will
work because of the uncertainty in the performance of
gettimeofday() and friends. Magenheimer suggests that perhaps that information could
be made available:
But the kernel doesn't expose a "gettimeofday
performance sucks" flag either. If it did (or in the case of
the patch, if tsc_reliable is zero) the application could at least
choose to turn off the 10000-100000 timestamps/second and log
a message saying "you are running on old hardware so you get
fewer features".
Magenheimer also wonders if the kernel developers are suffering from "hot
stove" syndrome, in that they have been burned in the past and are reluctant to
even consider changes. But Gleixner and van de Ven both point out that
there is no hardware that can make the guarantees that Magenheimer wants.
And Gleixner has the burn marks to prove it:
I'm unfortunately forced to deal with the 500+
different variants of borked timers and that makes me very reluctant
to believe anything what chip/board/bios vendors promise. It's not the
one time hot stove experience, it's the constant exposure to the never
ending supply of hot stoves, which makes me nervous.
While the discussion had various interesting analogies including hanging
ropes/knives and condoms versus abstention, it did not (yet) find a car
analogy. It did, however, seem to find some common ground that information
about whether the clock calls are implemented as vsyscalls or system calls
should be exported. That is unlikely to satisfy those that have been "using vsyscalls for a while and still have a
performance headache", who Magenheimer quotes, but there is nothing stopping
applications from reading the TSC directly. Those applications just have
to be prepared to handle any strange TSC behavior they encounter.
Ingo Molnar tries to clarify the reasons
that the kernel can't export the reliability information: "The point is for the kernel to not be complicit in
practices that are technically not reliable.
[...]
So the kernel wont 'signal' that something is safe to
use if it is not safe to use."
But he also sees some reason to hope:
You could win the argument by coming up with a patch
that changes gettimeofday to make use of the TSC in a
reliable manner.
I really mean it - and it might be possible - but we
have not found it yet.
Peter Zijlstra has another solution to the problem. He would like to see
the kernel move to eventually disable RDTSC from user space. By
emulating the instruction and logging all uses of it (and the related
RDTSCP), user-space programs that use it could be identified and changed:
Once we get most of userspace running fine, we can switch it to
generating faults.
Of course closed source stuff will have to deal with it themselves, but
who cares about that anyway ;-)
Exporting the information about whether gettimeofday() is "slow"
or not seems like a reasonable starting
point. No patches to do that have emerged yet, but it is a fairly
straightforward thing to do. Eventually, something like Gleixner's
vget_tsc_raw() may also come about, though it won't satisfy those
who are unhappy with the current vsyscall performance. Those applications
will just have to read the TSC themselves and deal with whatever the
hardware throws at them.
Comments (17 posted)
By Jonathan Corbet
May 18, 2010
When LWN last
looked at suspend
blockers in April, it appeared that this functionality was on a path to
be merged into the mainline sometime soon. It may still be on that
path, but an extended discussion has muddied the picture somewhat. It is a
relatively small and obscure bit of code, but the fate of suspend blockers
may have significant implications on how the kernel community deals with
external projects in the future.
Suspend blockers, remember, are tied to the "opportunistic suspend" mode
used by the Android system. In this mode, the kernel is placed into a sort
of controlled narcolepsy; it will fall asleep (suspend the system) just
about anytime that somebody is not actively prodding it. A suspend blocker
is a form of prod which can be used to keep the system awake while some
sort of important processing is going on. As long as there are suspend
blockers outstanding, the system will not suspend.
There are two aspects to this approach which sit well with the Android
developers. One is that they are able to get better power performance
(longer battery life) by suspending the entire system whenever nothing is
going on. Using normal runtime power management does not give them the
same results. The other key point is that opportunistic suspend can happen
even when processes are running in user space. In the absence of a suspend
blocker, any computation underway is not considered to be important enough
to keep the system awake. This behavior is a form of defense against
poorly-written applications which might, otherwise, drain a system's
battery in a short period of time.
Suspend blockers have few enthusiastic supporters. Opportunistic suspend
seems like a bit of a hack, and the need to put suspend blocker calls into
drivers looks invasive. Even if this feature is configured out of most
kernels, it looks like it could be a maintenance burden going forward.
Even so, most of the developers involved - including almost everybody
involved with Linux power management - have concluded that nobody has any
better ideas. So, as Matthew Garrett put
it:
Look, I don't want to sound like I have a one-track mind or
anything, but all of these arguments would be significantly more
compelling if someone would actually provide a concrete
implementation proposal that deals with the set of use-cases that
Google's implementation does and which doesn't make anyone cry.
Otherwise the immeasurably most likely outcome is that this code
gets merged and we get to live with it.
Requests for alternatives have been posted a number of times in this
discussion, but actual proposals have been rather rare. A number of the
suspend blocker opponents seem more interested in changing the use case -
mandating that all Android applications be well written, for example. The
problem is that users will blame the device (rather than that new dog
whistle application) if its battery fails to last long enough. So the
Android developers must choose between somehow forcing good behavior on all
application developers (perhaps losing the "open to all" feature that is at
the core of the Android way of doing things) or creating a system robust
enough to function with non-ideal applications installed.
The Android developers have taken the latter approach. In the process,
they have made suspend blockers a key part of their platform. Many of the
drivers which have been developed for Android have suspend blocker support
built into them; they cannot be merged in their current form if the suspend
blocker API is not available for them. So the current alternatives are to
keep those drivers out, or to hack out the suspend blocker usage before
merging them. In the former case, we have more out-of-tree code; in the
latter case, we have in-tree drivers which are not actually used, tested,
or maintained by anybody. Neither alternative looks good.
Merging suspend blockers would make it easier to get much of the rest of
this code in; as Android developer Brian Swetland said:
With wakelock support in the kernel, I'm able to maintain drivers
that (provided they meet the normal style, correctness, etc
requirements) that both can be submitted to mainline (yay!) and can
ship on production hardware as-is (yay!). Porting other linux
based environments to hardware like G1, N1, etc becomes that much
easier too, which hopefully makes various folks happy.
This helps get us ever closer to being able to build a
production-ready kernel for various android devices "out of the
box" from the mainline tree and gets me ever closer to not being in
the business of maintaining a bunch of SoC-specific
android-something-2.6.# trees, which seriously is not a business I
particularly want to be in.
("Wakelocks" are the old name for suspend blockers).
Google and Android have taken a lot of grief for their failure to work with
upstream and get their code into the mainline kernel. There can be no
doubt that their code could have been handled better; had the Android
developers worked with the kernel community before shipping this
functionality in millions of handsets, perhaps much of this trouble could
have been avoided. But that history cannot be rewritten now; not even the
secret git "plumbing" commands can make that happen. But we can try to
improve the situation going forward.
The suspend blocker effort looks like a real attempt to do better. The
code has not just been posted; it has been through rather more than the
usual number of revisions as its developers have put considerable time into
trying to address comments which have been made. A failure to merge it
would be demoralizing at best. If the development community refuses this
attempt to bring Android and the mainline closer, it risks creating an
impression of bad faith at best. If we do not accept their code, we really
should not complain about them maintaining it outside of the mainline.
As of this writing, the 2.6.35 merge window is open. What will happen with
suspend blockers is anybody's guess. The power management developers are
in favor of merging it, but some others have made a fair amount of noise
and Linus has not made his feelings known. So it is hard to say whether
this long story is about to come to a close or not.
Comments (48 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>