The current 2.6 development kernel remains 2.6.30-rc3; no new
prepatches have been released over the last week. Changes continue to flow
into the mainline repository; they are almost all fixes, but there was also
the restoration of Tux as the kernel mascot.
The current stable 2.6 kernel is 126.96.36.199, released on April 27 with
about 100 patches. "There are a lot of fixes in this release
touching all over the tree. At least a few have possible security impact
(e.g. af_rose, agp, capability fs_mask, splice/ocfs2). As usual, you're
encouraged to upgrade."
Comments (none posted)
Kernel development news
Hey, it was only meant to be a single release. Now they can all
die as far as I'm concerned.
returns Tux to the kernel
In days of old in 2.6.29, netfilter did locketh using a
lock of the reader kind when doing its table business, and do
a writer when with pen in hand like a overworked accountant
did replace the tables. This sucketh and caused the single
lock to fly back and forth like a poor errant boy.
But then netfilter was blessed with RCU and the performance
was divine, but alas there were those that suffered for
trying to replace their many rules one at a time.
So now RCU must be vanquished from the scene, and better
chastity belts be placed upon this valuable asset most dear.
The locks that were but one are now replaced by one per suitor.
The repair was made after much discussion involving
Eric the wise, and Linus the foul. With flowers springing
up amid the thorns some peace has finally prevailed and
all is soothed.
-- Stephen Hemminger
First of all, I happily admit that wrt locking I'm a barbarian, and
proud of it. I.e. simpler locking scheme beats theoretical
improvement, unless we have really good evidence that there's a
real-world problem. All things equal, complexity loses. All
things not quite equal - ditto.
-- Al Viro
Comments (3 posted)
The changes that are being made for a faster-booting Linux have generally
been welcomed, but when they lead to an apparent regression, complaints
will be heard. That situation arose recently when Jeff Garzik reported a
regression that caused one of his systems to no longer boot. Because of
some changes made to the way USB initializes, the system no longer
recognized his disks in time to mount the root filesystem from them. As it
turns out, the problem is not limited to disks, nor is it new; it is a
longstanding race condition that previously was being "won" by most
hardware, but that same hardware is often losing the race now.
Garzik had bisected the problem to a particular commit
made back in September of 2008. Instead of sleeping for 100ms as part of
the initialization of each USB root hub, the new code uses the delayed work
mechanism to schedule the next initialization step 100ms in the future.
For kernels which had the USB code built-in, this would allow the boot
thread to do other work, rather than block waiting for these delays. It
had a rather positive impact on boot speed, with patch author Alan Stern
Arjan van de Ven was very pleased to see that this shaved 700 ms off
his computer's boot time. Since his total boot time is on the order
of two seconds, the improvement is considerable.
From Garzik's perspective, the problem is that this system booted
successfully with every kernel version until 2.6.28. The immediate
suggestion was to use the rootdelay kernel boot option which will
delay the boot process for the given number of seconds before trying to
mount the root filesystem. That did not sit very well with Garzik, and he
asked: "When did regressions become
an acceptable tradeoff for speed?"
As it turns out, Garzik had just been "lucky" before, he could have run
into this problem on earlier kernels with different hardware as Greg
Kroah-Hartman points out: "What
happens when you buy a new box with more USB host controllers and a
faster processor? Same problem." The underlying issue is specific
to USB, as the old initialization waited 100ms per USB bus (i.e. root hub)
synchronously, so a system with five hubs would effectively wait 500ms for
the first to initialize and enumerate the devices attached. The new code
does those same initializations in parallel.
While it is relatively uncommon to have USB root filesystems, it is far
from unheard of. Embedded systems are a fairly likely candidate, due to
cost and form factor issues, as Alan Cox explained. Multiple distributions also have
support for running completely from a USB device, typically a USB flash
But, as Garzik and others point out, users that upgrade their kernels (or
distributions who do so), but don't add in a rootdelay option,
risk having systems that cannot boot. USB is fundamentally different than
other buses, however, because there is no way to know when the enumeration
of devices on a particular hub has been completed. Mark Lord questioned the explanation, noting:
"SATA drives also take variable amounts of time to 'show up' at
boot." But as Arjan van de Ven explained, there is a significant difference:
the difference is that with sata you know when you are done and have all
possible drives. No so much much with USB. So with SATA we can, and do,
wait for the scan to complete at the right point in the boot.
It turns out that the same problem in a slightly different guise shows up
for embedded devices that use USB consoles. David
VomLehn has been working on a patch to wait
for USB consoles to become available. Because embedded devices often have
USB consoles, but only for development and debugging, a long delay waiting
for a console that is unlikely to show up in the majority of cases is
undesirable. But, because it is impossible to know that all USB devices
have reported in, some kind of delay is inevitable. VomLehn's mechanism
would delay up until a timeout specified in the kernel boot parameters,
but, unlike rootdelay, would wake up early as soon as a console
device was detected.
As VomLehn notes, the problem goes even
further than that, affecting USB network devices needed at boot time as
well. Discussion on
various versions of his patch also pointed out that similar problems exist
for other buses. As boot parallelization gets better—and more
pervasive—more of these kinds
of problems are going to be discovered. A more general solution for devices
required at boot time needs to be
found as van de Ven describes:
For other pieces it's hard. Non-enumeratable busses just suck;
at some point all you can do is just wait (which we have already
available today for anyone to do). I realize people don't want to
just wait 4 seconds (the people who first objected to boot time
improvements then suddenly care about boot time ;-)...
For root fs there's some options, and I have patches to basically retry
on fail. (The patches have a bug and I don't have time to solve it this
week, so I'm not submitting them)
For other devices it is hard. Realistically we need hotplug to work
well enough so that when a device shows up, we can just hook it up when
So far, the problems have just been identified and discussed. Workarounds
like rootdelay have been mentioned, but that only "solves" one facet
of the problem. Distributions are, or will be, shipping 2.6.29 kernels in
their upcoming releases, one hopes they have already dealt with the issue
or there may be a number of rather puzzled users with systems that don't
boot. It would seem important to address the problems, at least for USB
storage, as part of 2.6.31.
Comments (14 posted)
As has been well publicized by now, the Linux kernel lacks the sort of
tracing features which can be found in certain other Unix-like kernels.
That gap is not the result of a want of trying. In the past,
developers trying to put tracing infrastructure into the kernel
have often run into a number of challenges, including opposition from their
colleagues who do not see the
value of that infrastructure and resent its perceived
overhead. More recently, it would seem that the increased
interest in tracing has helped developers to overcome some of those
objections; an ongoing
discussion shows, though, that concerns about tracing are still alive and
have the potential to derail the addition of tracing facilities to the
Sun's DTrace is famously a dynamic tracing facility, meaning that it can be
used to insert tracepoints at (almost) any location in the kernel. But the
Solaris kernel also comes with an extensive and
well-documented set of static tracepoints which can be activated by
name. These tracepoints have been placed at carefully-considered locations
which facilitate investigations into what the kernel is actually doing.
Many real-world DTrace scripts need only the static tracepoints and do no
dynamic tracepoint insertion at all.
There is clear value in these static tracepoints. They represent the
wisdom of the developers who (presumably) are the most familiar with each
kernel subsystem. System administrators can use them to extract a great
deal of useful information without having to know the code in question.
Properly-placed static tracepoints bring a significant amount of transparency to
the kernel. As tracing capabilities in Linux improve, developers naturally
want to provide a similar set of static tracepoints. The fact that static
tracing is reasonably well supported (via FTrace) in mainline kernels -
with more extensive support available via SystemTap and LTTng - also
encourages the creation of static tracepoints. As a result, there have
been recent patches adding tracepoints to workqueues and some core memory management
functions, among others.
Digression: static tracepoints
As an aside, it's worth looking at the form these tracepoints take; the
design of Linux tracepoints gives a perspective on the problems they were
intended to solve. As an example, consider the following
tracepoints for the memory management code which reports on
page allocations. The declaration of the tracepoint looks like this:
TP_PROTO(unsigned long pfn, unsigned long free),
__field(unsigned long, pfn)
__field(unsigned long, free)
__entry->pfn = pfn;
__entry->free = free;
TP_printk("pfn=%lx zone_free=%ld", __entry->pfn, __entry->free)
That seems like a lot of boilerplate for what is, in a sense, a switchable
printk() call. But, naturally, there is a reason for each piece.
The TRACE_EVENT() macro declares a tracepoint - this one is called
mm_page_allocation - but does not yet instantiate it in the code.
The tracepoint has arguments which are passed
to at its actual instantiation (which we'll get to below); they are
declared fully in the TP_PROTO() macro and named in the
TP_ARGS() macro. Essentially, TP_PROTO() provides a
function prototype for the tracepoint, while TP_ARGS() looks like
a call to that tracepoint.
These values are enough to let the programmer
place a tracepoint in the code with a line like:
This tracepoint is really just a known point in the code which can have, at run
time, one or more function pointers stored into it by in-kernel tracing
utilities like SystemTap or
Ftrace. When the tracepoint is enabled, any functions stored there will be called with
the given arguments. In this case, enabling the tracepoint will result in
calls whenever a page is allocated; those calls will receive the page frame
number of the allocated page and the number of free pages remaining as
As can be seen in the declaration above, there's more to the tracepoint
than those arguments; the rest of
the information in the tracepoint declaration is used by the Ftrace
subsystem. Ftrace has a couple of
seemingly conflicting goals; it wants to be able to quickly enable
human-readable output from a tracepoint with no external tools, but the
Ftrace developers also want to be able to export trace data from the kernel
quickly, without the overhead of encoding it first. And that's where the
remaining arguments to TRACE_EVENT() come in.
When properly defined (the magic exists in a bunch of header files under
kernel/trace), TP_STRUCT__entry() adds extra fields to
the structure which represent the tracepoint; those fields should be
capable of holding the binary parameters associated with the tracepoint.
The TP_fast_assign() macro provides the code needed to copy the
relevant data into that structure. That data can, with some changes merged
for 2.6.30, be exported directly to user space in binary format. But, if
the user just wants to see formatted information, the TP_printk() macro
gives the format string and arguments needed to make that happen.
The end result is that defining a tracepoint takes a small amount of work,
but using it thereafter is relatively easy. With Ftrace, it's a simple
matter of accessing a couple of debugfs files. But other tools, including
LTTng and SystemTap, are also able to make use of these tracepoints.
Given all the talk about tracing in recent years, there is clearly demand
for this sort of facility in the kernel. So one might think that adding
tracepoints would be uncontroversial. But, naturally enough, it's not that
The first objection that usually arises has to do with the performance
impact of tracepoints, which are often placed in the most
performance-critical code paths in the kernel. That is, after all, where
the real action happens. So adding an unconditional function call to
implement a tracepoint is out of the question; even putting an if
test around it is problematic. After literally years of work, the
developers came up with a scheme
involving run-time code patching that
reduces the performance cost of an inactive tracepoint to, for all
practical purposes, zero. Even the most performance-conscious developers
have stopped fretting about this particular issue. But, of course, there
A tracepoint exists to make specific kernel information available to user
space. So, in some real sense, it becomes part of the kernel ABI. As an
ABI feature, a tracepoint becomes set in stone once it's shipped in a
stable kernel. There is not a universal agreement on the immutability of
kernel tracepoints, but the simple fact is that, once these tracepoints
become established and prove their usefulness, changing them will cause
user-space tracing tools to break. That means that, even if tracepoints
are not seen as a stable ABI the way system calls are, there will still be
considerable resistance to changing them.
Keeping tracepoints stable when the code around them changes will be a
challenge. A substantial subset of the developer community will probably
never use those tracepoints, so they will tend to be unaware of them and
will not notice when they break. But even a developer who is trying to
keep tracepoints stable is going to run into trouble when the code evolves
to the point that the original tracepoint no longer makes sense. One can
imagine all kinds of cruft being added so that a set of tracepoints gives
the illusion of a very different set of decisions than is being made in a
future kernel; one can also imagine the hostile reception any such code
The maintenance burden associated with tracepoints is the reason behind
Andrew Morton's opposition to their addition. With regard to the workqueue
tracepoints, Andrew said:
If someone wants to get down and optimise our use of workqueues
then good for them, but that exercise doesn't require the permanent
addition of large amounts of code to the kernel.
The same amount of additional code and additional churn could be
added to probably tens of core kernel subsystems, but what _point_
is there to all this? Who is using it, what problems are they
We keep on adding all these fancy debug gizmos to the core kernel
which look like they will be used by one person, once. If that!
Needless to say, the tracing developers see the code as being more widely
useful than that. Frederic Weisbecker gave a
detailed description of the sort of debugging which can be done with
the workqueue tracepoints. Ingo Molnar's response appears to be an attempt to hold up
the addition of other types of kernel instrumentation until the tracepoint
issue is resolved. Andrew remains unconvinced, though; it seems he
would rather see much of this work done with dynamic tracing tools
As of this writing, that's where things stand. If these tracepoints do not
get into the mainline, it is hard to see developers going out and creating
others in the future. So Linux could end up without a set of well-defined
static tracepoints for a long time yet - though it would not be surprising
to see the enterprise Linux vendors adding some to their own kernels. Perhaps
that is the outcome that the development community as a whole wants, but
it's not clear that this
feeling is universal at this time. If, instead, Linux is going to end up
with a reasonable set of tracepoints, the development community will need
to come to some sort of consensus on which kinds of tracing instrumentation
Comments (25 posted)
Back in November, LWN examined
the KSM (kernel shared memory) patch
. KSM was written to address a
problem encountered on systems running virtualized guests: such systems can
have large numbers of pages holding identical data, but the kernel has no
way to let guests share those pages. The KSM code scans through memory to
find pairs of pages holding identical data; when such pairs are found, they
are merged into a single page mapped into both locations. The pages are
marked copy-on-write, so the kernel will automatically separate them again
should one process modify its data.
There were some concerns about the intended purpose of this patch, but it
was soon made clear
that KSM can help the system to save significant amounts of memory. But
KSM was not merged as a result of two other problems. One of them,
discussed mostly behind closed doors, appears to be concerns about the use
of SHA1 hashes to compare pages. If an attacker could create hash
collisions, he might just be able to inject his own data (or code) into
processes belonging to other users and/or virtual machines. The other
problem had to do with a different kind of attacker: VMWare holds a patent to an
algorithm which looks quite similar to the method used by the early KSM
patches. There is evidence that this patent could be overturned on prior
art, but that is still a battle that nobody wants to fight.
KSM disappeared from view for a while after those issues came to light,
but, more recently, new versions of the KSM patches have been posted for review. A quick look
at the code makes it clear that both of these concerns have been addressed
- and, in fact, that the KSM developers were able to kill both birds with
the same stone. It's all a matter of doing away with the hashing of pages.
6,789,156 is not exactly light reading; it has a full 74 claims. Most
of the independent claims have one thing in common, though: they include the
calculation of a hash value to find identical pages in the system. If
the KSM code were to avoid hashing pages, those claims of the patent would
clearly not read against it. And, as described above, the use of hashing
also created some security worries. So it must have seemed clear to the
KSM developers (and any lawyers they may have talked to) that the hash
needed to go.
The current KSM patches have replaced the hash table with two separate red-black trees. Pages tracked
by KSM are initially stored in the "unstable tree"; the term "unstable"
means that KSM considers their contents to be volatile. Placement in the
tree is determined by a simple memcmp() of the page's contents;
essentially, the page is treated as containing a huge number and sorted
accordingly. The unstable tree is suitable for finding pages with
duplicate contents; a relatively quick traversal of the tree will turn up
the only candidates.
It's worth noting that KSM does not place every page it scans in the
unstable tree. If the contents of a page change over the course of one
memory scanning cycle, the page will not really be a good candidate for
sharing anyway. So pages which are seen to change are not represented in
the unstable tree. The unstable tree is also dumped and rebuilt from the
beginning after each scan cycle. That deals with the problem of pages
which, as a result of modifications, find themselves in the wrong location
in the tree. The nature of red-black trees means that search and insertion
operations are almost the same thing, so there is little real cost to
rebuilding the unstable tree from the beginning every time.
The other pages which are not found in the unstable tree are those which
have actually been merged with duplicates. Since shared pages are marked
read-only, KSM knows that their contents cannot change. Those pages are
put into a separate "stable tree." The stable tree is also a red-black
tree, but, since pages cannot become misplaced there, it need not be
rebuilt regularly. Once a page goes into the stable tree, it stays there
until all users have either modified or unmapped it.
The resulting system clearly works. Dropping the hash may impose a cost in
the form of slightly higher CPU and memory use; there have been no
benchmarks posted which would show the difference. But there is no cost on
systems which do not use KSM at all, and, in any case, avoiding
the potential problems associated with using hash tables to identify pages
with identical contents will be worth the cost. At this point, comments on
the KSM code are mostly concerned with relatively small details. It could
well be that this code will be ready for inclusion into the 2.6.31 kernel.
(Postscript: above, your editor noted that "most" of the independent claims
in the VMWare patent required the use of a hashing mechanism. There are,
in fact, a few claims which lack that requirement, but they replace it with
one or two others. Some claims cover the use of copy-on-write pages, but
they all explicitly say that this technique is used on pages with a
"relatively high probability of impending modification." But there is
little point in sharing such pages at all; KSM, by leaving them out of the
unstable tree, ignores those pages entirely. The remaining claims describe
partitioning memory into "classes," which is not done in KSM.)
Comments (21 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
- Gregory Haskins: irqfd .
(April 24, 2009)
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>