The current development kernel is 3.6-rc1, announced on August 2. "As usual, even the shortlog is too big
to usefully post, but there's the usual breakdown: about two thirds of
the changes are drivers (with the CSR driver from the staging tree
being a big chunk of the noise - christ, that thing is big and wordy
even after some of the crapectomy).
Of the non-driver portion, a bit over a third is arch (arm, x86, tile,
mips, powerpc, m68k), and the rest is a fairly even split among fs,
include file noise, networking, and just 'rest'." See the summary
below for what was merged after last week's update.
Stable updates: The 3.2.25 and 3.2.26 kernels were released on August 3 and
August 5 respectively. The 3.2.27, 3.4.8, 3.0.40,
and 3.5.1 stable reviews are underway as of
this writing; those kernels can be expected on or after August 9.
Comments (none posted)
Trust me: every problem in computer science may be solved by an
indirection, but those indirections are *expensive*. Pointer
chasing is just about the most expensive thing you can do on modern
— Linus Torvalds
When the GNU OS concept started the idea that everyone would have a
Unix capable system on their desk was pretty hard to imagine. The
choice of a Mach based microkernel was both in keeping with a lot
of the research of the time and also had a social element. The
vision was a machine where any user could for example implement
their own personal file system without interfering with other
users. Viewed in the modern PC world that sounds loopy but on a
shared multi-user computer it was an important aspect of software
Sticking to Mach and being hostile to Linux wasn't very smart and a
lot of developers have not forgiven the FSF for that, which is one
reason they find the "GNU/Linux" label deeply insulting.
The other screw up was that they turned down the use of UZI, which
would have given them a working if basic v7 Unix equivalent OS
years before Linux was released. Had they done that Linux would
never have happened and probably the great Windows battle would
have been much more fascinating.
— History lessons from Alan Cox
Comments (2 posted)
the 3.6 merge window on August
2, a couple of days earlier
than would have normally been expected. There were evidently two reasons
for that: a desire to send a message to those who turn in their pull
requests on the last day of the merge window, and his upcoming vacation.
In the end, he only pulled a little over 300 changes since the previous merge window summary
, with the
result that 8,587 changes were pulled in the 3.6 merge window as a whole.
Those 300+ changes included the following:
- The block I/O bandwidth controller has been reworked so that each
control group has its own request list, rather than working from a
single, global list. This increases the memory footprint of block I/O
control groups, but makes them function in a manner much closer to the
original intention when lots of requests are in flight.
- A set of restrictions on the creation of
hard and soft links has been added in an attempt to improve
security; they should eliminate a lot of temporary file
- The device mapper dm-raid module now supports RAID10 (a combination of
striping and mirroring).
- The list of new hardware support in 3.6 now includes OMAP DMA engines.
- The filesystem freeze functionality has been reimplemented to be more
robust; in-tree filesystems have been updated to use the new mechanism.
The process of stabilizing all of those changes now begins; if the usual
patterns hold, the final 3.6 kernel can be expected sometime in the second
half of September.
Comments (3 posted)
Kernel development news
It is not uncommon for software projects — free or otherwise — to include a
set of tests intended to detect regressions before they create problems for
users. The kernel lacks such a set of tests. There are some good reasons
for this; most kernel problems tend to be associated with a specific device
or controller and nobody has anything close to a complete set of relevant
hardware. So the kernel depends heavily on early testers to find
problems. The development process is also, in the form of the stable
trees, designed to collect fixes for problems found after a release and to
get them to users quickly.
Still, there are places where more formalized regression testing could be
helpful. Your editor has, over the years, heard a large number of
presentations given by large "enterprise" users of Linux. Many of them
expressed the same complaint: they upgrade to a new kernel (often skipping
several intermediate versions) and find that the performance of their
workloads drops considerably. Somewhere over the course of a year or so of
kernel development, something got slower and nobody noticed. Finding
performance regressions can be hard; they often only show up in workloads
that do not exist except behind several layers of obsessive corporate
firewalls. But the fact that there is relatively little testing for such
regressions going on cannot help.
Recently, Mel Gorman ran an extensive set of benchmarks on a set of
machines and posted the results. He found some interesting things that
tell us about the types of performance problems that future kernel users
His results include a set of scheduler tests,
consisting of the "starve," "hackbench," "pipetest," and "lmbench"
benchmarks. On an Intel Core i7-based system, the results were generally
quite good; he noted a regression in 3.0 that was subsequently fixed, and a
regression in 3.4 that still exists, but, for the most part, the kernel has
held up well (and even improved) for this particular set of benchmarks. At
least, until one looks at
the results for other processors. On a Pentium 4 system, various
regressions came in late in the 2.6.x days, and things got a bit worse
again through 3.3. On an AMD Phenom II system, numerous regressions
have shown up in various 3.x kernels, with the result that performance as a
whole is worse than it was back in 2.6.32.
Mel has a hypothesis for why things may be happening this way: core kernel
developers tend to have access to the newest, fanciest processors and are
using those systems for their testing. So the code naturally ends up being
optimized for those processors, at the expense of the older systems.
Arguably that is exactly what should be happening; kernel developers are
working on code to run on tomorrow's systems, so that's where their focus
should be. But users may not get flashy new hardware quite so quickly;
they would undoubtedly appreciate it if their existing systems did not get
slower with newer kernels.
He ran the sysbench tool on
three different filesystems: ext3, ext4, and xfs. All of them showed some regressions over
time, with the 3.1 and 3.2 kernels showing especially bad swapping
performance. Thereafter, things started to improve, with the
developers' focus on fixing writeback problems almost certainly being a
part of that solution. But ext3 is still showing a lot of regressions,
while ext4 and xfs have gotten a lot better. The ext3 filesystem is
supposed to be in maintenance mode, so it's not surprising that it isn't
advancing much. But there are a lot of deployed ext3 systems out there;
until their owners feel confident in switching to ext4, it would be good if
ext3 performance did not get worse over time.
Another test is designed to determine how
well the kernel does at satisfying high-order allocation requests (being
requests for multiple, physically-contiguous pages). The result here is
that the kernel did OK and was steadily getting better—until the 3.4
release. Mel says:
This correlates with the removal of lumpy reclaim which compaction
indirectly depended upon. This strongly indicates that enough
memory is not being reclaimed for compaction to make forward
progress or compaction is being disabled routinely due to failed
attempts at compaction.
On the other hand, the test does well on idle systems, so the
anti-fragmentation logic seems to be working as intended.
Quite a few other test results have been posted as well; many of them show
regressions creeping into the kernel in the last two years or so of
development. In a sense, that is a discouraging result; nobody wants to
see the performance of the system getting worse over time. On the other
hand, identifying a problem is the first step toward fixing it; with
specific metrics showing the regressions and when they first showed up,
developers should be able to jump in and start fixing things. Then,
perhaps, by the time those large users move to newer kernels, these
particular problems will have been dealt with.
That is an optimistic view, though, that is somewhat belied by the minimal
response to most of Mel's results on the mailing lists. One gets the sense
that most developers are not paying a lot of attention to these results,
but perhaps that is a wrong impression. Possibly developers are far too
busy tracking down the causes of the regressions to be chattering on the
mailing lists. If so, the results should become apparent in future
Developers can also run these tests themselves; Mel has released the whole
set under the name MMTests. If this test
suite continues to advance, and if developers actually use it, the kernel
should, with any luck at all, see fewer core performance regressions in the
future. That should make users of all systems, large or small, happier.
Comments (40 posted)
A data structure implementation that is more or less replicated in 50 or
more places in the kernel seems like some nice low-hanging fruit to pick.
That is just what Sasha Levin is trying to do with his generic hash table patch set. It implements a
simple fixed-size hash table and starts the process of changing various
existing hash table implementations to use this new infrastructure.
The interface to Levin's hash table is fairly straightforward. The API is
linux/hashtable.h and one declares a hash table as follows:
This creates a table with the given name
and a power-of-2 size
based on bits
. The table is implemented using buckets containing
type. It implements a chaining hash, where hash
collisions are simply added to the head of the hlist
One then calls:
to initialize the buckets.
Once that's done, a structure containing a struct hlist_node
pointer can be constructed to hold
the data to be inserted, which is done with:
hash_add(name, bits, node, key);
is a pointer to the hlist_node
is the key that is hashed
into the table. There are also two mechanisms to iterate over the table.
The first iterates through the entire hash table, returning the entries in
hash_for_each(name, bits, bkt, node, obj, member)
The second returns only the entries that correspond to the key
hash_for_each_possible(name, obj, bits, node, member, key)
In each case, obj
is the type of the underlying data,
is a struct hlist_head
pointer to use as a loop
is the name of the struct hlist_node
in the stored data type. In addition, hash_for_each()
integer loop cursor, bkt
. Beyond that, one can remove an entry
from the table with:
Levin has also converted six different hash table uses in the kernel as
examples in the patch set. While the code savings aren't huge (a net loss
of 16 lines), they could be reasonably significant after converting the 50+
different fixed-size hash tables that Levin found in the kernel. There is also the
obvious advantage of restricting all of the hash table implementation bugs
to one place.
There has been a fair amount of discussion of the patches over the three
revisions that Levin has posted so far. Much of it concerned
implementation details, but there was another more global concern as
well. Eric W. Biederman was not convinced
that replacing the existing simple hash tables was desirable:
For a trivial hash table I don't know if the abstraction is worth it.
For a hash table that starts off small and grows as big as you need it
the [incentive] to use a hash table abstraction seems a lot stronger.
But, Linus Torvalds disagreed. He
mentioned that he had been "playing around" with a directory
cache (dcache) patch that uses a fixed-size hash table as an L1 cache for
directory entries that provided a noticeable performance boost. If a
lookup in that
first hash table fails, the code then falls back to the existing
dynamically sized hash table. The reason that the code hasn't been
committed yet is because
"filling of the
small L1 hash is racy for me right now" and he has not yet found a
lockless and race-free way to do so. So:
[...] what I really wanted to bring up was the fact that static hash
tables of a fixed size are really quite noticeably faster. So I would
say that Sasha's patch to make *that* case easy actually sounds nice,
rather than making some more complicated case that is fundamentally
slower and more complicated.
Torvalds posted his patch (dropped diff attachment) after a request
from Josh Triplett. The
race condition is "almost entirely theoretical", he said, so
it could be used to generate some preliminary performance numbers. Beyond
just using the small fixed-sized table, Torvalds's patch also circumvents
any chaining; if the hash bucket doesn't contain the entry, the second
cache is consulted. By avoiding "pointer
chasing", the L1 dcache "really improved performance".
Torvalds's dcache work is, of course, something of an aside in terms of Levin's patches, but
several kernel developers seemed favorably inclined toward consolidating
the various kernel hash table implementations.
Biederman was unimpressed with the
conversion of the UID cache in the user namespace code and Nacked it. On
the other hand,
Desnoyers had only minor comments on the
conversion of the tracepoint hash table and Eric Dumazet had mostly stylistic comments on the conversion of the 9p
error table. There are several other
maintainers who have not yet weighed in, but so far most of the reaction
has been positive. Levin is trying to attract more reviews by converting a
few subsystems, as he notes in the patch.
It is still a fair amount of work to convert the other 40+ implementations,
but the conversion seems fairly straightforward. But, Biederman's
the conversion of the namespace code is something to note: "I don't
have the time for a new improved better hash table that makes
the code buggier." Levin will need to prove that his implementation
works well, and that the conversions don't introduce regressions, before there
is any chance that we will see it in the mainline. There is no reason that all
hash tables need to be converted before that happens—though it might
make it more likely to go in.
Comments (21 posted)
Here is another in our series of articles with questions posed to a kernel
developer. If you have
about technical or procedural things involving Linux kernel
development, ask them in the comment section, or email them directly to
the author. This time, we look at UEFI booting, real-time kernels, driver
configuration, and building kernels.
I’d like to follow a mailing list on UEFI-booting-related topics but don’t
seem to find any specific subsystem in the MAINTAINERS file, would you please
share some pointers?
Because of the wide range of topics involved in UEFI booting, there is
no "one specific" mailing list where you can track just the UEFI issues.
I recommend filtering the fast-moving linux-kernel mailing list, as
most of the topics that kernel developers discuss cross that list.
As the kernel isn't directly involved in UEFI, there is no one specific
"maintainer" of this area at the moment. That being said, there are
lots of different people working on this task right now.
From the kernel side itself, there has been some wonderful work from
Matt Flemming and other Intel developers, in making it so that the
kernel can be built as an image that is bootable from EFI directly. There were some recent
patches that went into the 3.6-rc1 kernel that have made it
easier for bootloaders to load the kernel in EFI mode. See the patch
for the details about how this is done, but note that some bootloader
work is also needed to take advantage of this.
From the "secure boot" UEFI mode side, James Bottomley, chair of the
Technical Advisory Board of the Linux Foundation (and kernel SCSI
subsystem maintainer), has been working through a lot of the "how do you
get a distribution to boot in secure mode" effort and documenting it all for
all distributions to use. He's published his results,
with code; I also recommend reading his previous blog posts about this topic for
more information about the subject and how it pertains to Linux.
As for distribution-specific work, both Canonical and Red Hat have been
working with the UEFI Forum to help make Linux work properly on
UEFI-enabled machines. I recommend asking those companies about how they
plan to handle this issue, on their respective mailing lists, if you are
interested in finding out what they are planning to do. Other distributions
are aware of the issue, but as of this point in time, I do not believe
they are working with the UEFI Forum.
I am evaluating Linux for use as an operating system in a real-time embedded
application, however, I find it hard to find recent data with respect
to the real-time performance of Linux.
Do you have, or know of someone who has, information on the real-time
performance of the Linux kernel, preferably under various load
I get this type of question a lot, in various forms. The
very simple answer is: "No, there is no data, you should evaluate it
yourself on your hardware platform, with your system loads, to determine
if it meets your requirements." And in reality, that's what you should
be doing in the first place even if there were "numbers" published
anywhere. Don't trust a vendor, or a project, to know exactly how you
are going to be using the operating system. Only you know best, so only
you know how to determine if it solves your problem or not.
So, go forth, download the code, run it, and see if it works. It's
really that simple.
Note, if it doesn't work for you, let the developers know about it. If
they don't know about any problems, then they can't fix them.
What is the best way to get configuration data into a driver? (This is
paraphrased from many different questions all asking almost the same
In the past (i.e. 10+ years ago), lots of developers used module
parameters in order to pass configuration options into a driver to
control a device. That started to break down very quickly when multiple
devices of the same type were in the same system, as there isn't a
simple way to use module parameters for this.
When the sysfs filesystem was created, lots of developers started using
it to help configure devices, as the individual devices controlled by a
single driver are much easier to see and write values to. This works
today, for simple sets of configuration options (such as calibrating an
input device). But, for more complex types of configurations, the best
thing to use is configfs (kernel
documentation, LWN article), which was written specifically for this task.
It handles ways to tie configurations to sysfs devices easily, and
handles notifying drivers when things have been changed by the user. At
this point in time, I strongly recommend using that interface for any
reasonably complex configuration task that a driver or subsystem might
What is a good, fast and reliable way to compile a custom kernel
for a system? In the past, people have used lspci,
lsusb, and others, combined with the old autokernelconf
tool, but that can be difficult, is there a better way?
As Linus pointed out a few weeks ago,
configuring a kernel is getting more and
more complex, with different options being needed by different distributions.
The simplest way I have found to get a custom kernel up and running on a
machine is to take a distribution-built kernel that you know works, and then
use the "make localmodconfig" build option.
To use this option, first boot the distribution kernel, and plug in any
devices that you expect to use on the system, which will load the kernel
drivers for them. Then go into your kernel source directory, and run "make localmodconfig". That option will dig through your system and find the
kernel configuration for the running kernel (which is
usually at /proc/config.gz, but can sometimes be located in the boot
partition, depending on the distribution). Then, the script will remove
all options for kernel modules that are not currently loaded, stripping
down the number of drivers that will be built significantly. The
resulting configuration file will be written to the .config file, and
then you can build the kernel and install it as normal. The time to
build this stripped-down kernel should be very short, compared to the
full configuration that the distribution provides.
Comments (10 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jake Edge
Next page: Distributions>>