The current development kernel is 2.6.32-rc4
on October 11. It
has lots of small fixes and a pair of new SCSI drivers. The short-form changelog
is in the announcement, or see the
for all the details.
2.6.32-rc5 is to be expected on October 15, immediately prior to Linus's
travel to Tokyo for the 2009 Kernel Summit.
The current stable kernel is 188.8.131.52, released (along with 184.108.40.206) on October 12.
These updates contain another set of important fixes for these kernels; this summary by Andy Whitcroft
for a bit more information on the changes in 220.127.116.11.
Comments (none posted)
That driver is _not_ "just a driver". It's something
more. Something dank and smelly, that has grown in dark and
-- Linus Torvalds
Again, you're living in that dream world. Wake up, sheeple.
BIOS writers write crap, because it's a crap job. It's that
simple. Yes, they're probably drunk or drugged up, but they need it
to deal with the hand they have been dealt....
So stop blaming the BIOS. We _know_ firmware is crap - there is no
point in blaming it. The response to "firmware bug" should be "oh,
of course - and our code was too fragile, since it didn't take that
And stop saying these problems would magically go away with
open-source firmware. That just shows that you don't understand the
realities of the situation. Even an open-source bios would end up
having buggy tables, and even with an opensource bios, users
generally wouldn't upgrade it.
-- Linus Torvalds
Any time people do ad-hoc locking with "clever" schemes, it's
almost invariably buggy. So the rule is: just don't do that.
-- Linus Torvalds
Comments (21 posted)
Static analysis tools can bring great value to the development process;
they often find bugs which escape review and which, potentially, can live
in the code base for years. Linux has benefited from bug reports from
Coverity's tools, but those tools are proprietary. Unfortunately, free
static analysis tools have always lagged the proprietary alternatives.
That won't change overnight, but there is a new contender on the block in
the form of Stanse; the 1.0 version
was recently announced on the
kernel mailing list. Specific problems that Stanse can test for include
locking errors, memory leaks, failure to check for memory allocation
failures, non-atomic operations in atomic context, and some reference
counting errors. A list of
kernel bugs found by Stanse has been posted.
Clearly, it would be nice to extend Stanse with more tests. Many kernel
developers may balk at doing that, though; Stanse is a Java application,
and checker rules must be written in XML. That limits rule additions to
those who are both familiar with kernel code and able to work in a Java/XML
Comments (9 posted)
The 2009 Kernel Summit will be held October 19 and 20 in Tokyo, Japan,
immediately prior to the Japan
. This will be the first time that the Summit has been
held in Asia. If nothing else, the sight of that many kernel hackers
running loose in Akihabara should be amusing.
agenda for the event has been posted; as usual, it gives an insight
into the kinds of problems which are seen to be pressing at this time.
Following the tradition of the last few years, the Summit is spending a
relatively small amount of time on specific technical issues; that kind of
problem is usually best solved on the mailing lists and with code. What
face-to-face meetings are often best for, instead, is process-oriented
The agenda this time contains a panel consisting of (unnamed, thus far) end
users from both the embedded and enterprise communities. Enterprise
representatives have been fairly common participants at these meetings, but
the presence of the embedded user community is new. With any luck, this
panel will encourage the trend whereby embedded systems vendors are
participating more in the development process. On the second day, instead,
the Summit will hear from a user not normally associated with embedded
systems: there will be a session on Google's use of Linux and problems
which have been encountered.
Another process-oriented session is the perennial "regressions and kernel
quality" topic. A separate session looks at performance regressions in
particular; it's likely to follow up on a similar discussion held during
the kernel developers' panel
at LinuxCon. There's also sessions on how linux-next and the staging tree
work, and an open session on improving the development process.
On the technical side, the summit begins with summary reports from a number
of recently-held mini-summits. Perf events and tracing occupy a
significant chunk of time; some of that will be dedicated to a
demonstration of what can be done with perf, ftrace, and timechart. There
will be discussions on expanding the use of the device tree abstraction to
other architectures, improving generic architecture support, and the
merging of the remaining realtime preemption patches. The "hacking hour,"
introduced last year, is back; there has been a suggestion that the topic
this year could be big kernel lock elimination.
As usual, LWN editor Jonathan Corbet will be there to report on the
discussion. Reports will be posted as soon as they are available; stay
Comments (1 posted)
As a general rule, all new features are supposed to be added to the kernel
during the two-week merge window. There is an exception of sorts, though,
for new device drivers. A well-written driver should not be able to cause
regressions anywhere else in the kernel, and there is often value in
getting it to users as quickly as possible. So drivers will often make it
into the mainline when other large changes are barred.
As the story of the recent SCSI fixes pull
request shows, though, there are limits. This request included a pair
of new drivers for high-end SCSI storage systems. Linus got grumpy for a
few reasons: he would like to see subsystem maintainers try harder to get
drivers in during the merge window, he thinks that the "driver exception"
is mainly useful for consumer-level devices, and the driver in question
here is not small bit of code - it's a 50,000 line monster. In the end,
the driver was merged for 2.6.32-rc4, but Linus made it clear that he would
rather see this kind of code during the merge window.
The conversation drifted into whether the driver should have gone into the
staging tree instead; those who looked at it did not describe it as the
best code they had seen that day. SCSI maintainer James Bottomley sees the staging tree mainly as the place
where user-space ABI issues are cleaned up. Mere code quality issues, he
believes, are better handled directly in the SCSI tree. Others disagree;
in the end, it will come down to what specific subsystem maintainers want
to do. If the maintainer takes a new driver directly into the subsystem
tree, nobody else can force it into staging instead.
The discussion brought out another potential use for the staging tree - as
a last resting place for old drivers on their way out of the
kernel. Staging maintainer Greg Kroah-Hartman noted:
It seems that I'm the only one that has the ability to drop drivers
out of the kernel tree, which is a funny situation :)
In thinking about this a lot more, I don't really mind it. If
people want to push stuff out of "real" places in the kernel, into
drivers/staging/ and give the original authors and maintainers
notice about what is going on, _and_ provide a TODO file for what
needs to happen to get the code back into the main portion of the
kernel tree, then I'll be happy to help out with this and manage
The idea remains hypothetical, though, until somebody actually uses the
staging tree in this manner. It is hard to imagine a demotion to staging
that would not be resisted by somebody; the first attempt to do so may well
be interesting to watch.
Comments (3 posted)
One of the longstanding quirks of BSD-inspired networking is that network
interfaces are a strange sort of device. They live in their own namespace,
do not appear in /dev
, and generally fail to live up to the
"everything is a file" idea that drives much of the POSIX API. That said,
the Unix way of networking has functioned well for nearly 30 years. It is
likely that few people were expecting a serious patch which tries to change
This patch from "Narendra K"
at Dell does exactly that, though, and in surprising ways. With this patch
in place, every network interface gets an associated char device. It is a
singularly useless device: any attempt to open it just returns
ENOSYS. The only real reason for this device's existence, it
turns out, is to generate events for udev which, in turn, can generate
alternative names for the interface.
Why this change? System vendors and administrators are getting tired of
their network interfaces changing name at each boot. Non-deterministic
interface naming is the result of a few factors, including weird BIOS
behavior and the way current kernels enumerate devices via a parallel
hot-plug approach. When interfaces change names, configuration scripts can
get confused; the end result is rarely a working network. Some of this
confusion can be avoided by carefully configuring interfaces based on their
MAC address, but that, too, can fail in the face of "swap out the entire
server" approach to fast failure recovery.
Vendors have tried to work around some of these difficulties at the
hardware level. Dell machines are designed to enumerate network interfaces
in the same order as often as possible. HP blade servers can configure
interface MAC addresses at power-on time. But there are limits to how many
hardware hacks the vendors are willing to add to deal with this problem.
This message from Matt Domsch is
recommended reading for anybody wanting a full understanding of the
Thus the patch, which allows udev to create pseudo-names for each interface
based on criteria like the PCI slot number, chassis label, or anything else
that seems to make sense. The patch is tied to the libnetdevname
library, which maps these pseudo-names into the real interface name,
which can then be used with the socket system calls.
The patch has gotten a bit of a rough reception; it looks to some like a
hack for problems which can be solved in other ways. The discussion has
made it clear that a real problem exists, though, so some sort of solution
will likely be applied in the end. Udev seems like the place for this
solution to happen - that is how naming has been handled for every other
device in the system, after all. So expect something to get in eventually,
though the current may evolve somewhat before it is considered ready for
Comments (21 posted)
Kernel development news
Once upon a time, Linux was limited to less than 1GB of physical memory on
32-bit systems. This limit was imposed by two technical decisions:
processes run with the same page tables in both kernel and user mode, and
all physical memory had to be directly addressable by the kernel. Not
changing page tables at every transition between kernel and user space is a
significant performance win, but it forces the two modes to share the same
4GB address space. The directly-addressable requirement meant that total
physical memory could not exceed the amount of virtual memory address space
assigned to the kernel. Indeed, not even the full kernel space was
available, due to the need to leave some space for I/O memory,
, and so on. The normal split is 3GB for user space and
1GB for kernel space; that limited systems to a bit less than 1GB of
The way this problem was fixed was to create the concept of "high memory":
memory which is not directly addressable by the kernel. Most of the time,
the kernel does not need to directly manipulate much of the memory on the
system; almost all user-space pages, for example, are usually only accessed in user
mode. But, occasionally, the kernel must be able to
reach into any page in the system. Zeroing new pages is one example;
reading system call arguments from a user-space page is another. Since
high-memory pages cannot live permanently in the kernel's virtual address
space, the kernel needs a mechanism by which it can temporarily create a
kernel-space address for specific high-memory pages.
That mechanism is called kmap(); it takes a pointer to a
struct page and returns a kernel-space virtual address for the
page. When the kernel is done with the page, it must use kunmap()
to unmap the page and make the address available for other mappings.
kmap() works, but it can be slow; it requires translation
lookaside buffer flushes and, potentially, cross-CPU interrupts for every
mapping. Linus recently commented on the
costs of high memory:
HIGHMEM accesses really are very slow. You don't see that in user
space, but I really have seen 25% performance differences between
non-highmem builds and CONFIG_HIGHMEM4G enabled for things that
try to put a lot of data in highmem (and the 64G one is even more
expensive). And that was just with 2GB of RAM.
All that costly work is done to keep the kernel-space mapping
consistent across all processors in the system, even though many of these
mappings are used only briefly, and only on a single CPU.
To improve performance, the kernel developers introduced a special version:
void *kmap_atomic(struct page *page, enum km_type idx);
|Atomic kmap slots|
This function differs from kmap()
in some important ways. It only
creates a mapping on the current CPU, so there is no need to bother other
processors with it. It also creates the mapping using one of a very small
set of kernel-space addresses. The caller must specify which address to
use by way of the idx
argument; these addresses are specified by a
set of "slot" constants. For example, KM_USER0
are set aside for code called directly from user context
- system call implementations, generally. KM_PTE0
is used for
page table operations, KM_SOFTIRQ0
is used in software interrupt
mode, etc. There are about twenty of these slots defined in current
kernels; see the list at the right for the 2.6.32 slots.
The use of fixed slots requires that the code using these mappings be
atomic - hence the name kmap_atomic(). If code holding an atomic
kmap could be preempted, the thread which takes its place could use the
same slots, with unfortunate results. The per-CPU nature of atomic
mappings means that any cross-CPU migration would be disastrous.
It's worth noting that there is no
other protection against multiple use of specific slots; if two functions
in a given call chain disagree about the use of KM_USER0, bad
things are going to happen. In practice, this problem does not seem to
actually bite people, though.
This API has seen little change for years, but Peter Zijlstra has recently decided
that it could use a face lift. The result is a patch series changing this
fundamental interface and fixing the resulting compilation problems in over
The change is conceptually simple: the slots disappear, and the range of
addresses is managed as a stack instead. After all, users of
kmap_atomic() don't really care about which address they get; they
just want an address that nobody else is using. The new API does force
map and unmap operations to nest properly, but the atomic nature of these
mappings means that usage generally fits that pattern anyway.
There seems to be little question of this change being merged; Linus welcomed it, saying "I think this is how
we should have done it originally." There were some quibbles about
the naming in the first version of the patch (kmap_atomic() had
become kmap_atomic_push()), but that was easily fixed for the
It is also interesting to look at how this patch series was reworked. The
first version was a single patch which did all of the changes at once. In
response to reviewers, Peter broke the second version down into four steps:
- Make sure that all atomic kmaps are created and destroyed in a
strictly nested manner. There were a few places in the code where
that did not happen; fixing it was usually just a matter of reordering
a couple of kunmap_atomic() calls.
- Switch to the stack-based mode without changing the
kmap_atomic() prototype. So, after this patch,
kmap_atomic() simply ignores the idx argument.
- The kmap_atomic() prototype loses the idx argument;
this is, by far, the largest patch of the series.
- Various final details are fixed up.
Doing things this way will make it a lot easier to debug any strange
problems which result from the changes. The most significant change in
terms of how the kernel works is step 2, so that's the patch which is
most likely to create problems. But this organization makes that patch
relatively small, so tracking down any residual bugs should be relatively
easy. Instead, the really huge patch (part 3) should not really
change the binary kernel at all, so the chances of it being problem-free
are quite high.
All that remains is getting this change merged. It's too late for 2.6.32,
but putting it into linux-next is likely to create large numbers of
patch conflicts. That is a common problem with wide-ranging patches like
this, though; developers have gotten better over the years at maintaining
them against a rapidly-changing kernel
Comments (13 posted)
One in a series of columns in which questions are asked of a kernel
developer and he tries to answer them. If you have unanswered questions
relating to technical or procedural things around Linux kernel
development, ask them in the comment section, or email them directly to
How do I open an effective communication channel with a kernel developer
to get my issues fixed?
Despite the size of most kernel subsystem maintainer's inbox, this is a
question that comes up a lot in conversations with users, so it is good
to get it out there.
The easiest way to communicate with a kernel developer about a problem
is to write an email and send it to the subsystem list that handles the
area in which you are having problems, and to copy the developers as
well to make sure that they see the message.
Ah, but how do you figure out what subsystem or mailing list to use?
Luckily the kernel contains a list of the mailing lists and the
developers responsible for the different kernel subsystems. The file,
MAINTAINERS in the Linux kernel source tree, lists all of the
different subsystems, the name of the maintainer, the email address, and
the mailing list that is the best place to bring up things on. If there
is no mailing list specified, then use the default linux-kernel mailing
If you narrow the problem down to a file that you are having questions
about, the script scripts/get_maintainer.pl in the kernel
source tree can find the proper people responsible for changing it
last, as well as any maintainer and mailing lists automatically.
For example, suppose you have a problem with the ftdi_sio driver, which
is located in drivers/usb/serial/ftdio_sio.c. By
running the get_maintainer.pl script with the -f
option, you would get the following:
$ scripts/get_maintainer.pl -f drivers/usb/serial/ftdi_sio.c
Greg Kroah-Hartman <firstname.lastname@example.org>
Alan Cox <email@example.com>
Make sure you always send a copy to a development mailing list, do not just
kernel developers privately, as their email load is quite high. By
emailing the mailing list, you offer up the ability for anyone to help
you out with your question - taking advantage of the large development
community - and you avoid overloading the individual maintainers any more
than they are already overloaded.
What happens if I get no response from my email?
Be persistent. If you do not hear back within a week, send a friendly
"did you miss this email?" type response.
In the BSD world, there is a "security officer." Why is there no
"security officer" for the Linux kernel?"
It is true there is no one person responsible for security for the Linux
kernel, it is a group of developers who have taken this role on. The
email address firstname.lastname@example.org goes directly to this group
of developers who will quickly respond to any reported problems.
Instructions on how to contact this list, and the rules around which
they operate concerning disclosure and amount of time before publicly
fixing the problem, can be found in the Linux kernel file
Documentation/SecurityBugs. If anyone has any questions
about these rules, feel free to contact the security team for
Do you look at the code of the BSDs in order to find new ideas and
concepts, or do you ignore them completely?
This is a personal decision on where to find ideas to implement
in Linux. As far as I am concerned, I have not looked at the BSDs in
many many years, as I have been busy with lots of Linux-only things
(driver model, USB, Linux Driver Project, etc.) But other kernel
developers do work with the BSD developers on coming up with solutions
to different problems, or to get proper hardware support for types of
Back in the early days of USB support in Linux, I did work with a number
of the BSD USB kernel developers to share how specific devices operated
so that drivers could be written for both operating systems, and
overall, the developers are quite friendly toward each other, as we are
working toward solving the same types of problems, but usually in
Comments (2 posted)
Much of the realtime scheduling work in Linux has been based around getting
the best behavior out of the POSIX realtime scheduling classes. Techniques
like priority inheritance, for example, exist to ensure that the
highest-priority task really can run within a bounded period of time. In
much of the rest of the world, though, priorities and POSIX realtime are no
longer seen as the best way to solve the problem. Instead, the realtime
community likes to talk about "deadlines" and deadline-oriented
scheduling. In this article, we'll look at a deadline scheduler has
recently been posted for review and related discussion at the recent Real
Time Linux Workshop in Dresden.
Priority-based realtime scheduling has the advantage of being fully
deterministic - the highest-priority task always runs. But priority-based
scheduling is subject to some unpleasant failure modes (priority
and starvation, for example), does not really isolate tasks running on the
same system, and is often not the best way to describe the problem. Most
tasks are more readily described in terms of an amount of work which must
be accomplished within a specific time period; the desire to work in those
terms has led to a lot of research in deadline-based scheduling in recent
A deadline system does away with static priorities. Instead, each running
task provides a set of three scheduling parameters:
- A deadline - when the work must be completed.
- An execution period - how often the work must be performed.
- The worst-case execution time (WCET) - the maximum amount of CPU
time which will be required to get the work done.
Deadline-scheduled tasks usually recur on a regular basis - thus the period
parameter - but sporadic work can also be handled with this model.
There are some advantages to this model. The "bandwidth" requirement of a
process - what percentage of a CPU it needs - is easily calculated, so the
scheduler knows at the outset whether the system is oversubscribed or not.
The scheduler can (and should) refuse to accept tasks which would require
more bandwidth than the system has available.
By refusing excess work, the scheduler
will always be able to provide the requisite CPU time to every process
within the specified deadline. That kind of promise makes realtime
Linux currently has no deadline scheduler. There is, however, an implementation posted for
review by Dario Faggioli and others; Dario also presented this
scheduler in Dresden. This implementation uses the "earliest deadline first"
(EDF) algorithm, which is based on a simple concept: the process with the earliest
deadline will be the first to run. Essentially, EDF attempts to ensure that
every process begins executing by its deadline, not that it actually
gets all of its work done by then. Since EDF runs work as early as
possible, most tasks should complete well ahead of their declared
This scheduler is implemented with the creation of a new scheduling class
called SCHED_EDF. It does away with the distinction between the
"deadline" and "period" parameters, using a single time period for both.
The patch places this class between the existing
realtime classes (SCHED_FIFO and SCHED_RR) and the normal
interactive scheduling class (SCHED_FAIR). The idea behind this
placement was to avoid breaking the "highest priority always runs" promise
provided by the POSIX realtime classes. Peter Zijlstra, though, thinks that deadline scheduling should run at
the highest priority; otherwise it cannot ensure that the deadlines will be
met. That placement could be seen as violating POSIX requirements; to
that, Peter responds, "In short, sod POSIX."
Peter would also like to name the scheduler SCHED_DEADLINE, for
the simple reason that EDF is not the only deadline algorithm out there.
In the future, it may be desirable to switch to a different algorithm
without forcing applications to change which scheduling class they
request. At the moment, the other contender would appear to be "least
laxity first" scheduling, which picks the task with the smallest amount of
"cushion" time between its remaining compute time and its deadline. Least
laxity first tries to ensure that each process can complete its computing
by the deadline. It tends to suffer from much higher context-switching
rates than EDF, though, and nobody is pushing such a scheduler for Linux at
One nice feature of deadline schedulers is that no process should be able
to prevent another from completing its work before its deadline passes. The
real world is messier
than that, as we will see below, but, even in the absence of deeper
problems, the scheduler can only make that guarantee if every process
actually stops running within its declared WCET. The EDF scheduler solves
that problem in an unsubtle way: when a process exceeds its bandwidth, it
is simply pushed out of the CPU until its next deadline period begins.
This approach is simple to implement and ensures that deadlines will be
met, but it can be hard on a process which must do a bit of extra computing
In the SCHED_EDF patch, processes indicate the end of their
processing period by calling sched_yield(). This modification to
the semantics of that system call makes some developers uneasy, though; it
is likely that the final patch will do something different. There may be a
new "I'm done for now" system call added for this purpose.
Peter also gave a talk in Dresden; his was mostly about why Linux does not
have a deadline
scheduler yet. The "what happens when a process exceeds its WCET" problem
was one of the reasons he gave. Calculating the
worst-case execution time is exceedingly difficult for any sort of
non-trivial program. As Peter puts it, researchers have spent their entire
lives trying to solve it. There are people working on automatically
deriving WCET from the source, but they are far from being able to do this
with real-world systems. So, for now, specification of the WCET comes down
to empirical observations and guesswork.
Another serious problem with EDF is that it works much better on
single-processor systems than on SMP systems. True EDF on a multiprocessor
system requires the maintenance of a global run queue, with all of the
scalability problems that entails. One solution is to partition SMP
systems, so that each CPU becomes an essentially independent scheduling domain;
the SCHED_EDF patch works this way. Partitioned systems have their own
problems, of course; the assignment of tasks to CPUs can be a pain, and it
is hard (or impossible) to get full utilization if tasks cannot move
Another problem with partitioning is that some scheduling problems simply
cannot be solved without occasional process migration. Imagine a two-CPU
system running three processes, each of which needs 60% of a single CPU's
time. The system clearly has the resources to run those three processes,
but not if it is unable to move processes between CPUs. So a partitioned
EDF scheduler needs to be able to migrate processes occasionally; the
SCHED_EDF developers have migration logic in the works, but it has
not yet been posted.
Yet another serious problem, according to Peter, is priority inversion. The
priority inheritance techniques used to solve priority inversion are tied
to priorities; it is not clear how to apply them to deadline schedulers.
But the problem is real: imagine a process acquiring an important lock,
then being preempted or forced out because it has exceeded its WCET. That
then block the execution of otherwise runnable processes with urgent
There are a few ways to approach this issue. Simplest, perhaps, is
deadline inheritance: lock owners inherit the earliest deadline in the
system for as long as they hold the lock. More sophisticated is bandwidth
inheritance; in this case, a lock owner which has exhausted its WCET will
receive a "donation" of time from the process(es) blocked on that lock. A
variant of that technique is proxy execution: blocked processes are left on the run
queue, but, when they "run," the lock owner runs in their place. Proxy
execution gets tricky in SMP environments when multiple processes are
blocked on the same lock; the result could be multiple CPUs trying to
proxy-execute the same process. The solution to that problem appears to be
to migrate blocked processes to the owner's CPU.
Proxy execution also runs into difficulties when the lock-owning process is
blocked for I/O. In that case, it cannot run as a proxy for the original
blocked task, which must then be taken off the run queue. That, in turn,
forces the creation of a "wait list" of processes which must be returned to
a runnable state when a different process (the lock owner) becomes
runnable. Needless to say, all this logic adds complexity and increases
The final problem, according to Peter, is POSIX, but it's an easy one to
solve. Since POSIX is silent on the topic of deadline schedulers, we can
do anything we want and life is good. He repeated that
SCHED_DEADLINE will probably be placed above SCHED_FIFO
in priority. There will be a new system call -
sched_setscheduler_ex() - to enable processes to request the
deadline scheduler and set the parameters accordingly; the
SCHED_EDF patch already implements that call. So many of the
pieces for deadline scheduling for Linux are in place, but a number of the
details are yet to be resolved.
The bottom line is that deadline schedulers in the real world are a
non-trivial problem - something that is true of real-world scheduling in
general. These problems should be solvable, though, and Linux should be
able to support a deadline scheduler at some point in the future. That
scheduler will probably make its first appearance in the realtime tree,
naturally, but it could eventually find users well beyond the realtime
community. Deadline schedulers are a fairly natural fit for periodic tasks
like the management of streaming media, which could
profitably make use of deadline scheduling to help eliminate jitter and
dropped-data problems. But that remains a little while in the future;
first, the code must be made ready for widespread use. And that, as we all
know, is a process which recognizes few deadlines.
Comments (40 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>