Brief items
The current 2.6 development kernel is 2.6.28-rc8,
released on December 10. 2.6.28-rc8 contains
another fairly long list of fixes, including some for fairly important
regressions.
Linus also notes (in the 2.6.28-rc8 announcement) that he's trying to
figure out whether to release 2.6.28 before or after the holidays. He asks
for suggestions, of which, one assumes, he will get plenty.
The current stable 2.6 kernel is 2.6.27.8, released on December 5.
This is a large update with fixes for a wide variety of problems.
Comments (3 posted)
Kernel development news
For some reason, the act of typing in some kerneldoc makes people's
brains turn off. Perhaps it's because "oh, I am supposed to type
some documentation here" instead of "gee, I think this code is
unclear, let's clarify that".
--
Andrew Morton
I personally don't like kerneldoc at all, because the truth is that
people will work on fixing bugs and other useful things before
keeping kerneldoc up to date.
And that's the basic fact which cannot be denied.
I wish it could work, but it doesn't across the board. So unless
we have dedicated monkeys scouring over every single patch that
goes into the tree and doing the necessary kerneldoc updates,
kerneldoc will be chronically wrong somewhere.
That leads to confusion and lost developer time. Because if the
kerneldoc bits are wrong, it's worthless.
--
David Miller
I expect better: You never see me hard with time word making
sentence coherent stuff. Ever.
--
Rusty Russell
As usual: You shall never rely on the source code comments, they
will only mislead you.
--
Manfred Spraul
Comments (16 posted)
The Kernel Miniconf at linux.conf.au next January is looking for a speaker
or two to fill out the schedule. "
Presentations do not have to be limited to a slide deck. If you have an
idea for a 50-minute session that follows a non-traditional format, it
will be considered."
Full Story (comments: none)
By Jonathan Corbet
December 9, 2008
It has been just over four years, now, since
the realtime discussion got
serious and the realtime preemption patch set got its start. During
that time, your editor has heard many predictions for when the bulk of the
realtime work would be merged; generally, the guess has been "within about
a year." While a lot of realtime work
has been merged, some of the
core components of the realtime tree remain outside of the mainline.
Beyond that, the realtime developers have been relatively quiet over the
last year - at least on the realtime front. Having taken on some little
side tasks - unifying the x86 architecture and maintaining it going
forward, for example
- some of those developers have been just a little bit distracted recently.
The realtime patch set has not gone away, though. If nothing else, the
fact that a number of distributors are shipping this code is enough to
ensure continued interest in its development. So your editor noted with
interest the recent announcement
of a new -rt tree with an updated set of realtime patches. This tree
will be of interest for anybody wanting to look at the realtime work in the
context of the 2.6.28 kernel or beyond.
One of the core technologies in the realtime tree is a change to how
spinlocks work. Spinlocks in the mainline will busy-wait until the
required lock becomes available; they thus occupy the processor to no
useful end when acquiring a contended lock. Holding a spinlock will also
prevent a thread from being preempted. This behavior is generally best for
system throughput; it also makes it easier to write correct code. But
anything which prevents a CPU from immediately servicing the
highest-priority process runs counter to the chief design goal of a
realtime operating system: providing deterministic response times in all
situations. So, for the realtime patches, classic spinlocks had to go.
The solution was to turn most spinlocks into a form of mutex with priority
inheritance. A process which attempts to acquire a contended "spinlock"
will no longer spin; instead, it goes to sleep and waits for the lock to
become free, making the processor available to another thread. Code which
holds one of these non-spinlocks is no longer immune to preemption; a
higher-priority thread can always push it out of the way. By changing
spinlocks in this way, the realtime hackers were able to eliminate one of
the largest sources of latency in the mainline kernel.
Much of that work found its way into the mainline some time ago in the form
of the mutex API, but spinlocks themselves have not been changed in the
mainline.
To minimize the pain of maintaining the realtime patches, the
developers simply redefined the spinlock_t type to be the new
mutex type instead. Except that, as it turns out, some spinlocks in
low-level parts of the kernel really do need to be spinlocks still. So
those were switched to a new raw_spinlock_t type - but without
changing the various spin_lock() calls. Instead, some truly
frightening macro trickery was introduced to cause the spinlock API to do
the right thing when passed either of two entirely different mutual
exclusion primitives. This bit of macro magic was always going to be an
impediment to mainline inclusion, so the realtime developers never really
expected to merge the lock code in that form.
The new realtime tree now shows how the realtime developers think this work
might get into the mainline. It involves a more explicit separation of the
two types of "spinlocks" - and a lot of code churn. In the realtime tree,
most locks of type spinlock_t are changed to a new lock_t
type. There is a new set of operations for this type:
#include <linux/lock.h>
lock_t lock;
acquire_lock(&lock);
release_lock(&lock);
For a normal, non-realtime kernel build, lock_t will be the same
as spinlock_t, and things will work as they always have. On
realtime kernels, instead, lock_t will be a mutex type. The other
variants of the spinlock API will be represented in the new API (there is
an acquire_lock_irqsave(), for example), but none of them will
actually disable interrupts in a realtime kernel. Meanwhile,
spinlock_t will remain a true spinlock type.
This change gets rid of the tricky macros, but at the cost of changing the
declarations of and operations on almost all spinlocks in the kernel. That
is a lot of code changes: a quick grep turns up over 20,000
spin_lock*() calls in the upcoming 2.6.28 kernel. That will make
for some pain if and when this change is merged. But in the mean time, it
can only make for a lot of pain for the people who have to maintain
this patch out of tree. To make their lives a little easier, the realtime
developers have created a couple of scripts to do the bulk of the work.
First, all spinlocks in a pristine kernel are converted to lock_t,
then the few locks which truly must be spinlocks are switched back. This
work is kept in a separate branch which is regenerated when needed; in this
way, the realtime developers avoid the need to do nasty merges to keep up
with current kernels.
Your editor has heard talk of another locking change which does not, yet,
appear in this tree. One problem with the realtime patch set is that it
requires distributors to create yet another kernel build - something they
hate doing - if they want to
support realtime operation.
In an effort to make life easier for distributors, the
realtime developers are working on a scheme whereby a kernel would
determine at run time whether it should be running in a realtime mode. If
so, spinlocks will be changed to sleeping locks by patching the kernel
binary as it boots. Kernels built this way will be able to run efficiently
in either mode.
The branches of the realtime tree provide a quick guide to the other parts
of the realtime work which remain outside of the mainline. The threaded interrupt handler code
is one example; that change could be proposed (again) for merging in the
near future. The priority
workqueue mechanism sits in another branch, as do patches aimed at Java
support, filesystem changes, memory management changes, and more. Then,
there's a branch for stuff which will never be merged; for example, there
is this
patch which gives Java programs direct access to physical memory - not
something which strikes most kernel developers as a good idea. All told,
there is a great deal of work sitting in the realtime patch set; this work
is finally being organized into a proper git tree.
The "upstream first" policy says that vendors should merge their code
upstream before shipping it to customers. The 2.6.x development model is
built on the idea that no change is too fundamental to be accepted into a
regular, 3-month development cycle. The realtime patches would appear to be
an exception to both rules. It has taken over four years to get to a point
where some of the fundamental realtime technologies are close to ready for the mainline,
but distributors have been shipping it for at least three of those years.
It has, in other words, been one of the biggest forks of the Linux kernel,
ever. The plan has always been to join this fork back with the mainline,
though; perhaps, finally, that goal is getting closer. With luck, it will
happen within about a year.
Comments (6 posted)
By Jake Edge
December 10, 2008
The Linux boot process, at least as provided by distributions,
depends on help from user space, with
drivers being loaded as required from the initial filesystem (initramfs/initrd).
Loading drivers requires using tools built into initramfs and
if those tools break, the kernel won't boot. But when a working kernel
configuration and initramfs are used with a new kernel, the result
is expected to be a kernel that successfully boots. When that doesn't
happen, bugs are filed regarding kernel regressions but, as a recent
example shows, the actual problem may be elsewhere.
The original report was made in late
October, but no progress was made until Evgeniy Polyakov saw it again in early December. The symptom
was a kernel that hangs after printing:
request_module: runaway loop modprobe char-major-5-1
four times on the console. Since nothing in the user space (initramfs)
or kernel configuration had changed, it seemed to clearly point to
something in the
kernel itself.
It turns out that the "runaway loop" message is meant to indicate that the
request_module() function has been invoked recursively. So in an
effort to load the driver for the character device with major/minor numbers
5/1—which corresponds to
/dev/console—request_module() was invoked again.
The code in kernel/kmod.c:
if (atomic_read(&kmod_concurrent) > max_modprobes) {
/* We may be blaming an innocent here, but unlikely */
if (kmod_loop_msg++ < 5)
printk(KERN_ERR
"request_module: runaway loop modprobe %s\n",
module_name);
atomic_dec(&kmod_concurrent);
return -ENOMEM;
}
only allows that message to be printed four times, but the invoker should
recognize the
ENOMEM and handle it appropriately.
The root cause was that something in the kernel was trying to access
/dev/console before that device was registered in the kernel.
This led the kernel to try and load a module to handle
/dev/console, which will fail. Because of the failure, something
in the user space
modprobe then tries to access /dev/console, presumably to
output an error message, which repeats the kernel module loading process.
And so on.
After that recurses enough to exceed the max_modprobes limit,
request_module() will produce the runaway loop message and return
ENOMEM which should put a stop to the whole process.
In an acrimonious thread—and kernel bug
report—Alan Cox, Kay Sievers, and Polyakov tried to
determine where the problem came from and what to do about it. It didn't
help matters that they were using different distribution's initramfs
so that they saw different behavior. Polyakov/Sievers were using Debian
user space while Cox was using Fedora. Something in the Debian version was
continuing to try to open /dev/console even after getting
ENOMEM. This leads to an infinite loop, thus a kernel hang.
Sievers eventually tracked it to the kernel cryptographic API:
It is caused by:
"modprobe cryptomgr" called from swapper[1]
This modprobe process does try to log an error, accesses /dev/console,
which is not initialized in the kernel at that time, and the kernel
module loader tries the load a module to support dev_t 5:1, which
again runs modprobe, and ...
Setting CONFIG_CRYPTO_MANAGER=y makes it disappear.
It turns out that the crypto layer attempts to load the cryptomgr module as
part of its algorithm testing infrastructure. If cryptomgr fails to load,
the algorithm registration code can continue without it. It is optional,
but modprobe wants to put out a message when it fails to load it,
which leads to the runaway loop. As Herbert Xu points out, though, the problem is not
crypto-specific at all:
In any case the loop itself does not involve any crypto components
so I don't think making changes in the crypto layer is going to
make this go away forever as anyone calling request_module early
enough will get into this loop.
It is this potential pitfall that Sievers and Polyakov would like to see
removed. In
general, user-space programs are not required to be concerned with the
availability of /dev/console—except when they are run from
early kernel initialization. But Cox points out that user-space helpers must
concern themselves
with avoiding loops because there are multiple possible ways to cause that
to happen.
As an example, he notes that if UNIX-domain sockets (AF_UNIX) are in a
module and syslog() is called before the module is loaded, a
similar loop will occur.
In an effort to "step back" from the arguments that were going back and
forth, Ted Ts'o offers his analysis of the
problem along with a suggested course of action:
There is a dispute about whether it is looping forever, or whether it
should be getting caught by kernel/kmod.c's modprobe recursion
detector. Alan has checked the recursion detector and reports that it
works just fine; Evgeniy and Kay are claiming that it in fact loops
forever, and the recursion detector is not working.
[...] So I would think the best thing to do is to figure out what Debian's
initrd is doing that is evading the recursion detection. Fixing that
is going to make things much more robust.
Clearly the recursion detector is working to some extent, or the runaway
loop messages would not be seen, but on Debian, at least, that detection
doesn't stop the problem. Ts'o's theory is that something outside of
directly invoked helper is actually the culprit: "I'm guessing why
it isn't working given Debian's initrd setup is that whatever is
ultimately opening /dev/console isn't being called until after the
helper script has exited." That seems worth tracking down as Ts'o
points out in a later message:
It would be good to make sure we understand what
the root causes for while the modprobe recursion detector is
apparently not triggering, since it could be that Debian's initrd
might cause some other uncaught recursion loop if we don't drive this
problem determination to root cause.
The exact cause of the problem and why Debian and Fedora behave differently
is still not known. Digging into Debian's initrd to figure that out, as
Ts'o suggests, is clearly the right starting point. That answer will
likely lead to sensible fixes, either in user space or the
kernel—possibly both.
Bickering about where and how to fix the problem before it is fully
understood seems counter-productive at best.
Comments (7 posted)
By Jonathan Corbet
December 9, 2008
Low-level optimization of performance-critical code can be a challenging
task. At this point, one assumes, the potential for algorithmic
improvements in the targeted code has been realized; what is left is trying
to locate and address problems
like cache misses, mis-predicted branches, and so on. Such problems can be
impossible to find by just looking at the code; one needs support from the
hardware. The good news is that contemporary hardware provides that
support; most processors can collect a wide range of performance data for
analysis. The bad news is that, despite the fact that processors have been
able to collect that data for many years, there has never been support for
this kind of performance monitoring in the mainline kernel. That situation
may be about to change, but, first, the development community will have to
make a choice between a venerable out-of-tree implementation and an
unexpected competitor.
The "perfmon" patch set has been under development for some years, but, for
a number of reasons, it has never found its way into the mainline kernel.
The most recent version of the patch was posted for review by
Stéphane Eranian in late
November. The perfmon patches show the signs of all those years of
development work and
usage experience; they offer a wide set of features and extensive user-space
support. The full perfmon patch adds twelve system calls to the kernel;
the posted version, though, trims that count back to five in the hope that
a narrower interface will have a better chance of getting into the
mainline. The additional system calls, one assumes, will be proposed for
inclusion sometime after the perfmon core is merged.
The reduced interface is described in the
patch set; briefly, an application hooks into the performance
monitoring subsystem with a call to:
int pfm_create(int flags, pfarg_sinfo_t *regs);
This system call returns a file descriptor to identify the performance
monitoring session. The regs parameter is used to return a list
of performance monitoring registers available on the current system;
flags is currently unused.
Specific performance counter registers can be manipulated with:
int pfm_write(int fd, int flags, int type, void *d, size_t sz);
int pfm_read(int fd, int flags, int type, void *d, size_t sz);
These system calls can be used to write values into registers (thus
programming the performance monitoring hardware) and to read counter and
configuration information from those registers.
Actually doing some performance monitoring requires a couple more calls:
int pfm_attach(int fd, int flags, int target);
int pfm_set_state(int fd, int flags, int state);
A call to pfm_attach() specifies which process is to be monitored;
pfm_set_state() then turns monitoring on and off.
There are a couple of distinctive aspects to the perfmon interface. One is
that it knows almost nothing about the specific performance monitoring
registers; that information, instead, is expected to live in user space.
As a result, the bare perfmon system call interface is probably not
something that most monitoring applications would use; instead, those
system calls are hidden behind a user-space library which knows how to
program different types of processors for the desired results. Beyond
that, perfmon uses the ptrace() mechanism to stop the monitored
process while performance counters are being queried; as a result, the
monitoring process must have the right to trace the target process.
On December 4, Thomas Gleixner and Ingo Molnar posted a surprise announcement of a new
performance counter subsystem. The announcement states:
We are aware of the perfmon3 patchset that has been submitted to
lkml recently. Our patchset tries to achieve a similar end result,
with a fundamentally different (and we believe, superior :-)
design.
This is not the first time that these developers have shown up with an
out-of-the-blue reimplementation of somebody else's subsystem; other
examples include the CFS scheduler, high-resolution timers, dynamic tick,
and realtime preemption. Most of the time, the new code quickly supplants
the older version - an occurrence which is not always pleasing to the
original developers - but the situation does not seem quite as
straightforward this time.
The proposed interface is much simpler,
adding a single system call:
int perf_counter_open(u32 hw_event_type, u32 hw_event_period,
u32 record_type, pid_t pid, int cpu);
This call will return a file descriptor corresponding to a single hardware
counter. A call to read() will then return the current value of
the counter. The hw_event_period can be used to block reads until
the counter overflows the given value, allowing, for example, events to be
queried in batches of 1000. The pid parameter can be used to
target a specific process, and cpu can restrict monitoring to a
specific processor.
There are a few advantages claimed for the new implementation. The
simplicity of the system call interface is one of those; it is possible to
write a very simple application to perform monitoring tasks, with no
additional libraries required. The second version of the patch
includes a simple "kerneltop" utility which can display a
constantly-updated profile of anything the performance counting hardware
can monitor. Another advantage is the avoidance of ptrace(); this
reduces the amount of privilege needed by the monitoring process and avoids
perturbing the monitored process by stopping and restarting it. The
management of counters is said to be more flexible, with facilities for
sharing counters between processes and reserving them for administrative
access. The low-level hardware interface is said to be simpler as well.
Those claimed advantages notwithstanding, a
number of complaints have been raised with regard to the new performance
monitoring code. Two of those seem to be at the top of the list: the
single counter per file descriptor API, and programming the hardware
performance monitoring unit inside the kernel. On the API side, the
biggest concern is that putting each counter behind its own file descriptor
makes it very hard to correlate two or more counters. Reading two counters
requires two independent read() system calls; as is always the
case, just about anything could happen between those two calls. So it's
hard to tell how two different counter values relate to each other. But
that sort of correlation is exactly what developers doing performance
optimization want to do. Paul Mackerras says:
Your API has as its central abstraction the "counter". I am saying
that that is the wrong abstraction. The abstraction really needs
to be a set of counters that are all active over precisely the same
interval, so that their values can be meaningfully compared and
related to each other.
In response, Ingo argues that the loss of
precision caused by independent read() calls is small - much
smaller than the muddying of the results caused by stopping the target
process so that all of the counters can be read at the same time. That
argument does not appear to have convinced the detractors, though.
The other complaint is that moving the counter programming task into the
kernel requires that the kernel know about the complexities of every
possible performance monitoring unit it may encounter. This hardware sits
at the core of the most performance-critical CPU subsystems, so its design
parameters value non-interference above features or a straightforward
programming interface. So programming it can be a complex business,
involving sizeable tables describing how various operations interact with
each other. The perfmon code keeps those tables in a user-space library,
but the alternative implementation won't allow that. Quoting Paul again:
Now, the tables in perfmon's user-land libpfm that describe the
mapping from abstract events to event-selector values and the
constraints on what events can be counted together come to nearly
29,000 lines of code just for the IBM 64-bit powerpc processors.
Your API condemns us to adding all that bloat to the kernel, plus
the code to use those tables.
Paul (and others) argue that this information - which can add up to
hundreds of kilobytes - is better kept in user
space.
There also seems to be a bit of concern over the fact that Stéphane had clearly never heard about this work before it was
posted for review. It must, indeed, be a shock to work on a subsystem for
years, then find a proposed replacement sitting in one's mailbox. As David
Miller put it:
And also, another part of the backlash is that the poor perfmon3
person was completely blindsided by this new stuff. Which to be
honest was pretty unfair. He might have had great ideas about the
requirements (even if you don't give a crap about his approach to
achieving those requirements) and thus could have helped avoid the
past few days of churn.
So, at this point, what will happen with performance monitoring is unclear
at best. Perhaps, though, this discussion will have the effect of raising
the profile of performance monitoring, which has been without proper kernel
support for many years. The merging of either solution - or, perhaps, a
combination of both - seems like it has to be an improvement over having no
support at all.
Comments (25 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jake Edge
Next page: Distributions>>