Brief items
The current development kernel remains 2.6.32-rc8. As of this
writing, just over 200 changes have been merged since 2.6.32-rc8, including
some significant feature enhancements to the FS-Cache and slow work
subsystems. Linus has not told the world whether he thinks that's enough
change to justify an -rc9 release or not; stay tuned.
Comments (6 posted)
GPUs have gotten more and more complex every 6 months for about 8
years now. A current radeonhd 4000 series bears little resemblence
to the radeon r100 that was out then. The newer GPUs require a full
complier to be written for an instruction set more complex than x86
in some places. The newer GPUs get more and more varied
modesetting combos that all require supporting.
Now I'd would guess (educated slightly) that the amount of code
required to write a full driver stack for a modern GPU has probably
gone up 40-50x what used to be required, whereas the number of open
source community developers has probably doubled since 2001. Also
newer GPU designs have forced us to redesign the Linux GPU
architecture, this had to happen in parallel with all the other
stuff, again with similiar number of developers. So yes it sucks
but it should point out why there is no reason why 3D should really
be working on all cards.
--
Dave Airlie
The best way to make everything "just work" is to eliminate it.
--
Jon Smirl
I agree that having only one of SLAB/SLUB/SLQB would be nice, but
it's going to take a lot of heavy lifting in the form of hacking
and benchmarking to have confidence that there's a clear
performance winner. Given the multiple dimensions of performance
(scalability/throughput/latency for starters), I don't even think
there's good a priori reason to believe that a clear winner CAN
exist. SLUB may always have better latency, and SLQB may always
have better throughput. If you're NYSE, you might have different
performance priorities than if you're Google or CERN or Sony that
amount to millions of dollars. Repeatedly saying "but we should
have only one allocator" isn't going to change that.
--
Matt Mackall
Comments (none posted)
By Jonathan Corbet
December 2, 2009
Good developers carefully write their code to handle error conditions which
may arise. This code frequently suffers from one problem, though: test
coverage is hard. Many of the anticipated errors never come about, so the
error-handling code never gets exercised. So when things go wrong for
real, recovery does not work as expected. For a few years, the Linux
kernel has had a
fault injection
framework designed to help in the debugging of some types of
error-handling code. By forcing specific things (memory allocations in
particular) to go wrong, the fault injection framework can help developers
ensure that errors are really handled as expected.
Sripathi Kodi recently posted a
patch adding certain types of futex failures to the fault injection
framework. Ingo Molnar responded with a
potentially surprising request:
Instead of this unacceptably ugly and special-purpose debugfs
interface, please extend perf events to allow event injection. Some
other places in the kernel (which deal with rare events) want/need
this capability too.
This "unacceptably ugly" interface has existed as part of the fault
injection framework since 2006, so it is a little surprising to hear, now,
that it cannot be used. Ingo is firm about this point, though, and appears
unwilling to back down.
Extending perf events for fault injection might be the right long-term
solution. But this situation highlights a trap for developers which
certainly acts to make participation in the development process harder. In
his travels, your editor has heard complaints from developers who set out
to accomplish a specific task, only to be told that they must undertake a
much larger cleanup to get their code merged. The topic also came up at
the 2009 kernel summit; there, the consensus seemed to be that this kind of
request can quickly become unreasonable.
In this case, Sripathi has not been asked to fix the remainder of the fault
injection framework code. But adding a new functionality to the perf
events subsystem still likely goes rather beyond the scope of the original
project. Sripathi has not responded to this request, so it's not clear
whether we'll see a futex fault injection mechanism reworked to fit the new
requirements, or whether this code will just fade away.
Comments (9 posted)
Kernel development news
By Jake Edge
December 2, 2009
When last we looked in on
utrace, back in March, it was being proposed for inclusion into 2.6.30.
There were various objections at that time, but the biggest was the lack of
a "real" in-kernel user for utrace. It was suggested that providing a real
user along with utrace itself would smooth its path into the mainline.
Now utrace has returned in the form of a
set of patches from Oleg Nesterov (based on Roland McGrath's work), along
with a
rewrite of the ptrace()
system call using the utrace interface. With the 2.6.33 merge window
opening soon, the hope is that utrace will, finally, make its way into the
mainline.
Utrace provides a means to control user-space threads, which could be
used for debugging, tracing, and other tasks like user-mode-linux.
SystemTap is one of the biggest current utrace users, as Red Hat and Fedora
kernels have had utrace support for several years. Utrace came
from a recognition that ptrace() was too limited—and
messy—for many of the things folks wanted to use it for. In
particular, only allowing one active tracing process for a given thread, as
ptrace() requires, was too limiting for various envisioned tracing
and control scenarios. Utrace allows multiple tracing "engines" to attach
to a thread, list which events they are interested in, and receive
callbacks when those events occur.
The interface provided by utrace has not changed enormously since our first
look in March 2007. Engines,
which are typically implemented as loadable kernel modules, will attach to
a given
thread by using utrace_attach_task() or
utrace_attach_pid() depending on whether they have a struct
task_struct or struct pid available. In either case, a
struct utrace_engine pointer is returned, which is used to
identify the engine in additional calls.
The struct utrace_engine looks like:
struct utrace_engine {
const struct utrace_engine_ops *ops;
void *data;
unsigned long flags;
};
with
flags containing an event mask and
data used for
engine-specific private data. The most interesting part is the
ops field which points to a set of ten different callback
functions. These functions make up the heart of the tracing engine
functionality.
The function pointers in struct utrace_engine_ops are described
in linux/utrace.h. All of the kerneldoc comments are pulled from
the source code files into the
DocBook
documentation that comes with the patchset. The callbacks are made as
the traced thread encounters various events. These include signals being
delivered, clone() or exec() being called, other system
calls as they are entered or exited, thread exit or death, and more. In
each case, the callbacks are made for each interested engine in the order
in which the engines were attached.
An engine uses the utrace_set_events() (or
utrace_set_events_pid()) call to indicate which of the events it
is interested in:
int utrace_set_events(struct task_struct *target,
struct utrace_engine *engine,
unsigned long events);
The
UTRACE_EVENT() macro is used to turn on the appropriate bits
in the
events mask. There must be a callback defined in the
engine->ops table for any events which are enabled.
Once a callback has been invoked, the engine uses utrace_control()
(or utrace_control_pid()) to tell the traced thread to do
something:
int utrace_control(struct task_struct *target,
struct utrace_engine *engine,
enum utrace_resume_action action);
The
action parameter governs what is supposed to happen. Those
actions include things like single-stepping, block-stepping, resuming
execution, detaching from the thread, and so on.
In the only real complaint about the patchset seen so far, Christoph
Hellwig is unhappy that the
ptrace() reimplementation is not supplanting the
current ptrace() code: "One thing I really hate about this
is that it introduces two ptrace implementations by adding the new one
without removing the old one." In the patches, the inclusion of
utrace is governed by the CONFIG_UTRACE flag. Since it isn't
optional to have a ptrace() system call, that meant the current
code needed to stay.
What Hellwig suggests, though, is adding utrace support to the two major
architectures that don't have it (arm and mips), then removing the current
ptrace(). He clearly believes it is too late to get utrace into
2.6.33, which would allow time to get utrace support into those—and
hopefully
other, minor architectures—before utrace is merged. "If the
remaining minor architectures don't manage to get their homework done
they're left without ptrace," he said.
That didn't sit well with various other kernel hackers. Pavel Machek
said: "I don't think introducing
regressions to force people to rewrite code is a good way to go".
In addition, Ingo Molnar seems to have warmed up to utrace's inclusion
since it was last proposed. Molnar had many complaints
about utrace last time, but is much more positive this time. He doesn't think adding more architecture support is the
way to go:
Regarding porting it to even more architectures - that's pretty much the
worst idea possible. It increases maintenance and testing overhead by
exploding the test matrix, while giving little to [the] end result. Plus the
worst effect of it is that it becomes even more intrusive and even
harder (and riskier) to merge.
Unlike last time, where most of the complaints were not aimed at the code
itself, but more at its timing and lack of an in-kernel user, this time
there is some code review taking place. Peter Zijlstra has a fairly
detailed review of both the code and
documentation for example. There is a clear sense that utrace is clearing
hurdles that may have held it up in the past.
One of the outcomes from the tracing meetings at the Collaboration Summit
in April was to come up with an in-kernel user, and ptrace()
seemed like a good candidate. Other ideas were mentioned in those
meetings, including adding a gdb "stub" in the kernel to allow
debugging of user-space programs. A patch to expose a
/proc/PID/gdb interface that implements gdb's remote serial
protocol was proposed by Srikar Dronamraju.
That patch is running into more serious difficulty than utrace seems to
be. Because kgdb already exposes the remote serial interface for
gdb, but for the kernel instead, Zijlstra and Molnar think that the two
should be combined. It seems unlikely to get merged until that is resolved.
Some parts of the utrace patchset have spent time in the -mm tree, and
utrace has been shipped with every Fedora kernel since FC6. But the
utrace-ptrace piece has not done any time in either -mm or -next, which may
make it harder to get it in the mainline for 2.6.33. Since utrace is
optional, though, there are relatively few risks. McGrath is willing to
consider removing the current
ptrace() implementation, but its clear that he and
Nesterov—maintainers of the current ptrace()—would
prefer to get utrace into the mainline now:
We don't want to rush anyone, like uninterested arch maintainers. We don't
want to break anyone who doesn't care about our work (we do test for ptrace
regressions but of course new code will always have new bugs so some
instances of instability in the transition to a new ptrace implementation
have to be expected no matter how hard we try). We just don't want to keep
working out of tree.
Presumably, we will know within the next few weeks whether utrace makes its
way into 2.6.33. But, if that doesn't happen, it would
seem that one more kernel development cycle is all that it should take.
Comments (2 posted)
By Jonathan Corbet
December 1, 2009
Reader-writer spinlocks and interrupt-enabled interrupt handlers both have
a long history in the Linux kernel. But both may be nearing the end of
their story. This article looks at the push for the removal of a pair of
legacy techniques for mutual exclusion in the kernel.
Reader-writer spinlocks (rwlocks) behave like ordinary spinlocks, but with
some significant exceptions. Any number of readers can hold the lock at
any given time; this allows multiple processors to access a shared data
structure if none of them are making changes to it. Reader locks are also
naturally nestable; a single processor can acquire a given read lock more
than once if need be. Writers, instead, require exclusive access; before a
write lock can be granted, all read locks must be released, and only one
write lock can be held at any given time.
Rwlocks in Linux are inherently unfair in that readers can stall writers
for an arbitrary period of time. New read locks are allowed even if a
writer is waiting, so a steady stream of readers can block a writer
indefinitely. In practice this problem rarely surfaces, but Nick Piggin
has reported a case where the right
user-space workload can cause an indefinite system livelock. This is a
performance problem for specific users, but it is also a potential denial
of service attack vector on many systems. In response, Nick started
pondering on the challenge of implementing more fair rwlocks which do not
create performance regressions.
That is not an easy task. The obvious solution - blocking new readers when a
writer gets in line - will not work for the most important rwlock
(tasklist_lock) because that lock can be acquired by interrupt
handlers. If a processor already holding a read lock on
tasklist_lock is interrupted, and the interrupt handler, too,
needs that lock, forcing the handler to wait will deadlock the processor.
So workable solutions require allowing nested reader locks to be acquired
even when writers are waiting, or disabling interrupts when
tasklist_lock is held. Neither solution is entirely pleasing.
Beyond that, there has been a general sentiment toward the removal of
rwlocks for some years. The locking primitives themselves are
significantly slower than plain spinlocks, so any performance gain from
allowing multiple readers must be large enough to make up for that extra
cost. In many cases, that gain does not appear to actually exist. So,
over time, kernel developers have been changing rwlocks to normal spinlocks
or replacing them with read-copy-update mechanisms. Even so, a few hundred
rwlocks remain in the kernel. Perhaps it would be better to focus on
removing them instead of putting a lot of work into making them more fair.
Almost all of those rwlocks could be turned into spinlocks tomorrow and
nobody would ever notice. But tasklist_lock is a bit of a thorny
problem; it is acquired in many places in the core kernel and it's not
always clear what this lock is protecting. This lock is also taken in a
number of critical kernel fast paths, so any change has to be done
carefully to avoid performance regressions. For these reasons,
kernel developers have generally avoided messing with
tasklist_lock.
Even so, it would appear that, over time, a number of the structures
protected by tasklist_lock have been shifted to other protection
mechanisms. This lock has also been changed in the realtime preemption
tree, though that code has not yet made it to the mainline. Seeing all
this, Thomas Gleixner decided to try to get rid
of this lock, saying "If nobody beats me I'm going to let sed
loose on the kernel, lift the task_struct rcu free code from -rt and figure
out what explodes." As of this writing, the results of this
exercise have not been posted. But Thomas is still active on the mailing
list, so one concludes that any explosions experienced cannot have been
fatal.
If tasklist_lock can be converted successfully to an ordinary
spinlock, the conversion of the remaining rwlocks is likely to happen
quickly. Shortly after that, rwlocks may go away altogether, simplifying
the set of mutual exclusion primitives in Linux considerably.
IRQF_DISABLED
Meanwhile, a different sort of exclusion happens with interrupt
handlers. In the early days of Linux, these handlers were divided into
"fast" and "slow" varieties. Fast handlers could be run with other
interrupts disabled, but slow handlers needed to have other interrupts
enabled. Otherwise, a slow handler (perhaps doing a significant amount of
work in the handler itself) could block the processing of more important
interrupts, impacting the performance of the system.
Over the years, this distinction has slowly faded away, for a number of
reasons. The increase in processor speeds means that even an interrupt
handler which does a fair amount of work can be "fast." Hardware has
gotten smarter, minimizing the amount of work which absolutely must be done
immediately on receipt of the interrupt. The kernel has gained improved
mechanisms (threaded interrupt handlers, tasklets, and workqueues) for
deferred processing. And the quality of drivers has generally improved.
So driver authors generally do not really even need to think about whether
their handlers run with interrupts enabled or not.
Those authors still need to make that choice when setting up interrupt
handlers, though. Unless the handler is established with the
IRQF_DISABLED flag set, it will be run with interrupts enabled.
For added fun, handlers for shared interrupts (perhaps the majority on most
systems) can never be assured of running with interrupts disabled; other
handlers running on the same interrupt line might enable them at any time.
So many handlers will be running with interrupts enabled, even though that
is not needed.
The solution, it would seem, would be to eliminate the
IRQF_DISABLED flag and just run all handlers with interrupts
disabled. In almost all cases, everything will work just fine. There are
just a few situations where interrupt handling still takes too long, or
where one interrupt handler depends on interrupts for another device being
delivered at any time. Those handlers could be identified and dealt with.
"Dealt with" in this case could take a few forms. One would be to equip
the driver with a better-written interrupt handler which does not have this
problem. Another, related approach would be to move the driver to a
threaded handler which, naturally, will run with interrupts enabled. Or,
finally, the handler could be set up with a new flag
(IRQF_NEEDS_IRQS_ENABLED, perhaps) which would cause it to run
with interrupts turned on in the old way.
It's not clear when all this might happen, but it could be that, in the
near future, all hard interrupt handlers are expected to run - quickly -
with interrupts disabled. Few people will even notice, aside from some
maintainers of out-of-tree drivers who will need to remove
IRQF_DISABLED from their code. But the kernel as a whole should
be faster for it.
Comments (12 posted)
By Jonathan Corbet
December 2, 2009
One of the stated goals of the staging tree is to bring widely-used drivers
into the mainline kernel tree. This effort has been quite successful; the
number of out-of-tree drivers has dropped considerably over the last year
or so. There is one high-profile holdout, though: the
Linux Infrared Remote Control (LIRC)
subsystem. LIRC is used to obtain input events from remote control devices
and feed them through to applications; Linux-based digital video recorder
systems are heavy LIRC users, but there are others as well. Back in
October, Jarod Wilson
posted a
new version of LIRC for consideration. One month later, the kernel
developers have started talking about it; what they lack in punctuality has
been more than made up for in volume.
One might think that merging this longstanding, heavily-used project into
the mainline would not require a great deal of discussion. The problem is
that LIRC brings with it a new ABI. Since user-space interfaces must be
supported indefinitely, they tend to come under a higher degree of scrutiny
than other parts of the code. LIRC has never had to freeze its ABI during
its many years of out-of-tree existence, a freedom which has made life easier for its developers. But LIRC
in mainline would not have this freedom, so any incompatible ABI changes
need to be made prior to merging. And, as it happens, some developers
would like to see significant changes.
One would think that an IR receiver would be a simple device; all it must
do is report button press and release events, much like a keyboard. Often,
it seems, the simplest devices are the most complex to deal with. Some
receivers have decoders built into them, allowing them to pass scan codes
to the driver, which can then map them onto key codes to pass to
applications. But others are simple indeed - they simply report the timing
and length of pulses received from the remote. In this case, the driver
must filter out glitches and perform protocol processing to get to the
point where it can generate scan codes. For extra fun, there are a number
of protocols in use, and some manufacturers have wisely decided that life would be
much more interesting if they were to make their own versions of the
protocols which differ from everybody else's. So the protocol processing
can be painful and unpleasant.
LIRC handles this mess by having drivers report "raw" pulse-length information via
a special device; a user-space daemon then handles the task of turning that
information into something that usefully describes a button-press event.
In many cases, the low-level driver runs in user space and does not involve
the kernel at all. Distribution of these events is also handled by the
LIRC daemon, which can direct specific events to different applications,
run programs in response to events, and so on in a flexible, scriptable
manner.
LIRC works, and some developers would like to see it merged into the
mainline more-or-less as it stands now. Others, though, dislike
the special-purpose "raw" interface used by LIRC. As Jon Smirl put it:
[W]e used to have device specific user space interfaces for
mouse and keyboard. These caused all sort of problems. A lot of
work went into unifying them under evdev. It will be years until
the old, messed up interfaces can be totally removed.
I'm not in favor of repeating the problems with a device specific
user space interface for IR. I believe all new input devices should
implement the evdev framework.
In other words, these developers want remote control devices to look like
any other input device and generate input events through the same
interface. Jon has posted a proposed IR input
driver for discussion; it is actually a rework of work first posted one
year ago. This code moves all processing into the kernel and provides a
flexible mechanism for dealing with multiple remote controls.
As it happens, a number of remote control receivers already work this way,
even in the absence of Jon's patch. LIRC is not
the sole repository of IR receiver drivers; a fair number of them also live
in the mainline kernel already, in the Video4Linux2 subsystem. TV cards
often come with a bundled remote control and receiver, so it makes sense to
write a driver for the receiver as part of the larger V4L2 driver. These
drivers do not use the LIRC interface; instead, they generate input events
directly. See the
Conexant CX2388x IR driver for an example of what this sort of driver
looks like.
The discussion covered various approaches to IR receivers without coming to
any real resolution. Jon Smirl's attempt to
clarify the goals for in-kernel IR support may have brought some focus,
but little in the way of solid conclusions. Even so, there are some points
of near consensus; these include:
- There needs to be some sort of API based on the input subsystem, where
applications can obtain processed, high-level keycodes for button
presses. The goal is to have remote-using applications "just work"
whenever possible.
- There probably needs to be a separate interface where special-purpose
applications can get raw timing data from the receiver - at least, for
receivers without built-in decoders which can provide this information. This
interface can be used to reverse-engineer the sequences sent by new
remote control units and to deal with pathologically-bad hardware.
There is talk of funneling raw data through the input layer as well,
but it's not clear that doing so buys anything; it may be that just
adopting the existing LIRC interface for raw data is as good an
approach as any.
With regard to the keycode interface, there is still a lot of
disagreement over where the keycodes should come from. Some developers
want all of the IR drivers to be in the kernel, while others are happy with
using the LIRC daemon (or something like it) to generate keycodes and push
them back into the kernel from user space. In-kernel drivers have the potential to work
with no daemon process and they can use the current module loading
mechanism. Kernel-based drivers will also have lower response latency than
a user-space daemon, saving precious milliseconds for desperate users who
want to change channels and evade that "too much information"
pharmaceuticals commercial.
On the other hand, in-kernel drivers are kernel code, with the higher level
of risk that always implies. Filtering of input sequences and protocol
processing can be a significant amount of work that some would rather see
done in user space. It may never be possible to support the more
problematic hardware in the kernel. Then, there are the truly wild ideas,
such as wiring an IR receiver to a sound card's microphone input -
something people actually do, evidently.
The fact that some IR protocols may be
patent-encumbered also needs to be kept in mind.
Another detail worth bearing in mind: a number of IR receivers are also
capable of transmitting information. A solution based solely on the input
layer will not be able to handle the output case.
There is one final, simple point: the LIRC drivers have seen many years of
development, and they work. If LIRC is merged directly, the kernel will
benefit from that work and the associated lessons learned. If LIRC is
dropped in favor of fully in-kernel drivers, chances are good that some of
those lessons will have to be learned anew. If the kernel were to go with
a non-LIRC approach to IR drivers, it would probably, eventually,
reach a point where it had a more capable and flexible system with wider
device support than is available now. But, between here and there would be
a period - perhaps a
long period - where in-kernel IR support was not as good as LIRC.
Still, that might just be how things go in the end. The kernel development
community, always concerned about what it will have to maintain five or ten
years in the future, tends not to be in a hurry to merge something now just
because it is seen to work. So, while it is yet possible that LIRC could
be merged in something close to its current form, it's also possible that
it could lurk on the sidelines while something significantly different is
created for the mainline.
Comments (14 posted)
Patches and updates
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>