Brief items
The 2.6.33 merge window is still open, so there is no published
development kernel as of this writing. The 2.6.33-rc1 release, closing the
merge window, can be expected almost any time now.
Stable kernel updates: 2.6.32.1 and 2.6.31.8 were released on
December 14. Both contain a long list of fixes, with many of them
applied to the ext4 filesystem.
Comments (none posted)
No mum just the creator of Linux making my life hard on a Friday.
I'm sure Dad can find articles about it.
--
Dave
Airlie
Damn, this is complicated crap. The analagous task in real life
would be keeping a band of howler monkeys, each in their own tree,
singing in unison while the lead vocalist jumps from tree to tree,
and meanwhile, an unseen conductor keeps changing the tempo the
piece is played at. Thankfully, there are no key changes, however,
occasionally new trees sprout up at random and live ones fall over.
--
Zachary Amsden (thanks to Markus
Armbruster)
Overdesigning is a SIN. It's the archetypal example of what I call
"bad taste". I get really upset when a subsystem maintainer starts
overdesigning things.
--
Linus Torvalds
Or maybe he's talking about ye olde readlocke, used widely for OS
research throughout the middle ages. You still find that spelling
in some really old CS literature.
--
Linus Torvalds
Comments (6 posted)
By Jonathan Corbet
December 15, 2009
Thomas Gleixner has set himself the task of getting rid of the messy rwlock
called
tasklist_lock; in many cases, the solution is to use
read-copy-update (RCU) instead. In the process, he
found some problems with how some code uses
RCU. They merit a quick look, since these problems may occur elsewhere,
and may reflect an outdated understanding of how RCU works.
The core idea behind RCU is to delay the freeing of obsoleted,
globally-visible data until it is known that no users of that data exist.
Traditionally, this has been accomplished by (1) requiring that all
uses of RCU-protected data be in atomic code, and (2) not freeing any
old data until every CPU in the system has scheduled at least once after
that data was replaced by an updated copy. Since atomic code cannot
schedule, this set of rules is sufficient to know that no references to the
old data exist.
Needless to say, code working with RCU-protected data must have preemption
disabled - otherwise the processor could schedule while a reference to that
data still exists. So the rcu_read_lock() primitive has
traditionally disabled preemption. Based on the code Thomas found, that
seems to have led to the conclusion that disabling preemption is
sufficient for code using RCU.
The problem is that newer forms
of RCU use a more sophisticated batching mechanism to track references
to RCU-protected data. This change was necessary to make RCU scale better,
especially in situations (realtime, for example) where disabling preemption
is undesirable. When using hierarchical (or "tree") RCU, code which simply
disables preemption before accessing RCU-protected data will have ugly race
conditions. So it's important to always use rcu_read_lock() when
working with such data. Unfortunately, this is a hard rule to enforce in
an automated way, so programmers will simply have to remember it.
Comments (2 posted)
By Jonathan Corbet
December 16, 2009
Salman Qazi hypothesizes a situation many of us have certainly found
ourselves in:
Imagine being in a tent in Death Valley with a laptop. You are
bored, and you want to watch a movie. However, you also want to do
your best to make the battery last and watch as much of the movie
as possible.
The proposed solution, as it happens, also
happens to work for another situation. Imagine you are Google, and you
want to get the most out of each data center. One way to do that is to
populate the site with more machines than the incoming power is able to
handle, then moderate the power consumption of individual machines to keep
the total below the limit.
In particular, the code that Google has works by forcing the processor to
go idle for a given percentage of the time, where that percentage is set
dynamically depending on the load on the machine and on the data center as
a whole. If need be, a special-purpose realtime task will take over and
idle the processor for the required time to keep the total computing time
below the limit. There's some interesting heuristics for trying to force
the idle cycles onto low-priority processes and for determining whose time
slices the idle cycles are charged to.
This work sounds quite similar to the ACPI processor aggregator driver
which was merged for 2.6.32 over scheduler maintainer Peter Zijlstra's
objections. Peter has not yet spoken up on this patch, but, from the
description, it sounds like it is closer to what he was requesting for this
kind of functionality. It is hard to tell for sure, though; the actual
code has not yet been posted. Hopefully that will follow soon, and this
change can be evaluated for real.
Comments (none posted)
By Jonathan Corbet
December 16, 2009
Nice new tracing tools notwithstanding, kernel developers still tend to
reach for
printk() when trying to figure out problems. But one
need not work on kernel code for very long before running into an
unpleasant fact: the most interesting stuff is often printed immediately
before a crash, but, for many kinds of problems, the death of the system
can prevent the output of those crucial lines. It's no fun to stare at a
hung system, knowing that the information needed to find the problem is
probably trapped in a buffer somewhere in that system's memory.
2.6.33 will contain a new mechanism designed to help get that last bit of
information out of a dying system's clutches. The developer need only set
up a new "kmsg dumper" along these lines:
#include <linux/kmsg_dump.h>
struct kmsg_dumper {
void (*dump)(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
const char *s1, unsigned long l1,
const char *s2, unsigned long l2);
struct list_head list;
int registered;
};
The dump() function will be called in the event of a crash; the
two arguments s1 and s2 will have pointers to the data in
the kernel's output buffer. Two pointers are needed due to the circular
nature of this buffer; s1 will point to the older set of
messages.
Registering and unregistering this function is a matter of calling:
int kmsg_dump_register(struct kmsg_dumper *dumper);
int kmsg_dump_unregister(struct kmsg_dumper *dumper);
In the 2.6.33 kernel, the "mtdoops" module has been reworked to use this
new mechanism to save crash data to a flash device.
Comments (1 posted)
By Jonathan Corbet
December 16, 2009
Per-CPU variables are a performance-improving technology. They allow
processors to work with data without having to worry about locking or cache
contention. One would want these operations to be well optimized, but, as
it turns out, they can be improved; Tejun Heo and Christoph Lameter have
done just that for
2.6.33. In the process, they have changed the way developers work with these
variables.
There is a set of new operations:
this_cpu_read(scalar);
this_cpu_write(scalar, value);
this_cpu_add(scalar, value);
this_cpu_sub(scalar, value);
this_cpu_inc(scalar);
this_cpu_dec(scalar);
this_cpu_and(scalar, value);
this_cpu_or(scalar, value);
this_cpu_xor(scalar, value);
In each case, scalar is either a per-CPU variable obtained with a
new allocator or a static per-CPU variable as obtained from
per_cpu_var(). All of them are atomic, in that the operation will
not be interrupted part-way through on the current processor. It is not
necessary to call put_cpu() after using these operations.
See, for example, the VM
statistics conversion for an example of how operations on per-CPU
variables change under the new scheme.
Comments (2 posted)
Kernel development news
By Jonathan Corbet
December 16, 2009
Since
last week's summary, there
have been over 4200 patches merged
for the 2.6.33 development cycle. That makes a total of 8152 patches
for this merge window, as of this writing.
User-visible changes include:
- If there are any remaining reiserfs users out there: that filesystem
has seen a major rework of its internal locking to eliminate use of
the big kernel lock.
- The Super-H architecture has gained perf events support for a number
of system types.
- The exofs filesystem (for object storage devices) now has multi-device
mirror support.
- There is a new "discard" mount option for ext4 filesystems,
controlling whether ext4 issues TRIM commands for newly-freed space.
It defaults to off due to fears about how well this feature will
really work once hardware begins to support it.
- It is now possible to configure a kernel without ext2 or ext3 support,
but still mount filesystems with those formats using the ext4 code.
- The Nouveau reverse-engineered NVIDIA driver has been merged, but
without the accompanying firmware; see this article for more
information.
- The "ramzswap" device, formerly known as compcache,
has been merged into the staging tree.
- There is now support for the "BATMAN" mesh network protocol in the
staging tree.
- The "perf" tool now has a "diff" mode which will calculate the change
in performance between two different runs and generate a report.
- The semantics for the O_SYNC and O_DSYNC open-time
flags have been rationalized, as described in this article.
- The MD layer now supports barrier requests for all RAID types. The
device mapper, too, has improved barrier support.
- The snapshot merge
target for the device mapper has been merged.
- An extensive set of tracepoints has been added to the XFS filesystem,
allowing fine-grained visibility into most aspects of its operation.
- Memory pages shared with the kernel shared memory (KSM)
mechanism are now swappable.
- New hardware support:
- Block devices: The VMware paravirtualized SCSI HBA device,
LSI 3ware SAS/SATA-RAID controllers,
PMC-Sierra SPC 8001 SAS/SATA based host adapters,
Apple PowerMac/PowerBook internal 'MacIO' IDE controllers,
Blackfin Secure Digital host controllers,
TI DAVINCI multimedia card interfaces, and
BCM Reference Board NAND flash controllers.
- Miscellaneous: Dynapro serial touchscreens,
Altera University Program PS/2 ports,
Samsung S3C2410 touchscreens,
National Semiconductor LM73 temperature sensors,
Nuvoton NUC900 series SPI controllers
SuperH MSIOF SPI controllers,
OMAP SPI 100K master controllers,
ST-Ericsson AB4500 Mixed Signal Power management chips,
Freescale MC13783 realtime clocks,
Freescale MC13783 touchscreen devices,
SHARP LQ035Q1DH02 TFT displays, and
TI BQ32000 I2C realtime clocks.
- Networking: RealTek RTL8192U Wireless LAN NICs,
Agere Systems HERMES II Wireless PC Cards (Model 0110), and
Analog Devices Blackfin on-chip CAN controllers.
- Sound: AD525x digital potentiometers and
Texas Instruments DAC7512 digital-to-analog converters.
- Systems and processors: Neuros OSD 2.0 devices,
Nintendo GameCubes,
Freescale P1020RDB processors,
Freescale p4080ds reference boards,
Arcom/Eurotech ZEUS single-board SBC systems,
ATNGW100 mkII Network Gateway boards, and
Acvilon BF561 boards.
- USB: Xilinx USB host controllers and
OMAP34xx USBHOST 3 port EHCI controllers.
- Video4Linux: OmniVision OV2610, OV3610, and OV96xx sensors,
Sharp RJ54N1CB0C sensors,
E3C EC168 DVB-T USB2.0 receivers,
E3C EC100 DVB-T demodulators,
Maxim MAX2165 silicon tuners,
Aptina MT9T112 cameras, and
DiBcom DiB0090 tuners.
Changes visible to kernel developers include:
- The scsi_debug module can now emulate "thin provisioning" devices.
- The detect() callback in struct i2c_driver has lost
the unused kind parameter. Also, struct
i2c_client_address_data is no more; address lists are represented
with simple unsigned short arrays instead.
- The spinlock renaming
patch has been applied. Developers working near low-level code
will see the new arch_spin_lock_t type being used with
non-sleeping (even in the realtime tree) locks.
- Video4Linux2 has a
new subdevice API, called media-bus, intended to help in the
negotiation of image formats between the sensor and the controller.
- There is a new mechanism for grabbing and saving kernel messages on a system
crash; see this article
for more information.
- The per-CPU variable allocator has been replaced, and there is a new
set of operations for working with these variables; see this article for a brief
introduction.
This merge window should close in the very near future, so the 2.6.33
kernel is, at this point, close to being feature-complete. Any final
additions will be noted in next week's edition.
Comments (1 posted)
By Jonathan Corbet
December 16, 2009
Your editor suspects that, were somebody to poll the community of Linux
users, very few would state that they dislike the idea of having their
systems suspend and resume more quickly. Rafael Wysocki has been working
toward this goal for some time; his
asynchronous suspend/resume
patches were covered here back in August. This code has not
encountered any real turbulence for a while, so one might well assume that
Rafael's
2.6.33 pull request containing
asynchronous suspend/resume would not be controversial. Such assumptions,
however, fail to take into account the "last-minute Linus" effect.
The simple fact of the matter is that, like anybody else, Linus cannot
possibly follow all of the projects under way at any given time; that makes
it entirely possible for work on a specific project to proceed to a
conclusion without ever drawing
his attention. That will inevitably come to an end, though, when somebody
sends a pull request asking that the work be merged into the mainline. It
seems clear that some requests are scrutinized more closely than others,
but some are looked at closely indeed. The power management request, as it
turns out, was one of those.
Linus didn't like what he saw, to say the
least. The code struck him as overly complex and possibly unsafe; he
refused to pull it. In particular, he thought that far too much work went into
trying to map out the device tree topology and all of the dependencies
between devices. In the past, attempts to make things asynchronous based
on just the apparent topology have run into trouble; why should it be
different this time?
Having said that, Linus then went on to outline an alternative solution
based mainly on the device tree. In so doing, he wanted to make it
possible for most drivers to ignore the concept of asynchronous suspend and
resume
entirely. For much of the hardware on the system, the time required for
either operation is so short that there is really little point in trying to
do it in parallel. If a device can be suspended in a few milliseconds, one
might as well just do it serially and avoid the complexity.
For the rest, Linus very much wanted the decision on whether to do things
asynchronously to be made at the driver level. But the power management
core still needs to know enough about asynchronous operation to wait until
it is done; one cannot suspend a controller until all devices connected to
it have, themselves, completed suspending. After some revisions, Linus's plan came down to something like this:
- A reader/writer semaphore (rwsem) is associated with each node in the
device
tree. These semaphores allow an unlimited number of concurrent reader
locks, but only one writer lock can exist at any given time, and
writers must first wait for any readers to finish. At the beginning
of the suspend process, no locks are taken.
- The suspend process is initiated on all children of a given node. If
suspend is done synchronously, it happens right away and no further
action is required.
- Should the driver decide to suspend its device asynchronously, it
starts a thread to do that work. It also takes a read lock on the
parent's rwsem.
- When an asynchronous suspend for a specific device completes, the read
lock is released.
- The parent node acquires a write lock on its own rwsem before
suspending the device. If any child nodes are suspending
asynchronously, the write lock will block as a result of the
outstanding read locks. Only when all read locks are released -
meaning that all children are suspended - can the parent acquire its
write lock and suspend.
For resume, the write lock is taken first, and all children take read locks
on their parent before resuming the hardware. That will ensure that all
devices complete resuming before any child devices begin the process.
This scheme has the benefit of simplicity. Getting it implemented took a
few rounds of discussion, though, with Linus repeatedly asking developers
to retain that simplicity and not try to make up new locking schemes.
Things still changed along the way; as
of this writing, the current
suspend/resume patch set does not use Linus's plan as originally
written. Among
other things, Rafael, who did implement an rwsem-based solution, ran into
problems with lockdep that Linus agreed
were serious.
What has been implemented instead is a variant on that scheme based on
completions. Every device node gets a completion structure, initially set
to the "not complete" state. Additionally, any driver which implements
asynchronous suspend/resume needs to call
device_enable_async_suspend() to inform the power management core
of that fact. It's now up to that core to create threads for asynchronous
suspend/resume operations, and to invoke driver callbacks from those
threads. Before suspending a specific device node, the power core will
wait for completions for any child devices which have been marked for
asynchronous callbacks. Once again, that ensures that all children have
been suspended before the parent node is suspended.
Linus doesn't like the completion-based approach, but has indicated that he
will be willing to take it. As of this writing, that has not yet happened,
though.
Seen in one light, this episode highlights the sort of disregard for
developer time which is occasionally seen in the kernel development
process. It is not that uncommon for code which has seen a lot of work to
end up being discarded or massively reworked. This model can seem quite
wasteful, and there can be no doubt that it can be highly frustrating for
the developers involved. But it is also a fundamental part of how quality
control for the kernel works. The suspend/resume code was clearly improved
by this last-minute redesign. One might say that it would have been better
done some months ago, but what matters most for Linux users is that it
happens at all.
Comments (6 posted)
By Jonathan Corbet
December 15, 2009
The merge window is normally a bit of a hectic time for subsystem
maintainers. They have two weeks in which to pull together a well-formed
tree containing all of the changes destined for the next kernel development
cycle. Occasionally, though, last-minute snags can make the merge window
even more busy than usual. The unexpected merging of the
Nouveau driver is the
result of one such snag - but it is a story with a happy ending for all.
Dave Airlie probably thought he had enough on his plate when he generated
the DRM pull request for 2.6.33. This tree
contained 203 commits touching 122 different files, and adding over 9,000
lines of code. One of the key features aimed at the kernel is the new
"page flipping ioctl()," helpfully described in the commit message
as "The ioctl takes an fb ID and a ctrc ID and flips the crtc to the
given fb at the next vblank." In English, it means that a specific
video output can be quickly switched from one region of video memory to
another, allowing for clean video changes without the "tearing" that
results from display of a video buffer which is being changed.
Other changes for DRM this time around include support for Intel's
"Ironlake" GPU and "Pineview" Atom processor, and a great deal of work
supporting kernel mode setting on Radeon GPUs. Radeon, it seems,
only lacks good power management support at this point; it will likely lose
its "staging" designation before the end of this development cycle.
Linus was not impressed by any of that, though. Instead, he had one concern: the fact that the Nouveau driver
- a reverse-engineered driver for NVIDIA chipsets - was not a part of the
pull request. Nouveau had been discussed at the 2009 Kernel Summit, and it was
generally agreed that this code should find its way into the mainline as
soon as possible. 2.6.33 is the first merge window since the summit, and
Linus clearly had expected some action on that front. When he didn't get
it, he made his disappointment known.
One might wonder what the problem with Nouveau was. The world is full of
out-of-tree Linux drivers; recent efforts have reduced their number
considerably, but they still exist and Linus does not normally complain
about them. Certainly Nouveau has a higher profile than most other
out-of-tree drivers; it is the only hope for a free driver for a large
percentage of available machines. But the real problem is that Fedora (at
least) has been shipping this driver without doing enough (in Linus's
opinion) to get it upstream. In Linus's
words:
I'm pissed off at distribution people. For years now, distributions
have talked about "upstream first", because of the disaster and
fragmentation that was Linux-2.4. And most of them do it, and have
been fairly good about it.
But not only is Fedora not following the rules, I know that Fedora
people are actively making excuses about not following the rules. I
know Red Hat actually employs (full-time or part-time I have no
idea) some Nouveau developer, and by that point Red Hat should also
man up and admit that they need to make "merge upstream" be a
priority for them.
A number of reasons for the non-merging of Nouveau have been given, ranging
from "not ready yet" and "unstable user-space API" to "we haven't found the
time yet." The real blocker in recent times, though, has been the binary
blob loaded into some NVIDIA GPUs by the driver. This chunk of code, known
as the "voodoo" or "ctxprogs," was obtained by watching the proprietary
drivers in action. Since nobody in the Nouveau project wrote this code,
nobody has been willing to sign off on it; it's not at all clear that it
can be legally distributed. Linus has not been
impressed by this reason either, but the fact remains: developers take
the Signed-off-by: line seriously and are not willing to attach it
to something which might be legally questionable.
The obvious answer, one which has been applied in other situations, is to
pull the firmware out of the driver and load it into the kernel at run
time. And that is exactly what happened with Nouveau: Ben Skeggs put in an
intensive effort to remove ctxprogs and use the firmware loading API to get
it when the driver loads. Dave then put together the "DRM Nouveau pony tree" and
requested that it be pulled for 2.6.33. Linus, of course, did exactly
that.
Potential users will still have to get the "ctxprogs" from elsewhere. For
whatever reason, pointers to "elsewhere" are hard to find, but your editor
happens to know that the firmware can be found in the
Nouveau git tree. Simply grabbing the right version and placing it in
the local firmware directory should be sufficient.
All of this marks significant progress for Nouveau, but a dependence on
firmware of dubious origin is likely to inhibit the adoption of this driver
in the long term. So it was good to learn (via an LWN comment posting) that the
contents of the ctxprogs blob are not quite as obscure as many of us had
thought:
[W]e know a lot about ctxprogs these days, including their purpose
[context switching], what they do [save/restore PGRAPH state], and
most of their opcodes. There are still some unknowns that prevent
us from writing new ctxprogs from scratch right now, but we're
working on that and it *will* be resolved in the proper way. Which
is throwing out nvidia's progs and writing our own prog generator.
It seems that things are moving quickly on this front too; on
December 15, Ben announced the
availability of a replacement firmware for NVIDIA GeForce 6/7
hardware. This is a first posting for this code; doubtless testers will
encounter some problems. But it sounds very much like the hardest problems
have been overcome, at least for this particular variant of the hardware.
With luck, NVIDIA's firmware will not be needed for much longer. In the
longer term, it might even turn out to be possible to program interesting
functions into the hardware, extending its capabilities in surprising ways.
Once upon a time, Linux users had to be very careful about which hardware
they bought. Over the years, most of those problems have gone away; it is
now easy to find systems which are completely supported by free software.
One of the biggest exceptions has been in the area of graphics. Vendors
like Intel and ATI/AMD have made the decision that their hardware should be
supported with free drivers (most
of the time) and have invested resources to make that
happen. NVIDIA has been rather less cooperative, and support for its
hardware has suffered accordingly. It would appear that the driver problem
is getting close to a solution, but we should never forget the effort which
was required to get to this point. NVIDIA would be far more worthy of our
future commercial support if it had not made that effort necessary.
Comments (114 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>