Kernel development
Brief items
Kernel release status
The 3.2 merge window is still open, so there is no development kernel prepatch as of this writing. See the separate article below for a summary of what has been merged into the mainline thus far.Stable updates: no stable updates have been released in the last week. The 2.6.32.47 and 2.6.33.20 updates are in the review process as of this writing; they can be expected on or after November 4. Both are significant updates with over 100 fixes. 2.6.33.20 is expected to be the last update for 2.6.33 (for real this time); realtime users are encouraged to move to 3.0, which will be supported as a long-term release.
Quote of the week
/* * More smoking hash instead of calculating it, damn see these * numbers float.. I bet that a pink elephant stepped on my memory. */ /* * Can't be having no nameless bastards around this place! */ /* * What it says above ^^^^^, I suggest you read it. */ /* * Plain impossible, we just registered it and checked it weren't no * NULL like.. I bet this mushroom I ate was good! */ /* * I took it apart and put it back together again, except now I have * these 'spare' parts.. where shall I put them. */ /* * We had N bottles of beer on the wall, we drank one, but now * there's not N-1 bottles of beer left on the wall... */
(Partially) graduating IIO
The industrial I/O (IIO) subsystem has lived in the staging tree for some time. It provides a framework for drivers that deal with all kinds of sensors that measure quantities like voltages, temperatures, acceleration, ambient light, and more. There has been some disagreement over the years about how sensors of this type should fit into the kernel; IIO, it is hoped, will provide the answer.The core IIO code sat out of tree for a long time; the state of the code, it is said, reflected that fact. There has been a determined effort to improve things in the staging tree, with some measurable results. There is now a set of core IIO patches that, according to maintainer Jonathan Cameron, is now ready to move out of staging and into the mainline proper.
IIO sensors vary a lot, from simple, low-bandwidth sensors to complex, high-bandwidth devices. The initial IIO move is aimed at the first set. For this kind of sensor, the user-space interface is expected to live entirely in sysfs, under /sys/bus/iio/devices. Each device entry will have a number of attributes; some, like name and sampling_frequency, will be present for all sensors. Others will depend on what the sensor actually measures; the proposed ABI attempts to standardize the names of those attributes wherever possible.
The plan is to get this core interface into the mainline, then to start moving the simpler (and cleaner) drivers after it. Support for more complex devices will come later. As of this writing, this code has not been pulled for 3.2, but that could yet happen. Meanwhile, vast numbers of IIO changes have gone into the staging tree for 3.2; there is clearly a lot of interest in getting this subsystem into shape.
Kernel development news
3.2 merge window, part 1
Linus released the 3.1 kernel and opened the 3.2 merge window on October 24 while sitting in the 2011 Kernel Summit. As of this writing, nearly 8200 non-merge changesets have been pulled into the mainline. That is a large number of changes, and a number of significant trees have yet to be pulled. This looks like it will be the busiest development cycle in some time, perhaps the biggest ever.The most significant user-visible changes merged for 3.2 include:
- The TCP stack now supports proportional
rate reduction, an algorithm which allows for faster recovery
after transient network problems.
- Support for persistent alias names for
disk devices has been added to the block layer.
- The TOMOYO security module can now implement restrictions on
environment variable names and socket operations.
- The extended verification module
subsystem, which uses the trusted platform module to protect a system
against offline modifications to files, has been merged.
- The CFS bandwidth controller, allowing
an administrator to set maximum CPU usage for groups of processes, has
been merged. See the documentation
file for information on how to use this feature.
- RAID 5 support has been added for
object-storage devices. This is the third RAID 5 implementation
in the kernel, with another (for btrfs) due to arrive in the near future.
- The s390 architecture has gained kernel crash dump support.
- The cross-memory attach facility, meant
to provide for fast interprocess messaging, is now in the mainline.
The form of the system calls has changed since this patch was last
covered here, though:
ssize_t process_vm_readv(pid_t pid, const struct iovec *lvec, unsigned long liovcnt, const struct iovec *rvec, unsigned long riovcnt, unsigned long flags); ssize_t process_vm_writev(pid_t pid, const struct iovec *lvec, unsigned long liovcnt, const struct iovec *rvec, unsigned long riovcnt, unsigned long flags);
See the man page for more information.
- The mremap() system call now works properly with transparent
huge pages, reducing the number of page-split operations.
- The x86 architecture has gained an SSSE3-optimized implementation of
the SHA1 hash algorithm. From the changelog: "
With this algorithm I was able to increase the throughput of a single IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using the SSSE3 variant -- a speedup of +34.8%.
" Optimized implementations of Blowfish and Twofish have been added as well. - There is a new user-space configuration interface for the crypto
layer. Unfortunately, the implementers do not yet appear to have
gotten around to writing any documentation for this interface.
- Support for the "Hexagon" DSP-based architecture has been merged; see
this article for more information.
- DVFS ("dynamic voltage and frequency scaling") is a new mechanism for
controlling devices that can operate at multiple voltage and frequency
values, trading off between power consumption and performance as
required. It is analogous to the cpufreq governor mechanism used for
the CPU.
- New device drivers:
- Processors and systems::
Analog Devices EVAL-ADAU1373 boards,
RSI Embedded Webserver boards,
Calao USB-A9G20 boards,
Samsung EXYNOS4212 SoC processors,
Samsung SMDK4412 boards,
Picochip picoXcell-based boards,
OMICRON electronics DEVIXP, MICCPT, and MIC256 boards,
DENX M28EVK boards,
Vision Engraving Systems EP9307 systems, and
Calxeda Highbank-based boards.
- Block:
Realtek RTS5139 USB card readers and
Marvell Universal Message Interface devices.
- Graphics:
SMSC UFX6000/7000 USB framebuffer devices,
Aeroflex Gaisler framebuffer devices, and
Samsung SoC EXYNOS series graphic units.
- Input: BMA150/SMB380 acceleration sensors,
Wiimote accelerometer and IR devices, and
Bachmann electronic TSC-10, 25 or 40 serial touchscreens.
- Media: Wolfson Micro WM5100 low-power audio subsystems,
Analog Devices ADAU1373 audio codecs,
Au1000/Au1500/Au1100 audio controllers,
ITE IT913x demodulators and tuners,
Aptina MT9P031 and MT9T001 sensors,
NXP TDA10071 DVB-S/S2 demodulators,
Conexant CX24118A tuners,
TOPRO USB cameras,
Pinnacle PCTV HDTV Pro USB devices, and
TT Connect S2-3600 cards.
- Miscellaneous: Incite Technology USB-DUXsigma data
acquisition boards,
Qualcomm PM8xxx-base vibrator devices,
Analog Devices AD7280A lithium ion battery monitoring devices,
Analog Devices AD7190, AD7192 and AD7195 analog to digital
convertors,
Wiimote rumble and force-feedback devices,
TI PICO DLP mini-projector devices,
Samsung EXYNOS4 thermal management units,
Linear Technologies LTC2978 hardware monitoring systems,
ePAPR hypervisor byte channels,
Renesas TPU LED controllers, and
GPIO-controlled regulators.
- Networking: Marvell PCIE 8766 wireless adapters.
- USB: Marvell PXA168 on-chip EHCI HCD controllers and
DesignWare USB3 DRD controllers.
- Note also that the ath6kl wireless driver, the brcm80211 wireless driver, the tm6000 V4L2 driver, the Altera FPGA firmware download module, and the core hyper-v driver have graduated from the staging directory into the mainline.
- Processors and systems::
Analog Devices EVAL-ADAU1373 boards,
RSI Embedded Webserver boards,
Calao USB-A9G20 boards,
Samsung EXYNOS4212 SoC processors,
Samsung SMDK4412 boards,
Picochip picoXcell-based boards,
OMICRON electronics DEVIXP, MICCPT, and MIC256 boards,
DENX M28EVK boards,
Vision Engraving Systems EP9307 systems, and
Calxeda Highbank-based boards.
Changes visible to kernel developers include:
- The network drivers directory (drivers/net) has been
massively rearranged with most drivers moved into media-specific
or protocol-specific subdirectories.
- The new "pin control subsystem" allows developers on embedded systems
to configure the many multi-purpose pins found on contemporary
system-on-chip processors. See Documentation/pinctrl.txt for the
details. Drivers for the U300 and CSR SiRFprimaII pinmuxes have been
added as well.
- The new module_platform_driver() macro can eliminate a bunch
of boilerplate code for simple platform drivers.
- The power management quality-of-service API has grown a new capability for the management of per-device QOS constraints; it is intended to be used with the new DVFS subsystem. See Documentation/power/pm_qos_interface.txt for details on this API.
The merge window can be expected to run through approximately November 7. The latter part of the merge window will be summarized here in the November 10 Weekly Edition.
A btrfs update at LinuxCon Europe
In October, the btrfs user community expressed concerns about the still missing-in-action filesystem checker and repair tool. At that time, btrfs creator Chris Mason said that he hoped to demonstrate a working checker during his LinuxCon Europe session. Your editor was there as part of a standing-room-only crowd ready to see the show; we did indeed get a demonstration, but it may not have been quite what some attendees expected.Chris started by talking about btrfs and its goals in general; those have been well covered here and need not be repeated now. He reiterated Oracle's plan to use btrfs as the core filesystem for its RHEL-derivative Linux distribution; needless to say, supporting that role requires a rock-solid implementation. So a lot of work has been going into extensive testing of the filesystem and fixing bugs.
The 3.2 kernel release will see the results of that work; it will contain
lots of fixes. There will also be significant improvements to the logging
code. It turns out that a lot of data was being logged more than once,
greatly increasing the amount of I/O required; that has now been fixed.
I/O traffic for the log, it seems, has been cut to about 25% of its
previous level.
For 3.3, the main improvement seems to be the use of larger blocks for nodes in the filesystem B-tree. Larger blocks can hold more data, of course, and, in particular, more metadata. That means that metadata that was previously scattered in the filesystem can be kept together with the relevant inode. That, in turn, leads to significant performance improvements for many filesystem operations.
Another near-term feature, due to arrive "
Talk of protecting metadata leads naturally to the problem of recovering a
filesystem when its metadata has been corrupted. That is what a filesystem
checker program is for; btrfs, thus far, has been increasingly famous for
it lack of a proper checker (and, more importantly, a proper filesystem
repair tool). As of the LinuxCon talk, btrfs still does not have a real
repair tool, but some progress has been made in that direction and a couple
of other mechanisms have been provided.
The copy-on-write nature of btrfs implies that there will be numerous old
copies of the filesystem metadata on the storage device at any given time.
Any change, after all, will create a new copy, leaving the previous version
in place until the block is reused.
Chris observed that filesystem corruptions rarely affect that older
metadata, so it makes sense to use it as a primary resource in the recovery
of a corrupted disk. But, first, one needs to be able to find that
older metadata.
To that end, btrfs maintains an array containing the block locations of
many older versions of the filesystem root. The root block, he said, is
more important than the superblock when it comes to recovering data. The
root is replaced often as metadata changes percolate up to the top of the
directory hierarchy, so the "old root blocks" array contains pointers to
what is, in effect, a set of snapshots of the very recent state of the
filesystem. Clearly, this will be a valuable resource should something go
badly wrong.
One way of using that array is simply to mount the filesystem using an
older version of the root. Chris demonstrated this feature by poking holes
in a test filesystem, then mounting an older root to get back to where
things had been before. For simple, quickly-detected problems, older root
blocks should be a path toward a quick solution.
It is not too hard to imagine situations where this approach will not work,
though. If a metadata block in a rarely-changed subtree is, say, zeroed by
a hardware malfunction, it could go undetected for some time. By the time
the user realizes that something is wrong, there may be no older hierarchy
containing the information needed to put things back together. So other
solutions will be necessary.
Obviously, one of those solutions will be the full filesystem checker and
repair tool. That tool is still not ready, though. Getting a repair tool
right is a hard problem; without a lot of care, a well-intentioned attempt
to repair a filesystem can easily make it worse. Data that may have been
recoverable before the repair attempt may no longer be so afterward. Even
if a proper btrfsck were available today, it would probably be some years
before it reflected enough experience to inspire confidence in users who
are concerned about their data.
So it seems that something else is required. That "something else" turns
out to be a data recovery tool written by Josef Bacik. This tool has a
simple (to explain) job: dig through a corrupted filesystem in read-only
mode and extract as much of the data as possible. Since it makes no
changes, it cannot make things worse; it seems like a worthwhile tool to
have around even if a full repair tool existed.
That tool, along with all the requisite filesystem support, is expected to
be available in the 3.2 kernel time frame. Meanwhile, there is a new btrfs-progs repository that will include
the recovery tool in the near future. All told, it may not be quite the
btrfsck that some users were hoping for, but it should be enough to make
those users feel a bit more confident about entrusting their data to a new
filesystem. Judging from the size of the crowd at Chris's talk, there are
a lot of people interested in doing exactly that.
[Your editor would like to thank the Linux Foundation for funding his travel to LinuxCon Europe.]
The core idea behind frontswap is to provide a less expensive alternative
to pushing a page out to the swap device. That alternative could be one of
a number of possibilities: storing the page (possibly compressed) in a
memory pool shared between
virtual machines, writing it to an SSD-based intermediate device, or adding a
reference to a stored page with duplicate contents, for example. Frontswap
is not required to accept a page handed to it, but, if it does accept that
page, it must be able to reproduce it on demand in the future. The primary
use case appears to be balancing memory use between Xen-based virtual
machines, but others can be imagined.
If one were to look at the initial response to the post, it would appear
that there was a groundswell of support for these patches; several messages
came in calling for their inclusion. Those messages, however, came from other
people at Oracle (Dan's employer) or other large companies, though, and
their authors are not normally known for their participation in
conversations about memory management code, so they may have had something
other than the intended effect. It looked a bit like an organized pressure
campaign. When the core kernel developers started to respond, the tone of
the conversation changed considerably.
There were a number of complaints raised. The frontswap patches were not
going through the -mm tree, and they did not carry acks from any of the
recognized memory management developers, so some people started to suspect
that Dan was trying to circumvent the normal processes. There is also a
fair amount of doubt about the utility of the patches and the way they
operate; Christoph Hellwig, for example, described frontswap as "
One other complaint raised was the lack of any sort of associated
benchmarks. Frontswap is, in the end, a sort of performance-enhancing
patch; such changes are normally expected to be accompanied by test results
showing that performance is indeed enhanced for the target workloads.
Equally important is showing that performance is not hurt on other
workloads - always a big concern when making changes to memory management
behavior. For this kind of change, it is important to show that there is
no impact on systems where the new facility is not used at all; Dan has not
yet done that.
Chances are good that satisfying benchmark results can be produced
eventually, and that the technical objections that have been raised can be
fixed. Even then, though, frontswap is unlikely to get an immediate green
light for merging into the mainline, for a few reasons. One is that life
is never easy for those making core memory management changes; experience
has shown that it is far too easy to make mistakes that only show up many
months later when somebody tries their important workload on a new kernel.
Dan has complained about the "hazing" he
has gone through, but he has had an easier time than some others.
That said, Dan's life has not been improved by the association of his work
with Xen which, while being free software that is now mostly in the
mainline, is still looked upon dimly by many developers. His interaction
style also sometimes does not help. Finally, Dan has, by virtue of unfortunate
timing that is not at all his fault, run into another problem that was best
explained by Andrew Morton:
Dan had not explicitly volunteered for that role, but, then, few people (or
dummies) ever do. But, at this point, the process will have to play out on
those terms. Barring some sort of surprising executive decision by Linus,
this particular discussion is unlikely to come to a resolution before the
close of the 3.2 merge window.
Another idea discussed at the kernel summit was that code that is in active
use, and that is shipped by distributors over the long haul, should
probably find its way into the mainline eventually even if it is not
entirely pleasing. Transcendent memory has been in the openSUSE kernel
since 2009, and has been shipped by Oracle for some time as well. Clearly,
some people see value in this work. Given time, patience, and a willingness to
address technical issues, that should be sufficient to get this capability
into the mainline eventually.
right after fsck
",
is the merging
of Dave Woodhouse's RAID5 and RAID6 implementations. That work was initially posted in 2009; Chris apologized for
taking so long to get it merged. How this feature will actually be used
still needs some thought; RAID5 or 6 is quite good for data, but it
can be problematic for metadata, which tends to not fill anything close to
a full RAID stripe and, thus, can lead to low I/O performance. Happily, btrfs has
been designed from the beginning to keep
data and metadata separate; that means that things can be set up where data
is protected with full RAID while metadata is managed using simple
mirroring.
Frontswap gets broadsided
"Frontswap" is the second half of Dan Magenheimer's transcendent memory concept; the first half
("cleancache") was merged for the 3.0 kernel. Given that the job was
halfway done, one might be forgiven for thinking that getting frontswap
merged would not be a big challenge, despite the fact that, like many
memory-management patches, transcendent memory has had a long and somewhat
rocky path into the mainline. Dan must have known better, though, as
evidenced by his decision to copy your editor on the frontswap pull request, nicely providing a
front-row seat to the 100+ messages that followed. Some version of this
patch set may well make it into the mainline eventually, but it now seems
quite unlikely to happen in the 3.2 cycle.
a bunch of
really ugly hooks over core code, without a clear definition of how they
work or a killer use case.
" Various core memory management
developers, their attention drawn by the pull request, found a number of
things not to like.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>