Kernel development [LWN.net]

Kernel release status

The 3.2 merge window is still open, so there is no development kernel prepatch as of this writing. See the separate article below for a summary of what has been merged into the mainline thus far.

Stable updates: no stable updates have been released in the last week. The 2.6.32.47 and 2.6.33.20 updates are in the review process as of this writing; they can be expected on or after November 4. Both are significant updates with over 100 fixes. 2.6.33.20 is expected to be the last update for 2.6.33 (for real this time); realtime users are encouraged to move to 3.0, which will be supported as a long-term release.

Comments (none posted)

Quote of the week

/*
 * More smoking hash instead of calculating it, damn see these
 * numbers float.. I bet that a pink elephant stepped on my memory.
 */

/*
 * Can't be having no nameless bastards around this place!
 */

/*
 * What it says above ^^^^^, I suggest you read it.
 */

/*
 * Plain impossible, we just registered it and checked it weren't no
 * NULL like.. I bet this mushroom I ate was good!
 */

/*
 * I took it apart and put it back together again, except now I have
 * these 'spare' parts.. where shall I put them.
 */
 
/*
 * We had N bottles of beer on the wall, we drank one, but now
 * there's not N-1 bottles of beer left on the wall...
 */

-- Somebody told Peter Zijlstra to add some comments

Comments (22 posted)

(Partially) graduating IIO

By Jonathan Corbet
November 2, 2011

The industrial I/O (IIO) subsystem has lived in the staging tree for some time. It provides a framework for drivers that deal with all kinds of sensors that measure quantities like voltages, temperatures, acceleration, ambient light, and more. There has been some disagreement over the years about how sensors of this type should fit into the kernel; IIO, it is hoped, will provide the answer.

The core IIO code sat out of tree for a long time; the state of the code, it is said, reflected that fact. There has been a determined effort to improve things in the staging tree, with some measurable results. There is now a set of core IIO patches that, according to maintainer Jonathan Cameron, is now ready to move out of staging and into the mainline proper.

IIO sensors vary a lot, from simple, low-bandwidth sensors to complex, high-bandwidth devices. The initial IIO move is aimed at the first set. For this kind of sensor, the user-space interface is expected to live entirely in sysfs, under /sys/bus/iio/devices. Each device entry will have a number of attributes; some, like name and sampling_frequency, will be present for all sensors. Others will depend on what the sensor actually measures; the proposed ABI attempts to standardize the names of those attributes wherever possible.

The plan is to get this core interface into the mainline, then to start moving the simpler (and cleaner) drivers after it. Support for more complex devices will come later. As of this writing, this code has not been pulled for 3.2, but that could yet happen. Meanwhile, vast numbers of IIO changes have gone into the staging tree for 3.2; there is clearly a lot of interest in getting this subsystem into shape.

Comments (2 posted)

3.2 merge window, part 1

By Jonathan Corbet
November 2, 2011

Linus released the 3.1 kernel and opened the 3.2 merge window on October 24 while sitting in the 2011 Kernel Summit. As of this writing, nearly 8200 non-merge changesets have been pulled into the mainline. That is a large number of changes, and a number of significant trees have yet to be pulled. This looks like it will be the busiest development cycle in some time, perhaps the biggest ever.

The most significant user-visible changes merged for 3.2 include:

The TCP stack now supports proportional rate reduction, an algorithm which allows for faster recovery after transient network problems.
Support for persistent alias names for disk devices has been added to the block layer.
The TOMOYO security module can now implement restrictions on environment variable names and socket operations.
The extended verification module subsystem, which uses the trusted platform module to protect a system against offline modifications to files, has been merged.
The CFS bandwidth controller, allowing an administrator to set maximum CPU usage for groups of processes, has been merged. See the documentation file for information on how to use this feature.
RAID 5 support has been added for object-storage devices. This is the third RAID 5 implementation in the kernel, with another (for btrfs) due to arrive in the near future.
The s390 architecture has gained kernel crash dump support.

The cross-memory attach facility, meant to provide for fast interprocess messaging, is now in the mainline. The form of the system calls has changed since this patch was last covered here, though:

    ssize_t process_vm_readv(pid_t pid, const struct iovec  *lvec, 
			     unsigned long liovcnt, const struct iovec *rvec,
		 	     unsigned long riovcnt, unsigned long flags);

    ssize_t process_vm_writev(pid_t pid, const struct iovec  *lvec, 
			      unsigned long liovcnt, const struct iovec *rvec,
		 	      unsigned long riovcnt, unsigned long flags);

See the man page for more information.

The mremap() system call now works properly with transparent huge pages, reducing the number of page-split operations.
The x86 architecture has gained an SSSE3-optimized implementation of the SHA1 hash algorithm. From the changelog: "With this algorithm I was able to increase the throughput of a single IPsec link from 344 Mbit/s to 464 Mbit/s on a Core 2 Quad CPU using the SSSE3 variant -- a speedup of +34.8%." Optimized implementations of Blowfish and Twofish have been added as well.
There is a new user-space configuration interface for the crypto layer. Unfortunately, the implementers do not yet appear to have gotten around to writing any documentation for this interface.
Support for the "Hexagon" DSP-based architecture has been merged; see this article for more information.
DVFS ("dynamic voltage and frequency scaling") is a new mechanism for controlling devices that can operate at multiple voltage and frequency values, trading off between power consumption and performance as required. It is analogous to the cpufreq governor mechanism used for the CPU.
New device drivers:
- Processors and systems:: Analog Devices EVAL-ADAU1373 boards, RSI Embedded Webserver boards, Calao USB-A9G20 boards, Samsung EXYNOS4212 SoC processors, Samsung SMDK4412 boards, Picochip picoXcell-based boards, OMICRON electronics DEVIXP, MICCPT, and MIC256 boards, DENX M28EVK boards, Vision Engraving Systems EP9307 systems, and Calxeda Highbank-based boards.
- Block: Realtek RTS5139 USB card readers and Marvell Universal Message Interface devices.
- Graphics: SMSC UFX6000/7000 USB framebuffer devices, Aeroflex Gaisler framebuffer devices, and Samsung SoC EXYNOS series graphic units.
- Input: BMA150/SMB380 acceleration sensors, Wiimote accelerometer and IR devices, and Bachmann electronic TSC-10, 25 or 40 serial touchscreens.
- Media: Wolfson Micro WM5100 low-power audio subsystems, Analog Devices ADAU1373 audio codecs, Au1000/Au1500/Au1100 audio controllers, ITE IT913x demodulators and tuners, Aptina MT9P031 and MT9T001 sensors, NXP TDA10071 DVB-S/S2 demodulators, Conexant CX24118A tuners, TOPRO USB cameras, Pinnacle PCTV HDTV Pro USB devices, and TT Connect S2-3600 cards.
- Miscellaneous: Incite Technology USB-DUXsigma data acquisition boards, Qualcomm PM8xxx-base vibrator devices, Analog Devices AD7280A lithium ion battery monitoring devices, Analog Devices AD7190, AD7192 and AD7195 analog to digital convertors, Wiimote rumble and force-feedback devices, TI PICO DLP mini-projector devices, Samsung EXYNOS4 thermal management units, Linear Technologies LTC2978 hardware monitoring systems, ePAPR hypervisor byte channels, Renesas TPU LED controllers, and GPIO-controlled regulators.
- Networking: Marvell PCIE 8766 wireless adapters.
- USB: Marvell PXA168 on-chip EHCI HCD controllers and DesignWare USB3 DRD controllers.
- Note also that the ath6kl wireless driver, the brcm80211 wireless driver, the tm6000 V4L2 driver, the Altera FPGA firmware download module, and the core hyper-v driver have graduated from the staging directory into the mainline.

Changes visible to kernel developers include:

The network drivers directory (drivers/net) has been massively rearranged with most drivers moved into media-specific or protocol-specific subdirectories.
The new "pin control subsystem" allows developers on embedded systems to configure the many multi-purpose pins found on contemporary system-on-chip processors. See Documentation/pinctrl.txt for the details. Drivers for the U300 and CSR SiRFprimaII pinmuxes have been added as well.
The new module_platform_driver() macro can eliminate a bunch of boilerplate code for simple platform drivers.
The power management quality-of-service API has grown a new capability for the management of per-device QOS constraints; it is intended to be used with the new DVFS subsystem. See Documentation/power/pm_qos_interface.txt for details on this API.

The merge window can be expected to run through approximately November 7. The latter part of the merge window will be summarized here in the November 10 Weekly Edition.

Comments (3 posted)

A btrfs update at LinuxCon Europe

By Jonathan Corbet
November 2, 2011

In October, the btrfs user community expressed concerns about the still missing-in-action filesystem checker and repair tool. At that time, btrfs creator Chris Mason said that he hoped to demonstrate a working checker during his LinuxCon Europe session. Your editor was there as part of a standing-room-only crowd ready to see the show; we did indeed get a demonstration, but it may not have been quite what some attendees expected.

Chris started by talking about btrfs and its goals in general; those have been well covered here and need not be repeated now. He reiterated Oracle's plan to use btrfs as the core filesystem for its RHEL-derivative Linux distribution; needless to say, supporting that role requires a rock-solid implementation. So a lot of work has been going into extensive testing of the filesystem and fixing bugs.

The 3.2 kernel release will see the results of that work; it will contain lots of fixes. There will also be significant improvements to the logging code. It turns out that a lot of data was being logged more than once, greatly increasing the amount of I/O required; that has now been fixed. I/O traffic for the log, it seems, has been cut to about 25% of its previous level.

For 3.3, the main improvement seems to be the use of larger blocks for nodes in the filesystem B-tree. Larger blocks can hold more data, of course, and, in particular, more metadata. That means that metadata that was previously scattered in the filesystem can be kept together with the relevant inode. That, in turn, leads to significant performance improvements for many filesystem operations.

Another near-term feature, due to arrive "right after fsck", is the merging of Dave Woodhouse's RAID5 and RAID6 implementations. That work was initially posted in 2009; Chris apologized for taking so long to get it merged. How this feature will actually be used still needs some thought; RAID5 or 6 is quite good for data, but it can be problematic for metadata, which tends to not fill anything close to a full RAID stripe and, thus, can lead to low I/O performance. Happily, btrfs has been designed from the beginning to keep data and metadata separate; that means that things can be set up where data is protected with full RAID while metadata is managed using simple mirroring.

Talk of protecting metadata leads naturally to the problem of recovering a filesystem when its metadata has been corrupted. That is what a filesystem checker program is for; btrfs, thus far, has been increasingly famous for it lack of a proper checker (and, more importantly, a proper filesystem repair tool). As of the LinuxCon talk, btrfs still does not have a real repair tool, but some progress has been made in that direction and a couple of other mechanisms have been provided.

The copy-on-write nature of btrfs implies that there will be numerous old copies of the filesystem metadata on the storage device at any given time. Any change, after all, will create a new copy, leaving the previous version in place until the block is reused. Chris observed that filesystem corruptions rarely affect that older metadata, so it makes sense to use it as a primary resource in the recovery of a corrupted disk. But, first, one needs to be able to find that older metadata.

To that end, btrfs maintains an array containing the block locations of many older versions of the filesystem root. The root block, he said, is more important than the superblock when it comes to recovering data. The root is replaced often as metadata changes percolate up to the top of the directory hierarchy, so the "old root blocks" array contains pointers to what is, in effect, a set of snapshots of the very recent state of the filesystem. Clearly, this will be a valuable resource should something go badly wrong.

One way of using that array is simply to mount the filesystem using an older version of the root. Chris demonstrated this feature by poking holes in a test filesystem, then mounting an older root to get back to where things had been before. For simple, quickly-detected problems, older root blocks should be a path toward a quick solution.

It is not too hard to imagine situations where this approach will not work, though. If a metadata block in a rarely-changed subtree is, say, zeroed by a hardware malfunction, it could go undetected for some time. By the time the user realizes that something is wrong, there may be no older hierarchy containing the information needed to put things back together. So other solutions will be necessary.

Obviously, one of those solutions will be the full filesystem checker and repair tool. That tool is still not ready, though. Getting a repair tool right is a hard problem; without a lot of care, a well-intentioned attempt to repair a filesystem can easily make it worse. Data that may have been recoverable before the repair attempt may no longer be so afterward. Even if a proper btrfsck were available today, it would probably be some years before it reflected enough experience to inspire confidence in users who are concerned about their data.

So it seems that something else is required. That "something else" turns out to be a data recovery tool written by Josef Bacik. This tool has a simple (to explain) job: dig through a corrupted filesystem in read-only mode and extract as much of the data as possible. Since it makes no changes, it cannot make things worse; it seems like a worthwhile tool to have around even if a full repair tool existed.

That tool, along with all the requisite filesystem support, is expected to be available in the 3.2 kernel time frame. Meanwhile, there is a new btrfs-progs repository that will include the recovery tool in the near future. All told, it may not be quite the btrfsck that some users were hoping for, but it should be enough to make those users feel a bit more confident about entrusting their data to a new filesystem. Judging from the size of the crowd at Chris's talk, there are a lot of people interested in doing exactly that.

[Your editor would like to thank the Linux Foundation for funding his travel to LinuxCon Europe.]

Comments (10 posted)

Frontswap gets broadsided

By Jonathan Corbet
November 2, 2011

"Frontswap" is the second half of Dan Magenheimer's transcendent memory concept; the first half ("cleancache") was merged for the 3.0 kernel. Given that the job was halfway done, one might be forgiven for thinking that getting frontswap merged would not be a big challenge, despite the fact that, like many memory-management patches, transcendent memory has had a long and somewhat rocky path into the mainline. Dan must have known better, though, as evidenced by his decision to copy your editor on the frontswap pull request, nicely providing a front-row seat to the 100+ messages that followed. Some version of this patch set may well make it into the mainline eventually, but it now seems quite unlikely to happen in the 3.2 cycle.

The core idea behind frontswap is to provide a less expensive alternative to pushing a page out to the swap device. That alternative could be one of a number of possibilities: storing the page (possibly compressed) in a memory pool shared between virtual machines, writing it to an SSD-based intermediate device, or adding a reference to a stored page with duplicate contents, for example. Frontswap is not required to accept a page handed to it, but, if it does accept that page, it must be able to reproduce it on demand in the future. The primary use case appears to be balancing memory use between Xen-based virtual machines, but others can be imagined.

If one were to look at the initial response to the post, it would appear that there was a groundswell of support for these patches; several messages came in calling for their inclusion. Those messages, however, came from other people at Oracle (Dan's employer) or other large companies, though, and their authors are not normally known for their participation in conversations about memory management code, so they may have had something other than the intended effect. It looked a bit like an organized pressure campaign. When the core kernel developers started to respond, the tone of the conversation changed considerably.

There were a number of complaints raised. The frontswap patches were not going through the -mm tree, and they did not carry acks from any of the recognized memory management developers, so some people started to suspect that Dan was trying to circumvent the normal processes. There is also a fair amount of doubt about the utility of the patches and the way they operate; Christoph Hellwig, for example, described frontswap as "a bunch of really ugly hooks over core code, without a clear definition of how they work or a killer use case." Various core memory management developers, their attention drawn by the pull request, found a number of things not to like.

One other complaint raised was the lack of any sort of associated benchmarks. Frontswap is, in the end, a sort of performance-enhancing patch; such changes are normally expected to be accompanied by test results showing that performance is indeed enhanced for the target workloads. Equally important is showing that performance is not hurt on other workloads - always a big concern when making changes to memory management behavior. For this kind of change, it is important to show that there is no impact on systems where the new facility is not used at all; Dan has not yet done that.

Chances are good that satisfying benchmark results can be produced eventually, and that the technical objections that have been raised can be fixed. Even then, though, frontswap is unlikely to get an immediate green light for merging into the mainline, for a few reasons. One is that life is never easy for those making core memory management changes; experience has shown that it is far too easy to make mistakes that only show up many months later when somebody tries their important workload on a new kernel. Dan has complained about the "hazing" he has gone through, but he has had an easier time than some others.

That said, Dan's life has not been improved by the association of his work with Xen which, while being free software that is now mostly in the mainline, is still looked upon dimly by many developers. His interaction style also sometimes does not help. Finally, Dan has, by virtue of unfortunate timing that is not at all his fault, run into another problem that was best explained by Andrew Morton:

At kernel summit there was discussion and overall agreement that we've been paying insufficient attention to the big-picture "should we include this feature at all" issues. We resolved to look more intensely and critically at new features with a view to deciding whether their usefulness justified their maintenance burden. It seems that you're our crash-test dummy ;)

Dan had not explicitly volunteered for that role, but, then, few people (or dummies) ever do. But, at this point, the process will have to play out on those terms. Barring some sort of surprising executive decision by Linus, this particular discussion is unlikely to come to a resolution before the close of the 3.2 merge window.

Another idea discussed at the kernel summit was that code that is in active use, and that is shipped by distributors over the long haul, should probably find its way into the mainline eventually even if it is not entirely pleasing. Transcendent memory has been in the openSUSE kernel since 2009, and has been shipped by Oracle for some time as well. Clearly, some people see value in this work. Given time, patience, and a willingness to address technical issues, that should be sufficient to get this capability into the mainline eventually.

Comments (3 posted)

Deepak Saxena Linaro Kernel October 2011 Release ?

Marek Szyprowski ARM: DMA-mapping framework redesign ?

Mahesh J Salgaonkar [RFC PATCH v3 00/10] fadump: Firmware-assisted dump support for Powerpc. ?

Omar Ramirez Luna OMAP: iommu: hwmod support and runtime PM ?

Peter De Schrijver Add support for tegra30 and cardhu ?

David Brown msm: msm8660 and msm8960 clock support ?

Stanislav Kinsbursky [PATCH v7 0/7] SUNRPC: make rpcbind clients allocated and destroyed dynamically ?

Stanislav Kinsbursky SUNRPC: initial part of making pipefs work in net ns ?

Deepthi Dharwar [PATCH v9 0/4] cpuidle: Global registration of idle states with per-cpu statistics ?

Glauber Costa per-cgroup /proc/stat ?

Tejun Heo cgroup: stable threadgroup during attach & subsys methods consolidation ?

Steven Rostedt [GIT PULL] ktest: lots of nice new features ?

Dmitry Monakhov xfstests: Bunch of new stress tests -v3 ?

Barry Song dmaengine: add CSR SiRFprimaII DMAC driver ?

Rajendra Nayak Device tree support for regulators ?

Rob Herring net: add calxeda xgmac ethernet driver ?

Manjunath Hadli RFC for Media Controller capture driver for DM365 ?

Sylwester Nawrocki Staging: Abilis Systems AS102 driver ?

Konrad Rzeszutek Wilk TTM DMA pool v2.2 or [GIT PULL] (stable/ttm.dma_pool.v2.3) for 3.3 ?

Ricardo Ribalda Delgado Add spi support for CMA3000 driver ?

Lin Ming ata port runtime power management support ?

Pavel Shilovsky SMB2 protocol support for CIFS kernel module ?

Aditya Kali Metadata Replication for Ext4 ?

NeilBrown hot-replace support for RAID4/5/6 ?

NeilBrown hot-replace support for RAID1 and RAID10 ?

Zhu Yanhai A readahead complete notify approach to implement buffer aio ?

Dan Magenheimer mm: frontswap (for 3.2 window) ?

Jiri Pirko net: introduce ethernet teaming device ?

Neil Horman Introduce FCLONE_SCRATCH skbs to reduce stack memory useage and napi jitter ?

Kees Cook security: Yama LSM ?

Anup Patel (offtopic) Xvisor: eXtensible Versatile hypervISOR ?

Kernel development

Brief items

Kernel release status

Quote of the week

(Partially) graduating IIO

Kernel development news

3.2 merge window, part 1

A btrfs update at LinuxCon Europe

Frontswap gets broadsided

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous