LWN.net Logo

Kernel development

Brief items

Kernel release status

The 3.4 kernel is out, released on May 20. Significant features in this release include the Yama security module, support for the x32 ABI, asymmetric multiprocessing support, the dm-verity device mapper target, and more. For details, see the always-excellent KernelNewbies 3.4 page.

The 3.5 merge window is open as of this writing; see the separate article below for a summary of interesting changes merged so far.

Stable updates: 3.0.32 and 3.3.7 were released on May 21, 3.2.18 on May 21, and 2.6.34.12 on May 22.

Comments (none posted)

Quotes of the week

Enough data has come in to satisfy me that with all the improvements in Linux over the last year, and with BQL, codel and fq_codel, that we've won a major battle in the war against bufferbloat.
Dave Täht

	else if(!strcmp(str, "pony")) {
		tsc_clocksource_reliable = 1;
		sched_clock_stable = 1;
		tsc_perfect_smp_synchronization = 1;
	}
Steven Rostedt

	else if (!strcmp(str, "real")
	     panic("Can't handle real TSCs!\n");
Thomas Gleixner

Prediction: instead of Oracle coming out and admitting they were morons about their idiotic suit against Android, they'll come out posturing and talk about how they'll be vindicated, and pay lawyers to take it to the next level of idiocy.

Sometimes I really wish I wasn't always right. It's a curse, I tell you.

Linus Torvalds

Comments (1 posted)

Kernel development news

The 3.5 merge window opens

By Jonathan Corbet
May 23, 2012
Shortly after the release of the 3.4 kernel, Linus started the entire process all over again for the 3.5 development cycle; over 2,500 changesets were pulled into the mainline on the first day, and 4,600 have been merged as of this writing. It looks like it will be an interesting cycle with a lot of new stuff coming in and the removal of a bunch of old cruft. As of this writing, user-visible changes pulled for 3.5 include:

  • The TCP connection repair interface, useful for the implementation of checkpoint/restart functionality, has been merged.

  • The networking stack has gained support for RFC 5827 early retransmit, a mechanism aimed at speeding recovery from packet loss.

  • The CoDel queue management algorithm, which, hopefully, will be an important component in the solution to the bufferbloat problem, has been merged.

  • The seccomp filters mechanism has been merged; it allows processes to reduce the set of available system calls through the use of a mechanism based on the Berkeley packet filter. See Documentation/prctl/seccomp_filter.txt for details.

  • The Yama security module has two increasingly restrictive modes for controlling access to the PTRACE_ATTACH functionality.

  • The logging reliability patch set has been merged.

  • The NUMA scheduler has been rewritten with the result that it will make different, hopefully better scheduling decisions. Also, as has been threatened for some months, the power-aware scheduling code has been removed in the hope that somebody will replace it with something that actually works.

  • A lot of code has been removed in this development cycle, including the ixp2000 Ethernet driver, support for the sun4c SPARC CPU, the ip_queue netfilter module (superseded by nfnetlink_queue), all support for token ring networking, drivers for all MCA-based network cards, support for the Econet protocol, support for ARMv3 processors, support for Intel IXP2xxx (XScale) processors, support for ST-Ericsson U5500 development boards, the Motorola 68360 serial port driver, and the workqueue tracer.

  • New drivers include:

    • Processors and systems: Blackfin BF609-based boards, and Renesas Armadillo-800 EVA and KZM-A9-GT boards.

    • Miscellaneous: TI TPS65090 power regulators, TI Palmas series power management chips, RICOH RC5T583 power regulators, Freescale MXS, IMX6Q, IMX53 and IMX51 IOMUX controllers, ST Microelectronics SPEAr3xx pin controllers, Renesas Emma Mobile SoC GPIO controllers and integrated serial ports, Intersil ISL29028 concurrent light and proximity sensors, TAOS TSL/TMD2x71 and TSL/TMD2x72 light and proximity sensors, Analog Devices AD8366 variable gain amplifiers, and Atmel AT91 analog to digital converters.

    • Network: WIZnet W5100 and W5300 adapters, Marvell Avastar 88W8797 wireless chipsets, Emulex One Connect InfiniBand-over-Ethernet controllers, and GCT Semiconductor GDM72xx WiMAX controllers.

    • USB: Marvell PXA USB OTG controllers, Broadcom BCMA and SSB host controllers, NXP ISP1301 USB transceivers, and NXP LPC32XX USB peripheral controllers. Also added is a "configurable composite gadget" driver that allows user-space configuration of enabled functions.

    • Staging graduations: the industrial I/O (IIO) core has moved into drivers/iio; VME drivers are now in drivers/vme, and the Intel management engine interface (MEI) driver is now in drivers/misc.

Changes visible to kernel developers include:

  • The many variants of the NLA_PUT() macro used with netlink have been removed. Code should use one of the nla_put() versions instead and make its error handling explicit.

  • The mac80211 layer has gained support for MBSS mesh synchronization.

  • There is new core support for the writing of near-field communication (NFC) drivers using the HCI specification; see Documentation/nfc/nfc-hci.txt for details.

  • The "regmap" subsystem, which centralizes the handling of banks of device registers, now has support for registers in I/O memory.

  • The pin control subsystem now has full device tree support.

  • The Android "switch" class has been brought into the mainline and extended into a general "external connector" framework.

  • The "ramoops" mechanism has been reworked to use the pstore interface.

If the usual schedule holds, this merge window can be expected to close around June 4. Watch this space next week for coverage of the next sets of patches to be pulled into the mainline for the 3.5 development cycle.

Comments (5 posted)

Preparing for nonvolatile RAM

By Jonathan Corbet
May 23, 2012
Once upon a time, your editor had a job that involved working with a Data General Nova system. The Nova had an interesting characteristic: since it contained true core memory, the contents of that memory would persist across a reboot—or a power-down. So the end-of-day shutdown procedure was a simple matter of turning the machine off; when it was turned on the next morning, it would simply continue where it was before. There were no complaints about system boot time with that machine. The replacement of core memory with silicon-based RAM brought a lot of nice advantages, but the nonvolatile nature was lost on the way. But it appears that nonvolatile memory may be about to make a comeback, bringing some interesting development problems with it.

Matthew Wilcox raised the issue, noting that nonvolatile memory (NVM) is coming, that it promises bandwidth and latency numbers similar to those offered by dynamic RAM, and that, being cheaper than DRAM, it is likely to be offered in larger sizes than DRAM is. He later disclaimed any resemblance between this description and any future products to be offered by his employer; it is, he says, simply where the industry is going. Given that, it would be a good idea for the kernel community to be ready for this technology when it arrives.

One part of being ready is figuring out how to deal with nonvolatile memory within the kernel. The suggested approach was to use a filesystem:

We think the appropriate way to present directly addressable NVM to in-kernel users is through a filesystem. Different technologies may want to use different filesystems, or maybe some forms of directly addressable NVM will want to use the same filesystem as each other.

A filesystem approach would allow the association of names with regions of NVM space; an API was then proposed to allow the kernel to perform tasks like mapping regions of NVM into the kernel's address space.

One question that came up quickly was: won't the use of the filesystem model slow things down? There is a lot of overhead in the block layer, which was not designed to deal with "storage" that operates at full memory bandwidth. Matthew was never thinking of bringing in the full block layer, though; instead, he said: "I'm hoping that filesystem developers will indicate enthusiasm for moving to new APIs." Such enthusiasm was in short supply in this discussion; that is probably more indicative of a lack of thought about the problem than any sort of active opposition (which was also in short supply).

James Bottomley, though, questioned the filesystem idea, suggesting that NVM should be treated like memory. He said that the way to access NVM might be through the kernel's normal memory APIs, with nonvolatility just being another attribute of interest. One could imagine calling kmalloc() with a new GFP_NONVOLATILE flag, for example. The only problem with this approach is that it is not enough to request an arbitrary nonvolatile region; callers will usually want a specific NVM region that, presumably, contains data from a previous use. So the memory API would have to be extended with some sort of namespace giving reliable access to persistent data. To many, that namespace looks like a filesystem; James suggested using 32-bit keys like the SYSV shared memory mechanism does, but admirers of SYSV IPC tend to be scarce indeed on linux-kernel.

So, while there are a lot of details to be worked out, some sort of name-based kernel API seems certain to come about. Then there will be a mechanism, either through the memory-related or filesystem-related system calls, to make NVM available to user space. But that leads to another, perhaps harder question: what, then, do we do with all that fast, nonvolatile memory?

Some of it, certainly, could be used for caching; technologies like bcache could make good use of it. The page cache could go there; Matthew suggested that the inode cache might be another possibility. Both could speed booting considerably, though it would be necessary to somehow validate the cache contents against filesystems that could have changed while the system was down. Boaz Harrosh suggested that filesystems could store their journals in that space, speeding journal access and reducing journal I/O load on the underlying storage devices. He also mentioned checkpointing the system to NVM, allowing for quick recovery should the system go down unexpectedly. Vyacheslav Dubeyko had some wilder ideas about how NVM could eliminate system bootstrap entirely and make the concept of filesystems obsolete; instead, everything would just live in a persistent object environment.

Clearly, many of these ideas are going to take some time to come to fruition. Nonvolatile memory changes things in fundamental ways; Linux may have to scramble to keep up, but, then, that is a high-quality problem to have. It will be most interesting to watch how this plays out over the coming years.

Comments (43 posted)

Removing four bytes from the kernel ABI

By Jake Edge
May 23, 2012

Four bytes may not seem like a lot of space—typically it isn't—but when that space is wasted millions of times, it starts to add up. In addition, if the extra space has become part of the kernel ABI (intentionally or not), it will be difficult to remove it. That particular problem came up again in a recent linux-kernel discussion regarding the trace event header.

Just over a year ago, we looked at the unused lock_depth field in event headers. Frederic Weisbecker had added the field temporarily to assist in removal of the big kernel lock (BKL), and once the BKL was gone Steven Rostedt removed those, now useless, four bytes from the header. Unfortunately, in the interim, PowerTOP had started accessing events in the perf ring buffer, so removing lock_depth broke PowerTOP. That field wasn't actually used by PowerTOP, but the tool expected the header to have a particular size, which changed after Rostedt removed the wasted space.

That led to a reversion of the removal, which means that every event recorded by ftrace or perf has added overhead. The event format is fully self-describing, however, so there is no need for utilities like PowerTOP to grub around in the binary data making assumptions about what the format is. It was, however, easier to read the data directly rather than parse the format description, which is why PowerTOP did so. Rostedt has created a library to parse trace events using the format data that the kernel provides to avoid that situation in the future. That library was picked up by the recently released PowerTOP 2.0, so Rostedt posted an RFC asking when the lock_depth field—renamed to padding as part of the revert—could be removed.

Linus Torvalds was not particularly concerned about the wasted space, but did want to understand which distributions were picking up the new PowerTOP. It turns out that the version in Fedora 14 (which Torvalds said he still uses sometimes) is old enough that it doesn't use perf events at all, so it is unaffected. More recent Fedoras (16, 17) are using PowerTOP 1.98 which won't work with kernels built without the padding.

The assumption in the thread is that distributions will be picking up PowerTOP 2.0 for releases coming later in the year, but that still leaves users who build their own kernels on existing distributions in a bit of a bind if the padding is removed. Existing distributions also have various lifespans, and some will not be picking up the latest PowerTOP at all. Rostedt asked how long the kernel needed to support older distributions. PowerTOP, it seems, is in a different category from other applications because it is a developer-oriented tool. So Torvalds was willing to see the kernel change even if some distributions get left behind:

But breaking something like a F14-15 timeframe distro or something staid like a SLES (or "Debian Stale" or whatever they call that thing that only takes crazy-old binaries)? It's fine. We don't want to *rush* into it, but no, if those distros are basically not updating, we can't care about them forever for something like powertop.

Things that break *normal* applications are different. There the rule really must be "never".

Arjan van de Ven concurred, pointing to 3.6 as a potential time frame to remove the padding, noting that those who haven't updated their distribution to get the newer PowerTOP are unlikely to be updating their kernel either. Rostedt said he will queue the patch up for 3.6 or 3.7.

While the four bytes seems unimportant to both Torvalds and Ingo Molnar, Rostedt pointed out that it is a frequent problem for tracing users. Beyond that, though, he disagrees with Molnar's contention that the wasted space is merely a "cosmetic detail":

4 bytes is not cosmetic for a 32 byte event. That's 1/8th overhead. If we could get rid of 4 bytes from struct page, would we do that? It's only just 4 bytes for [every] 4096 bytes. Just a 1/1024 overhead. Of course perf events are much bigger than 32 bytes. It's one of the biggest complaints I hear about perf, the size of its events. We should be trying hard to fix that.

For memory-constrained situations, for example on embedded devices or for users trying to squeeze every process they can onto their systems, reducing the overhead of events can make a difference. By capturing more events in the same amount of memory, there is a better chance of finding the problem that tracing was enabled for. When the issue came up a year ago, David Sharp of Google noted that the size of events was a big problem for the search giant. Others undoubtedly face similar challenges.

While the format of the perf ring buffer data may soon be a solved problem—though it's possible, if unlikely, that other tools are manually pulling data from the ring buffer—tracepoints as a whole are still an unresolved ABI issue. Right now, much of the work is in adding new tracepoints, but some day one or more of those may need to come out or be modified. If tools are dependent on specific tracepoints providing the exact same information in just the right place in the code, changing those will be a real problem. And it will be one that is difficult for a library to paper over.

Comments (14 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds