Kernel development

Brief items

Kernel release status

The current development kernel is 3.7-rc7, released on November 25. "A week ago, I had even considered skipping -rc7 entirely as things had been so calm, but decided that there was little reason to hurry the release. And oh, how sadly right I was." This could prove to be the last one, though, if the next week of testing goes well.

Stable updates: 3.0.53, 3.4.20 and 3.6.8 were released on November 26.

Comments (none posted)

Quotes of the week

Ok, guys. Cage fight!

The rules are simple: two men enter, one man leaves.

And the one who comes out gets to explain to me which patch(es) I should apply, and which I should revert, if any.

— Linus Torvalds's new decision-making process

And yes, that is the thing about "fairness" -- there are a great many definitions, many of the most useful of which appear to many to be patently unfair.

— Paul McKenney

This isn't the message that's gone over, and even for device drivers everyone seems to be taking the whole device tree thing as a move to pull all data out of the kernel. In some cases there are some real practical advantages to doing this but a lot of the people making these changes seem to view having things in DT as a goal in itself.

— Mark Brown

My spinning head fell on the floor and is now drilling its way to China.

— Andrew Morton

Comments (22 posted)

Kernel development news

Uninitialized blocks and unexpected flags

By Jonathan Corbet
November 28, 2012

One often-heard complaint in the early BitKeeper era was that, by letting code reach the mainline without going via a mailing list, BitKeeper made it easy for maintainers to slip surprise changes in underneath the review radar. Those worries have mostly proved unfounded; when surprises have happened, the response from the community has usually helped to ensure that there would be no repeats. But some developers are charging that the 3.7 kernel contains exactly this type of stealth change and are demanding that it be reverted.

Background

The fallocate() system call is meant to be a way for an application to request the efficient allocation of blocks for a file. Use of fallocate() allows a process to verify that the required disk space is available, helps the filesystem to allocate all of the space in a single, contiguous group, and avoids the overhead that block-by-block allocation would incur. In the absence of an fallocate() implementation (each filesystem must implement it independently), the C library will emulate it by simply writing zeroes to the requested block range; that gets the space allocated, but is less efficient than one would like. The implementation of fallocate() within filesystems tries to be more efficient than that; one way to do so is to avoid the process of writing zeroes to the newly-allocated blocks.

Leaving stale data in allocated blocks has obvious security implications: a hostile application could read those blocks in the hopes of finding confidential documents, passwords, or the missing Fedora 18 Beta release announcement. To avoid this exposure, filesystems like ext4 will mark unwritten blocks as being uninitialized; any attempt to read those blocks will be intercepted and just return zeroes. In the normal case, the application will write data to those blocks before ever trying to read them; writing obviously initializes the blocks without the need to write zeroes first. This implementation seems like it should be about optimal.

Except that, seemingly, ext4 marks uninitialized blocks at the extent (group of contiguous blocks) level. So, if an application writes to one uninitialized block, the containing extent must be split and the newly-written block(s) added to the previous extent, if possible. That turns out to be more expensive than some users would like. So a shortcut was attempted.

That shortcut first appeared in April, 2012, in the form of a new fallocate() flag called FALLOC_FL_NO_HIDE_STALE. If fallocate() was called with that flag, the newly-allocated blocks would be marked as being initialized even though the old data remained untouched. That obviously brings the old security issues back; to mitigate the problem, the patch added a mount option making the new functionality available only to members of a specific group. That was deemed to be enough, especially for settings where access to the machine as a whole is tightly controlled.

At least, the authors and supporters of the patch deemed the group check to be enough. The patch was roundly criticized by other filesystem developers; the prevailing opinion appeared to be that it was trying to open up a huge security hole in order to avoid fixing an ext4 performance problem. After that discussion, the patch went away and wasn't heard from again.

A surprise flag

At least, it was not heard from until recently, when some filesystem developers were surprised to discover this commit by Ted Ts'o which found its way into the mainline (via the ext4 tree) during the 3.7 merge window. The patch is small and simple; it simply defines the FALLOC_FL_NO_HIDE_STALE flag, but adds no code to actually implement it. The changelog reads:

As discussed at the Plumber's Conference, reserve the bit 0x04 in fallocate() to prevent collisions with a commonly used out-of-tree patch which implements the no-hide-stale feature.

Filesystem developer Dave Chinner, at least, does not recall this discussion. His response was to post a patch reverting the change, saying:

The lack of formal review and discussion for a syscall API change is grounds for reverting patch, especially given the controversial nature of the feature and the previous discussions and NAKs. The way the change was pushed into mainline borders on an abuse of the trust we place in maintainers and hence as a matter of principle this change should be reverted.

It is true that this particular change is a bit abnormal. It changes the core filesystem code but came by way of a filesystem-specific tree with no acks from any other developers. The patch does not appear to have been posted to any relevant mailing list, violating the rule that all patches should go through public review before being pushed toward the mainline. The addition of a flag with no in-kernel users is also contrary to usual kernel practice. It is, in summary, the sort of change that less well-established kernel developers would never get away with making. So it is hard to fault other filesystem developers for being surprised and unhappy.

On the other hand, the change just adds a flag definition; it obviously cannot cause problems for existing code. And there does appear to be a real user community for this feature. Ted justified his action this way:

It doesn't change the interface or break anything; it just reserves a bit so that out-of-tree patches don't collide with future allocations. There are significant usages of this bit within Google and Tao Bao. It is true that there has been significant pushback about adding this functionality on linux-fsdevel; I find it personally frustrating that in effect, if enough people scream, they can veto an optional feature that might only be implemented by a single file system.

This explanation does not appear to have satisfied anybody, though. So we have an impasse of sorts; some developers want a flag to control a functionality they need, while others see it as a security problem and the result of an abuse of the kernel's trust system.

Alan Cox suggested that it would be possible to, instead, reserve a set of filesystem-private flags that could be used for any purpose by any filesystem. Dave pointed out, however, that a flag bit that behaved differently from one filesystem to the next is a recipe for trouble. His suggestion, instead, is that this functionality should be implemented via the ioctl() interface, which is where filesystem-specific options usually hide. The ioctl() approach seems like it should be workable, but no patches to that effect have been posted thus far.

As of this writing, Linus has not accepted the revert, so the FALLOC_FL_NO_HIDE_STALE flag can still be found in the 3.7 kernel. He has also remained silent in the discussion. He will have to make a decision one way or the other, though, before the final 3.7 release is made. Once that flag is made available in a stable mainline release, it will be much harder to get rid of, so, if that flag is going to come out, it needs to happen soon.

Comments (5 posted)

The return of loadable security modules?

By Jake Edge
November 28, 2012

The idea behind the Linux Security Module (LSM) interface was initially discussed as part of the "NSA Linux" session at the first Kernel Summit back in 2001. The intent was to avoid wiring a particular security solution into the kernel; instead, multiple approaches to security could be built on top of a common kernel API. Originally, as the name implies, the solutions were built as loadable kernel modules, but eventually the "M" in LSM became just a historical artifact as the API was no longer exported to modules (essentially requiring security "modules" to be statically linked into the kernel). But it's possible that may all change again with a recent patch to bring back loadable LSMs.

Some history

A bit of history is probably in order. The LSM API came about specifically because Linus Torvalds didn't want to have to choose between a number of competing access control mechanisms for the kernel. Instead, LSM would provide a way for any of those mechanisms to hook into the kernel and deny access to various kinds of resources (files, devices, tasks, inodes, etc.) based on the security model being implemented. Initially, the LSMs would be implemented as kernel modules that could be loaded at runtime and, in some cases, unloaded.

The LSM interface was released as part of the 2.5 development kernel series in 2002, and was part of the first 2.6 release in December 2003. For several years after that, there was only one in-tree user of the interface: SELinux. That led to a 2005 suggestion to remove the LSM API entirely, effectively just calling SELinux directly. That would turn SELinux into the "one true security solution" for Linux. In 2006, James Morris proposed a patch to move LSM to the "feature removal" list, scheduled for the 2.6.18 kernel, which was roughly two months out at that point.

But, along came Smack, which implemented a "simplified" Mandatory Access Control (MAC) scheme for the kernel. It also used the LSM interface, so, to a certain extent, the decision on whether to merge it hinged on the future of LSM. In October 2007, Torvalds clearly stated his intention to keep LSM in the kernel, thus paving the way for Smack to be merged.

At more or less the same time Smack was merged, another change to LSM was made. First discussed in mid-2007, Torvalds merged a patch for the 2.6.24 kernel that switched LSM to a static interface so that security "modules" needed to be built into the kernel. One could still choose which security module to use with kernel command-line parameters, but dynamic security module loading would no longer be allowed.

There were a number of reasons behind the switch. For one thing, unloading modules was always messy (or impossible), partly because keeping a coherent security state through that process is difficult. In addition, the LSM API is very intrusive, allowing modules to hook nearly any kernel operation, which can be (and was) abused. While the LSM symbols were exported as GPL-only, that didn't stop some proprietary modules from abusing the interface. There were also free software modules that used the interface for non-security purposes (e.g. the realtime "security" module). Those kinds of problems could also be used as arguments against having the LSM API at all, but since Torvalds had already put his foot down on that particular question, removing the ability to load LSMs was seen as a reasonable alternative.

At the time that Torvalds merged the patch that made that switch, he asked for "real world" users of loadable security modules to step forward. There were a few examples of out-of-tree LSMs that were loadable (and, possibly, unloadable), but none that actually seemed to require that ability. The main users of the feature were LSM developers, who might routinely load and unload their LSM during development.

The next few years saw the merging of Smack (2.6.25), TOMOYO (2.6.30), and AppArmor (2.6.38). The latter had been long out of tree; its existence was part of the reason that the LSM interface came about in the first place. There have also been periodic attempts to get smaller, single-purpose security changes into the kernel over the years, but those were always pointed to the LSM interface. There is a problem with that particular suggestion, though, as only one LSM can be active at a time. Most distributions already have their one security module "slot" filled up. Red Hat and Fedora use SELinux, Ubuntu uses AppArmor, while SUSE and openSUSE have both AppArmor and SELinux available. Adding a specialized LSM for additional security protections is generally not possible without removing or disabling the distribution-supplied security solution.

Proposed LSM changes

That "one LSM at a time" problem has led to persistent (if intermittent) calls for ways to stack or chain LSMs. Smack developer Casey Schaufler is the most recent to propose a stacking solution. His patch set has been steadily reviewed on the LSM mailing list since it was first posted in September; it is now up to version 8. That particular version came with an interesting caveat:

I have not tried to reintroduce LSMs as loadable modules, in spite of the vigor with which it has been requested. I see that as work for another day, and a [separate] battle to fight.

Those requests came from the developer of the TOMOYO LSM, Tetsuo Handa. In earlier discussion of Schaufler's stacking patch, Handa advocated a return to allowing loadable LSMs. In fact, he went further than that, proposing a set of patches that would restore the ability to load LSMs as well as converting TOMOYO to use that feature.

Handa lists three reasons for making the change. To start with, any distribution that wants to allow its users to experiment with different LSMs must build all of those LSMs statically into the kernel. That will not only increase the size of the kernel, it will also increase the time it takes to load and boot the kernel. Most of that space (and time) would be completely wasted even for the users who are experimenting. All of that makes it less likely that distributions will actually build kernels that way.

Beyond that, though, many distributions have their preferred LSM, so they don't build extra LSMs into their kernels. That leaves users to build their own kernels, which is generally unacceptable, particularly in enterprise settings. But even if there are other LSMs built into the kernel, it takes a reboot to enable them. Handa notes that he uses a loadable kernel module that implements TOMOYO (called AKARI) to diagnose problems in enterprise systems. In order to access the LSM symbols (which are no longer exported), AKARI must do some kind of runtime address resolution, perhaps using /proc/kallsyms or System.map. But, AKARI is something he can load into running systems when needed—unlike regular LSMs.

One could argue that Handa's use of an LSM for system troubleshooting is a misuse of the interface, but the fact remains that changing LSMs currently requires a reboot. That problem potentially becomes more acute if LSM stacking is merged. One must decide pre-boot which LSMs to enable (and in what order they are consulted). Whatever else can be said, disallowing LSM loading reduces flexibility.

Handa's third reason is a bit more philosophical: "LSM is not the tool for thought control." Essentially, he argues, disallowing LSM loading just makes dealing with LSMs harder for both users and developers. It also means that the more "minor" LSMs (e.g. TOMOYO and Smack) get less exposure because fewer users can actually try them.

While there have been no comments on Handa's patches as yet, there have been expressions of support for loadable LSMs by some. Schaufler, for example, does not seem opposed necessarily. Kees Cook agreed with the need for loadable LSMs, though he was concerned that combining it with the LSM stacking patches would potentially block the progress for stacking. Morris, who authored the original patch to block loadable LSMs, has not yet spoken up one way or the other.

Taking away the ability to load LSMs did not really change the picture for the kinds of abuses that were brought up at the time the change was made. Kernel modules can still abuse the interface, though it may take a bit more work. If binary modules were willing to ignore the GPL-only export of the LSM interface, they are probably willing to ferret out the addresses they need instead. Open source modules can do much the same. At the time of the switch to a static interface back in 2007, Torvalds seemed very open to reverting it if there were real users—perhaps he can still be convinced.

Comments (none posted)

Statistics from the 3.7 development cycle

By Jonathan Corbet
November 28, 2012

The 3.7-rc7 prepatch came out on November 25; it may well be the last prepatch for the 3.7 development cycle. 3.7 was one of the more active cycles in recent history, with nearly 12,000 non-merge changesets incorporated by the time of this writing. It's time for our traditional look at what was done during this cycle and where all that work came from.

The 3.7 merge window was especially busy this time around. Here are some counts for recent kernels:

Kernel -rc1 Total

3.0 7,333 9,153

3.1 7,202 8,693

3.2 10,214 11,881

3.3 8,899 10,550

3.4 9,249 10,899

3.5 9,534 10,957

3.6 8,587 10,247

3.7 10,409 11,815

Kernel	-rc1	Total
3.0	7,333	9,153
3.1	7,202	8,693
3.2	10,214	11,881
3.3	8,899	10,550
3.4	9,249	10,899
3.5	9,534	10,957
3.6	8,587	10,247
3.7	10,409	11,815

The 3.7 development cycle, thus, saw the most active merge window in the 3.x era; it is, in fact, the most active merge window ever. Even allowing for the fact that 3.7 will add a few more changesets before final release, the 2.6.25 kernel, at 12,243 changesets total, will probably still hold the record for the most active development cycle ever, but the 2.6.25 merge window only saw 9,450 changesets merged. One could conclude from these numbers that we are getting better at getting our changes in during the merge window — and at having fewer things to fix thereafter.

Nearly 395,000 lines of code were removed from the kernel this time around. That must be balanced against the 719,000 lines that were added, though; the kernel grew by almost 324,000 lines as a result.

1,271 developers contributed to the 3.7 kernel — a relatively high number, but not out of line with previous development cycles. The lists of the most active developers do see some changes this time around, though:

Most active 3.7 developers

By changesets

H Hartley Sweeten 417 3.5%

Antti Palosaari 216 1.8%

Al Viro 167 1.4%

Wei Yongjun 145 1.2%

Sachin Kamat 138 1.2%

Mark Brown 136 1.2%

Eric W. Biederman 130 1.1%

Daniel Vetter 122 1.0%

David Howells 119 1.0%

Hans Verkuil 119 1.0%

Greg Kroah-Hartman 116 1.0%

Arnd Bergmann 112 0.9%

Peter Senna Tschudin 104 0.9%

Ben Skeggs 97 0.8%

Peter Ujfalusi 96 0.8%

Ian Abbott 96 0.8%

Devendra Naga 90 0.8%

David S. Miller 84 0.7%

Takashi Iwai 83 0.7%

Johannes Berg 78 0.7%

By changed lines

David Howells 65206 7.6%

Ben Skeggs 50282 5.8%

David Daney 46825 5.4%

Arnd Bergmann 17505 2.0%

Sebastian Andrzej Siewior 16088 1.9%

Daniel Cotey 14157 1.6%

H Hartley Sweeten 13566 1.6%

Catalin Marinas 13519 1.6%

Antti Palosaari 12336 1.4%

Bill Pemberton 10935 1.3%

Dan Magenheimer 10509 1.2%

Ezequiel Garcia 10211 1.2%

David S. Miller 9258 1.1%

Hans Verkuil 8686 1.0%

Will Deacon 8404 1.0%

Shawn Guo 7464 0.9%

Alois Schlögl 7301 0.8%

Roland Stigge 6987 0.8%

Greg Kroah-Hartman 6920 0.8%

Laurent Pinchart 6107 0.7%

In a repeat of his 3.6 performance, H. Hartley Sweeten hit the top of the by-changesets list with a vast number of patches preparing the comedi drivers for graduation from the staging tree (removing over 5000 lines of code in the process). Antti Palosaari did a lot of work on drivers in the Video4Linux2 subsystem. Al Viro continues to refactor and clean up the VFS and core kernel areas with some excursions into most architecture subtrees. Wei Yongjun and Sachin Kamat both did a lot of cleanup work all over the driver tree.

David Howells ended up at the top of the "lines changed" column mostly by virtue of the user-space API header file thrashup, but he also contributed code for module signing and more. Ben Skeggs merged a major reworking of the nouveau driver, David Daney improved support for MIPS OCTEON processors, Arnd Bergmann's many patches were dominated by the removal of the unused mach-bcmring architecture code, and Sebastian Andrzej Siewior did a lot of work on the USB gadget driver subsystem.

Worth noting in passing: Fengguang Wu is credited with 63 bug reports during this cycle, almost 11% of the total. The others with at least ten reports are Dan Carpenter (21), Randy Dunlap (16), Stephen Rothwell (15), Paul McKenney (11), and Alex Lyakas (10). Every one of those reports resulted in a bug that was fixed before this kernel was released in stable form.

An even 200 employers (that we know about) contributed during the 3.7 cycle. The most active of these were:

Most active 3.7 employers

By changesets

(None) 1435 12.1%

Red Hat 1159 9.8%

(Unknown) 843 7.1%

Intel 800 6.8%

Texas Instruments 597 5.1%

IBM 516 4.4%

Linaro 509 4.3%

Vision Engraving Systems 417 3.5%

SUSE 356 3.0%

Google 245 2.1%

Samsung 198 1.7%

Freescale 181 1.5%

Oracle 177 1.5%

Wolfson Microelectronics 148 1.3%

AMD 144 1.2%

Trend Micro 144 1.2%

Cisco 138 1.2%

Linux Foundation 132 1.1%

Arista Networks 130 1.1%

NVIDIA 123 1.0%

By lines changed

Red Hat 157023 18.2%

(None) 80191 9.3%

(Unknown) 71992 8.3%

Cavium 46757 5.4%

IBM 39227 4.5%

Intel 33381 3.9%

Linaro 28900 3.4%

Texas Instruments 28493 3.3%

ARM 24913 2.9%

Oracle 24095 2.8%

NVIDIA 19167 2.2%

linutronix 17211 2.0%

Vision Engraving Systems 14844 1.7%

Samsung 14519 1.7%

Microtrol S.R.L. 12800 1.5%

PHILOSYS Software 10311 1.2%

SUSE 10226 1.2%

Marvell 10067 1.2%

Cisco 9828 1.1%

Pengutronix 9793 1.1%

There are few surprises here. Texas Instruments has reached a new high in its contribution volume, a trend which, unfortunately, may not continue after the recent changes play out there. AMD, too, seems unlikely to remain on this list in the future. Meanwhile Red Hat maintains its place at the top of the list, where it has been since we first started generating these statistics.

And that is where things stand as the 3.7 kernel approaches its final release. Things appear to be running smoothly, with most development cycles taking less than 70 days to complete (if there is no 3.7-rc8, this cycle will run about 64 days). Stay tuned for the about-to-begin 3.8 cycle, with a release to be expected in early February, 2013.

Comments (1 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.7-rc7 ?

Greg KH Linux 3.6.8 ?

Greg KH Linux 3.4.20 ?

Steven Rostedt 3.4.20-rt31 ?

Greg KH Linux 3.0.53 ?

Steven Rostedt 3.0.53-rt77 ?

Architecture-specific

Anton Vorontsov ARM: KDB FIQ debugger ?

Davide Ciminaghi enable support for AMBA drivers under x86 ?

Jacob Pan pm: Intel powerclamp driver ?

Yinghai Lu x86, boot, 64bit: Add support for loading ramdisk and bzImage above 4G ?

Core kernel code

Fabio Baltieri cpufreq: handle SW coordinated CPUs ?

Aristeu Rozanski devcg: introduce proper hierarchy support ?

Development tools

Anton Vorontsov KDB: Kiosk (reduced capabilities) mode ?

Jan Kiszka Add gdb python scripts as kernel debugging helpers ?

Device drivers

Adrian Hunter Add SDHCI ACPI driver ?

Thierry Reding iio: adc: Add Texas Instruments ADC081C021/027 support ?

Thierry Reding rtc: Add NXP PCF8523 support ?

Benjamin Tissoires Support of dual pen/multitouch and new default for win 7 certified devices ?

Terje Bergstrom Support for Tegra 2D hardware ?

Laurent Pinchart Common Display Framework ?

Prabhakar Lad Media Controller capture driver for DM365 ?

Documentation

Vince Weaver perf: proposed perf_event_open() manpage ?

Filesystems and block I/O

Dave Kleikamp loop: Issue O_DIRECT aio using bio_vec ?

Kent Overstreet AIO performance improvements/cleanups ?

Janitorial

H. Peter Anvin RFC: Remove 386 support ?

Memory management

Mel Gorman Automatic NUMA Balancing V5 ?

Ingo Molnar Latest numa/core release, v17 ?

Ming Lei solve deadlock caused by memory allocation with I/O ?

Wen Congyang memory-hotplug: hot-remove physical memory ?

Jeff Moyer make I/O path allocations more numa-friendly ?

Dave Chinner Numa aware LRU lists and shrinkers ?

Anton Vorontsov Add mempressure cgroup ?

Networking

Jesper Dangaard Brouer [RFC net-next PATCH V1 0/9] net: fragmentation performance scalability on NUMA/SMP systems ?

Virtualization and containers

George Zhang VMCI for Linux upstreaming ?

George Zhang VSOCK for Linux upstreaming ?

Stanislav Kinsbursky nfsd: more NFSv4 state containerization ?

Miklos Szeredi autofs4: allow autofs to work outside the initial PID namespace ?

Jason Wang Multiqueue virtio-net ?

Page editor: Jonathan Corbet
Next page: Distributions>>