|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.7-rc7, released on November 25. "A week ago, I had even considered skipping -rc7 entirely as things had been so calm, but decided that there was little reason to hurry the release. And oh, how sadly right I was." This could prove to be the last one, though, if the next week of testing goes well.

Stable updates: 3.0.53, 3.4.20 and 3.6.8 were released on November 26.

Comments (none posted)

Quotes of the week

Ok, guys. Cage fight!

The rules are simple: two men enter, one man leaves.

And the one who comes out gets to explain to me which patch(es) I should apply, and which I should revert, if any.

Linus Torvalds's new decision-making process

And yes, that is the thing about "fairness" -- there are a great many definitions, many of the most useful of which appear to many to be patently unfair.
Paul McKenney

This isn't the message that's gone over, and even for device drivers everyone seems to be taking the whole device tree thing as a move to pull all data out of the kernel. In some cases there are some real practical advantages to doing this but a lot of the people making these changes seem to view having things in DT as a goal in itself.
Mark Brown

My spinning head fell on the floor and is now drilling its way to China.
Andrew Morton

Comments (22 posted)

Kernel development news

Uninitialized blocks and unexpected flags

By Jonathan Corbet
November 28, 2012
One often-heard complaint in the early BitKeeper era was that, by letting code reach the mainline without going via a mailing list, BitKeeper made it easy for maintainers to slip surprise changes in underneath the review radar. Those worries have mostly proved unfounded; when surprises have happened, the response from the community has usually helped to ensure that there would be no repeats. But some developers are charging that the 3.7 kernel contains exactly this type of stealth change and are demanding that it be reverted.

Background

The fallocate() system call is meant to be a way for an application to request the efficient allocation of blocks for a file. Use of fallocate() allows a process to verify that the required disk space is available, helps the filesystem to allocate all of the space in a single, contiguous group, and avoids the overhead that block-by-block allocation would incur. In the absence of an fallocate() implementation (each filesystem must implement it independently), the C library will emulate it by simply writing zeroes to the requested block range; that gets the space allocated, but is less efficient than one would like. The implementation of fallocate() within filesystems tries to be more efficient than that; one way to do so is to avoid the process of writing zeroes to the newly-allocated blocks.

Leaving stale data in allocated blocks has obvious security implications: a hostile application could read those blocks in the hopes of finding confidential documents, passwords, or the missing Fedora 18 Beta release announcement. To avoid this exposure, filesystems like ext4 will mark unwritten blocks as being uninitialized; any attempt to read those blocks will be intercepted and just return zeroes. In the normal case, the application will write data to those blocks before ever trying to read them; writing obviously initializes the blocks without the need to write zeroes first. This implementation seems like it should be about optimal.

Except that, seemingly, ext4 marks uninitialized blocks at the extent (group of contiguous blocks) level. So, if an application writes to one uninitialized block, the containing extent must be split and the newly-written block(s) added to the previous extent, if possible. That turns out to be more expensive than some users would like. So a shortcut was attempted.

That shortcut first appeared in April, 2012, in the form of a new fallocate() flag called FALLOC_FL_NO_HIDE_STALE. If fallocate() was called with that flag, the newly-allocated blocks would be marked as being initialized even though the old data remained untouched. That obviously brings the old security issues back; to mitigate the problem, the patch added a mount option making the new functionality available only to members of a specific group. That was deemed to be enough, especially for settings where access to the machine as a whole is tightly controlled.

At least, the authors and supporters of the patch deemed the group check to be enough. The patch was roundly criticized by other filesystem developers; the prevailing opinion appeared to be that it was trying to open up a huge security hole in order to avoid fixing an ext4 performance problem. After that discussion, the patch went away and wasn't heard from again.

A surprise flag

At least, it was not heard from until recently, when some filesystem developers were surprised to discover this commit by Ted Ts'o which found its way into the mainline (via the ext4 tree) during the 3.7 merge window. The patch is small and simple; it simply defines the FALLOC_FL_NO_HIDE_STALE flag, but adds no code to actually implement it. The changelog reads:

As discussed at the Plumber's Conference, reserve the bit 0x04 in fallocate() to prevent collisions with a commonly used out-of-tree patch which implements the no-hide-stale feature.

Filesystem developer Dave Chinner, at least, does not recall this discussion. His response was to post a patch reverting the change, saying:

The lack of formal review and discussion for a syscall API change is grounds for reverting patch, especially given the controversial nature of the feature and the previous discussions and NAKs. The way the change was pushed into mainline borders on an abuse of the trust we place in maintainers and hence as a matter of principle this change should be reverted.

It is true that this particular change is a bit abnormal. It changes the core filesystem code but came by way of a filesystem-specific tree with no acks from any other developers. The patch does not appear to have been posted to any relevant mailing list, violating the rule that all patches should go through public review before being pushed toward the mainline. The addition of a flag with no in-kernel users is also contrary to usual kernel practice. It is, in summary, the sort of change that less well-established kernel developers would never get away with making. So it is hard to fault other filesystem developers for being surprised and unhappy.

On the other hand, the change just adds a flag definition; it obviously cannot cause problems for existing code. And there does appear to be a real user community for this feature. Ted justified his action this way:

It doesn't change the interface or break anything; it just reserves a bit so that out-of-tree patches don't collide with future allocations. There are significant usages of this bit within Google and Tao Bao. It is true that there has been significant pushback about adding this functionality on linux-fsdevel; I find it personally frustrating that in effect, if enough people scream, they can veto an optional feature that might only be implemented by a single file system.

This explanation does not appear to have satisfied anybody, though. So we have an impasse of sorts; some developers want a flag to control a functionality they need, while others see it as a security problem and the result of an abuse of the kernel's trust system.

Alan Cox suggested that it would be possible to, instead, reserve a set of filesystem-private flags that could be used for any purpose by any filesystem. Dave pointed out, however, that a flag bit that behaved differently from one filesystem to the next is a recipe for trouble. His suggestion, instead, is that this functionality should be implemented via the ioctl() interface, which is where filesystem-specific options usually hide. The ioctl() approach seems like it should be workable, but no patches to that effect have been posted thus far.

As of this writing, Linus has not accepted the revert, so the FALLOC_FL_NO_HIDE_STALE flag can still be found in the 3.7 kernel. He has also remained silent in the discussion. He will have to make a decision one way or the other, though, before the final 3.7 release is made. Once that flag is made available in a stable mainline release, it will be much harder to get rid of, so, if that flag is going to come out, it needs to happen soon.

Comments (5 posted)

The return of loadable security modules?

By Jake Edge
November 28, 2012

The idea behind the Linux Security Module (LSM) interface was initially discussed as part of the "NSA Linux" session at the first Kernel Summit back in 2001. The intent was to avoid wiring a particular security solution into the kernel; instead, multiple approaches to security could be built on top of a common kernel API. Originally, as the name implies, the solutions were built as loadable kernel modules, but eventually the "M" in LSM became just a historical artifact as the API was no longer exported to modules (essentially requiring security "modules" to be statically linked into the kernel). But it's possible that may all change again with a recent patch to bring back loadable LSMs.

Some history

A bit of history is probably in order. The LSM API came about specifically because Linus Torvalds didn't want to have to choose between a number of competing access control mechanisms for the kernel. Instead, LSM would provide a way for any of those mechanisms to hook into the kernel and deny access to various kinds of resources (files, devices, tasks, inodes, etc.) based on the security model being implemented. Initially, the LSMs would be implemented as kernel modules that could be loaded at runtime and, in some cases, unloaded.

The LSM interface was released as part of the 2.5 development kernel series in 2002, and was part of the first 2.6 release in December 2003. For several years after that, there was only one in-tree user of the interface: SELinux. That led to a 2005 suggestion to remove the LSM API entirely, effectively just calling SELinux directly. That would turn SELinux into the "one true security solution" for Linux. In 2006, James Morris proposed a patch to move LSM to the "feature removal" list, scheduled for the 2.6.18 kernel, which was roughly two months out at that point.

But, along came Smack, which implemented a "simplified" Mandatory Access Control (MAC) scheme for the kernel. It also used the LSM interface, so, to a certain extent, the decision on whether to merge it hinged on the future of LSM. In October 2007, Torvalds clearly stated his intention to keep LSM in the kernel, thus paving the way for Smack to be merged.

At more or less the same time Smack was merged, another change to LSM was made. First discussed in mid-2007, Torvalds merged a patch for the 2.6.24 kernel that switched LSM to a static interface so that security "modules" needed to be built into the kernel. One could still choose which security module to use with kernel command-line parameters, but dynamic security module loading would no longer be allowed.

There were a number of reasons behind the switch. For one thing, unloading modules was always messy (or impossible), partly because keeping a coherent security state through that process is difficult. In addition, the LSM API is very intrusive, allowing modules to hook nearly any kernel operation, which can be (and was) abused. While the LSM symbols were exported as GPL-only, that didn't stop some proprietary modules from abusing the interface. There were also free software modules that used the interface for non-security purposes (e.g. the realtime "security" module). Those kinds of problems could also be used as arguments against having the LSM API at all, but since Torvalds had already put his foot down on that particular question, removing the ability to load LSMs was seen as a reasonable alternative.

At the time that Torvalds merged the patch that made that switch, he asked for "real world" users of loadable security modules to step forward. There were a few examples of out-of-tree LSMs that were loadable (and, possibly, unloadable), but none that actually seemed to require that ability. The main users of the feature were LSM developers, who might routinely load and unload their LSM during development.

The next few years saw the merging of Smack (2.6.25), TOMOYO (2.6.30), and AppArmor (2.6.38). The latter had been long out of tree; its existence was part of the reason that the LSM interface came about in the first place. There have also been periodic attempts to get smaller, single-purpose security changes into the kernel over the years, but those were always pointed to the LSM interface. There is a problem with that particular suggestion, though, as only one LSM can be active at a time. Most distributions already have their one security module "slot" filled up. Red Hat and Fedora use SELinux, Ubuntu uses AppArmor, while SUSE and openSUSE have both AppArmor and SELinux available. Adding a specialized LSM for additional security protections is generally not possible without removing or disabling the distribution-supplied security solution.

Proposed LSM changes

That "one LSM at a time" problem has led to persistent (if intermittent) calls for ways to stack or chain LSMs. Smack developer Casey Schaufler is the most recent to propose a stacking solution. His patch set has been steadily reviewed on the LSM mailing list since it was first posted in September; it is now up to version 8. That particular version came with an interesting caveat:

I have not tried to reintroduce LSMs as loadable modules, in spite of the vigor with which it has been requested. I see that as work for another day, and a [separate] battle to fight.

Those requests came from the developer of the TOMOYO LSM, Tetsuo Handa. In earlier discussion of Schaufler's stacking patch, Handa advocated a return to allowing loadable LSMs. In fact, he went further than that, proposing a set of patches that would restore the ability to load LSMs as well as converting TOMOYO to use that feature.

Handa lists three reasons for making the change. To start with, any distribution that wants to allow its users to experiment with different LSMs must build all of those LSMs statically into the kernel. That will not only increase the size of the kernel, it will also increase the time it takes to load and boot the kernel. Most of that space (and time) would be completely wasted even for the users who are experimenting. All of that makes it less likely that distributions will actually build kernels that way.

Beyond that, though, many distributions have their preferred LSM, so they don't build extra LSMs into their kernels. That leaves users to build their own kernels, which is generally unacceptable, particularly in enterprise settings. But even if there are other LSMs built into the kernel, it takes a reboot to enable them. Handa notes that he uses a loadable kernel module that implements TOMOYO (called AKARI) to diagnose problems in enterprise systems. In order to access the LSM symbols (which are no longer exported), AKARI must do some kind of runtime address resolution, perhaps using /proc/kallsyms or System.map. But, AKARI is something he can load into running systems when needed—unlike regular LSMs.

One could argue that Handa's use of an LSM for system troubleshooting is a misuse of the interface, but the fact remains that changing LSMs currently requires a reboot. That problem potentially becomes more acute if LSM stacking is merged. One must decide pre-boot which LSMs to enable (and in what order they are consulted). Whatever else can be said, disallowing LSM loading reduces flexibility.

Handa's third reason is a bit more philosophical: "LSM is not the tool for thought control." Essentially, he argues, disallowing LSM loading just makes dealing with LSMs harder for both users and developers. It also means that the more "minor" LSMs (e.g. TOMOYO and Smack) get less exposure because fewer users can actually try them.

While there have been no comments on Handa's patches as yet, there have been expressions of support for loadable LSMs by some. Schaufler, for example, does not seem opposed necessarily. Kees Cook agreed with the need for loadable LSMs, though he was concerned that combining it with the LSM stacking patches would potentially block the progress for stacking. Morris, who authored the original patch to block loadable LSMs, has not yet spoken up one way or the other.

Taking away the ability to load LSMs did not really change the picture for the kinds of abuses that were brought up at the time the change was made. Kernel modules can still abuse the interface, though it may take a bit more work. If binary modules were willing to ignore the GPL-only export of the LSM interface, they are probably willing to ferret out the addresses they need instead. Open source modules can do much the same. At the time of the switch to a static interface back in 2007, Torvalds seemed very open to reverting it if there were real users—perhaps he can still be convinced.

Comments (none posted)

Statistics from the 3.7 development cycle

By Jonathan Corbet
November 28, 2012
The 3.7-rc7 prepatch came out on November 25; it may well be the last prepatch for the 3.7 development cycle. 3.7 was one of the more active cycles in recent history, with nearly 12,000 non-merge changesets incorporated by the time of this writing. It's time for our traditional look at what was done during this cycle and where all that work came from.

The 3.7 merge window was especially busy this time around. Here are some counts for recent kernels:

Kernel-rc1Total
3.07,3339,153
3.17,2028,693
3.210,21411,881
3.38,89910,550
3.49,24910,899
3.59,53410,957
3.68,58710,247
3.710,40911,815

The 3.7 development cycle, thus, saw the most active merge window in the 3.x era; it is, in fact, the most active merge window ever. Even allowing for the fact that 3.7 will add a few more changesets before final release, the 2.6.25 kernel, at 12,243 changesets total, will probably still hold the record for the most active development cycle ever, but the 2.6.25 merge window only saw 9,450 changesets merged. One could conclude from these numbers that we are getting better at getting our changes in during the merge window — and at having fewer things to fix thereafter.

Nearly 395,000 lines of code were removed from the kernel this time around. That must be balanced against the 719,000 lines that were added, though; the kernel grew by almost 324,000 lines as a result.

1,271 developers contributed to the 3.7 kernel — a relatively high number, but not out of line with previous development cycles. The lists of the most active developers do see some changes this time around, though:

Most active 3.7 developers
By changesets
H Hartley Sweeten4173.5%
Antti Palosaari2161.8%
Al Viro1671.4%
Wei Yongjun1451.2%
Sachin Kamat1381.2%
Mark Brown1361.2%
Eric W. Biederman1301.1%
Daniel Vetter1221.0%
David Howells1191.0%
Hans Verkuil1191.0%
Greg Kroah-Hartman1161.0%
Arnd Bergmann1120.9%
Peter Senna Tschudin1040.9%
Ben Skeggs970.8%
Peter Ujfalusi960.8%
Ian Abbott960.8%
Devendra Naga900.8%
David S. Miller840.7%
Takashi Iwai830.7%
Johannes Berg780.7%
By changed lines
David Howells652067.6%
Ben Skeggs502825.8%
David Daney468255.4%
Arnd Bergmann175052.0%
Sebastian Andrzej Siewior160881.9%
Daniel Cotey141571.6%
H Hartley Sweeten135661.6%
Catalin Marinas135191.6%
Antti Palosaari123361.4%
Bill Pemberton109351.3%
Dan Magenheimer105091.2%
Ezequiel Garcia102111.2%
David S. Miller92581.1%
Hans Verkuil86861.0%
Will Deacon84041.0%
Shawn Guo74640.9%
Alois Schlögl73010.8%
Roland Stigge69870.8%
Greg Kroah-Hartman69200.8%
Laurent Pinchart61070.7%

In a repeat of his 3.6 performance, H. Hartley Sweeten hit the top of the by-changesets list with a vast number of patches preparing the comedi drivers for graduation from the staging tree (removing over 5000 lines of code in the process). Antti Palosaari did a lot of work on drivers in the Video4Linux2 subsystem. Al Viro continues to refactor and clean up the VFS and core kernel areas with some excursions into most architecture subtrees. Wei Yongjun and Sachin Kamat both did a lot of cleanup work all over the driver tree.

David Howells ended up at the top of the "lines changed" column mostly by virtue of the user-space API header file thrashup, but he also contributed code for module signing and more. Ben Skeggs merged a major reworking of the nouveau driver, David Daney improved support for MIPS OCTEON processors, Arnd Bergmann's many patches were dominated by the removal of the unused mach-bcmring architecture code, and Sebastian Andrzej Siewior did a lot of work on the USB gadget driver subsystem.

Worth noting in passing: Fengguang Wu is credited with 63 bug reports during this cycle, almost 11% of the total. The others with at least ten reports are Dan Carpenter (21), Randy Dunlap (16), Stephen Rothwell (15), Paul McKenney (11), and Alex Lyakas (10). Every one of those reports resulted in a bug that was fixed before this kernel was released in stable form.

An even 200 employers (that we know about) contributed during the 3.7 cycle. The most active of these were:

Most active 3.7 employers
By changesets
(None)143512.1%
Red Hat11599.8%
(Unknown)8437.1%
Intel8006.8%
Texas Instruments5975.1%
IBM5164.4%
Linaro5094.3%
Vision Engraving Systems4173.5%
SUSE3563.0%
Google2452.1%
Samsung1981.7%
Freescale1811.5%
Oracle1771.5%
Wolfson Microelectronics1481.3%
AMD1441.2%
Trend Micro1441.2%
Cisco1381.2%
Linux Foundation1321.1%
Arista Networks1301.1%
NVIDIA1231.0%
By lines changed
Red Hat15702318.2%
(None)801919.3%
(Unknown)719928.3%
Cavium467575.4%
IBM392274.5%
Intel333813.9%
Linaro289003.4%
Texas Instruments284933.3%
ARM249132.9%
Oracle240952.8%
NVIDIA191672.2%
linutronix172112.0%
Vision Engraving Systems148441.7%
Samsung145191.7%
Microtrol S.R.L.128001.5%
PHILOSYS Software103111.2%
SUSE102261.2%
Marvell100671.2%
Cisco98281.1%
Pengutronix97931.1%

There are few surprises here. Texas Instruments has reached a new high in its contribution volume, a trend which, unfortunately, may not continue after the recent changes play out there. AMD, too, seems unlikely to remain on this list in the future. Meanwhile Red Hat maintains its place at the top of the list, where it has been since we first started generating these statistics.

And that is where things stand as the 3.7 kernel approaches its final release. Things appear to be running smoothly, with most development cycles taking less than 70 days to complete (if there is no 3.7-rc8, this cycle will run about 64 days). Stay tuned for the about-to-begin 3.8 cycle, with a release to be expected in early February, 2013.

Comments (1 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.7-rc7 ?
Greg KH Linux 3.6.8 ?
Greg KH Linux 3.4.20 ?
Steven Rostedt 3.4.20-rt31 ?
Greg KH Linux 3.0.53 ?
Steven Rostedt 3.0.53-rt77 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Janitorial

H. Peter Anvin RFC: Remove 386 support ?

Memory management

Networking

Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds