Brief items
The current development kernel is 3.7-rc7,
released on November 25. "
A week
ago, I had even considered skipping -rc7 entirely as things had been so
calm, but decided that there was little reason to hurry the release. And
oh, how sadly right I was." This could prove to be the last one,
though, if the next week of testing goes well.
Stable updates: 3.0.53,
3.4.20 and 3.6.8 were released on November 26.
Comments (none posted)
Ok, guys. Cage fight!
The rules are simple: two men enter, one man leaves.
And the one who comes out gets to explain to me which patch(es) I
should apply, and which I should revert, if any.
—
Linus Torvalds's new decision-making process
And yes, that is the thing about "fairness" -- there are a great
many definitions, many of the most useful of which appear to many
to be patently unfair.
—
Paul McKenney
This isn't the message that's gone over, and even for device
drivers everyone seems to be taking the whole device tree thing as
a move to pull all data out of the kernel. In some cases there are
some real practical advantages to doing this but a lot of the
people making these changes seem to view having things in DT as a
goal in itself.
—
Mark Brown
My spinning head fell on the floor and is now drilling its way to
China.
—
Andrew Morton
Comments (22 posted)
Kernel development news
By Jonathan Corbet
November 28, 2012
One often-heard complaint in the early BitKeeper era was that, by letting code
reach the mainline without going via a mailing list, BitKeeper made it easy for
maintainers to slip surprise changes in underneath the review radar. Those
worries have mostly proved unfounded; when surprises have happened, the
response from the community has usually helped to ensure that there would
be no repeats. But some developers are charging that the 3.7 kernel
contains exactly this type of stealth change and are demanding that it be
reverted.
Background
The fallocate() system call is meant to be a way for an
application to request the efficient allocation of blocks for a file. Use
of fallocate() allows a process to verify that the required disk
space is available, helps the filesystem to allocate all of the space
in a single, contiguous group, and avoids the overhead that block-by-block
allocation
would incur. In the absence of an fallocate() implementation
(each filesystem must implement it independently), the
C library will emulate it by
simply writing zeroes to the requested block range; that gets the space
allocated, but is less efficient than one would like. The implementation
of fallocate() within filesystems tries to be more efficient than
that; one way to do so is to avoid the process of writing zeroes to the
newly-allocated blocks.
Leaving stale data in allocated blocks has obvious security implications: a
hostile application could read those blocks in the hopes of finding
confidential documents, passwords, or the missing Fedora 18 Beta release
announcement. To avoid this exposure, filesystems like ext4 will mark
unwritten blocks as being uninitialized; any attempt to read those blocks
will be intercepted and just return zeroes. In the normal case, the
application will write data to those blocks before ever trying to read
them; writing obviously initializes the blocks without the need to write
zeroes first. This implementation seems like it should be about optimal.
Except that, seemingly, ext4 marks uninitialized blocks at the extent
(group of contiguous blocks) level. So, if an application writes to one
uninitialized block, the containing extent must be split and the
newly-written block(s) added to the previous extent, if possible. That
turns out to be more expensive than some users would like. So a shortcut
was attempted.
That shortcut first appeared in April, 2012, in the form of a new fallocate() flag called
FALLOC_FL_NO_HIDE_STALE. If fallocate() was called with
that flag, the newly-allocated blocks would be marked as being initialized
even though the old data remained untouched. That obviously brings the old
security
issues back; to mitigate the problem, the patch added a mount option making
the new
functionality available only to members of a specific group. That was
deemed to be enough, especially for settings where access to the machine as
a whole is tightly controlled.
At least, the authors and supporters of the patch deemed the group check to
be enough. The patch was roundly criticized by other filesystem
developers; the prevailing opinion appeared to be that it was trying to
open up a huge security hole in order to avoid fixing an ext4 performance
problem. After that discussion, the patch went away and wasn't heard from
again.
A surprise flag
At least, it was not heard from until recently, when some filesystem
developers were surprised to discover this
commit by Ted Ts'o which found its way into the mainline (via the ext4 tree)
during the 3.7 merge window. The patch is small and simple; it simply
defines the FALLOC_FL_NO_HIDE_STALE flag, but adds no code to
actually implement it. The changelog reads:
As discussed at the Plumber's Conference, reserve the bit 0x04 in
fallocate() to prevent collisions with a commonly used out-of-tree
patch which implements the no-hide-stale feature.
Filesystem developer Dave Chinner, at least, does not recall this
discussion. His response was to post a patch
reverting the change, saying:
The lack of formal review and discussion for a syscall API change
is grounds for reverting patch, especially given the controversial
nature of the feature and the previous discussions and NAKs. The
way the change was pushed into mainline borders on an abuse of the
trust we place in maintainers and hence as a matter of principle
this change should be reverted.
It is true that this particular change is a bit abnormal. It changes
the core filesystem code but came by way of a filesystem-specific tree with
no acks from any other developers. The patch does not appear to have been
posted to any relevant mailing list, violating the rule that all patches
should go through public review before being pushed toward the mainline.
The addition of a flag with no in-kernel users is also contrary to usual
kernel practice. It is, in summary, the sort of change that less
well-established kernel developers would never get away with making. So it
is hard to fault other filesystem developers for being surprised and
unhappy.
On the other hand, the change just adds a flag definition; it obviously
cannot cause problems for existing code. And there does appear to be a
real user community for this feature. Ted justified his action this way:
It doesn't change the interface or break anything; it just reserves
a bit so that out-of-tree patches don't collide with future
allocations. There are significant usages of this bit within
Google and Tao Bao. It is true that there has been significant
pushback about adding this functionality on linux-fsdevel; I find
it personally frustrating that in effect, if enough people scream,
they can veto an optional feature that might only be implemented by
a single file system.
This explanation does not appear to have satisfied anybody, though. So we
have an impasse of sorts; some developers want a flag to control a
functionality they need, while others see it as a security problem and the
result of an abuse of the kernel's trust system.
Alan Cox suggested that it would be
possible to, instead, reserve a set of filesystem-private flags that could
be used for any purpose by any filesystem. Dave pointed out, however, that a flag
bit that behaved differently from one filesystem to the next is a recipe
for trouble. His suggestion, instead, is that this functionality should be
implemented via the ioctl() interface, which is where
filesystem-specific options usually hide. The ioctl() approach
seems like it should be workable, but no patches to that effect have been
posted thus far.
As of this writing, Linus has not accepted the revert, so the
FALLOC_FL_NO_HIDE_STALE flag can still be found in the 3.7
kernel. He has also remained silent in the discussion. He will have to
make a decision one way or the other, though, before the final 3.7 release
is made. Once that flag is made available in a stable mainline release, it
will be much harder to get rid of, so, if that flag is going to come out,
it needs to happen soon.
Comments (5 posted)
By Jake Edge
November 28, 2012
The idea behind the Linux Security Module (LSM) interface was initially
discussed as part of the "NSA Linux" session at the first
Kernel Summit back in 2001. The intent was to avoid wiring a particular
security solution into the kernel; instead, multiple approaches to security
could be built on top of
a common kernel API. Originally, as the name implies, the solutions
were built as loadable kernel modules, but eventually the "M" in LSM became
just a historical artifact as the API was no longer exported
to modules (essentially requiring security "modules" to be statically
linked into the kernel). But it's possible that may all change again
with a recent patch to bring back loadable LSMs.
Some history
A bit of history is probably in order. The LSM API came about specifically
because Linus Torvalds didn't want to have to choose between a number of
competing access control mechanisms for the kernel. Instead, LSM would
provide a way for any of those mechanisms to hook into the kernel and deny
access to various kinds of resources (files, devices,
tasks, inodes, etc.) based on the security model being implemented.
Initially, the LSMs would be implemented as kernel
modules that could be loaded at runtime and, in some cases, unloaded.
The LSM interface was released as part of the 2.5 development kernel series
in 2002, and was part of the first 2.6 release in December 2003. For
several years after that, there was only one in-tree user of the interface:
SELinux. That led to a 2005 suggestion to remove
the LSM API entirely, effectively just calling SELinux directly. That
would turn SELinux
into the "one true security solution" for Linux. In 2006, James Morris proposed a patch to
move LSM to the "feature removal" list, scheduled for the 2.6.18 kernel,
which was roughly two months out at that point.
But, along came Smack, which implemented a
"simplified" Mandatory Access Control (MAC) scheme for the kernel. It also
used the LSM interface, so, to a certain extent, the decision on whether to merge it hinged on the
future of LSM. In October 2007, Torvalds clearly stated his intention to keep LSM in
the kernel, thus paving the way for Smack to be merged.
At more or less the same time Smack was merged, another change to LSM was
made. First discussed in mid-2007, Torvalds
merged a patch for the 2.6.24 kernel that switched LSM to a static interface so that
security "modules" needed to be built into the kernel. One could still
choose which security module to use with kernel command-line parameters,
but dynamic security module loading would no longer be allowed.
There were a number of reasons behind the switch. For one thing, unloading
modules was always messy (or impossible), partly because keeping a coherent
security
state through that process is difficult. In addition, the LSM API is very
intrusive, allowing modules to hook nearly any kernel operation, which can
be (and was) abused. While the LSM symbols were exported as GPL-only, that
didn't stop some proprietary modules from abusing the interface. There
were also free
software modules that used the interface for non-security purposes (e.g. the realtime "security" module). Those kinds
of problems could also be used as arguments against having the LSM API at
all, but since Torvalds had already put his foot down on that particular
question, removing the ability to load LSMs was seen as a reasonable
alternative.
At the time that Torvalds merged the patch that made that switch, he asked
for "real world" users
of loadable security modules to step forward. There were a few examples of
out-of-tree LSMs that were loadable (and, possibly, unloadable), but none
that actually seemed to require that ability. The main users of the
feature were LSM developers, who might routinely load and unload their LSM
during development.
The next few years saw the merging of Smack (2.6.25), TOMOYO (2.6.30), and
AppArmor (2.6.38). The latter had been long out of tree; its existence was
part of the reason that the LSM interface came about in the first place. There have also been
periodic attempts to get smaller, single-purpose security changes into the
kernel over the years, but those were always pointed to the LSM interface.
There is a problem with that particular suggestion, though, as only one LSM
can be active at a time.
Most
distributions already have their one security module "slot" filled up. Red
Hat and Fedora use SELinux, Ubuntu uses AppArmor, while SUSE and openSUSE
have
both AppArmor and SELinux available. Adding a specialized LSM for
additional security
protections is generally not possible without removing or disabling the
distribution-supplied security solution.
Proposed LSM changes
That "one LSM at a time" problem has led to persistent (if intermittent)
calls for
ways to stack or chain LSMs. Smack developer Casey Schaufler is the most
recent to propose a stacking solution. His
patch set has been steadily reviewed on the LSM mailing list since it was
first posted in September; it is now up to version 8. That particular version came with an
interesting caveat:
I have not tried to
reintroduce LSMs as loadable modules, in spite of the
vigor with which it has been requested. I see that as
work for another day, and a [separate] battle to fight.
Those requests came from the developer of the TOMOYO LSM, Tetsuo Handa.
In earlier discussion of Schaufler's stacking patch, Handa advocated a return to allowing loadable
LSMs. In fact, he went further than that, proposing a set of patches that would restore the ability
to load LSMs as well as converting TOMOYO to use that feature.
Handa lists three reasons for making the change. To start with, any
distribution that wants to allow its users to experiment with different
LSMs must build all of those LSMs statically into the kernel. That will
not only increase the size of the kernel, it will also increase the time it
takes to load and boot the kernel. Most of that space (and time) would be
completely
wasted even for the users who are experimenting. All of that makes it less
likely that
distributions will actually build kernels that way.
Beyond that, though, many distributions have their preferred LSM, so they
don't build extra LSMs into their kernels. That leaves users to build
their own kernels, which is generally unacceptable, particularly in
enterprise settings. But even if there are other LSMs built into the
kernel, it takes a
reboot to enable them. Handa notes that he uses a loadable kernel module that
implements TOMOYO (called AKARI) to diagnose problems in enterprise
systems. In order to access the LSM symbols (which are no longer
exported), AKARI must do some kind of runtime address resolution,
perhaps using /proc/kallsyms or System.map. But, AKARI
is something he can load into running systems when needed—unlike
regular LSMs.
One could argue that Handa's use of an LSM for system troubleshooting is a
misuse of the interface, but the fact remains that changing LSMs currently
requires a reboot. That problem potentially becomes more acute if LSM
stacking is merged. One must decide pre-boot which LSMs to enable (and in
what order they are consulted). Whatever else can be said, disallowing LSM
loading reduces flexibility.
Handa's third reason is a bit more philosophical: "LSM is not the
tool for thought control." Essentially, he argues, disallowing LSM
loading just makes dealing with LSMs harder for both users and developers. It
also means
that the more "minor" LSMs (e.g. TOMOYO and Smack) get less exposure
because fewer users can actually try them.
While there have been no comments on Handa's patches as yet, there have
been expressions of support for loadable LSMs by some. Schaufler, for
example, does not seem opposed necessarily. Kees Cook
agreed with the need for loadable LSMs,
though he was concerned that combining it with the LSM stacking patches
would potentially block the progress for stacking. Morris, who authored
the original patch to block loadable LSMs, has not yet spoken up one way or
the other.
Taking away the ability to load LSMs did not really change the picture for
the kinds of abuses that were brought up at the time the change was made.
Kernel
modules can still abuse the interface, though it may take a bit more work.
If binary modules were willing to ignore the GPL-only export of the LSM
interface, they are probably willing to ferret out the addresses they
need instead. Open source modules can do much the same. At the time of the
switch to a static interface back in 2007, Torvalds seemed very open to
reverting it if there were real
users—perhaps he can still be convinced.
Comments (none posted)
By Jonathan Corbet
November 28, 2012
The
3.7-rc7 prepatch came out on
November 25; it may well be the last prepatch for the 3.7 development
cycle. 3.7 was one of the more active cycles in recent history, with
nearly 12,000 non-merge changesets incorporated by the time of this
writing. It's time for our traditional look at what was done during this
cycle and where all that work came from.
The 3.7 merge window was especially busy this time around. Here are some
counts for recent kernels:
| Kernel | -rc1 | Total |
| 3.0 | 7,333 | 9,153 |
| 3.1 | 7,202 | 8,693 |
| 3.2 | 10,214 | 11,881 |
| 3.3 | 8,899 | 10,550 |
| 3.4 | 9,249 | 10,899 |
| 3.5 | 9,534 | 10,957 |
| 3.6 | 8,587 | 10,247 |
| 3.7 | 10,409 | 11,815 |
The 3.7 development cycle, thus, saw the most active merge window in the
3.x era; it is, in fact, the most active merge window ever.
Even allowing for the fact that 3.7 will add a few more changesets before
final release, the 2.6.25
kernel, at 12,243 changesets total, will probably still hold the record for
the most active development cycle ever, but the 2.6.25
merge window only saw 9,450 changesets merged. One could conclude from
these numbers
that we are getting better at getting our changes in during the merge
window — and at having fewer things to fix thereafter.
Nearly 395,000 lines of code were removed from the kernel this time
around. That must be balanced against the 719,000 lines that were added,
though; the kernel grew by almost 324,000 lines as a result.
1,271 developers contributed to the 3.7 kernel — a relatively high number,
but not out of line with previous development cycles. The lists of the
most active developers do see some changes this time around, though:
| Most active 3.7 developers |
| By changesets |
| H Hartley Sweeten | 417 | 3.5% |
| Antti Palosaari | 216 | 1.8% |
| Al Viro | 167 | 1.4% |
| Wei Yongjun | 145 | 1.2% |
| Sachin Kamat | 138 | 1.2% |
| Mark Brown | 136 | 1.2% |
| Eric W. Biederman | 130 | 1.1% |
| Daniel Vetter | 122 | 1.0% |
| David Howells | 119 | 1.0% |
| Hans Verkuil | 119 | 1.0% |
| Greg Kroah-Hartman | 116 | 1.0% |
| Arnd Bergmann | 112 | 0.9% |
| Peter Senna Tschudin | 104 | 0.9% |
| Ben Skeggs | 97 | 0.8% |
| Peter Ujfalusi | 96 | 0.8% |
| Ian Abbott | 96 | 0.8% |
| Devendra Naga | 90 | 0.8% |
| David S. Miller | 84 | 0.7% |
| Takashi Iwai | 83 | 0.7% |
| Johannes Berg | 78 | 0.7% |
|
| By changed lines |
| David Howells | 65206 | 7.6% |
| Ben Skeggs | 50282 | 5.8% |
| David Daney | 46825 | 5.4% |
| Arnd Bergmann | 17505 | 2.0% |
| Sebastian Andrzej Siewior | 16088 | 1.9% |
| Daniel Cotey | 14157 | 1.6% |
| H Hartley Sweeten | 13566 | 1.6% |
| Catalin Marinas | 13519 | 1.6% |
| Antti Palosaari | 12336 | 1.4% |
| Bill Pemberton | 10935 | 1.3% |
| Dan Magenheimer | 10509 | 1.2% |
| Ezequiel Garcia | 10211 | 1.2% |
| David S. Miller | 9258 | 1.1% |
| Hans Verkuil | 8686 | 1.0% |
| Will Deacon | 8404 | 1.0% |
| Shawn Guo | 7464 | 0.9% |
| Alois Schlögl | 7301 | 0.8% |
| Roland Stigge | 6987 | 0.8% |
| Greg Kroah-Hartman | 6920 | 0.8% |
| Laurent Pinchart | 6107 | 0.7% |
|
In a repeat of his 3.6 performance, H. Hartley Sweeten hit the top of the
by-changesets list with a vast number of patches preparing the comedi
drivers for graduation from the staging tree (removing over 5000 lines of
code in the process). Antti Palosaari did a lot of
work on drivers in the Video4Linux2 subsystem. Al Viro continues to
refactor and clean up the VFS and core kernel areas with some excursions
into most architecture subtrees. Wei Yongjun and Sachin Kamat both did a
lot of cleanup work all over the driver tree.
David Howells ended up at the top of the "lines changed" column mostly by
virtue of the user-space API header file
thrashup, but he also contributed code for module signing and more.
Ben Skeggs merged a major reworking of the nouveau driver, David Daney
improved support for MIPS OCTEON processors, Arnd Bergmann's many patches
were dominated by the removal of the unused mach-bcmring architecture code,
and Sebastian Andrzej Siewior did a lot of work on the USB gadget driver
subsystem.
Worth noting in passing: Fengguang Wu is credited with 63 bug reports
during this cycle, almost 11% of the total. The others with at least ten
reports are Dan Carpenter (21), Randy Dunlap (16), Stephen Rothwell (15),
Paul McKenney (11), and Alex Lyakas (10). Every one of those reports
resulted in a bug that was fixed before this kernel was released in stable
form.
An even 200 employers (that we know about) contributed during the 3.7
cycle. The most active of these were:
| Most active 3.7 employers |
| By changesets |
| (None) | 1435 | 12.1% |
| Red Hat | 1159 | 9.8% |
| (Unknown) | 843 | 7.1% |
| Intel | 800 | 6.8% |
| Texas Instruments | 597 | 5.1% |
| IBM | 516 | 4.4% |
| Linaro | 509 | 4.3% |
| Vision Engraving Systems | 417 | 3.5% |
| SUSE | 356 | 3.0% |
| Google | 245 | 2.1% |
| Samsung | 198 | 1.7% |
| Freescale | 181 | 1.5% |
| Oracle | 177 | 1.5% |
| Wolfson Microelectronics | 148 | 1.3% |
| AMD | 144 | 1.2% |
| Trend Micro | 144 | 1.2% |
| Cisco | 138 | 1.2% |
| Linux Foundation | 132 | 1.1% |
| Arista Networks | 130 | 1.1% |
| NVIDIA | 123 | 1.0% |
|
| By lines changed |
| Red Hat | 157023 | 18.2% |
| (None) | 80191 | 9.3% |
| (Unknown) | 71992 | 8.3% |
| Cavium | 46757 | 5.4% |
| IBM | 39227 | 4.5% |
| Intel | 33381 | 3.9% |
| Linaro | 28900 | 3.4% |
| Texas Instruments | 28493 | 3.3% |
| ARM | 24913 | 2.9% |
| Oracle | 24095 | 2.8% |
| NVIDIA | 19167 | 2.2% |
| linutronix | 17211 | 2.0% |
| Vision Engraving Systems | 14844 | 1.7% |
| Samsung | 14519 | 1.7% |
| Microtrol S.R.L. | 12800 | 1.5% |
| PHILOSYS Software | 10311 | 1.2% |
| SUSE | 10226 | 1.2% |
| Marvell | 10067 | 1.2% |
| Cisco | 9828 | 1.1% |
| Pengutronix | 9793 | 1.1% |
|
There are few surprises here. Texas Instruments has reached a new high in
its contribution volume, a trend which, unfortunately, may not continue
after the recent
changes play out there. AMD, too, seems
unlikely to remain on this list
in the future. Meanwhile Red Hat maintains its place at the top of the
list, where it has been since we first started generating these statistics.
And that is where things stand as the 3.7 kernel approaches its final
release. Things appear to be running smoothly, with most development
cycles taking less than 70 days to complete (if there is no 3.7-rc8, this
cycle will run about 64 days). Stay tuned for the about-to-begin 3.8
cycle, with a release to be expected in early February, 2013.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>