Brief items
The current development kernel is 3.2-rc5,
released on December 9. "
It's
been a bit over a week, and I'm sad to report that -rc5 is bigger (at least
in number of commits - most of the commits are pretty small, so it's
possible that the *diff* ends up being smaller, but I didn't check) than
both -rc2 and -rc4 were. So much for 'calming down'." 355 changes
have been merged since -rc4, indeed bigger than -rc2 (280) and -rc4 (207)
but smaller than -rc3 (412). All told, there have been 1,254 changes since
-rc1, which, at a bit over 10% of the total, is actually relatively small.
Stable updates: the 2.6.32.50,
3.0.13, and
3.1.5 stable updates were released on
December 9. All three contain the usual long list of important fixes.
Comments (none posted)
I've started using Gnome 3.2 on my main desktop 4 weeks ago, and
while I came into it with prejudice and expected a rough ride,
everything is surprisingly nice so far.
It's in fact the best (read: most usable, most intuitive) Linux
desktop I've ever used for kernel development and maintenance
work-flows. It gets out my way, tries to be there when I need it
and takes usage ergonomy and UI consistency as seriously as Apple
and Google does. Kudos.
--
Ingo Molnar
Your next patch series better come with perfectly spelled changelog
entries that actually describe what the patches do, numbered
properly (none of this 30/30 crap after a 00/29 series), not break
the build (is that asking too much?) apply with no fuzz, and to
help it all out, a home made holiday bread of your choosing, as
long as it's fresh, and does not contain dried fruit bits (soaked
in liquor can't hurt either.)
--
Greg Kroah-Hartman
Comments (14 posted)
Vincent Bernat has posted
a
lengthy description of how the IPv4 routing cache works and how to tune
it for best results. "
Once an entry has been added to the route
cache, there are several ways to remove it. Most entries are removed by the
garbage collector which will scan the route cache and remove invalid and
older entries. It will be triggered when the route cache is full or at
regular interval, once a certain threshold has been met." (Thanks
to Paul Wise).
Comments (2 posted)
Kernel development news
By Jake Edge
December 14, 2011
The problems with symbolic link race conditions have existed for decades,
been well understood in that time, and developers have been given clear
guidelines on how to avoid them. Yet they still persist, with new
vulnerabilities discovered regularly. There is also a known way to avoid
most of the problems by changing the kernel—something that has been
done for many years in grsecurity and Openwall—but it has never made
its way into the mainline. While kernel hackers are understandably
unenthusiastic about working around buggy user-space programs in the
kernel, this particular problem is severe enough that it probably makes
sense to do so. It would seem that we are seeing some movement on that
front.
The basic problem is a time-to-check-to-time-of-use (TOCTTOU) flaw. Buggy
applications will look for the existence and/or characteristics of
temporary files before opening them. An attacker can exploit the flaw by
changing the file (often by making a symlink) in between the check and the
open(). If the program with the flaw has elevated privileges
(e.g. setuid), and the attacker replaces the file with a symlink to a
system file, serious problems can result.
The
bug generally happens in shared, world-writable directories that have the "sticky" bit set
(like /tmp).
The sticky bit on a directory is set to prevent users from deleting other
users' files. So, the fix restricts the ability to follow symlinks in
sticky directories. In particular,
a symlink is only followed if it is owned by the follower or if
the directory and symlink have the same owner. That solves much of the
symlink race problem without breaking any known applications.
We looked at patches to restrict the
following of symlinks in sticky directories in June 2010. Since that time,
there has been a two-pronged approach, championed by Kees Cook, to try to get the code into the
mainline. The first is the Yama LSM, which is meant to collect up
extensions to the Linux discretionary access control (DAC) model. But it
runs afoul of the usual problem for specialized LSMs: the inability to stack LSMs.
Cook and others would clearly prefer to see the symlink changes go into the
core VFS
code, rather than via an LSM, but there has been a push by some to keep it
out of the core. There was discussion of Yama and its
symlink protections at the Linux Security
Summit LSM roundtable, where the plan to
push Yama as a DAC enhancement LSM was hatched. That may well be a way
forward, but Cook has also posted a patch
set that would put the symlink restrictions into fs/namei.c.
The latter patch attracted some interesting comments that would seem to
indicate that Ingo Molnar and Linus Torvalds, at least, see value in
closing the hole. None of the VFS developers have weighed in on this
iteration, but
Cook notes that the patch reflects feedback from Al Viro, which could be
seen as a sign
that he's not completely opposed. Molnar was particularly unhappy
that the hole still exists:
Ugh - and people continue to get exploited from a preventable,
fixable and already fixed VFS design flaw.
Molnar also had some questions about the implementation, including whether
the PROTECTED_STICKY_SYMLINKS kernel configuration parameter
should default to 'yes', but was overall very interested in seeing the
patch move forward. Torvalds had a somewhat
different take, "Ugh. I really dislike the
implementation.", but suggested a different mechanism to try to
solve the underlying problem by using the permission bits on the symlink.
His argument is that Cook's approach is not very "polite"
because it is hidden away, so it turns into:
some kind of hacky run-time random behavior
depending on some invisible config option that people aren't even
aware of.
As Cook points out, though, Torvalds's
approach has its own set of "weird hidden behaviors".
Torvalds admittedly had not thought his proposal through completely, but it
does show an interest in seeing the problem solved. From Cook's
perspective, the changes he is proposing simply change the behavior of
sticky directories with respect to symlinks, whereas Torvalds's would have
wider-ranging effects on symlink creation. Either might do the job, but
Cook's solution does have an advantage in that the proposed changes have
been shaken out for years in grsecurity and Openwall, as well as in Ubuntu
more recently.
Given that several high-profile kernel hackers seem to be in favor of
fixing the problem—Ted Ts'o was also favorably disposed to a fix back
in 2010—the winds may have shifted in favor of the core VFS
approach. If Viro and the other VFS developers aren't completely unhappy with
it, we could see it in 3.4 or so.
If that were to happen, there is another related patch that would
presumably also be pushed for mainline inclusion: hard link restrictions.
That, like the symlink change, currently lives in Yama, though the case can
be made that it should also be done in the core VFS code. That patch would
disallow the creation of hard links to files that are inaccessible (neither
readable nor writable) to the user
making the link. It also disallows hard links to setuid and setgid files.
That would close some further holes in the symlink race
vulnerability, as well as fix some other application vulnerabilities.
Should both the symlink and hard link restrictions make their way into the
VFS core, that would only leave the ptrace() restrictions in
Yama. Those restrictions allow administrators to disallow a process from
using ptrace() on anything other than its descendants (unless it
has the CAP_SYS_PTRACE capability). Currently, any process can trace any
other running under the same UID, so a compromise in one running program
could lead to disclosing credentials and other sensitive information from
another running program. There may also be other DAC
enhancements that Cook or others are interested in adding to Yama in the
future.
One way or another, the problem is severe enough that there should, at
least, be a way for distributors or administrators to thwart these common
vulnerabilities. Whether the fix lives in VFS or an LSM, providing an
option to turn off a whole class of application flaws—which can often
lead to system compromise—seems worth doing. Hopefully we are seeing
movement in that direction.
Comments (11 posted)
By Jonathan Corbet
December 14, 2011
The story of tracing in the Linux kernel sometimes seems to resemble a bad
multi-season TV soap opera. We have no end of strong characters, plot
twists, independent story lines, recurring themes, and conflicting agendas.
The cast changes slowly over time, but things never seem to come to any
sort of satisfying conclusion. Those watching the show might be forgiven
for thinking that one of those story lines might be about to be wrapped up
when the
LTTng tracing system was
pulled into the staging tree for the 3.3 merge
window. But they should have known that they were just being set up for
another sad twist in the plot.
LTTng descends from some of the earliest dynamic tracing work done for
Linux. Its distinguishing characteristics include integrated kernel- and
user-space tracing, performance sufficient to deal with high-bandwidth
event streams, and a well-developed set of capture and analysis tools.
LTTng has always been maintained out of the mainline kernel tree, but it is
packaged by a number of distributors and has base of dedicated users, some
of whom have been happy to fund ongoing LTTng development work.
Had LTTng been merged years ago, the story may have been much simpler, but,
for a number of reasons (including the simple fact that, for years, any
sort of tracing capability was hard to sell to the kernel development
community) that did not happen. So we have ended up with a number of
projects in this area, including SystemTap (which also remains
out-of-tree), and the in-tree ftrace and perf subsystems. Naturally, none
of these solutions has proved entirely satisfactory so, while there has
been a fair amount of pressure to consolidate the various tracing projects,
that has tended not to happen.
That is not to say that there has been no progress at all. Some agreement
has been reached on the format of tracepoints themselves; much of the work
in that area was done by primary LTTng maintainer Mathieu Desnoyers. As a
result, the number of tracepoints in the kernel has been growing rapidly,
making kernel operations more visible to users in a number of ways. A lot
of talk about merging more infrastructure has been heard over the years -
said talk was often audible from a great distance at various conferences - but
progress has been minimal. It seems to be easy for developers in this area
to get bogged down on the details of ring buffers, event formats, and so on
at the expense of producing an actual, usable solution.
To Mathieu, merging into the staging tree must have looked like an
attractive way to push things forward. The relaxed rules for that tree
would allow the code into the mainline where its visibility would increase,
any remaining issues could be fixed up, and more users could be found. It
all seemed to be working - some cleanup patches from new developers were
posted - until Mathieu tried to add exports
for some core kernel symbols so LTTng could access them. That attracted
the attention of the core kernel developers who, to put it gently, were not
impressed with what they saw.
In the end, Ingo Molnar vetoed the whole patch
series and asked Greg Kroah-Hartman to remove the LTTng code from
staging. Greg complied with that request, with the result that LTTng is,
once again, no closer to merging into the mainline than it was before.
This particular story line, it seems, has at least one more season to run
yet.
What is it about LTTng that makes it unsuitable for merging into the
mainline? It starts with a lot of duplicated infrastructure. Inevitably,
LTTng brings in its own ring buffer to communicate events to user space,
despite the fact that the two ring buffers used by perf and ftrace are
already seen as being too many. There is a new instrumentation mechanism
for system calls - something that the kernel already has. And, of course,
there is a new user-space ABI to control all of this - again an unwelcome
addition when there is strong pressure from some directions to merge the
existing in-kernel tracing ABIs.
Duplicated infrastructure always tends to be hard to merge into the
mainline; duplicated user-space ABIs, which must be supported forever, are
even more so. It is thus not surprising that there is pushback against
these patches, even without considering the highly contentious nature of
the discussion around tracing work in general. Ingo claims to be receptive
to merging the parts of LTTng that are better than what the kernel has now
- after it has been unified with the existing infrastructure, of course -
but, he says, Mathieu has been more interested in maintaining LTTng as a
separate "brand" and has been unwilling to merge things in this way.
Mathieu's response has not done much to
address those concerns. Duplicate infrastructure, he said, is fine as long
as there is no agreement on how that infrastructure should work. Thus, he
said, it is better to get his ring buffer into the mainline and to try to
work out the differences there. He takes a similar approach to the ABI:
Interfaces to user-space: very much like filesystems, these ABIs
don't need to be shared across projects that have different
use-cases. Having multiple tracer ABIs, if self-contained, should
not hurt anybody and just increase the rate of innovation.
The points that are missed here are that (1) filesystems do, in fact, share
the same ABI, and (2) there is indeed a cost to multiple ABIs for
tracing. Those ABIs have to be maintained indefinitely and they fragment
the efforts of tool developers who find themselves forced to choose one or
the other. Unless he can produce a convincing proof that the existing
kernel interfaces cannot possibly be extended to meet LTTng's needs,
Mathieu will almost certainly not succeed in getting a new tracing ABI into
the mainline.
Two notable conclusions were reached at the 2011 Kernel Summit. One was that
maintainers should say "no" more often and accept fewer new features into
the mainline; that would argue that Ingo and others are right to block the
addition of LTTng in its current form. But the other conclusion was that
code that has been shipped for years and that has real users should be
strongly considered for merging even if it has known technical
shortcomings. That, of course, would argue for merging LTTng, which
certainly meets those conditions. Given the players involved, that
conflict seems almost certain to be resolved with LTTng remaining an active
project out of
the mainline. Tune in next year for another episode of "As the tracing
world turns."
Comments (14 posted)
By Jonathan Corbet
December 14, 2011
The kernel development process prides itself on being driven exclusively by
technical concerns. Ideally, all decisions with regard to the merging of
code would be based on whether that code makes technical sense or not;
decisions based on "political" concerns are seen as being rather less
ideal. But, as a recent discussion shows, even a seemingly "political"
decision can have technical reasoning behind it.
In June 2011, Honza Petrous posted a patch
to the linux-media list containing an implementation of a virtual DVB
(digital video broadcast)
device driver. DVB drivers normally talk to devices that tune in and
capture video streams - television tuners, in other words. Honza's
"vtunerc" driver, instead, drives no physical hardware at all. Instead, it
serves as a sort of loopback device. One side looks like a normal DVB
device; it handles all the usual DVB system calls. The other
side, which shows up as /dev/vtunercN, passes a processed
form of those DVB
system calls back to user space. The intended use is for a user-space
process to receive those operations and pass them to a remote peer
elsewhere on the network; that peer would then perform the operations on a
real DVB device. Using this mechanism, DVB devices could be hosted on a
network in a manner that is entirely transparent to DVB applications.
Honza has posted a
diagram showing how the pieces fit together.
Virtual devices of this type are not unprecedented in the Linux (and Unix)
tradition; the venerable virtual terminal devices work in much the same way. This
type of mechanism is also sometimes used to make devices available within
virtualized guest systems. But this patch was not accepted into the DVB
subsystem for a number of reasons, one of which being that it would
facilitate the creation of proprietary user-space drivers for DVB devices.
That was the reason Honza picked up on when he went to the linux-kernel list in an attempt to
gain support in November, saying that, while he didn't discount the
possibility of "bad guys" abusing the interface to create closed-source
drivers, he was not convinced that it justified the
"aggressive NACK" the code received.
As the subsequent discussion made clear, some developers do, indeed, believe
that the potential for abuse in this way is sufficient reason to keep an
interface out of the mainline kernel. That is the same reasoning that has,
for example,
blocked the merging of graphics drivers that have proprietary user-space
components. But it also turns out that there is rather more than that to
this particular decision. Reasons for keeping vtunerc out include:
- The same ABI that enables proprietary drivers also exposes a fair
amount of internal information about the DVB layer. That ABI would
have to remain unchanged even as DVB evolves, leading to maintenance
burdens in the future.
- There appears to be little advantage to routing all that video data
through the kernel and immediately back to user space; it would make
more sense for DVB applications to use a network video protocol
directly and avoid the cost of routing data through the kernel.
- DVB applications tend to work with tight timing constraints. Adding a
network connection into the mix will create latencies that may well
confuse these applications. Working across a network requires a
different approach than talking to a device directly; operations that
may be done synchronously on a local device may need to happen
asynchronously with a remote device. By hiding the network link,
vtunerc makes it impossible for applications to drive the device
appropriately.
- If the creation of this type of loopback device absolutely cannot be
avoided, it can be done with the CUSE
(char drivers in user space)
interface instead of adding a new ABI.
In the discussion, it seems that much of the motivation for vtunerc comes
from the fact that it would require no changes to applications at all,
while a user-space approach might require such changes. In fact, it seems that there is a political problem at that
level: the maintainer of the Video Disk Recorder (VDR) tool is
evidently uninterested in adding real network client support. Needless to
say, adding an interface to the kernel to get around an uncooperative
application maintainer is not an idea that gains a lot of sympathy on the
kernel side.
It is easy to see politics in decisions that do not go one's way. As the
old saying goes: just because you're paranoid doesn't mean that they aren't
out to get you; in some cases non-technical agendas almost certainly play a
part. But it may also be that the proposed code simply is not acceptable
in its current form and needs work. Going back to the mailing lists and
crying "politics" is an almost certain way to turn it into a political
issue, though, and with an almost certainly undesirable result.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>