Kernel development
Brief items
Kernel release status
The current development kernel is 3.2-rc5, released on December 9. "It's been a bit over a week, and I'm sad to report that -rc5 is bigger (at least in number of commits - most of the commits are pretty small, so it's possible that the *diff* ends up being smaller, but I didn't check) than both -rc2 and -rc4 were. So much for 'calming down'." 355 changes have been merged since -rc4, indeed bigger than -rc2 (280) and -rc4 (207) but smaller than -rc3 (412). All told, there have been 1,254 changes since -rc1, which, at a bit over 10% of the total, is actually relatively small.
Stable updates: the 2.6.32.50, 3.0.13, and 3.1.5 stable updates were released on December 9. All three contain the usual long list of important fixes.
Quotes of the week
It's in fact the best (read: most usable, most intuitive) Linux desktop I've ever used for kernel development and maintenance work-flows. It gets out my way, tries to be there when I need it and takes usage ergonomy and UI consistency as seriously as Apple and Google does. Kudos.
Bernat: Tuning the Linux IPv4 route cache
Vincent Bernat has posted a lengthy description of how the IPv4 routing cache works and how to tune it for best results. "Once an entry has been added to the route cache, there are several ways to remove it. Most entries are removed by the garbage collector which will scan the route cache and remove invalid and older entries. It will be triggered when the route cache is full or at regular interval, once a certain threshold has been met." (Thanks to Paul Wise).
Kernel development news
Fixing the symlink race problem
The problems with symbolic link race conditions have existed for decades, been well understood in that time, and developers have been given clear guidelines on how to avoid them. Yet they still persist, with new vulnerabilities discovered regularly. There is also a known way to avoid most of the problems by changing the kernel—something that has been done for many years in grsecurity and Openwall—but it has never made its way into the mainline. While kernel hackers are understandably unenthusiastic about working around buggy user-space programs in the kernel, this particular problem is severe enough that it probably makes sense to do so. It would seem that we are seeing some movement on that front.
The basic problem is a time-to-check-to-time-of-use (TOCTTOU) flaw. Buggy applications will look for the existence and/or characteristics of temporary files before opening them. An attacker can exploit the flaw by changing the file (often by making a symlink) in between the check and the open(). If the program with the flaw has elevated privileges (e.g. setuid), and the attacker replaces the file with a symlink to a system file, serious problems can result.
The bug generally happens in shared, world-writable directories that have the "sticky" bit set (like /tmp). The sticky bit on a directory is set to prevent users from deleting other users' files. So, the fix restricts the ability to follow symlinks in sticky directories. In particular, a symlink is only followed if it is owned by the follower or if the directory and symlink have the same owner. That solves much of the symlink race problem without breaking any known applications.
We looked at patches to restrict the following of symlinks in sticky directories in June 2010. Since that time, there has been a two-pronged approach, championed by Kees Cook, to try to get the code into the mainline. The first is the Yama LSM, which is meant to collect up extensions to the Linux discretionary access control (DAC) model. But it runs afoul of the usual problem for specialized LSMs: the inability to stack LSMs.
Cook and others would clearly prefer to see the symlink changes go into the core VFS code, rather than via an LSM, but there has been a push by some to keep it out of the core. There was discussion of Yama and its symlink protections at the Linux Security Summit LSM roundtable, where the plan to push Yama as a DAC enhancement LSM was hatched. That may well be a way forward, but Cook has also posted a patch set that would put the symlink restrictions into fs/namei.c.
The latter patch attracted some interesting comments that would seem to indicate that Ingo Molnar and Linus Torvalds, at least, see value in closing the hole. None of the VFS developers have weighed in on this iteration, but Cook notes that the patch reflects feedback from Al Viro, which could be seen as a sign that he's not completely opposed. Molnar was particularly unhappy that the hole still exists:
Molnar also had some questions about the implementation, including whether
the PROTECTED_STICKY_SYMLINKS kernel configuration parameter
should default to 'yes', but was overall very interested in seeing the
patch move forward. Torvalds had a somewhat
different take, "Ugh. I really dislike the
implementation.
", but suggested a different mechanism to try to
solve the underlying problem by using the permission bits on the symlink.
His argument is that Cook's approach is not very "polite
"
because it is hidden away, so it turns into:
As Cook points out, though, Torvalds's
approach has its own set of "weird hidden behaviors
".
Torvalds admittedly had not thought his proposal through completely, but it
does show an interest in seeing the problem solved. From Cook's
perspective, the changes he is proposing simply change the behavior of
sticky directories with respect to symlinks, whereas Torvalds's would have
wider-ranging effects on symlink creation. Either might do the job, but
Cook's solution does have an advantage in that the proposed changes have
been shaken out for years in grsecurity and Openwall, as well as in Ubuntu
more recently.
Given that several high-profile kernel hackers seem to be in favor of fixing the problem—Ted Ts'o was also favorably disposed to a fix back in 2010—the winds may have shifted in favor of the core VFS approach. If Viro and the other VFS developers aren't completely unhappy with it, we could see it in 3.4 or so.
If that were to happen, there is another related patch that would presumably also be pushed for mainline inclusion: hard link restrictions. That, like the symlink change, currently lives in Yama, though the case can be made that it should also be done in the core VFS code. That patch would disallow the creation of hard links to files that are inaccessible (neither readable nor writable) to the user making the link. It also disallows hard links to setuid and setgid files. That would close some further holes in the symlink race vulnerability, as well as fix some other application vulnerabilities.
Should both the symlink and hard link restrictions make their way into the VFS core, that would only leave the ptrace() restrictions in Yama. Those restrictions allow administrators to disallow a process from using ptrace() on anything other than its descendants (unless it has the CAP_SYS_PTRACE capability). Currently, any process can trace any other running under the same UID, so a compromise in one running program could lead to disclosing credentials and other sensitive information from another running program. There may also be other DAC enhancements that Cook or others are interested in adding to Yama in the future.
One way or another, the problem is severe enough that there should, at least, be a way for distributors or administrators to thwart these common vulnerabilities. Whether the fix lives in VFS or an LSM, providing an option to turn off a whole class of application flaws—which can often lead to system compromise—seems worth doing. Hopefully we are seeing movement in that direction.
LTTng rejection, next generation
The story of tracing in the Linux kernel sometimes seems to resemble a bad multi-season TV soap opera. We have no end of strong characters, plot twists, independent story lines, recurring themes, and conflicting agendas. The cast changes slowly over time, but things never seem to come to any sort of satisfying conclusion. Those watching the show might be forgiven for thinking that one of those story lines might be about to be wrapped up when the LTTng tracing system was pulled into the staging tree for the 3.3 merge window. But they should have known that they were just being set up for another sad twist in the plot.LTTng descends from some of the earliest dynamic tracing work done for Linux. Its distinguishing characteristics include integrated kernel- and user-space tracing, performance sufficient to deal with high-bandwidth event streams, and a well-developed set of capture and analysis tools. LTTng has always been maintained out of the mainline kernel tree, but it is packaged by a number of distributors and has base of dedicated users, some of whom have been happy to fund ongoing LTTng development work.
Had LTTng been merged years ago, the story may have been much simpler, but, for a number of reasons (including the simple fact that, for years, any sort of tracing capability was hard to sell to the kernel development community) that did not happen. So we have ended up with a number of projects in this area, including SystemTap (which also remains out-of-tree), and the in-tree ftrace and perf subsystems. Naturally, none of these solutions has proved entirely satisfactory so, while there has been a fair amount of pressure to consolidate the various tracing projects, that has tended not to happen.
That is not to say that there has been no progress at all. Some agreement has been reached on the format of tracepoints themselves; much of the work in that area was done by primary LTTng maintainer Mathieu Desnoyers. As a result, the number of tracepoints in the kernel has been growing rapidly, making kernel operations more visible to users in a number of ways. A lot of talk about merging more infrastructure has been heard over the years - said talk was often audible from a great distance at various conferences - but progress has been minimal. It seems to be easy for developers in this area to get bogged down on the details of ring buffers, event formats, and so on at the expense of producing an actual, usable solution.
To Mathieu, merging into the staging tree must have looked like an attractive way to push things forward. The relaxed rules for that tree would allow the code into the mainline where its visibility would increase, any remaining issues could be fixed up, and more users could be found. It all seemed to be working - some cleanup patches from new developers were posted - until Mathieu tried to add exports for some core kernel symbols so LTTng could access them. That attracted the attention of the core kernel developers who, to put it gently, were not impressed with what they saw.
In the end, Ingo Molnar vetoed the whole patch series and asked Greg Kroah-Hartman to remove the LTTng code from staging. Greg complied with that request, with the result that LTTng is, once again, no closer to merging into the mainline than it was before. This particular story line, it seems, has at least one more season to run yet.
What is it about LTTng that makes it unsuitable for merging into the mainline? It starts with a lot of duplicated infrastructure. Inevitably, LTTng brings in its own ring buffer to communicate events to user space, despite the fact that the two ring buffers used by perf and ftrace are already seen as being too many. There is a new instrumentation mechanism for system calls - something that the kernel already has. And, of course, there is a new user-space ABI to control all of this - again an unwelcome addition when there is strong pressure from some directions to merge the existing in-kernel tracing ABIs.
Duplicated infrastructure always tends to be hard to merge into the mainline; duplicated user-space ABIs, which must be supported forever, are even more so. It is thus not surprising that there is pushback against these patches, even without considering the highly contentious nature of the discussion around tracing work in general. Ingo claims to be receptive to merging the parts of LTTng that are better than what the kernel has now - after it has been unified with the existing infrastructure, of course - but, he says, Mathieu has been more interested in maintaining LTTng as a separate "brand" and has been unwilling to merge things in this way.
Mathieu's response has not done much to address those concerns. Duplicate infrastructure, he said, is fine as long as there is no agreement on how that infrastructure should work. Thus, he said, it is better to get his ring buffer into the mainline and to try to work out the differences there. He takes a similar approach to the ABI:
The points that are missed here are that (1) filesystems do, in fact, share the same ABI, and (2) there is indeed a cost to multiple ABIs for tracing. Those ABIs have to be maintained indefinitely and they fragment the efforts of tool developers who find themselves forced to choose one or the other. Unless he can produce a convincing proof that the existing kernel interfaces cannot possibly be extended to meet LTTng's needs, Mathieu will almost certainly not succeed in getting a new tracing ABI into the mainline.
Two notable conclusions were reached at the 2011 Kernel Summit. One was that maintainers should say "no" more often and accept fewer new features into the mainline; that would argue that Ingo and others are right to block the addition of LTTng in its current form. But the other conclusion was that code that has been shipped for years and that has real users should be strongly considered for merging even if it has known technical shortcomings. That, of course, would argue for merging LTTng, which certainly meets those conditions. Given the players involved, that conflict seems almost certain to be resolved with LTTng remaining an active project out of the mainline. Tune in next year for another episode of "As the tracing world turns."
Vtunerc and software acceptance politics
The kernel development process prides itself on being driven exclusively by technical concerns. Ideally, all decisions with regard to the merging of code would be based on whether that code makes technical sense or not; decisions based on "political" concerns are seen as being rather less ideal. But, as a recent discussion shows, even a seemingly "political" decision can have technical reasoning behind it.In June 2011, Honza Petrous posted a patch to the linux-media list containing an implementation of a virtual DVB (digital video broadcast) device driver. DVB drivers normally talk to devices that tune in and capture video streams - television tuners, in other words. Honza's "vtunerc" driver, instead, drives no physical hardware at all. Instead, it serves as a sort of loopback device. One side looks like a normal DVB device; it handles all the usual DVB system calls. The other side, which shows up as /dev/vtunercN, passes a processed form of those DVB system calls back to user space. The intended use is for a user-space process to receive those operations and pass them to a remote peer elsewhere on the network; that peer would then perform the operations on a real DVB device. Using this mechanism, DVB devices could be hosted on a network in a manner that is entirely transparent to DVB applications. Honza has posted a diagram showing how the pieces fit together.
Virtual devices of this type are not unprecedented in the Linux (and Unix) tradition; the venerable virtual terminal devices work in much the same way. This type of mechanism is also sometimes used to make devices available within virtualized guest systems. But this patch was not accepted into the DVB subsystem for a number of reasons, one of which being that it would facilitate the creation of proprietary user-space drivers for DVB devices. That was the reason Honza picked up on when he went to the linux-kernel list in an attempt to gain support in November, saying that, while he didn't discount the possibility of "bad guys" abusing the interface to create closed-source drivers, he was not convinced that it justified the "aggressive NACK" the code received.
As the subsequent discussion made clear, some developers do, indeed, believe that the potential for abuse in this way is sufficient reason to keep an interface out of the mainline kernel. That is the same reasoning that has, for example, blocked the merging of graphics drivers that have proprietary user-space components. But it also turns out that there is rather more than that to this particular decision. Reasons for keeping vtunerc out include:
- The same ABI that enables proprietary drivers also exposes a fair
amount of internal information about the DVB layer. That ABI would
have to remain unchanged even as DVB evolves, leading to maintenance
burdens in the future.
- There appears to be little advantage to routing all that video data
through the kernel and immediately back to user space; it would make
more sense for DVB applications to use a network video protocol
directly and avoid the cost of routing data through the kernel.
- DVB applications tend to work with tight timing constraints. Adding a
network connection into the mix will create latencies that may well
confuse these applications. Working across a network requires a
different approach than talking to a device directly; operations that
may be done synchronously on a local device may need to happen
asynchronously with a remote device. By hiding the network link,
vtunerc makes it impossible for applications to drive the device
appropriately.
- If the creation of this type of loopback device absolutely cannot be avoided, it can be done with the CUSE (char drivers in user space) interface instead of adding a new ABI.
In the discussion, it seems that much of the motivation for vtunerc comes from the fact that it would require no changes to applications at all, while a user-space approach might require such changes. In fact, it seems that there is a political problem at that level: the maintainer of the Video Disk Recorder (VDR) tool is evidently uninterested in adding real network client support. Needless to say, adding an interface to the kernel to get around an uncooperative application maintainer is not an idea that gains a lot of sympathy on the kernel side.
It is easy to see politics in decisions that do not go one's way. As the old saying goes: just because you're paranoid doesn't mean that they aren't out to get you; in some cases non-technical agendas almost certainly play a part. But it may also be that the proposed code simply is not acceptable in its current form and needs work. Going back to the mailing lists and crying "politics" is an almost certain way to turn it into a political issue, though, and with an almost certainly undesirable result.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
