User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.1-rc10, released on October 17. "There really hasn't been all that much going on - the smallish MIPS updates are still the bulk of this -rc, and the rest is pretty much small driver fixes. Oh, and some last-minute fs (btrfs and xfs) fixes in there too." Expect the final 3.1 release sometime in the near future.

Stable updates: the 3.0.7 update was released on October 17 with a moderately-sized set of important fixes.

Comments (2 posted)

Quotes of the week

We do seem to be approaching some sort of agreement ... well I am at least, I cannot speak for others :-)
-- Neil Brown

What I'm saying is that its much better to attack the primary source of evil in a manner that is unforgiving instead of trying to avoid the worst excesses and cause non-obvious breakage.

For a while people were promoting the idea that its good to be lenient in what you accept as input and strict in what you send out. I think people are starting to realize that was a horrid mistake since now they're getting utter crap and people don't even know what right is anymore.

-- Peter Zijlstra

I agree that the thing that needs doing probably involves web developers and threats of implied violence, but I suspect that web developers are being created faster than we can reeducate them.
-- Matthew Garrett

Comments (1 posted)

2011 Kernel Summit draft agenda posted

The 2011 Kernel Summit will be held on October 24 and 25 in Prague. The draft agenda has been posted. Needless to say, LWN will be reporting from the event; stay tuned.

Full Story (comments: 1)

Answers to some common account questions

A new FAQ file has posted in response to questions that have been raised about the process of getting back onto It's worth a read for those who have not yet reestablished their access. "At this time, we are only providing access to developers who previously hosted git repositories on, and whose repositories have shown activity after February, 2011. At a later time we will be able to consider creating accounts for developers with inactive trees or who have not had a account in the past."

Full Story (comments: none)

Another kernel RAID5 implementation

By Jonathan Corbet
October 18, 2011
There are many things that the kernel lacks, but RAID implementations is not on that list. Both the MD and DM subsystems currently have full RAID support, while the Btrfs filesystem has lower-level RAID support. RAID5/6 support for Btrfs has been posted a couple of times, but has not yet made it into the mainline. So, one might well be justified in wondering if yet another RAID5 implementation is needed in the kernel.

There will be one if Boaz Harrosh has his way; his RAID5 support patch has been posted to a few filesystem-related kernel development lists. Boaz's patch is aimed at adding RAID5 support to the "objects raid engine" code in the exofs filesystem, which provides a POSIX filesystem on top of object-storage devices. It also implements RAID5 for the pNFS object-storage backend.

According to Boaz, this work constitutes a nice, general-purpose RAID library that could be used in other settings; in particular, he says, Btrfs could make use of it. What would be even nicer, of course, is if some of the existing in-kernel RAID implementations could also move to this library - or if exofs could use one of those implementations. This version of RAID5 support may well be cleaner and more general than the others, but it may well take a stronger argument than that to get a new RAID subsystem merged at this point.

Comments (9 posted)

Kernel development news

Yet another opportunity for opportunistic suspend

By Jonathan Corbet
October 18, 2011
Your editor was innocently looking at some papers on his desk the other day when his computer abruptly decided to suspend itself. Rawhide is fun in that way; combined with GNOME's delight in forgetting user settings, it can produce no end of surprises to brighten one's working experience. The ability to suspend a desktop system to RAM is actually quite a nice feature, but your editor prefers to have a say in when it happens. Thankfully, GNOME still (so far) allows automatic suspend to be turned off. But it is clear that the suspend-to-RAM functionality is seeing increased use in a number of contexts; it is not just for laptops and Android anymore. Your editor's desktop is not the only place where stakeholders want some control over when the system sleeps and when it needs to stay running.

Indeed, control over automatic suspension of the system is at the core of the debate over Android's opportunistic suspend mechanism. As usage of suspend-to-RAM increases, so does interest in creating a proper mechanism for determining when a suspend can happen. A new patch set from Rafael Wysocki has restarted this discussion and led to, possibly, a surprising conclusion.

Rafael started with the conclusion that "whatever the kernel has to offer in this area is either too complicated to use in practice or inadequate for other reasons." He then went on to propose a new mechanism that, he hoped, would simplify things. It came in two parts:

  • A new sysfs knob, /sys/power/sleep_mode, which provided overall control of the suspend-to-RAM and hibernation functionality. If a suitably-privileged process writes "disabled" to this file, no attempt to suspend or hibernate the system will succeed. It is a sort of high-power wakelock that ensures the system will keep running while important work is being done.

  • Applications wanting to keep the system awake would open a new device, /dev/sleepctl, and execute an ioctl() to that effect. After this call, attempts to suspend the system would block until the application explicitly drops its lock or until a 500ms (by default) timeout period expires. The "stay awake" operation would also be done by the system at resume time to give processes time to perform whatever tasks need to be done.

It is probably safe to say that these patches will not be merged in anything resembling this form. Leading the opposition was Neil Brown, who asserted that the job could be done in user space, and, indeed, should be done that way. According to Neil:

The only sane way to handle suspend is for any (suitably privileged) process to be able to request that suspend doesn't happen, and then for one process to initiate suspend when no-one is blocking it.

Communication with that process, Neil claimed, should be no harder than using Rafael's simplified interface to communicate with the kernel. After a fair amount of discussion, Neil came up with a proposal for how he thinks things should actually work. As one would expect from the above quote, it centers around a single daemon with the responsibility for suspending and resuming the system. A decision to suspend the system is never made by the kernel, and, if everybody is following the rules, by no other user-space process.

The daemon has a pair of modes; it starts in the "on demand" mode where the system will only be suspended after an explicit request to do so. That request could come from the user closing the lid or pressing a button sequence; in this case, the system should suspend in short order regardless of what is happening, and it should not resume without an explicit user action. Suspend can also be requested by a suitably-privileged application; in this case the operation is only carried out if nothing is blocking it, and the system can be automatically resumed at some future time. This mode was also referred to as the "legacy" mode; it needs to be supported but it is not how things are expected to run most of the time.

Other processes in the system can affect suspend behavior by talking to the daemon. One of the things a sufficiently-privileged process can do is to ask the daemon to go into "immediate" mode; in that mode, the system will suspend anytime there is no known reason to stay awake. The immediate mode, thus, closely mirrors the opportunistic suspend mechanism used on Android systems. When the daemon is in immediate mode, it no longer makes sense for any process in the system to ask the system to suspend - the daemon is already prepared to suspend whenever the opportunity arises. So the rest of the interface is concerned with when the system should be awake.

Any process with an interest in suspend and resume events can open a socket to the daemon and request notification anytime a suspend is being contemplated. That process should respond to such notifications with a message saying that it is ready for the suspend to happen; it can, optionally, add a request that the system stay awake for the time being if there is work that must be done. If no processes block the suspend, the system will go to sleep; another message will be sent to all processes once the system resumes.

There is an interesting variant on this mechanism whereby processes can register one or more file descriptors with the daemon. In this case, the daemon will only query the associated processes before suspending if one or more of the given file descriptors is reported as readable by poll(). A readable file descriptor thus functions in a manner similar to a driver-acquired wakelock in the Android system. If a device wakes the system and provides input for a user-space process to read, the daemon will see that the file descriptor is readable and avoid suspending the system until that input has been consumed and acted upon. Meanwhile, processes that clearly have no need to block suspend will not need to wake up and respond to a notification every time a suspend is contemplated.

The daemon also allows processes to request that the system be awake at some future time. A tool like cron can use this feature to, say, wake the system late at night to run a backup.

At a first glance, this approach looks like it should be able to handle the opportunistic suspend problem without the need to add more mechanism to the kernel. But it must be remembered that this is a problem that has defeated a number of initially reasonable-looking solutions. Whether this proposal will fare better - and whether the various desktop and mobile environments will adopt it - remains to be seen.

Comments (11 posted)

Timer slack for slacker developers

By Jonathan Corbet
October 17, 2011
The "timer slack controller" is a proposed mechanism that would allow a session management program to adjust the timer tolerances of a group of processes with a single knob. It seems like a relatively obscure and harmless feature, but it has been the focus of an intense debate on the kernel mailing lists. The core question has been seen before: what measures should the kernel take, if any, to keep poorly-written applications from hurting performance?

Timers allow a process to request a wakeup at some future time; timer slack gives the kernel some leeway in its implementation of those timers. If the kernel can delay specific timers by a bounded amount, it can often expire multiple timers at once, minimizing the number of wakeups and, thus, reducing the system's power consumption. Some processes need more precise timing than others; for this reason, the kernel allows a process to specify its maximum timer slack with the prctl() system call. There is, currently, no mechanism to allow one process to adjust another process's timer slack value; it is generally assumed that any given process knows best when it comes to its own timing requirements.

The timer slack controller allows a suitably privileged process to set the timer slack value for every process contained within a control group. The patch has been circulating for some time without generating a great deal of interest; it recently resurfaced in response to the "plumber's wish list for Linux" which requested such a feature. The reasoning behind the request was explained by Lennart Poettering:

Consider you have one or more desktop user sessions logged in, each one in a timer slack cgroup. Now, userspace already tracks when sessions become idle (i.e. currently desktop userspace then starts a screensaver, or turns off the screen, or similar), and we'd like to increase the timer slack for the session cgroups individually as the individual session becomes idle, and decrease it again if the session stops being idle.

It is, in other words, a power-saving mechanism. When the session manager determines that nothing special is going on, it can massively increase the slack on any timers operated by desktop applications, effectively decreasing the number of wakeups. Applications need not be aware of whether the user is currently at the keyboard or not; they will simply slow down during the boring times.

There is some stiff opposition to merging this controller. Naturally, the fact that the timer slack controller uses control groups is part of the problem; some kernel developers have still not made their peace with control groups. Until that situation resolves itself - if it ever does - features based on control groups are going to have a bumpy ride on their way into the mainline.

Beyond the general control group issue, though, two complaints have been heard about this approach to power management. One is that applications running on the desktop may have timing requirements that are not dependent on whether the user is actually there or not. One could imagine a data acquisition application that does not have stringent response requirements, but which will still lose data if its timers suddenly gain multiple seconds of slack. Lennart's response is that such applications should be using the realtime scheduler classes, but that answer is unlikely to please anybody. There is likely to be no shortage of applications that have never needed to bother with realtime scheduling but which still will not work well with arbitrary delays. Imposing such delays could lead to any number of strange bugs.

The big complaint, though, as expressed by Peter Zijlstra and others, is that this feature makes it easier for developers to get away with writing low-quality applications. If the pressure to remove badly-written code is removed, it is said, that code will never get fixed. Peter suggests that, rather than papering over poor behavior in the kernel, it would be better to simply kill applications that waste power. He was especially strident about applications that continue to draw when their windows are not visible; such problems should be fixed, he said, before adding workarounds to the kernel.

The massive improvements in power behavior that resulted from the release and use of PowerTop is often pointed to as an example of how things should be done. This situation is a little different, though. The wakeup reductions inspired by PowerTop were low-hanging fruit - processes waking up multiple times per second for no useful purpose. The timer slack controller is aimed at a different problem: wakeups which are useful when somebody is paying attention, but which are not useful otherwise. That is a trickier problem.

Determining when the user is paying attention is not always straightforward, though there some obvious signs. If the screen has been turned off because the input devices are idle, the user probably does not care. Other cases - non-visible tabs in web browsers, for example - have been cited as well, but the situation is not so obvious there. As Matthew Garrett put it: buried tabs still need timer events "because people expect gmail to provide them with status updates even if it's not the foreground tab." Fixing the problem in applications would require figuring out when nothing is going on, finding a way to communicate it to applications, then fixing large numbers of them (some of which are proprietary) to respond to those events.

It is not surprising that developers facing that kind of challenge might choose to improve the situation with a simple kernel patch instead. It is, certainly, a relatively easy path toward better battery life. But the patch does raise a fundamental policy question that has never been answered in any definitive way. Does mitigating the effects of (what is seen as) application developer sloppiness encourage the distribution of low-quality code and worsen the system in the long run? Or, instead, does the "tough love" approach deter developers and impoverish our application environment without actually fixing the underlying problems?

An answer to that question is unlikely to come in the near future. What that probably means is that the current fuss will be enough to keep the timer slack controller from getting in through the 3.2 merge window. It also seems unlikely to go away, though; we are likely to see this topic return to the mailing lists in the future.

Comments (58 posted)

Limiting system calls via control groups?

By Jake Edge
October 19, 2011

Limiting the system calls available to processes is fairly hot topic in the kernel security community these days. There have been several different proposals and the topic was discussed at some length at the recent Linux Security Summit but, so far, no solution has made its way into the mainline. Łukasz Sowa recently posted an RFC for a different mechanism to restrict syscalls, which may have advantages over other approaches. It also has a potential disadvantage as it uses a feature that is unpopular with some kernel hackers: control groups.

Conceptually, Sowa's idea is pretty straightforward. An administrator could place a process or processes into a control group and then restrict which syscalls those processes (and their children) could make. The current interface uses system call numbers that are written to the syscalls.allow and syscalls.deny cgroup control files. Any system calls can be denied, but only those available to a parent cgroup could be enabled that way. Any process that makes a denied system call would get an ENOSYS error return.

Using system call numbers seems somewhat painful (and those numbers are not the same across architectures), but may be unavoidable. But there are some other bigger problems, performance to begin with. Sowa reports 5% more system time used by processes in the root cgroup, which is a hefty penalty to pay. His patch hooks into the assembly language syscall fastpath, which is probably not going to fly. It is also architecture-specific and only implemented for x86 currently. Paul Menage points out that hooking into the ptrace() path may avoid those problems:

Can't you hook into the ptrace callpath? That's already implemented on every architecture. Set the thread bit that triggers diverting to syscall_trace_enter() only when any of the thread's syscalls are denied, and then you don't have to work in assembly.

Menage also mentions some other technical issues with the patch, but he is skeptical overall of the need for it. "I'd guess that most vulnerabilities in a system can be exploited just using system calls that almost all applications need in order to get regular work done (open, write, exec ,mmap, etc) which limits the utility of only being able to turn them off by syscall number." Because the approach only allows a binary on or off choice for the system calls, he doesn't necessarily think that it has the right level of granularity. The granularity argument echoes the one made by Ingo Molnar on a 2009 proposal to extend seccomp by adding a bitmask of allowed system calls.

But there have been a number of projects that have expressed interest in having a more flexible seccomp-like feature in the kernel, starting with the Chromium browser team who have proposed several ways to do so. Seccomp provides a way to restrict processes to a few syscalls (read(), write(), exit(), and sigreturn()), but that is too inflexible for many projects. But Molnar has been very vocal in opposition to approaches that only allow binary decisions about system call usage, and he prefers a mechanism that filters system calls using Ftrace-style conditionals. That approach, however, is not popular with some of the other tracing and instrumentation developers.

It is a quandary. There are a number of projects (e.g. QEMU, vsftpd, LXC) interested in such a feature, but no implementation (so far) has passed muster. Sowa's cgroup-based solution may well be yet another. Certainly the current performance for processes that are not in a cgroup (i.e. are in the root cgroup) is not going to be popular—an understatement—but even if Menage's suggestion (or some other mechanism) leads to a solution with little or no performance impact, there may be complaints because of the unpopularity of cgroups.

There may be hope on the horizon in the form of a proposed discussion about expanding seccomp (or providing a means to disable certain syscalls) at the upcoming Kernel Summit, though it does not seem to have made it onto the agenda. Certainly many of the participants in the mailing list discussions will be present. Control groups is on the agenda, though, so there will be some discussion of that rather contentious topic. Look for LWN's coverage of the summit on next week's Kernel page.

Comments (7 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds