The current development kernel is 3.1-rc10, released on October 17. "There
really hasn't been all that much going on - the smallish MIPS updates are
still the bulk of this -rc, and the rest is pretty much small driver
fixes. Oh, and some last-minute fs (btrfs and xfs) fixes in there
too." Expect the final 3.1 release sometime in the near future.
Stable updates: the 3.0.7 update was
released on October 17 with a moderately-sized set of important fixes.
Comments (2 posted)
We do seem to be approaching some sort of agreement ... well I am
at least, I cannot speak for others :-)
-- Neil Brown
What I'm saying is that its much better to attack the primary
source of evil in a manner that is unforgiving instead of trying to
avoid the worst excesses and cause non-obvious breakage.
For a while people were promoting the idea that its good to be
lenient in what you accept as input and strict in what you send
out. I think people are starting to realize that was a horrid
mistake since now they're getting utter crap and people don't even
know what right is anymore.
-- Peter Zijlstra
I agree that the thing that needs doing probably involves web
developers and threats of implied violence, but I suspect that web
developers are being created faster than we can reeducate them.
-- Matthew Garrett
Comments (1 posted)
The 2011 Kernel Summit will be held on October 24 and 25 in Prague. The draft
has been posted. Needless to say, LWN will be reporting from
the event; stay tuned.
Full Story (comments: 1)
A new FAQ file has posted in response to questions that have been raised
about the process of getting back onto kernel.org. It's worth a read for
those who have not yet reestablished their access. "At this time, we are only providing access to developers who previously
hosted git repositories on kernel.org, and whose repositories have shown
activity after February, 2011. At a later time we will be able to
consider creating accounts for developers with inactive trees or who
have not had a kernel.org account in the past.
Full Story (comments: none)
There are many things that the kernel lacks, but RAID implementations is not on
that list. Both the MD and DM subsystems currently have full RAID support,
while the Btrfs filesystem has lower-level RAID support. RAID5/6 support for Btrfs
has been posted a
couple of times, but has not yet made it into the mainline. So, one might
well be justified in wondering if yet another RAID5 implementation is
needed in the kernel.
There will be one if Boaz Harrosh has his way; his RAID5 support patch has been posted to a few
filesystem-related kernel development lists. Boaz's patch is aimed at
adding RAID5 support to the "objects raid engine" code in the exofs
filesystem, which provides a POSIX filesystem on top of object-storage
devices. It also implements RAID5 for the pNFS object-storage backend.
According to Boaz, this work constitutes a nice, general-purpose RAID
library that could be used in other settings; in particular, he says, Btrfs
could make use of it. What would be even nicer, of course, is if some of
the existing in-kernel RAID implementations could also move to this library -
or if exofs could use one of those implementations. This version of RAID5
support may well be cleaner and more general than the others, but it may
well take a stronger argument than that to get a new RAID subsystem merged
at this point.
Comments (9 posted)
Kernel development news
Your editor was innocently looking at some papers on his desk the other day
when his computer abruptly decided to suspend itself. Rawhide is fun in
that way; combined with GNOME's delight in forgetting user settings, it can
produce no end of surprises to brighten one's working experience. The
ability to suspend a desktop system to RAM is actually quite a nice
feature, but your editor prefers to have a say in when it happens.
Thankfully, GNOME still (so far) allows automatic suspend to be turned
off. But it is clear that the suspend-to-RAM functionality is seeing
increased use in a number of contexts; it is not just for laptops and
Android anymore. Your editor's desktop is not the only place where
stakeholders want some control over when the system sleeps and when it
needs to stay running.
Indeed, control over automatic suspension of the system is at the core of the
debate over Android's opportunistic suspend mechanism. As usage of
suspend-to-RAM increases, so does interest in creating a proper mechanism
for determining when a suspend can happen. A
new patch set from Rafael
Wysocki has restarted this discussion and led to, possibly, a surprising
Rafael started with the conclusion that "whatever the kernel has to
offer in this area is either too complicated to use in practice or
inadequate for other reasons." He then went on to propose a new
mechanism that, he hoped, would simplify things. It came in two parts:
- A new sysfs knob, /sys/power/sleep_mode, which provided
overall control of the suspend-to-RAM and hibernation functionality.
If a suitably-privileged process writes "disabled" to this
file, no attempt to suspend or hibernate the system will succeed. It
is a sort of high-power wakelock that ensures the system will keep
running while important work is being done.
- Applications wanting to keep the system awake would open a new device,
/dev/sleepctl, and execute an ioctl() to that
effect. After this call, attempts to suspend the system would block
until the application explicitly drops its lock or until a 500ms (by
default) timeout period expires. The "stay awake" operation would
also be done by the system at resume time to give processes time to
perform whatever tasks need to be done.
It is probably safe to say that these patches will not be merged in
anything resembling this form. Leading the opposition was Neil Brown, who
asserted that the job could be done in user
space, and, indeed, should be done that way. According to Neil:
The only sane way to handle suspend is for any (suitably
privileged) process to be able to request that suspend doesn't
happen, and then for one process to initiate suspend when no-one is
Communication with that process, Neil claimed, should be no harder than
using Rafael's simplified interface to communicate with the kernel.
After a fair amount of discussion, Neil came up with a proposal for how he thinks things should
actually work. As one would expect from the above quote, it centers around
a single daemon with the responsibility for suspending and resuming the
system. A decision to suspend the system is never made by the kernel, and,
if everybody is following the rules, by no other user-space process.
The daemon has a pair of modes; it starts in the "on demand" mode where the
system will only be suspended after an explicit request to do so. That
request could come from the user closing the lid or pressing a button
sequence; in this case, the system should suspend in short order regardless
of what is happening, and it should not resume without an explicit user
action. Suspend can also be requested by a suitably-privileged
application; in this case the operation is only carried out if nothing is
blocking it, and the system can be automatically resumed at some future
time. This mode was also referred to as the "legacy" mode; it needs to be
supported but it is not how things are expected to run most of the time.
Other processes in the system can affect suspend behavior by talking to the
daemon. One of the things a sufficiently-privileged process can do is to
ask the daemon to go into "immediate" mode; in that mode, the system will
suspend anytime there is no known reason to stay awake. The immediate
mode, thus, closely mirrors the opportunistic suspend mechanism used on
Android systems. When the daemon is in immediate mode, it no longer makes
sense for any process in the system to ask the system to suspend - the
daemon is already prepared to suspend whenever the opportunity arises. So
the rest of the interface is concerned with when the system should be
Any process with an interest in suspend and resume events can open a socket
to the daemon and request notification anytime a suspend is being
contemplated. That process should respond to such notifications with a message saying that
it is ready for the suspend to happen; it can, optionally, add a request
that the system stay awake for the time being if there is work that must be
done. If no processes block the suspend, the system will go to sleep;
another message will be sent to all processes once the system resumes.
There is an interesting variant on this mechanism whereby processes can
register one or more file descriptors with the daemon. In this case, the
daemon will only query the associated processes before suspending if one or
more of the given file descriptors is reported as readable by
poll(). A readable file descriptor thus functions in a manner
similar to a driver-acquired wakelock in the Android system. If a device
wakes the system and provides input for a user-space process to read, the
daemon will see that the file descriptor is readable and avoid suspending
the system until that input has been consumed and acted upon. Meanwhile,
processes that clearly have no need to block suspend will not need to wake
up and respond to a notification every time a suspend is contemplated.
The daemon also allows processes to request that the system be awake at
some future time. A tool like cron can use this feature to, say,
wake the system late at night to run a backup.
At a first glance, this approach looks like it should be able to handle the
opportunistic suspend problem without the need to add more mechanism to the
kernel. But it must be remembered that this is a problem that has defeated
a number of initially reasonable-looking solutions. Whether this proposal
will fare better - and whether the various desktop and mobile environments
will adopt it - remains to be seen.
Comments (11 posted)
The "timer slack controller" is a proposed mechanism that would allow a
session management program to adjust the timer tolerances of a group of
processes with a single knob. It seems like a relatively obscure and
harmless feature, but it has been the focus of an intense debate on the
kernel mailing lists. The core question has been seen before: what
measures should the kernel take, if any, to keep poorly-written
applications from hurting performance?
Timers allow a process to request a wakeup at some future time; timer slack
gives the kernel some leeway in its implementation of those timers. If the
kernel can delay specific timers by a bounded amount, it can often expire
multiple timers at once, minimizing the number of wakeups and, thus,
reducing the system's power consumption. Some processes need more
precise timing than others; for this reason, the kernel allows a process to
specify its maximum timer slack with the prctl() system call.
There is, currently, no mechanism to allow one process to adjust another
process's timer slack value; it is generally assumed that any given
process knows best when it comes to its own timing requirements.
The timer slack controller allows a
suitably privileged process to set the timer slack value for every process
contained within a control group. The patch has been circulating for some
time without generating a great deal of interest; it recently resurfaced in
response to the "plumber's wish list for
Linux" which requested such a feature. The reasoning behind the
request was explained by Lennart
Consider you have one or more desktop user sessions logged in, each
one in a timer slack cgroup. Now, userspace already tracks when
sessions become idle (i.e. currently desktop userspace then starts
a screensaver, or turns off the screen, or similar), and we'd like
to increase the timer slack for the session cgroups individually as
the individual session becomes idle, and decrease it again if the
session stops being idle.
It is, in other words, a power-saving mechanism. When the session manager
determines that nothing special is going on, it can massively increase the
slack on any timers operated by desktop applications, effectively
decreasing the number of wakeups. Applications need not be aware of
whether the user is currently at the keyboard or not; they will simply slow
down during the boring times.
There is some stiff opposition to merging this controller.
Naturally, the fact that the timer slack controller uses control groups is
part of the problem; some kernel developers have still not made their
peace with control groups. Until that situation resolves itself - if it
ever does - features based on control groups are going to have a bumpy ride
on their way into the mainline.
Beyond the general control group issue, though, two complaints have been
heard about this approach to power management.
One is that applications running on the desktop may have timing
requirements that are not dependent on whether the user is actually there
or not. One could imagine a data acquisition application that does not
have stringent response requirements, but which will still lose data if its
timers suddenly gain multiple seconds of slack. Lennart's response is that such applications should be
using the realtime scheduler classes, but that answer is unlikely to please
anybody. There is likely to be no shortage of applications that have never
needed to bother with realtime scheduling but which still will not work
well with arbitrary delays. Imposing such delays could lead to any number
of strange bugs.
The big complaint, though, as expressed by
Peter Zijlstra and others, is that this feature makes it easier for
developers to get away
with writing low-quality applications. If the pressure to remove
badly-written code is removed, it is said, that code will never get fixed.
Peter suggests that, rather than papering over poor behavior in the kernel,
it would be better to simply kill applications that waste power. He was
especially strident about applications that continue to draw when their
windows are not visible; such problems should be fixed, he said, before
adding workarounds to the kernel.
The massive improvements in power behavior that resulted from the release
and use of PowerTop is often pointed to as an example of how things should
be done. This situation is a little different, though. The wakeup
reductions inspired by PowerTop were low-hanging fruit - processes waking
up multiple times per second for no useful purpose. The timer slack
controller is aimed at a different problem: wakeups which are useful
when somebody is paying attention, but which are not useful otherwise.
That is a trickier problem.
Determining when the user is paying attention is not always
straightforward, though there some obvious signs. If the screen has been
turned off because the input devices are idle, the user probably does not
care. Other cases - non-visible tabs in web browsers, for example - have
been cited as well, but the situation is not so obvious there. As Matthew
Garrett put it: buried tabs still need
timer events "because people expect gmail to provide them with status
updates even if it's not the foreground tab." Fixing the problem in
applications would require figuring out when nothing is going on, finding a
way to communicate it to applications, then fixing large numbers of them
(some of which are proprietary) to respond to those events.
It is not surprising that developers facing that kind of challenge might
choose to improve the situation with a simple kernel patch instead. It is,
certainly, a relatively easy path toward better battery life. But the
patch does raise a fundamental policy question that has never been answered
in any definitive way. Does mitigating the effects of (what is seen as)
application developer sloppiness encourage the distribution of low-quality
code and worsen the system in the long run? Or, instead, does the "tough
love" approach deter developers and impoverish our application environment
without actually fixing the underlying problems?
An answer to that question is unlikely to come in the near future. What
that probably means is that the current fuss will be enough to keep the
timer slack controller from getting in through the 3.2 merge window. It
also seems unlikely to go away, though; we are likely to see this topic
return to the mailing lists in the future.
Comments (58 posted)
Limiting the system calls available to processes is fairly hot topic in the
kernel security community these days. There have been several different
proposals and the topic was discussed at
some length at the recent Linux Security Summit but, so far, no solution
has made its way into the mainline. Łukasz Sowa recently posted an RFC for a different mechanism to
restrict syscalls, which may have advantages over other approaches. It
also has a potential disadvantage as it uses a feature that is unpopular
with some kernel hackers: control groups.
Conceptually, Sowa's idea is pretty straightforward. An administrator
could place a process or
processes into a control group and then restrict which syscalls those
processes (and their children) could make. The current interface uses
system call numbers that are written to the syscalls.allow and
syscalls.deny cgroup control files. Any system calls can be
denied, but only those available to a parent cgroup could be enabled that
way. Any process that makes a denied system call would get an
ENOSYS error return.
Using system call numbers seems somewhat painful (and those numbers are not
the same across architectures), but may be unavoidable. But there are some
other bigger problems, performance to begin with. Sowa reports 5% more
system time used by processes in the root cgroup, which is a hefty penalty
to pay. His patch hooks into the assembly language syscall fastpath, which
is probably not going to fly. It is also architecture-specific and only
implemented for x86 currently. Paul Menage points out that hooking into the
ptrace() path may avoid those problems:
Can't you hook into the ptrace callpath? That's already implemented on
every architecture. Set the thread bit that triggers diverting to
syscall_trace_enter() only when any of the thread's syscalls are
denied, and then you don't have to work in assembly.
Menage also mentions some other technical issues with the patch, but he is
skeptical overall of the need for it. "I'd guess
that most vulnerabilities in a system can be exploited just using
system calls that almost all applications need in order to get regular
work done (open, write, exec ,mmap, etc) which limits the utility of
only being able to turn them off by syscall number." Because the
approach only allows a binary on or off choice for the system calls, he
doesn't necessarily think that it has the right level of granularity.
The granularity argument echoes the one made by Ingo
Molnar on a 2009 proposal to extend
seccomp by adding a bitmask of allowed system calls.
But there have been a number of projects that have expressed interest in
having a more flexible seccomp-like feature in the kernel, starting with
the Chromium browser team who have proposed
several ways to do so. Seccomp
provides a way to restrict processes to a few syscalls
(read(), write(), exit(), and
sigreturn()), but that is too inflexible for many projects. But
Molnar has been very vocal in opposition to approaches that only allow
binary decisions about system call usage, and he prefers a mechanism that
filters system calls using Ftrace-style
conditionals. That approach, however, is not
popular with some of the other tracing and instrumentation developers.
It is a quandary. There are a number of projects (e.g. QEMU, vsftpd, LXC)
interested in such a
feature, but no implementation (so far) has passed muster. Sowa's
cgroup-based solution may well be yet another.
Certainly the current performance for processes that are not in a cgroup
(i.e. are in the root cgroup) is not going to be popular—an
understatement—but even if Menage's suggestion (or some other
mechanism) leads to a solution
with little or no performance impact, there may be complaints because of
the unpopularity of cgroups.
There may be hope on the horizon in the form of a proposed discussion about
expanding seccomp (or providing a means to disable certain syscalls) at the
upcoming Kernel Summit, though
it does not seem to have made it onto the agenda. Certainly many of
the participants in the mailing list discussions will be present.
Control groups is on the agenda, though, so there will be some discussion
of that rather contentious topic. Look for LWN's coverage of the summit on
next week's Kernel page.
Comments (7 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>