Kernel development
Brief items
Kernel release status
The current development kernel is 3.18-rc2, released on October 26. "I had hoped that the rc1 release would mean that a few stragglers would quickly surface, and then the rest of the rc would be more normal. But no, I had straggling merge-window pull requests come in all week, and rc2 is bigger than I'd like." Perhaps the most significant of those requests was for the overlayfs union filesystem, which has finally been merged after years of trying.
Stable updates: no stable updates have been released in the last week. As of this writing, the 3.17.2, 3.16.7, 3.14.23, and 3.10.59 updates are all in the review process; they can be expected on or after October 30.
Kernel development news
Toward better CPU idle-time predictions
The CPU idle ("cpuidle") code has one of those tasks that would be best handled with absolute knowledge of the future: knowing how long the processor will be idle so that the most appropriate sleep state can be chosen. Since that knowledge is hard to come by, the cpuidle code must get by with heuristics. At the 2014 Linux Plumbers Conference (LPC), Daniel Lezcano talked about a scheme he has to improve those heuristics and, in the process, bring about better integration between the scheduler and the cpuidle subsystem.The current cpuidle code suffers from a number of shortcomings. It is not actually tied to the scheduler, so it has no idea of what the scheduler plans to do next. Even the most sophisticated governors tend to get things wrong, leading to the wrong sleep states being chosen. By focusing on a few relatively predictable parameters, Daniel hopes to come up with an approach to cpuidle that is both simpler and more accurate.
The "menu" cpuidle governor used on many systems looks at the recent past
and tries to come up with a good guess as to how long the system will sleep
the next time it goes idle. But actual system behavior depends on a wide
variety of different events. Some, like timer expirations, are entirely
predictable. Others, such as block I/O operations, are reasonably
predictable, especially if one is watching carefully. Others, including
things like keyboard events, are not predictable at all. Including these
latter events in the calculation, Daniel said, leads to bad predictions and
erratic performance from the cpuidle governor.
Daniel's cpuidle patch set addresses the most predictable wakeup events by having the scheduler pass the next timer expiration time into the cpuidle code. The scheduler also passes in the current latency requirement; cpuidle can use that information to avoid putting the processor into an overly deep sleep. The most unpredictable events, instead, are simply ignored in this version of the cpuidle code.
That leaves the moderately predictable events, primarily block I/O. Daniel's patch set starts by maintaining a simple running average of per-task I/O completion times. All tasks waiting for block I/O on a given CPU are put into a red-black tree; the closest expected completion time is easily obtained from that tree and used to predict the next wakeup time. But the running average is a bit too simple, being overly affected by the occasional operation that takes much longer (or much less time) than expected. So something a bit more complex is called for.
Daniel's response is to divide I/O completion times into buckets; after some investigation, he settled on 200µs as the optimal bucket granularity. Each bucket contains a counter of "hits," being the number of times that an operation has fallen into that bucket's duration. The buckets tracking I/O completion times are stored in a linked list. When a process first starts up, the data structure might look something like this:
The "hits" counter is incremented in the appropriate bucket for each I/O operation completion. After a small number of operations, the data structure might look like this:
Something interesting happens every time one bucket gains five hits though: it is moved to the beginning of the list. So, if the next operation completes in just over 200µs, the data structure will look like this:
The idea is that the buckets that see the most activity will be found toward the beginning of the list. When it comes time to predict when the next operation will complete, Daniel's code iterates through the list, computing a score for each bucket. Essentially, that score is the number of hits in the bucket divided by 1+2p, where p is the bucket's position in the list. So the bucket in the second position must have three times as many hits to get a higher score than the bucket in the first position.
The idea is to try to guess the most likely completion time based on both long-term and recent history. According to Daniel, it works pretty well, yielding far better results than the existing menu governor. Even so, this design did not survive the discussion in the LPC microconference, so any version of this patch set that gets into the mainline kernel is likely to look somewhat different.
The issue was the use of a per-task data structure for I/O completion time tracking. The advantage of this approach is that, when a task moves between CPUs, the tracking information will move with it, so the scheduler on the new CPU has an immediate idea of how long that task's I/O operations will take. But I/O completion times are not really a task-specific parameter; they are, instead, tied to the underlying device. And, as it happens, the kernel is already tracking device performance in the block layer. That information reflects current loads and should be quite accurate; using it might enable the entire bucket data structure to be done without.
So a future version of this patch set will probably be recast along those lines, using the backing store information already maintained by the kernel. But there are other challenges looming for this code. As Peter Zijlstra pointed out, the block layer is increasingly trying to maintain locality in I/O requests, ensuring that the CPU handling an I/O completion is the one that initiated the operation. Better locality makes sense, Peter said, but it also can conflict with the scheduler's attempts at distributing load across a system. It is harder to guess at wakeup times if it's not clear which CPU will wake up to deal with a specific I/O completion. The multiqueue block layer code is going to make this problem worse; some work will have to be done to reconcile these differing approaches to performance.
But, even if it is not a complete solution to the problem, Daniel's patch achieves the goals of better sleep-time predictions and better integration of the cpuidle code with the scheduler. That should be sufficient to get some version of this code into the mainline — someday.
[Your editor would like to thank the Linux Foundation for supporting his travel to this event].
In a bind with binder
The Android microconference at the 2014 Linux Plumbers Conference started off with an assessment of how the Android project was doing with regard to integration of its kernel changes with the mainline. The general feeling was that things are getting better; with 3.14, Android carries "only" 346 patches on top of the mainline kernel. But subsequent events have shown that the long-lasting tension between Android and mainline kernel developers has not entirely dissipated yet; the much-maligned Android "binder" mechanism is now one of the central points of that tension.Many of the Android-specific components that are in the mainline kernel still live in the staging tree. Early in the session, Greg Kroah-Hartman got up to discuss whether any of those components should be moved out of staging — either out of the kernel tree entirely, or into the mainline proper. As an example of the first type, the Android "logger" module, which is no longer used as of the Lollipop release, may simply go away entirely. The story with binder is different, though.
Android's binder has a long history, having first shown up as part of the BeOS system. It is a mechanism for remote procedure calls and remote object management that is used heavily within the Android system. For an overview of how binder relates to other interprocess communication mechanisms, see this 2011 article by Neil Brown. Almost nobody seems to like binder; it is seen as abusing various low-level kernel interfaces, has known security problems when used outside of the tightly controlled Android setting, and more. But Android needs it, so it persists.
In the Plumbers session, Greg noted that binder had been in the staging
tree for years. Since it is in active use, binder is not going away
anytime soon. It is "horrible" and "broken by design," but it is an ABI
that we need to support, Greg said, so we might as well move it out of the
staging tree. No objections to the idea were raised in the session;
everybody seemed happy with the idea of getting binder out of staging.
Greg wasted no time before posting a patch to move binder out of staging; it was on the lists before the Android microconference had concluded its business. Whether Greg expected the wider discussion to go as smoothly as the microconference is not clear; what is clear is that's not what he got.
John Stultz, who has done a lot of work toward the mainlining of Android-specific features, expressed a few concerns. The first of those had to do with maintenance: who was going to be the maintainer of this code, and had the Android developers agreed with that decision? Greg's response to this question was to note that the binder code had not changed in any significant way for a long time; there is not, he feels, a lot of maintenance that is needed. To the extent that binder needs a maintainer, Greg has volunteered to do it.
Another concern of John's had to do with efforts to replace binder with
something better tied into the kernel; work is progressing on writing a
binder-compatible library over the (still out-of-tree) kdbus interface. Moving binder into the
mainline, he thought, might reduce the incentive to get that work done.
Greg's answer here was that any such work is entirely new code; it doesn't
mitigate the need to maintain the existing code "forever." "So as
there really is nothing left to do on it, it deserves to be in the main
part of the kernel source tree.
"
Finally, John worried that moving binder to the mainline might encourage others
to make use of it; this was a concern that Alan Cox shared. Greg's response here is that there is
never a way to control how others will use the software we ship. But, if
anybody outside of Android were to use binder, he said, "you deserve
all of the pain and suffering and rooted machines you will get.
"
Arnd Bergmann raised a number of issues mostly relating to security. Evidently Android does not use the full API exported by binder; he would like to see an audit of how the API is used so that the unneeded parts can be removed, reducing the attack surface of the whole thing. Binder also leaks kernel-space pointers into user space and has no awareness of namespaces, so it can also leak information between containers in undesirable ways. These points have not been addressed in the discussion so far.
Finally, Christoph Hellwig attempted to block the move outright, saying:
Greg disagreed about the claim that no work was being done toward a new interface; he also repeated that, no matter how that work goes, the existing interface needs to be supported indefinitely. Christoph was not satisfied by the answer, though; here he represents a group of kernel developers who feel that the Android developers are still not really trying to work with the mainline kernel. From this point of view, merging substandard Android code just allows Google to offload the pain of maintaining it and encourages more of the same behavior in the future. About the only tool that the kernel developers have to address this problem, as they see it, is their ability to refuse to accept inadequate code.
At that point, the discussion wound down. Greg has not said whether he still plans to move binder regardless of the points that have been raised. In the end, it could be said to make little difference; binder will be shipped with the kernel tree whether it is moved or not. But the decision to move binder or not could send a message on how the kernel development community feels about its relationship with the Android team.
[Your editor would like to thank the Linux Foundation for supporting his travel to the Linux Plumbers Conference].
A desktop kernel wishlist
GNOME developer Bastien Nocera recently shared a "wishlist" with the kernel mailing list, outlining a number of features that GNOME and other desktop-environment projects would like to see added to or enhanced in the kernel. In the resulting discussion, some of the wishlist items were subsequently crossed off, but many of the others sparked real discussion that could, in time, develop into mainline kernel features.
Nocera prefaced his list by saying that GNOME has had productive
discussions with kernel developers in the past, and that the current
list consisted of "items that kernel developers might not
realise we'd like to rely on, or don't know that we'd make use of if
merged.
" Most of these items fall into either one of two
categories: power management or filesystem features, although there
are a handful of others in the mix as well.
Less power
Under power management, Nocera listed native support for hybrid suspend
(known as Intel Rapid Start on certain hardware), connected
standby support, " Hybrid suspend is a firmware-level feature intended to minimize the
time require to restore a system from the suspended state. It works by
initially suspending the system to RAM, but then writing the memory
contents to disk (directly from the firmware) after a given time period and
hibernating. Resume will be fast before the timeout expires, but the
system will preserve its state regardless of how long it is suspended.
Matthew Garrett wrote a patch
implementing support for the feature in mid-2013, which Nocera
subsequently made
a push to merge upstream—a push that ultimately stalled. One of
the critiques was that Intel's hybrid-suspend feature would eventually
be made obsolete by connected standby, a newer idle-state power mode
that allows certain background processes to continue. Garrett has
also worked on an approach to
connected standby.
John Stultz asked for clarification
about one other power-management request: exporting the cause of a
wake event. Nocera elaborated, saying
that the goal was to be able to determine whether the machine was
awakened by a user event or by a hardware event (such as the realtime
clock alarm), in order to respond accordingly to different scenarios.
The use case he cited was for user-space code to try and determine if
it was a good time to run a previously scheduled backup: if the user
woke the machine by opening the laptop lid, presumably it would be a
bad time to start a lengthy backup process. If the wake was
automatic, the backup should proceed.
Stultz, however, argued that reporting the cause of the wakeup
would not truly satisfy the use case—in part because any number
of wake events could take place (and, being asynchronous, could arrive
in an unhelpful order) in between the kernel waking up and user-space
code being able to run. But more importantly, as Zygo Blaxnell put it, which event was most recent is far less
important than (for example) whether or not the user is actually using
the machine—a fact that could be determined through other means,
such as keyboard activity. Alan Cox, on the other hand, commented that, in the long term, most of
the assumptions that go with current thinking about suspend, resume,
and hibernation states may go away anyhow:
- On/off is an extreme action rarely taken (feature parity with 1970s
VAXen ;-) )
- The "blob with a lid" model of construction is no longer useful. Even a
keyboarded device is quite likely have a removable keyboard.
The second large category of feature requests concerned filesystem
features and the VFS layer. First, Nocera reported that inotify is
not meeting the needs of desktop utilities in a number of ways.
Performance on large directory structures consumes too many resources,
file-creation notification requires watching an entire directory, and
monitoring directories' file-renaming and removal events is expensive
on large directories. These limitations impact the performance of
filesystem indexers, backup tools, and programs that manage file
"libraries" (e.g., music and video managers).
Sergey Davidoff of
elementary OS elaborated on the subject,
saying that desktop application developers are keen to move to the
file-library concept (as used in music and video managers) in
other application types as well. Presenting the user with the
filesystem hierarchy, he said, is far less useful than intelligently
tracking the relevant files and allowing the user to search and
interact with them based on their metadata. fanotify, as both noted, lacks the
proper level of detail, such as reporting rename and move events.
Nocera also asked for a way to propagate timestamp changes up a
directory chain. That is, if a file located in /foo/bar/
changes, there would be a way to detect the change not only on the
file itself, but also on /foo/bar and on /foo
itself. He added, though, that simply updating the change time on the
containing directories would clearly be the wrong solution, since it
would break many programs.
In short, he said, user-space programs would benefit from a better
file-change notification system—ideally one that would
consolidate events and monitor a directory structure without
re-crawling it periodically. The combination of improving fanotify
and adding user-space glue might work, he said, as would adding the
changelog features currently available in Btrfs and XFS to other
filesystems.
Pavel Machek asked whether or not a
adding a (hypothetical) recursive version of the mtime timestamp would
be a possible solution. Davidoff replied with skepticism that it would
work for monitoring online changes, but that monitoring Btrfs's
changelog " Nocera's final cluster of wishlist items is a bit of a grab bag.
It includes a better user-space API for the industrial input/output
(IIO) subsystem used for various sensors, a user-space helper for the out-of-memory (OOM)
killer, a system call to poll whether any processes out of a set of
processes have exited, and a variant of epoll_wait() that accepts
an absolute time rather than a timeout.
As was the case with the other categories, some of these items
sparked quick responses. Patrik Lundquist suggested that the desired
epoll_wait() functionality could be achieved with timerfd(). To
that, Nocera quoted the original source of the request, Ryan Lortie,
who said that making a separate call to
set up timerfd() on every instance of entering the kernel to sleep is
cumbersome, and that " As to the other suggestions, there has thus far been little
reaction either way. Certainly a number of the wishlist items boil
down to implementing friendlier user-space APIs. Nocera commented on
both the IIO and wake-event reporting issues that the present-day
interface of examining raw sysfs files is far from sufficient.
On the
whole, though, the kernel community has clearly been receptive to these
needs of desktop-environment projects. On the wishlist wiki page,
Nocera likened the exercise to the "plumbers' wishlists" submitted to kernel developers by
Kay Sievers, Lennart Poettering, and Harald Hoyer. The plumbers'
wishlist approach, of course, was successful enough that the plumbers
in question have since repeated it. That bodes well for Nocera's
desktop wishlist.
It is fairly common to hear about new kernel work that is driven by
the needs of either high-end data center users or embedded-system
builders—perhaps simply because companies in those lines of
business tend to hire kernel developers. Thus, is it always good to
see that the kernel community is equally responsive to the needs of
user-space developers working in other areas, when those developers
take time to reach out with their concerns.
a hibernation implementation that doesn't
use the swap space
", and several smaller items (such as
establishing uniform semantics for screen-backlight settings and
better documentation for managing USB power).
Filesystem issues
more or less as it happens
" would probably
suffice. In particular, monitoring Btrfs changelogs on the fly could
at least ensure that a fixed-size buffer would get the job done, as
opposed to the unbounded memory that fanotify would require for the
same task.
Miscellany
epoll in general suffers from being _way_
too chatty about the syscalls that you have to do.
" Andy
Lutomirski added that he had
implemented procfs
polling several years ago, and would be willing to resume work on
it if it were useful. procfs polling would allow a user to open a set
of /proc directories corresponding to a chosen set of
processes, then poll the directories to see if any exit.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
