|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 3.18-rc2, released on October 26. "I had hoped that the rc1 release would mean that a few stragglers would quickly surface, and then the rest of the rc would be more normal. But no, I had straggling merge-window pull requests come in all week, and rc2 is bigger than I'd like." Perhaps the most significant of those requests was for the overlayfs union filesystem, which has finally been merged after years of trying.

Stable updates: no stable updates have been released in the last week. As of this writing, the 3.17.2, 3.16.7, 3.14.23, and 3.10.59 updates are all in the review process; they can be expected on or after October 30.

Comments (none posted)

Kernel development news

Toward better CPU idle-time predictions

By Jonathan Corbet
October 29, 2014

Linux Plumbers Conference
The CPU idle ("cpuidle") code has one of those tasks that would be best handled with absolute knowledge of the future: knowing how long the processor will be idle so that the most appropriate sleep state can be chosen. Since that knowledge is hard to come by, the cpuidle code must get by with heuristics. At the 2014 Linux Plumbers Conference (LPC), Daniel Lezcano talked about a scheme he has to improve those heuristics and, in the process, bring about better integration between the scheduler and the cpuidle subsystem.

The current cpuidle code suffers from a number of shortcomings. It is not actually tied to the scheduler, so it has no idea of what the scheduler plans to do next. Even the most sophisticated governors tend to get things wrong, leading to the wrong sleep states being chosen. By focusing on a few relatively predictable parameters, Daniel hopes to come up with an approach to cpuidle that is both simpler and more accurate.

The "menu" cpuidle governor used on many systems looks at the recent past and tries to come up with a good guess as to how long the system will sleep the next time it goes idle. But actual system behavior depends on a wide variety of different events. Some, like timer expirations, are entirely predictable. Others, such as block I/O operations, are reasonably [Daniel Lezcano] predictable, especially if one is watching carefully. Others, including things like keyboard events, are not predictable at all. Including these latter events in the calculation, Daniel said, leads to bad predictions and erratic performance from the cpuidle governor.

Daniel's cpuidle patch set addresses the most predictable wakeup events by having the scheduler pass the next timer expiration time into the cpuidle code. The scheduler also passes in the current latency requirement; cpuidle can use that information to avoid putting the processor into an overly deep sleep. The most unpredictable events, instead, are simply ignored in this version of the cpuidle code.

That leaves the moderately predictable events, primarily block I/O. Daniel's patch set starts by maintaining a simple running average of per-task I/O completion times. All tasks waiting for block I/O on a given CPU are put into a red-black tree; the closest expected completion time is easily obtained from that tree and used to predict the next wakeup time. But the running average is a bit too simple, being overly affected by the occasional operation that takes much longer (or much less time) than expected. So something a bit more complex is called for.

Daniel's response is to divide I/O completion times into buckets; after some investigation, he settled on 200µs as the optimal bucket granularity. Each bucket contains a counter of "hits," being the number of times that an operation has fallen into that bucket's duration. The buckets tracking I/O completion times are stored in a linked list. When a process first starts up, the data structure might look something like this:

[cpuidle bucket data structure]

The "hits" counter is incremented in the appropriate bucket for each I/O operation completion. After a small number of operations, the data structure might look like this:

[cpuidle bucket data structure]

Something interesting happens every time one bucket gains five hits though: it is moved to the beginning of the list. So, if the next operation completes in just over 200µs, the data structure will look like this:

[cpuidle bucket data structure]

The idea is that the buckets that see the most activity will be found toward the beginning of the list. When it comes time to predict when the next operation will complete, Daniel's code iterates through the list, computing a score for each bucket. Essentially, that score is the number of hits in the bucket divided by 1+2p, where p is the bucket's position in the list. So the bucket in the second position must have three times as many hits to get a higher score than the bucket in the first position.

The idea is to try to guess the most likely completion time based on both long-term and recent history. According to Daniel, it works pretty well, yielding far better results than the existing menu governor. Even so, this design did not survive the discussion in the LPC microconference, so any version of this patch set that gets into the mainline kernel is likely to look somewhat different.

The issue was the use of a per-task data structure for I/O completion time tracking. The advantage of this approach is that, when a task moves between CPUs, the tracking information will move with it, so the scheduler on the new CPU has an immediate idea of how long that task's I/O operations will take. But I/O completion times are not really a task-specific parameter; they are, instead, tied to the underlying device. And, as it happens, the kernel is already tracking device performance in the block layer. That information reflects current loads and should be quite accurate; using it might enable the entire bucket data structure to be done without.

So a future version of this patch set will probably be recast along those lines, using the backing store information already maintained by the kernel. But there are other challenges looming for this code. As Peter Zijlstra pointed out, the block layer is increasingly trying to maintain locality in I/O requests, ensuring that the CPU handling an I/O completion is the one that initiated the operation. Better locality makes sense, Peter said, but it also can conflict with the scheduler's attempts at distributing load across a system. It is harder to guess at wakeup times if it's not clear which CPU will wake up to deal with a specific I/O completion. The multiqueue block layer code is going to make this problem worse; some work will have to be done to reconcile these differing approaches to performance.

But, even if it is not a complete solution to the problem, Daniel's patch achieves the goals of better sleep-time predictions and better integration of the cpuidle code with the scheduler. That should be sufficient to get some version of this code into the mainline — someday.

[Your editor would like to thank the Linux Foundation for supporting his travel to this event].

Comments (9 posted)

In a bind with binder

By Jonathan Corbet
October 29, 2014

Linux Plumbers Conference
The Android microconference at the 2014 Linux Plumbers Conference started off with an assessment of how the Android project was doing with regard to integration of its kernel changes with the mainline. The general feeling was that things are getting better; with 3.14, Android carries "only" 346 patches on top of the mainline kernel. But subsequent events have shown that the long-lasting tension between Android and mainline kernel developers has not entirely dissipated yet; the much-maligned Android "binder" mechanism is now one of the central points of that tension.

Many of the Android-specific components that are in the mainline kernel still live in the staging tree. Early in the session, Greg Kroah-Hartman got up to discuss whether any of those components should be moved out of staging — either out of the kernel tree entirely, or into the mainline proper. As an example of the first type, the Android "logger" module, which is no longer used as of the Lollipop release, may simply go away entirely. The story with binder is different, though.

Android's binder has a long history, having first shown up as part of the BeOS system. It is a mechanism for remote procedure calls and remote object management that is used heavily within the Android system. For an overview of how binder relates to other interprocess communication mechanisms, see this 2011 article by Neil Brown. Almost nobody seems to like binder; it is seen as abusing various low-level kernel interfaces, has known security problems when used outside of the tightly controlled Android setting, and more. But Android needs it, so it persists.

In the Plumbers session, Greg noted that binder had been in the staging tree for years. Since it is in active use, binder is not going away anytime soon. It is "horrible" and "broken by design," but it is an ABI [Greg
Kroah-Hartman] that we need to support, Greg said, so we might as well move it out of the staging tree. No objections to the idea were raised in the session; everybody seemed happy with the idea of getting binder out of staging.

Greg wasted no time before posting a patch to move binder out of staging; it was on the lists before the Android microconference had concluded its business. Whether Greg expected the wider discussion to go as smoothly as the microconference is not clear; what is clear is that's not what he got.

John Stultz, who has done a lot of work toward the mainlining of Android-specific features, expressed a few concerns. The first of those had to do with maintenance: who was going to be the maintainer of this code, and had the Android developers agreed with that decision? Greg's response to this question was to note that the binder code had not changed in any significant way for a long time; there is not, he feels, a lot of maintenance that is needed. To the extent that binder needs a maintainer, Greg has volunteered to do it.

Another concern of John's had to do with efforts to replace binder with something better tied into the kernel; work is progressing on writing a binder-compatible library over the (still out-of-tree) kdbus interface. Moving binder into the mainline, he thought, might reduce the incentive to get that work done. Greg's answer here was that any such work is entirely new code; it doesn't mitigate the need to maintain the existing code "forever." "So as there really is nothing left to do on it, it deserves to be in the main part of the kernel source tree."

Finally, John worried that moving binder to the mainline might encourage others to make use of it; this was a concern that Alan Cox shared. Greg's response here is that there is never a way to control how others will use the software we ship. But, if anybody outside of Android were to use binder, he said, "you deserve all of the pain and suffering and rooted machines you will get."

Arnd Bergmann raised a number of issues mostly relating to security. Evidently Android does not use the full API exported by binder; he would like to see an audit of how the API is used so that the unneeded parts can be removed, reducing the attack surface of the whole thing. Binder also leaks kernel-space pointers into user space and has no awareness of namespaces, so it can also leak information between containers in undesirable ways. These points have not been addressed in the discussion so far.

Finally, Christoph Hellwig attempted to block the move outright, saying:

NAK. It's complete rubbish and does things to the FD code that it really shouldn't. Android needs to completely redo the interface, and there's been absolutely no work towards that.

Greg disagreed about the claim that no work was being done toward a new interface; he also repeated that, no matter how that work goes, the existing interface needs to be supported indefinitely. Christoph was not satisfied by the answer, though; here he represents a group of kernel developers who feel that the Android developers are still not really trying to work with the mainline kernel. From this point of view, merging substandard Android code just allows Google to offload the pain of maintaining it and encourages more of the same behavior in the future. About the only tool that the kernel developers have to address this problem, as they see it, is their ability to refuse to accept inadequate code.

At that point, the discussion wound down. Greg has not said whether he still plans to move binder regardless of the points that have been raised. In the end, it could be said to make little difference; binder will be shipped with the kernel tree whether it is moved or not. But the decision to move binder or not could send a message on how the kernel development community feels about its relationship with the Android team.

[Your editor would like to thank the Linux Foundation for supporting his travel to the Linux Plumbers Conference].

Comments (6 posted)

A desktop kernel wishlist

By Nathan Willis
October 29, 2014

GNOME developer Bastien Nocera recently shared a "wishlist" with the kernel mailing list, outlining a number of features that GNOME and other desktop-environment projects would like to see added to or enhanced in the kernel. In the resulting discussion, some of the wishlist items were subsequently crossed off, but many of the others sparked real discussion that could, in time, develop into mainline kernel features.

Nocera prefaced his list by saying that GNOME has had productive discussions with kernel developers in the past, and that the current list consisted of "items that kernel developers might not realise we'd like to rely on, or don't know that we'd make use of if merged." Most of these items fall into either one of two categories: power management or filesystem features, although there are a handful of others in the mix as well.

Less power

Under power management, Nocera listed native support for hybrid suspend (known as Intel Rapid Start on certain hardware), connected standby support, "a hibernation implementation that doesn't use the swap space", and several smaller items (such as establishing uniform semantics for screen-backlight settings and better documentation for managing USB power).

Hybrid suspend is a firmware-level feature intended to minimize the time require to restore a system from the suspended state. It works by initially suspending the system to RAM, but then writing the memory contents to disk (directly from the firmware) after a given time period and hibernating. Resume will be fast before the timeout expires, but the system will preserve its state regardless of how long it is suspended. Matthew Garrett wrote a patch implementing support for the feature in mid-2013, which Nocera subsequently made a push to merge upstream—a push that ultimately stalled. One of the critiques was that Intel's hybrid-suspend feature would eventually be made obsolete by connected standby, a newer idle-state power mode that allows certain background processes to continue. Garrett has also worked on an approach to connected standby.

John Stultz asked for clarification about one other power-management request: exporting the cause of a wake event. Nocera elaborated, saying that the goal was to be able to determine whether the machine was awakened by a user event or by a hardware event (such as the realtime clock alarm), in order to respond accordingly to different scenarios. The use case he cited was for user-space code to try and determine if it was a good time to run a previously scheduled backup: if the user woke the machine by opening the laptop lid, presumably it would be a bad time to start a lengthy backup process. If the wake was automatic, the backup should proceed.

Stultz, however, argued that reporting the cause of the wakeup would not truly satisfy the use case—in part because any number of wake events could take place (and, being asynchronous, could arrive in an unhelpful order) in between the kernel waking up and user-space code being able to run. But more importantly, as Zygo Blaxnell put it, which event was most recent is far less important than (for example) whether or not the user is actually using the machine—a fact that could be determined through other means, such as keyboard activity. Alan Cox, on the other hand, commented that, in the long term, most of the assumptions that go with current thinking about suspend, resume, and hibernation states may go away anyhow:

- There may be no such thing as suspend or resume, just make your code very well behaved on wakeup events, and closing unneeded devices/resources whenever it can.

- On/off is an extreme action rarely taken (feature parity with 1970s VAXen ;-) )

- The "blob with a lid" model of construction is no longer useful. Even a keyboarded device is quite likely have a removable keyboard.

Filesystem issues

The second large category of feature requests concerned filesystem features and the VFS layer. First, Nocera reported that inotify is not meeting the needs of desktop utilities in a number of ways. Performance on large directory structures consumes too many resources, file-creation notification requires watching an entire directory, and monitoring directories' file-renaming and removal events is expensive on large directories. These limitations impact the performance of filesystem indexers, backup tools, and programs that manage file "libraries" (e.g., music and video managers).

Sergey Davidoff of elementary OS elaborated on the subject, saying that desktop application developers are keen to move to the file-library concept (as used in music and video managers) in other application types as well. Presenting the user with the filesystem hierarchy, he said, is far less useful than intelligently tracking the relevant files and allowing the user to search and interact with them based on their metadata. fanotify, as both noted, lacks the proper level of detail, such as reporting rename and move events.

Nocera also asked for a way to propagate timestamp changes up a directory chain. That is, if a file located in /foo/bar/ changes, there would be a way to detect the change not only on the file itself, but also on /foo/bar and on /foo itself. He added, though, that simply updating the change time on the containing directories would clearly be the wrong solution, since it would break many programs.

In short, he said, user-space programs would benefit from a better file-change notification system—ideally one that would consolidate events and monitor a directory structure without re-crawling it periodically. The combination of improving fanotify and adding user-space glue might work, he said, as would adding the changelog features currently available in Btrfs and XFS to other filesystems.

Pavel Machek asked whether or not a adding a (hypothetical) recursive version of the mtime timestamp would be a possible solution. Davidoff replied with skepticism that it would work for monitoring online changes, but that monitoring Btrfs's changelog "more or less as it happens" would probably suffice. In particular, monitoring Btrfs changelogs on the fly could at least ensure that a fixed-size buffer would get the job done, as opposed to the unbounded memory that fanotify would require for the same task.

Miscellany

Nocera's final cluster of wishlist items is a bit of a grab bag. It includes a better user-space API for the industrial input/output (IIO) subsystem used for various sensors, a user-space helper for the out-of-memory (OOM) killer, a system call to poll whether any processes out of a set of processes have exited, and a variant of epoll_wait() that accepts an absolute time rather than a timeout.

As was the case with the other categories, some of these items sparked quick responses. Patrik Lundquist suggested that the desired epoll_wait() functionality could be achieved with timerfd(). To that, Nocera quoted the original source of the request, Ryan Lortie, who said that making a separate call to set up timerfd() on every instance of entering the kernel to sleep is cumbersome, and that "epoll in general suffers from being _way_ too chatty about the syscalls that you have to do." Andy Lutomirski added that he had implemented procfs polling several years ago, and would be willing to resume work on it if it were useful. procfs polling would allow a user to open a set of /proc directories corresponding to a chosen set of processes, then poll the directories to see if any exit.

As to the other suggestions, there has thus far been little reaction either way. Certainly a number of the wishlist items boil down to implementing friendlier user-space APIs. Nocera commented on both the IIO and wake-event reporting issues that the present-day interface of examining raw sysfs files is far from sufficient.

On the whole, though, the kernel community has clearly been receptive to these needs of desktop-environment projects. On the wishlist wiki page, Nocera likened the exercise to the "plumbers' wishlists" submitted to kernel developers by Kay Sievers, Lennart Poettering, and Harald Hoyer. The plumbers' wishlist approach, of course, was successful enough that the plumbers in question have since repeated it. That bodes well for Nocera's desktop wishlist.

It is fairly common to hear about new kernel work that is driven by the needs of either high-end data center users or embedded-system builders—perhaps simply because companies in those lines of business tend to hire kernel developers. Thus, is it always good to see that the kernel community is equally responsive to the needs of user-space developers working in other areas, when those developers take time to reach out with their concerns.

Comments (14 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.18-rc2 ?
Jiri Slaby Linux 3.12.31 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds