LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.10-rc4, released on June 2. Linus says: "Anyway, rc4 is smaller than rc3 (yay!). But it could certainly be smaller still (boo!). There's the usual gaggle of driver fixes (drm, pinctrl, scsi target, fbdev, xen), but also filesystems (cifs, xfs, with small fixes to reiserfs and nfs)."

Stable updates: 3.2.46 was released on May 31.

Comments (none posted)

Quotes of the week

Our review process is certainly not perfect when you have to wait for stuff to break in linux-next before you get people to notice the problems.
Arnd Bergmann

I have recently learned, from a very reliable source, that ARM management seriously dislikes the Lima driver project. To put it nicely, they see no advantage in an open source driver for the Mali, and believe that the Lima driver is already revealing way too much of the internals of the Mali hardware. Plus, their stance is that if they really wanted an open source driver, they could simply open up their own codebase, and be done.

Really?

Luc Verhaegen

Comments (1 posted)

Kernel development news

The multiqueue block layer

By Jonathan Corbet
June 5, 2013
The kernel's block layer is charged with managing I/O to the system's block ("disk drive") devices. It was designed in an era when a high-performance drive could handle hundreds of I/O operations per second (IOPs); the fact that it tends to fall down with modern devices, capable of handling possibly millions of IOPs, is thus not entirely surprising. It has been known for years that significant changes would need to be made to enable Linux to perform well on fast solid-state devices. The shape of those changes is becoming clearer as the multiqueue block layer patch set, primarily the work of Jens Axboe and Shaohua Li, gets closer to being ready for mainline merging.

The basic structure of the block layer has not changed a whole lot since it was described for 2.6.10 in Linux Device Drivers. It offers two ways for a block driver to hook into the system, one of which is the "request" interface. When run in this mode, the block layer maintains a simple request queue; new I/O requests are submitted to the tail of the queue and the driver receives requests from the head. While requests sit in the queue, the block layer can operate on them in a number of ways: they can be reordered to minimize seek operations, adjacent requests can be coalesced into larger operations, and policies for fairness and bandwidth limits can be applied, for example.

This request queue turns out to be one of the biggest bottlenecks in the entire system. It is protected by a single lock which, on a large system, will bounce frequently between the processors. It is a linked list, a notably cache-unfriendly data structure especially when modifications must be made — as they frequently are in the block layer. As a result, anybody who is trying to develop a driver for high-performance storage devices wants to do away with this request queue and replace it with something better.

The second block driver mode — the "make request" interface — allows a driver to do exactly that. It hooks the driver into a much higher part of the stack, shorting out the request queue and handing I/O requests directly to the driver. This interface was not originally intended for high-performance drivers; instead, it is there for stacked drivers (the MD RAID implementation, for example) that need to process requests before passing them on to the real, underlying device. Using it in other situations incurs a substantial cost: all of the other queue processing done by the block layer is lost and must be reimplemented in the driver.

The multiqueue block layer work tries to fix this problem by adding a third mode for drivers to use. In this mode, the request queue is split into a number of separate queues:

  • Submission queues are set up on a per-CPU or per-node basis. Each CPU submits I/O operations into its own queue, with no interaction with the other CPUs. Contention for the submission queue lock is thus eliminated (when per-CPU queues are used) or greatly reduced (for per-node queues).

  • One or more hardware dispatch queues simply buffer I/O requests for the driver.

While requests are in the submission queue, they can be operated on by the block layer in the usual manner. Reordering of requests for locality offers little or no benefit on solid-state devices; indeed, spreading requests out across the device might help with the parallel processing of requests. So reordering will not be done, but coalescing requests will reduce the total number of I/O operations, improving performance somewhat. Since the submission queues are per-CPU, there is no way to coalesce requests submitted to different queues. With no empirical evidence whatsoever, your editor would guess that adjacent requests are most likely to come from the same process and, thus, will automatically find their way into the same submission queue, so the lack of cross-CPU coalescing is probably not a big problem.

The block layer will move requests from the submission queues into the hardware queues up to the maximum number specified by the driver. Most current devices will have a single hardware queue, but high-end devices already support multiple queues to increase parallelism. On such a device, the entire submission and completion path should be able to run on the same CPU as the process generating the I/O, maximizing cache locality (and, thus, performance). If desired, fairness or bandwidth-cap policies can be applied as requests move to the hardware queues, but there will be an associated performance cost. Given the speed of high-end devices, it may not be worthwhile to try to ensure fairness between users; everybody should be able to get all the I/O bandwidth they can use.

This structure makes the writing of a high-performance block driver (relatively) simple. The driver provides a queue_rq() function to handle incoming requests and calls back to the block layer when requests complete. Those wanting to look at an example of how such a driver would work can see null_blk.c in the new-queue branch of Jens's block repository:

    git://git.kernel.dk/linux-block.git

In the current patch set, the multiqueue mode is offered in addition to the existing two modes, so current drivers will continue to work without change. According to this paper on the multiqueue block layer design [PDF], the hope is that drivers will migrate over to the multiqueue API, allowing the eventual removal of the request-based mode.

This patch set has been significantly reworked in the last month or so; it has gone from a relatively messy series into something rather cleaner. Merging into the mainline would thus appear to be on the agenda for the near future. Since use of this API is optional, existing drivers should continue to work and this merge could conceivably happen as early as 3.11. But, given that the patch set has not yet been publicly posted to any mailing list and does not appear in linux-next, 3.12 seems like a more likely target. Either way, Linux seems likely to have a much better block layer by the end of the year or so.

Comments (10 posted)

Toward reliable user-space OOM handling

By Jonathan Corbet
June 5, 2013
A visit from the kernel's out-of-memory (OOM) killer is usually about as welcome as a surprise encounter with the tax collector. The OOM killer is called in when the system runs out of memory and cannot progress without killing off one or more processes; it is the embodiment of a frequently-changing set of heuristics describing which processes can be killed for maximum memory-freeing effect and minimal damage to the system as a whole. One would not think that this would be a job that is amenable to handling in user space, but there are some users who try to do exactly that, with some success. That said, user-space OOM handling is not as safe as some users would like, but there is not much consensus on how to make it more robust.

User-space OOM handling

The heaviest user of user-space OOM handling, perhaps, is Google. Due to the company's desire to get the most out of its hardware, Google's internal users tend to be packed tightly into their servers. Memory control groups (memcgs) are used to keep those users from stepping on each others' toes. Like the system as a whole, a memcg can go into the OOM condition, and the kernel responds in the same way: the OOM killer wakes up and starts killing processes in the affected group. But, since an OOM situation in a memcg does not threaten the stability of the system as a whole, the kernel allows a bit of flexibility in how those situations are handled. Memcg-level OOM killing can be disabled altogether, and there is a mechanism by which a user-space process can request notification when a memcg hits the OOM wall.

Said notification mechanism is designed around the needs of a global, presumably privileged process that manages a bunch of memcgs on the system; that process can respond by raising memory limits, moving processes to different groups, or doing some targeted process killing of its own. But Google's use case turns out to be a little different: each internal Google user is given the ability (and responsibility) to handle OOM conditions within that user's groups. This approach can work, but there are a couple of traps that make it less reliable than some might like.

One of those is that, since users are doing their own OOM handling, the OOM handler process itself will be running within the affected memcg and will be subject to the same memory allocation constraints. So if the handler needs to allocate memory while responding to an OOM problem, it will block and be put on the list of processes waiting for the OOM situation to be resolved; this is, essentially, a deadlocking of the entire memcg. One can try to avoid this problem by locking pages into memory and such, but, in the end, it is quite hard to write a user-space program that is guaranteed not to cause memory allocations in the kernel. Simply reading a /proc file to get a handle on the situation can be enough to bring things to a halt.

The other problem is that the process whose allocation puts the memcg into an OOM condition in the first place may be running fairly deeply within the kernel and may hold any number of locks when it is made to wait. The mmap_sem semaphore seems to be especially problematic, since it is often held in situations where memory is being allocated — page fault handling, for example. If the OOM handler process needs to do anything that might acquire any of the same locks, it will block waiting for exactly the wrong process, once again creating a deadlock.

The end result is that user-space OOM killing is not 100% reliable and, arguably, can never be. As far as Google is concerned, somewhat unreliable OOM handling is acceptable, but deadlocks when OOM killing goes bad are not. So, back in 2011, David Rientjes posted a patch establishing a user-configurable OOM killer delay. With that delay set, the (kernel) OOM killer will wait for the specified time for an OOM situation to be resolved by the user-space handler before it steps in and starts killing off processes. This mechanism gives the user-space handler a window within which it can try to work things out; should it deadlock or otherwise fail to get the job done in time, the kernel will take over.

David's patch was not merged at that time; the general sentiment seemed to be that it was just a workaround for user-space bugs that would be better fixed at the source. At the time, David said that Google would carry the patch internally if need be, but that he thought others would want the same functionality as the use of memcgs increased. More than two years later, he is trying again, but the response is not necessarily any friendlier this time around.

Alternatives to delays

Some developers responded that running the OOM handler within the control group it manages is a case of "don't do that," but, once David explained that users are doing their own OOM handling, they seemed to back down a bit on that one. There does still seem to still be a bit of a sentiment that the OOM handler should be locked into memory and should avoid performing memory allocations. In particular, OOM time seems a bit late to be reading /proc files to get a picture of which processes are running in the system. The alternative, though, is to trace process creation in each memcg, which has performance issues of its own.

Some constructive thoughts came from Johannes Weiner, who had a couple of ideas for improving the current situation. One of those was a patch intended to solve the problem of processes waiting for OOM resolution while holding an arbitrary set of locks. This patch makes two changes, the first of which comes into play when a problematic allocation is the direct result of a system call. In this case, the allocating process will not be placed in the OOM wait queue at all; instead, the system call will simply fail with an ENOMEM error. This solves most of the problem, but at a cost: system calls that might once have worked will start returning an error code that applications might not be expecting. That could cause strange behavior, and, given that the OOM situation is rare, such behavior could be hard to uncover with testing.

The other part of the patch changes the page fault path. In this case, just failing with ENOMEM is not really an option; that would result in the death of the faulting process. So the page fault code is changed to make a note of the fact that it hit an OOM situation and return; once the call stack has been unwound and any locks are released, it will wait for resolution of the OOM problem. With these changes in place, most (or all) of the lock-related deadlock problems should hopefully go away.

That doesn't solve the other problem, though: if the OOM handler itself tries to allocate memory, it will be put on the waiting list with everybody else and the memcg will still deadlock. To address this issue, Johannes suggested that the user-space OOM handler could more formally declare its role to the kernel. Then, when a process runs into an OOM problem, the kernel can check if it's the OOM handler process; in that case, the kernel OOM handler would be invoked immediately to deal with the situation. The end result in this case would be the same as with the timeout, but it would happen immediately, with no need to wait.

Michal Hocko favors Johannes's changes, but had an additional suggestion: implement a global watchdog process. This process would receive OOM notifications at the same time the user's handler does; it would then start a timer and wait for the OOM situation to be resolved. If time runs out, the watchdog would kill the user's handler and re-enable kernel-provided OOM handling in the affected memcg. In his view, the problem can be handled in user space, so that's where the solution should be.

With some combination of these changes, it is possible that the problems with user-space OOM-handler deadlocks will be solved. In that case, perhaps, Google's delay mechanism will no longer be needed. Of course, that will not be the end of the OOM-handling discussion; as far as your editor can tell, that particular debate is endless.

Comments (29 posted)

Power-aware scheduling meets a line in the sand

By Jonathan Corbet
June 5, 2013
As mobile and embedded processors get more complex — and more numerous — the interest in improving the power efficiency of the scheduler has increased. While a number of power-related scheduler patches exist, none seem all that close to merging into the mainline. Getting something upstream always looked like a daunting task; scheduler changes are hard to make in general, these changes come from a constituency that the scheduler maintainers are not used to serving, and the existence of competing patches muddies the water somewhat. But now it seems that the complexity of the situation has increased again, to the point that the merging of any power-efficiency patches may have gotten even harder.

The current discussion started at the end of May, when Morten Rasmussen posted some performance measurements comparing a few of the existing patch sets. The idea was clearly to push the discussion forward so that a decision could be made regarding which of those patches to push into the mainline. The numbers were useful, showing how the patch sets differ over a small set of workloads, but the apparent final result is unlikely to be pleasing to any of the developers involved: it is entirely possible that none of those patch sets will be merged in anything close to their current form, after Ingo Molnar posted a strongly-worded "line in the sand" message on how power-aware scheduling should be designed.

Ingo's complaint is not really about the current patches; instead, he is unhappy with how CPU power management is implemented in the kernel now. Responsibility for CPU power management is currently divided among three independent components:

  • The scheduler itself clearly has a role in the system's power usage characteristics. Features like deferrable timers and suppressing the timer tick when idle have been added to the scheduler over the years in an attempt to improve the power efficiency of the system.

  • The CPU frequency ("cpufreq") subsystem regulates the clock frequency of the processors in response to each processor's measured idle time. If the processor is idle much of the time, the frequency (and, thus, power consumption) can be lowered; an always-busy processor, instead, should run at a higher frequency if possible. Most systems probably use the on-demand cpufreq governor, but others exist. The big.LITTLE switcher operates at this level by disguising the difference between "big" and "little" processors to look like a wide range of frequency options.

  • The cpuidle subsystem is charged with managing processor sleep states. One might be tempted to regard sleeping as just another frequency option (0Hz, to be exact), but sleep is rather more complicated than that. Contemporary processors have a wide range of sleep states, each of which differs in the amount of power consumed, the damage inflicted upon CPU caches, and the time required to enter and leave that state.

Ingo's point is that splitting the responsibility for power management decisions among three components leads to a situation where no clear policy can be implemented:

Today the power saving landscape is fragmented and sad: we just randomly interface scheduler task packing changes with some idle policy (and cpufreq policy), which might or might not combine correctly. Even when the numbers improve, it's an entirely random, essentially unmaintainable property: because there's no clear split (possible) between 'scheduler policy' and 'idle policy'.

He would like to see a new design wherein the responsibility for all of these aspects of CPU operation has been moved into the scheduler itself. That, he claims, is where the necessary knowledge about the current workload and CPU topology lives, so that is where the decisions should be made. Any power-related patches, he asserts, must move the system in that direction:

This is a "line in the sand", a 'must have' design property for any scheduler power saving patches to be acceptable - and I'm NAK-ing incomplete approaches that don't solve the root design cause of our power saving troubles.

Needless to say, none of the current patch sets include a fundamental redesign of the scheduler, cpuidle, and cpufreq subsystems. So, for all practical purposes, all of those patches have just been rejected in their current form — probably not the result the developers of those patches were hoping for.

Morten responded with a discussion of the kinds of issues that an integrated power-aware scheduler would have to deal with. It starts with basic challenges like defining scheduling policies for power-efficient operation and defining a mechanism by which a specific policy can be chosen and implemented. There would be a need to represent the system's power topology within the scheduler; that topology might not match the cache hierarchy represented by the existing scheduling domains data structure. Thermal management, which often involves reducing CPU frequencies or powering down processors entirely, would have to be factored in. And so on. In summary, Morten said:

This is not a complete list. My point is that moving all policy to the scheduler will significantly increase the complexity of the scheduler. It is my impression that the general opinion is that the scheduler is already too complicated. Correct me if I'm wrong.

In his view, the existing patch sets are part of an incremental solution to the problem and a step toward the overall goal. Whether Ingo will see things the same way is, as of this writing, unclear. His words were quite firm, but lines in the sand are also relatively easy to relocate. If he holds fast to his expressed position, though, the addition of power-aware scheduling could be delayed indefinitely.

It is not unheard of for subsystem maintainers to insist on improvements to existing code as a precondition to merging a new feature. At past kernel summits, such requirements have been seen as being unfair, but they sometimes persist anyway. In this case, Ingo's message, on its face, demands a redesign of one of the most complex core kernel subsystems before (more) power awareness can be added. That is a significant raising of the bar for developers who were already struggling to get their code looked at and merged. A successful redesign on that scale is unlikely to happen unless the current scheduler maintainers put a fair amount of their own time into the requested redesign.

The cynical among us could certainly see this position as an easy way to simply make the power-aware scheduling work go away. That is certainly an incorrect interpretation, though. The more straightforward explanation — that the scheduler maintainers want to see the code get better and more maintainable over time — is far more likely. What has to happen now is the identification of a path toward that better scheduler that allows for power management improvements in the short term. The alternative is to see the power-aware scheduler code relegated to vendor and distributor trees, which seems like a suboptimal outcome.

Comments (27 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds