Brief items
The current development kernel is 3.10-rc4,
released on June 2. Linus says:
"
Anyway, rc4 is smaller than rc3 (yay!). But it could certainly be
smaller still (boo!). There's the usual gaggle of driver fixes (drm,
pinctrl, scsi target, fbdev, xen), but also filesystems (cifs, xfs, with
small fixes to reiserfs and nfs)."
Stable updates: 3.2.46 was released
on May 31.
Comments (none posted)
Our review process is certainly not perfect when you have to wait
for stuff to break in linux-next before you get people to notice
the problems.
—
Arnd Bergmann
I have recently learned, from a very reliable source, that ARM
management seriously dislikes the Lima driver project. To put it
nicely, they see no advantage in an open source driver for the
Mali, and believe that the Lima driver is already revealing way too
much of the internals of the Mali hardware. Plus, their stance is
that if they really wanted an open source driver, they could simply
open up their own codebase, and be done.
Really?
—
Luc Verhaegen
Comments (1 posted)
Kernel development news
By Jonathan Corbet
June 5, 2013
The kernel's block layer is charged with managing I/O to the system's block
("disk drive") devices. It was designed in an era when a high-performance
drive could handle hundreds of I/O operations per second (IOPs); the fact
that it tends to fall down with modern devices, capable of handling
possibly millions of IOPs, is thus not entirely surprising. It has been
known for years that significant changes would need to be made to enable
Linux to perform well on fast solid-state devices. The shape of those
changes is becoming clearer as the multiqueue block layer patch set,
primarily the work of Jens Axboe and Shaohua Li, gets closer to being ready
for mainline merging.
The basic structure of the block layer has not changed a whole lot since it
was described for 2.6.10 in Linux Device
Drivers. It offers two ways for a block driver to hook into the
system, one of which is the "request" interface. When run in this mode,
the block layer maintains a simple request queue; new I/O requests are
submitted to the tail of the queue and the driver receives requests from
the head. While requests sit in the queue, the block layer can operate on
them in a number of ways: they can be reordered to minimize seek
operations, adjacent requests can be coalesced into larger operations, and
policies for fairness and bandwidth limits can be applied, for example.
This request queue turns out to be one of the biggest bottlenecks in the
entire system. It is protected by a single lock which, on a large system,
will bounce frequently between the processors. It is a linked list, a
notably cache-unfriendly data structure especially when modifications must
be made —
as they frequently are in the block layer. As a result, anybody who is
trying to develop a driver for high-performance storage devices wants to do
away with this request queue and replace it with something better.
The second block driver mode — the "make request" interface — allows a
driver to do exactly that. It hooks the driver into a much higher part
of the stack, shorting out the request queue and handing I/O requests
directly to the driver. This interface was not originally intended for
high-performance drivers; instead, it is there for stacked drivers (the MD
RAID implementation, for example) that need to process requests before
passing them on to the real, underlying device. Using it in other
situations incurs a substantial cost: all of the other queue processing
done by the block layer is lost and must be reimplemented in the driver.
The multiqueue block layer work tries to fix this problem by adding a third
mode for drivers to use. In this mode, the request queue is split into a
number of separate queues:
- Submission queues are set up on a per-CPU or per-node basis. Each CPU
submits I/O operations into its own queue, with no interaction with the
other CPUs. Contention for the submission queue lock is thus
eliminated (when per-CPU queues are used) or greatly reduced (for
per-node queues).
- One or more hardware dispatch queues simply buffer I/O requests for
the driver.
While requests are in the submission queue, they can be operated on by the
block layer in the usual manner. Reordering of requests for locality
offers little or no benefit on solid-state devices; indeed, spreading
requests out across the device
might help with the parallel processing of requests. So reordering will
not be done, but coalescing requests will reduce the total number of I/O
operations, improving performance somewhat. Since the submission queues
are per-CPU, there is no way to coalesce requests submitted to different
queues. With no empirical evidence whatsoever, your editor would guess
that adjacent requests are most likely to come from the same process and,
thus, will automatically find their way into the same submission queue, so
the lack of cross-CPU coalescing is probably not a big problem.
The block layer will move requests from the submission queues into the
hardware queues up to the maximum number specified by the driver. Most
current devices will have a single hardware queue, but high-end devices
already support multiple queues to increase parallelism. On such a device,
the entire submission and completion path should be able to run on the same
CPU as the process generating the I/O, maximizing cache locality (and,
thus, performance). If desired, fairness or bandwidth-cap policies can be
applied as requests move to the hardware queues, but there will be an
associated performance cost. Given the speed of high-end devices, it may
not be worthwhile to try to ensure fairness between users; everybody should
be able to get all the I/O bandwidth they can use.
This structure makes the writing of a high-performance block driver
(relatively) simple. The driver provides a queue_rq() function to
handle incoming requests and calls back to the block layer when requests
complete. Those wanting to look at an example of how such a driver would
work can see null_blk.c in the
new-queue branch of Jens's block repository:
git://git.kernel.dk/linux-block.git
In the current patch set, the multiqueue mode is offered in addition to the
existing two modes, so current drivers will continue to work without
change. According to this
paper on the multiqueue block layer design [PDF], the hope is that drivers will
migrate over to the multiqueue API, allowing the eventual removal of the
request-based mode.
This patch set has been significantly reworked in the last month or so; it
has gone from a relatively messy series into something rather
cleaner.
Merging into the mainline would thus appear to be on the agenda for the
near future. Since use of this API is optional, existing drivers should
continue to work and this merge could conceivably happen as early as 3.11.
But, given that the patch set has not yet been publicly posted to any
mailing list and does not appear in linux-next, 3.12 seems like a more
likely target. Either way, Linux seems likely to have a much better block
layer by the end of the year or so.
Comments (10 posted)
By Jonathan Corbet
June 5, 2013
A visit from the kernel's out-of-memory (OOM) killer is usually about as
welcome as a surprise encounter with the tax collector. The OOM killer is
called in when the system runs out of memory and cannot progress without
killing off one or more processes; it is the embodiment of a
frequently-changing set of heuristics describing which processes can be killed for
maximum memory-freeing effect and minimal damage to the system as a whole.
One would not think that this would be a job that is amenable to handling
in user space, but there are some users who try to do exactly that, with
some success. That said, user-space OOM handling is not as safe as some users
would like, but there is not much consensus on how to make it more robust.
User-space OOM handling
The heaviest user of user-space OOM handling, perhaps, is Google. Due to
the company's desire to get the most out of its hardware, Google's internal
users tend to be packed
tightly into their servers. Memory control groups (memcgs) are used to
keep those users from stepping on each others' toes. Like the system as a
whole, a memcg can go into the OOM condition, and the kernel responds in
the same way: the OOM killer wakes up and starts killing processes in the
affected group. But, since an OOM situation in a memcg does not threaten
the stability of the system as a whole, the kernel allows a bit of
flexibility in how those situations are handled. Memcg-level OOM killing
can be disabled altogether, and there is a mechanism by which a user-space
process can request notification when a memcg hits the OOM wall.
Said notification mechanism is designed around the needs of a global, presumably
privileged process that manages a bunch of memcgs on the system; that
process can respond by raising memory limits, moving processes to different
groups, or doing some targeted process killing of its own. But Google's
use case turns out to be a little different: each internal Google user is
given the ability
(and responsibility) to handle OOM conditions within that user's groups.
This approach can work, but there are a couple of traps that make it less
reliable than some might like.
One of those is that, since users are doing their own OOM handling, the OOM
handler process itself will be running within the affected memcg and will
be subject
to the same memory allocation constraints. So if the handler needs to
allocate memory while responding to an OOM problem, it will block and be
put on the
list of processes waiting for the OOM situation to be resolved; this is,
essentially, a deadlocking of the entire memcg. One can try to avoid this
problem by locking pages into memory and such, but, in the end, it is quite
hard to write a user-space program that is guaranteed not to cause memory
allocations in the kernel. Simply reading a /proc file to get a
handle on the situation can be enough to bring things to a halt.
The other problem is that the process whose allocation puts the memcg into
an OOM condition in the first place may be running fairly deeply within the
kernel and may hold any number of locks when it is made to wait. The
mmap_sem semaphore seems to be especially problematic, since it
is often held in situations where memory is being allocated — page fault
handling, for example. If the OOM handler process needs to do anything
that might acquire any of the same locks, it will block waiting for exactly the
wrong process, once again creating a deadlock.
The end result is that user-space OOM killing is not 100% reliable and,
arguably, can never be. As far as Google is concerned, somewhat unreliable OOM
handling is acceptable, but deadlocks when OOM killing goes bad are not.
So, back in 2011, David Rientjes posted a
patch establishing a user-configurable OOM killer delay. With that
delay set, the (kernel) OOM killer will wait for the specified time for an OOM
situation to be resolved by the user-space handler before it steps in and
starts killing off processes. This
mechanism gives the user-space handler a window within which it can try to
work things out; should it deadlock or otherwise fail to get the job done
in time, the kernel will take over.
David's patch was not merged at that time; the general sentiment seemed to
be that it was just a workaround for user-space bugs that would be better
fixed at the source. At the time, David said that Google would carry the patch
internally if need be, but that he thought others would want the same
functionality as the use of memcgs increased. More than two years later,
he is trying again, but the response is not
necessarily any friendlier this time around.
Alternatives to delays
Some developers responded that running the OOM handler within the control
group it manages is a case of "don't do that," but, once David explained
that users are doing their own OOM handling, they seemed to back down a bit
on that one. There does still seem to still be a bit of a sentiment that
the OOM handler should be locked into memory and should avoid performing
memory allocations. In particular, OOM time seems a bit late to be
reading /proc files to get a picture of which processes are
running in the system. The alternative, though, is to trace process
creation in each memcg, which has performance issues of its own.
Some constructive thoughts came from Johannes Weiner, who had a couple of
ideas for improving the current situation. One of those was a patch intended to solve the problem of
processes waiting for OOM resolution while holding an arbitrary set of
locks. This patch makes two changes, the first of which comes into play
when a problematic allocation is the direct result of a system call. In
this case, the allocating process will not be placed in the OOM wait queue
at all; instead, the system call will simply fail with an ENOMEM error.
This solves most of the problem, but at a cost: system calls that might
once have worked will start returning an error code that applications might
not be expecting. That could cause strange behavior, and, given that the
OOM situation is rare, such behavior could be hard to uncover with testing.
The other part of the patch changes the page fault path. In this case,
just failing with ENOMEM is not really an option; that would result in the
death of the faulting process. So the page fault code is changed to
make a note of the fact that it hit an OOM situation and return; once the
call stack has been unwound and any locks are released, it will wait for
resolution of the OOM problem. With these changes in place, most (or all)
of the lock-related deadlock problems should hopefully go away.
That doesn't solve the other problem, though: if the OOM handler itself
tries to allocate memory, it will be put on the waiting list with everybody else
and the memcg will still deadlock. To address this issue, Johannes suggested that the user-space OOM handler
could more formally declare its role to the kernel. Then, when a process
runs into an OOM problem, the kernel can check if it's the OOM handler
process; in that case, the kernel OOM handler would be invoked immediately
to deal with the situation. The end result in this case would be the same
as with the timeout, but it would happen immediately, with no need to wait.
Michal Hocko favors Johannes's changes, but had an additional suggestion: implement a global
watchdog process. This process would receive OOM notifications at the same
time the user's handler does; it would then start a timer and wait for the OOM
situation to be resolved. If time runs out, the watchdog would kill the
user's handler and re-enable kernel-provided OOM handling in the affected
memcg. In
his view, the problem can be handled in user space, so that's where the
solution should be.
With some combination of these changes, it is possible that the problems
with user-space OOM-handler deadlocks will be solved. In that case,
perhaps, Google's delay mechanism will no longer be needed. Of course,
that will not be the end of the OOM-handling discussion; as far as your
editor can tell, that particular debate is endless.
Comments (29 posted)
By Jonathan Corbet
June 5, 2013
As mobile and embedded processors get more complex — and more numerous —
the interest in improving the power efficiency of the scheduler has
increased. While
a number of power-related
scheduler patches exist, none seem all that close to merging into the
mainline. Getting something upstream always looked like a daunting task;
scheduler changes are hard to make in general, these changes come from a
constituency that the scheduler maintainers are not used to serving, and
the existence of competing patches muddies the water somewhat. But now it
seems that the complexity of the situation has increased again, to the
point that the merging of any power-efficiency patches may have gotten even
harder.
The current discussion started at the end of May, when Morten Rasmussen
posted some performance measurements
comparing a few of the existing patch sets. The idea was clearly to push
the discussion forward so that a decision could be made regarding which of
those patches to push into the mainline. The numbers were useful, showing
how the patch sets differ over a small set of workloads, but the apparent
final result is unlikely to be pleasing to any of the developers involved:
it is entirely possible that none of those patch sets will be merged in
anything close to their current form, after Ingo Molnar posted a strongly-worded "line in the sand" message
on how power-aware scheduling should be designed.
Ingo's complaint is not really about the current patches; instead, he is
unhappy with how CPU power management is implemented in the kernel now.
Responsibility for CPU power management is currently divided among three
independent components:
- The scheduler itself clearly has a role in the system's power usage
characteristics. Features like deferrable timers and suppressing the timer tick when idle have
been added to the scheduler over the years in an attempt to improve
the power efficiency of the system.
- The CPU frequency ("cpufreq") subsystem regulates the clock frequency
of the processors in response to each processor's measured idle time.
If the processor is idle much of the time, the frequency (and, thus,
power consumption) can be lowered; an always-busy processor, instead,
should run at a higher frequency if possible. Most systems probably
use the on-demand cpufreq governor,
but others exist. The big.LITTLE
switcher operates at this level by disguising the difference
between "big" and "little" processors to look like a wide range of
frequency options.
- The cpuidle subsystem is charged with
managing processor sleep states. One might be tempted to regard
sleeping as just another frequency option (0Hz, to be exact), but
sleep is rather more complicated than that. Contemporary processors
have a wide range of sleep states, each of which differs in the amount
of power consumed, the damage inflicted upon CPU caches, and the time
required to enter and leave that state.
Ingo's point is that splitting the responsibility for power management
decisions among three components leads to a situation where no clear policy
can be implemented:
Today the power saving landscape is fragmented and sad: we just
randomly interface scheduler task packing changes with some idle
policy (and cpufreq policy), which might or might not combine
correctly. Even when the numbers improve, it's an entirely random,
essentially unmaintainable property: because there's no clear split
(possible) between 'scheduler policy' and 'idle policy'.
He would like to see a new design wherein the responsibility for all of
these aspects of CPU operation has been moved into the scheduler itself.
That, he claims, is where the necessary knowledge about the current
workload and CPU topology lives, so that is where the decisions should be
made. Any power-related patches, he asserts, must move the system in that
direction:
This is a "line in the sand", a 'must have' design property for any
scheduler power saving patches to be acceptable - and I'm NAK-ing
incomplete approaches that don't solve the root design cause of our
power saving troubles.
Needless to say, none of the current patch sets include a fundamental
redesign of the scheduler, cpuidle, and cpufreq subsystems. So, for all
practical purposes, all of
those patches have just been rejected in their current form — probably not
the result the developers of those patches were hoping for.
Morten responded with a discussion of the
kinds of issues that an integrated power-aware scheduler would have to deal
with. It starts with basic challenges like defining scheduling policies
for power-efficient operation and defining a mechanism by which a specific
policy can be chosen and implemented. There would be a need to represent
the system's power topology within the scheduler; that topology might not
match the cache hierarchy represented by the existing scheduling domains data structure. Thermal
management, which often involves reducing CPU frequencies or powering down
processors entirely, would have to be factored in. And so on. In summary,
Morten said:
This is not a complete list. My point is that moving all policy to
the scheduler will significantly increase the complexity of the
scheduler. It is my impression that the general opinion is that
the scheduler is already too complicated. Correct me if I'm wrong.
In his view, the existing patch sets are part of an incremental solution to
the problem and a step toward the overall goal.
Whether Ingo will see things the same way is, as of this writing, unclear.
His words were quite firm, but lines in the sand are also relatively easy
to relocate. If he holds fast to his expressed position, though, the
addition of power-aware scheduling could be delayed indefinitely.
It is not unheard of for subsystem maintainers to insist on improvements to
existing code as a precondition to merging a new feature. At past kernel
summits, such requirements have been seen as being unfair, but they
sometimes persist anyway. In this case, Ingo's message, on its face,
demands a redesign of one of the most complex core kernel
subsystems before (more) power awareness can be added. That is a
significant raising of the bar for developers who were already struggling to
get their code looked at and merged. A successful redesign on that scale
is unlikely to happen unless the
current scheduler maintainers put a fair amount of their own time into the
requested redesign.
The cynical among us could certainly see this position as an easy way to
simply make the power-aware scheduling work go away. That is
certainly an incorrect interpretation, though. The more straightforward
explanation — that the scheduler maintainers want to see the code get
better and more maintainable over time — is far more likely. What has to
happen now is the identification of a path toward that better scheduler
that allows for power management improvements in the short term. The
alternative is to see the power-aware scheduler code relegated to vendor
and distributor trees, which seems like a suboptimal outcome.
Comments (27 posted)
Patches and updates
Kernel trees
- Sebastian Andrzej Siewior: 3.8.13-rt10 .
(June 3, 2013)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>