Brief items
The current development kernel remains 2.6.35-rc5; no new -rc
releases have been made over the last week. The stream of fixes into the
mainline continues, though; there has also been a renaming of the logical
memory block (LMB) code to "memblock." The -rc6 release can probably be
expected shortly after LWN's publication time.
Stable updates: there have been no stable updates over the last
week.
Comments (none posted)
With Rusty's patch for the modules bug, and reverting Greg
Vandal-Hartman's "driver core: remove CONFIG_SYSFS_DEPRECATED" and
deleting the BUG_ON from generic_delete_inode(), I have a login
prompt! Admittedly I don't have any networking any more, but that
seems a minor quibble.
--
Andrew Morton
Comments (1 posted)
By Jonathan Corbet
July 21, 2010
A few years back, it seemed that incompatible sysfs changes created broken
systems on a regular basis. Since then, though, things have gotten better,
with no reports of broken systems or forced udev upgrades for a while.
That improvement is the result of a deliberate effort on the part of the
sysfs hackers to stabilize things and to establish best practices for the
use of sysfs-exported information. As some linux-next testers are
currently finding out, though, the legacy of older sysfs problems has not
entirely faded away yet.
The CONFIG_SYSFS_DEPRECATED configuration option exists as one way
of mitigating the effects of a major sysfs change. In the early days of
sysfs, devices tended to pop up in strange places, including, especially,
under /sys/class. In order to bring more consistency to the
filesystem, the layout was reorganized to move more device information into
/sys/devices, create the /sys/block directory, and more.
Needless to say, any such change would be fatal for systems which expected
the old layout, so the configuration option was added to restore that old
layout when needed.
In 2010, nobody has shipped a distribution which relies on the old layout
for some time. So Greg Kroah-Hartman has posted a patch to remove the configuration option and
the significant amount of code needed to support it; that patch has also
gone into linux-next. Greg notes: "This is no longer needed by any
userspace tools, so it's safe to remove."
Except that maybe it's not safe to remove. Andrew Morton quickly reported that his Fedora Core 6 box would
not boot without this option. Andrew is well known for running archaic
distributions just for the purpose of finding this kind of compatibility
issue; one might argue that there probably are not that many other FC6
boxes in use, and even fewer which will be wanting to run 2.6.35 kernels.
But, as Dave Airlie noted, RHEL5 boxes will
also fail to boot, and there are rather more of those in operation.
Dave's advice was blunt: "Live with your mistakes guys, don't try and
bury them." He knows as well as anybody what the cost of living
with mistakes is: the graphics ABIs include a few of their own. Mistakes
will happen, but, when they become part of the user-space ABI, they can be
difficult to get away from. That is why ABI additions tend to come under
high levels of scrutiny: once somebody depends on them, they must be
supported indefinitely.
Comments (19 posted)
Kernel development news
By Jonathan Corbet
July 20, 2010
"Writeback" is the process of writing the contents of dirty memory pages
back to their backing store, where that backing store is normally a file or
swap area. Proper handling of writeback is crucial for both system
performance and data integrity. If writeback falls too far behind the
dirtying of pages, it could leave the system with severe memory pressure
problems. Having lots of dirty data in memory also increases the amount of
data which may be lost in the event of a system crash. Overly enthusiastic
writeback, on the other hand, can lead to excessive I/O bandwidth usage,
and poorly-planned writeback can greatly reduce I/O performance with
excessive disk seeks. Like many memory-management tasks, getting writeback
right is a tricky exercise involving compromises and heuristics.
Back in April, LWN looked at a
specific writeback problem: quite a bit of writeback activity was
happening in direct reclaim. Normally, memory pages are reclaimed (made
available for new uses, with data written back, if necessary) in the
background; when all goes well, there will always be a list of free pages
available when memory is needed. If, however, a memory allocation request
cannot be satisfied
from the free list, the kernel may try to reclaim pages directly in the
context of the process performing the allocation. Diverting an allocation
request into this kind of cleanup activity is called "direct reclaim."
Direct reclaim normally
works, and it is a good way to throttle memory-hungry processes, but it
also suffers from a couple of significant problems. One of those is stack
overflows; direct reclaim can happen from almost anywhere in the kernel, so
it may be that the kernel stack is already mostly used before the reclaim process even
starts. But if reclaim involves writing file pages back, it can be just
the beginning of a long call chain in its own right, leading to the
overflow of the kernel stack. Beyond that, direct reclaim, which reclaims
pages wherever it can find them, tends to create
seek-intensive I/O, hurting the whole system's I/O performance.
Both of these problems have been seen on production systems. In response,
a number of filesystems have been changed so that they simply ignore
writeback requests which come from the direct reclaim code. That makes the
problem go away, but it is a kind of papering-over that pleases nobody; it
also arguably increases the risk that the system could go into the dreaded
out-of-memory state.
Mel Gorman has been working on the reclaim problem, on and off, for a few
months now. His latest patch
set will, with luck, improve the situation. The actual changes made
are relatively small, but they apparently tweak things in the right
direction.
The key to solving a problem is understanding it. So, perhaps, it's not
surprising that the bulk of the changes do not actually affect writeback;
they are, instead, tracing instrumentation and tools which provide
information on what the reclaim code is actually doing. The new
tracepoints provide visibility into the nature of the problem and,
importantly, how much each specific change helps.
The core change is deep within the direct reclaim loop. If direct reclaim
stumbles across a page which is dirty, it now must think a bit harder about
what to do with it. If the dirty page is an anonymous (process data) page,
writeback happens as before. The reasoning here seems to be that the
writeback path for these pages (which will be going to a swap area) will be
simpler than it is for file-backed pages; there are also fewer
opportunities for anonymous pages to be written back via other paths. As a
result, anonymous writeback might still create seek problems - but only if
the swap area shares a spindle with other, high-bandwidth data.
For dirty, file-backed pages, the situation is a little different; direct
reclaim will no longer try to write back those pages directly. Instead, it
creates a list of the dirty pages it encounters, then hands them over to
the appropriate background process for the real writeback work. In some
cases (such as when lumpy
reclaim is trying to free specific larger chunks of memory), the direct
reclaim code will wait in the hope that the identified pages will soon
become free. The rest of the time, it simply moves on, trying to find free
pages elsewhere.
Handing the reclaim work over to the threads which exist for that task has
a couple of benefits. It is, in effect, a simple way of switching to
another kernel stack - one which is known to be mostly empty - before
heading into the writeback paths. Switching stacks directly in the direct
reclaim code had been discussed, but it was decided that the mechanism the
kernel already has for switching stacks (context switches) was probably the
right thing to use in this situation. Keeping the writeback work in kswapd
and the per-BDI writeback threads should also help performance, since those
threads try to order operations to minimize head seeks.
When this problem was discussed in April, Andrew Morton pointed out that,
over time, the amount of memory written back in direct reclaim has grown
significantly, with an adverse effect on system performance. He wanted to
see thought put into why that change has happened rather than trying to
mitigate its effects. The final patch in Mel's series looks like an
attempt to address this concern. It changes the direct reclaim code so
that, if that code starts encountering dirty pages, it pokes the writeback
threads and tells them to start cleaning pages more aggressively. The idea
here is to keep the normal reclaim mechanisms running at a fast-enough pace
that direct reclaim is not needed so often.
This
tweak seems to have a significant effect on some benchmarks; Mel says:
Apparently, background flush must have been doing a better job
getting [pages] cleaned in time and the direct reclaim stalls are
harmful overall. Waking background threads for dirty pages made a
very large difference to the number of pages written back. With all
patches applied, just 759 filesystem pages were written back in
comparison to 581811 in the vanilla kernel and overall the number
of pages scanned was reduced.
Anybody who likes digging through benchmark results is advised to look at
Mel's patch posting - he appears to have run just about every test that he
could to quantify the effects of this patch series. This kind of extensive
benchmarking makes sense for deep memory management changes, since even
small changes can have surprising results on specific workloads. At this
point, it seems that the changes have the desired effect and most of the
concerns expressed with previous versions have been addressed. The
writeback changes, perhaps, are getting ready for production use.
Comments (9 posted)
By Jonathan Corbet
July 20, 2010
The Linux scheduler, in both the mainline and realtime versions, provides a
couple of scheduling classes for realtime tasks. These classes
implement the classic POSIX priority-based semantics, wherein the
highest-priority runnable task is guaranteed to have access to the CPU.
While this scheduler works as advertised, priority-based scheduling has a
number of problems and has not been the focus of realtime research for some
time. Cool schedulers in this century are based on deadlines instead.
Linux does not yet have a deadline scheduler, though
there is one in the works. A
recent discussion on implementing the full deadline model has shown, once
again, just how complex it can be to get deadline scheduling right in the
real world.
Deadline scheduling does away with priorities, replacing them with a
three-parameter tuple: a worst-case execution time (or budget), a deadline,
and a period. In essence, a process tells the scheduler that it will
require up to a certain amount of CPU time (the budget) by the given
deadline, and that the deadline optionally repeats with the given period.
So, for example, a video-processing application might request 1ms of CPU
time to process the next incoming frame, expected in 10ms, with a 33ms
period thereafter for subsequent frames. Deadline scheduling is appealing
because it allows the specification of a process's requirements in a
natural way which is not affected by any other processes running in the
system. There is also great value, though, in using the deadline
parameters to guarantee that a process will be able to meet its deadline,
and to reject any process which might cause a failure to keep those
guarantees.
The SCHED_DEADLINE scheduler being developed by Dario Faggioli
appears to be on track for an eventual mainline merger, though nobody, yet,
has been so foolish to set a deadline for that particular task. This
scheduler works, but, thus far, it takes a bit of a shortcut: in
SCHED_DEADLINE, the deadline and the period are assumed to be the
same. This simplification makes the "admission test" - the decision as to
whether to accept a new SCHED_DEADLINE task - relatively easy.
Each process gets a "bandwidth" parameter, being the ratio of the CPU
budget to the deadline/period value. As long as the sum of the bandwidth
values for all processes on a given CPU does not exceed 1.0, the scheduler
can guarantee that the deadlines will be met.
As Dario recently brought up on
linux-kernel, there are users who would like to be able to specify separate
deadline and period values. Adjusting the scheduler to implement those
semantics is not particularly hard. Coming up with an admission test which
insures that deadlines can still be met is rather harder, though. Once the
period and the deadline are separated from each other, it becomes possible
for processes to miss their deadlines even if the total bandwidth of the
CPU has not been oversubscribed.
To see how this might come about, consider an
example posted by Dario. Imagine a process which needs 4ms of CPU time
with a period of 10ms and a deadline of 5ms. A timeline of how that
process might be scheduled could look like this:
Here, the scheduler is able to run the process within its deadline; indeed,
there is 1ms of time to spare. Now, though, if a second process comes in
with the same set of requirements, the results will not be as good:
Despite the fact that 20% of the available CPU time remains unused, process
P2 will miss its deadline by 3ms. In a hard realtime situation, that tardiness
could prove fatal. Clearly, the scheduler should reject P2 in this
situation. The problem is that detecting this kind of problem is not easy,
especially if the scheduler is (as seems reasonable) expected to leave some
CPU time for applications and not use it all performing complex admission
calculations. For this reason, admission decision algorithms are currently
an area of considerable research interest in the academic realtime
community. See this paper by
Alejandro Masrur et al. [PDF] or this one by Marko
Bertogna [PDF] for examples of how involved it can get.
There are a couple of ways to simplify the problem. One of those would be
to change the bandwidth calculation to be the ratio of the CPU budget to
the relative deadline time (rather than to the period). For the example
processes shown above, each has a bandwidth of 0.8 using this formula; the
scheduler, on seeing that the second process would bump the total to 1.6,
could then decide to reject it. Using this calculation, the scheduler can,
once again, guarantee that deadlines will be met, but at a cost: this
formula will cause the rejection of processes that, in reality, could be
scheduled without trouble. It is an overly pessimistic heuristic which will
prevent full utilization of the available resources.
An alternative, proposed by Dario, would be to stick with the period-based
bandwidth values for admission decisions and to take the risk that some
deadlines might not be met. In this case, user-space developers would be
responsible for ensuring that the full set of tasks on the system can be
scheduled. They might take some comfort in the fact that, since the
overall bandwidth of the CPU would still
not be oversubscribed, the amount by which a deadline could be missed would
be deterministically bounded.
That idea did not survive its encounter
with Peter Zijlstra, who thinks it ruins everything that a deadline
scheduler is supposed to provide:
The whole reason SCHED_FIFO and friends suck so much is that they
don't provide any kind of isolation, and thus, as an
Operating-System abstraction they're an utter failure. If you take
out admission control you end up with a similar situation.
Peter's suggestion, instead, was to split deadline scheduling logically
into two different schedulers. The hard realtime scheduler would retain
the highest priority, and would require that the deadline and period values
be the same. If, at some future time, a suitable admission controller is
developed then that requirement could be relaxed as long as deadlines could
still be guaranteed.
Below the hard realtime scheduler would be a soft realtime scheduler which
would have access to (most of) the CPU bandwidth left unused by the hard
scheduler. That scheduler could accept processes using period-based
bandwidth values with the explicit understanding that deadlines might be
missed by bounded amounts. Soft realtime is entirely good enough for a
great many applications, so there is no real reason not to provide it as
long as hard realtime is not adversely affected.
So that is probably how things will go, though the real shape of the
solution will not be seen until Dario posts the next version of the
SCHED_DEADLINE patch. Even after this problem is solved, though, deadline
scheduling has a number of other challenges to overcome, with good
multi-core performance being near the top of the list. So, while Linux
will almost certainly have a deadline scheduler at some point, it's still
hard to say just when that might be.
(Readers who are interested in the intersection of academic and practical
realtime work might want to peruse the recently-released proceedings
[PDF] from the OSPERT 2010 conference, held in Brussels in early July.)
Comments (21 posted)
By Jonathan Corbet
July 21, 2010
Allocation of physically-contiguous memory buffers is required by many
device drivers, but it cannot always be reliably done on long-running Linux
systems. That leads to all kinds of unsatisfying workarounds in driver
code. The
contiguous memory
allocator patches recently posted by Michal Nazarewicz are an attempt
to solve this problem in a consistent way for all drivers.
A few years ago, when your editor was writing the camera driver for the
original OLPC XO system, a problem turned up. The video acquisition
hardware on the system was capable of copying video frames into memory via
DMA operations, but only to physically contiguous buffers. There was, in
other words, no scatter/gather DMA capability built into this (cheap) DMA
engine. A choice was thus forced: either allocate memory for video
acquisition at boot time, or attempt to allocate it on the fly when the
camera is actually used. The former choice is reliable, but it has the
disadvantage of leaving a significant chunk of memory idle (on a
memory-constrained system) whenever the camera is not in use - most of the
time on most systems. The latter choice does not waste memory, but is
unreliable - large, contiguous allocations are increasingly hard to do as
memory gets fragmented. In the OLPC case, the decision was to sacrifice
the memory to ensure that the camera would always work.
This particular problem has been faced many times by many developers over the years; each
driver author has tended to go with whatever ad hoc solution seems
to make sense at the time. For some years, the "bigphysarea" patch was
available to help, but that patch was never put into the mainline and has
not seen any maintenance for some time. So the problem remains
unsolved in any sort of general sense.
The contiguous memory allocation (CMA) patches are an attempt to put
together a flexible solution which can be used in all drivers. The basic
technique will be familiar: CMA grabs a chunk of contiguous physical memory
at boot time (when it's plentiful), then doles it out to drivers in
response to allocation requests. Where it differs is mainly in an
elaborate mechanism for defining the memory region(s) to reserve and the
policies for handing them out.
A system using CMA will always need to have at least one boot-time
parameter describing the memory region(s) to use and the policy for allocating
from those regions. The syntax used is rather complex, to the point that a large portion
of the patch is made up of parsing code; see the included Documentation/cma.txt file for the full
details. A simple example of a CMA command-line option would be something
like:
cma=c=10M cma_map=camera=c
This defines a 10MB region (called "c") and states that allocation requests
from the camera device should be satisfied from this region.
Multiple regions can be defined, each with its own size, alignment
constraints, and allocation algorithm, and memory regions can be split into
different "kinds" as well. The "kinds" feature might be used to separate
large and small allocations, or to put different buffers into different DMA
zones or NUMA nodes. The more complex command lines are reminiscent of
regular expressions, but with less readability. The purpose behind this
complexity is to enable a great deal of flexibility in how memory is
handled without the need to change the drivers which are working with that
memory. Whether that flexibility is worth the cost is not (to your editor,
at least) entirely clear.
A driver can actually allocate a memory chunk with:
#include <linux/cma.h>
unsigned long cma_alloc(const struct device *dev, const char *kind,
unsigned long size, unsigned long alignment);
If all goes well, the return value will be the physical address of the
allocated memory region.
For reasons which are not entirely clear, buffers allocated with CMA have a
reference count associated with them. So two functions are provided to
manipulate that count:
int cma_get(unsigned long addr);
int cma_put(unsigned long addr);
Since reference counting is used, there is no cma_free() function;
instead, the memory chunk is passed to cma_put() and freed
internally when the reference count goes to zero.
CMA comes with a best-fit allocator, but it is designed to work with
multiple internal allocators. So, should there be a need to use a
different allocation algorithm, it's a really straightforward matter to add
it to the system. Naturally enough, the command-line syntax offers a way
to specify which allocator should be used for each region.
In summary: CMA offers a solution to a problem which driver authors have
been dealing with for some years. Your editor suspects, though, that it
will require some changes before a mainline merge can be contemplated. The
complexity of the solution is probably more than is really called for in
this situation, and the whole thing might benefit from some integration
with the DMA mapping infrastructure. But, someday, it would be nice to
incorporate a solution to the large-buffer problem that all drivers can use.
Comments (15 posted)
Patches and updates
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>