Brief items
The current development kernel is 2.6.36-rc4,
released on September 12.
"
Nothing in particular stands out, although there's been more noise
in GPU development than I'd like at this point (both Radeon and i915). But
that should hopefully all be just stabilization. There's also been some
PCIe/firmware interaction changes, that should fix way more issues than it
breaks." The short-form changelog is in the
announcement, or see
the
full changelog for all the details.
Stable updates: the 2.6.34.7 update was released on
September 13. "It fixes a single bug that a number of users have
reported in that their USB devices no longer work properly. Sometimes it
causes lost keystrokes, and other times X refuses to boot as it can not
communicate properly with some tablet devices."
Comments (none posted)
People who get angry at an unexpected cc need to get a clue. Or
get slapped.
--
Andrew Morton
Nevertheless, everyone I know that has reviewed the newly released
[Broadcom] driver code is being treated for eye cancer. I wouldn't
expect to see it in F-14.
--
John Linville
In the meantime, people are quite happily shipping the 'offending'
b43 driver in all parts of the world without hearing *anything*
from the authorities. And yet the Broadcom lawyers still seem to
cling to their fantasy that a hackable Open Source driver somehow
puts them at more risk than a just-as-hackable closed-source
driver.
Fixing bugs and making other improvements in the closed source
driver is much harder than it is in the open driver, of course --
but if all you want to do is remove restrictions on available
channels and tweak things like TX power, that's actually fairly
easy with the binary drivers. That's why I say 'just as hackable'.
--
David Woodhouse
Comments (6 posted)
Broadcom - long seen as the last big proprietary holdout in the area of
wireless networking - has announced the availability of a fully open driver
for its current 802.11n chipsets. "
The driver,
while still a work in progress, is released as full source and uses the
native mac80211 stack. It supports multiple current chips (BCM4313,
BCM43224, BCM43225) as well as providing a framework for supporting
additional chips in the future, including mac80211-aware embedded
chips." It's going into the staging tree initially. (Thanks to
Luis Rodriguez).
Full Story (comments: 40)
In the middle of a technical discussion, Linus Torvalds let slip that he
has just become a citizen of the United States. He can't test patches
right away, it seems, because he has to go off and register to vote.
Full Story (comments: 101)
By Jake Edge
September 15, 2010
One of the outcomes from this year's Linux Storage and Filesystem Summit
was a plan to create a combined tree to help ease the process of
integrating changes to various storage subsystems. At the summit, James
Bottomley "volunteered" himself
to put the tree together, and that came to fruition with his announcement of the tree on
September 10. Paralleling the discussion at the summit, there is still the
lingering belief that more than just an automatically generated tree may be
needed.
The tree currently collects patches from several subsystem trees, scsi,
libata, and block, along with patches from the dm quilt repository. It is
being automatically pulled and built nightly, much like linux-next. It
will also be rebased daily against the mainline which will make it somewhat
harder for kernel hackers to use—also like linux-next. Because
of that, Dave Chinner didn't really see the storage-tree as being all that
useful: "I really don't see a tree like this getting
wide use - if I enjoyed the pain of rebasing against throw-away
merge trees every day, then I'd already be using linux-next."
Bottomley acknowledged that complaint,
noting that using linux-next had been suggested at the summit, but pointed
out that the storage-tree is a much smaller target than linux-next: "The diffs to mainline in the current storage tree are still
under a megabyte." Bottomley also noted that the summit
participants were a bit skeptical that a tree without a "storage
maintainer" to oversee it (a la Dave Miller's networking tree) might not
prove to solve the problem, which was
one of Chinner's concerns as well.
But there are political considerations too. "Unlike net, storage has never had a single
maintainer, so it's a bit more political than just doing that by
fiat", Bottomley said. Chinner was of the opinion that the summit
is the obvious place to have made a decision to appoint a storage
maintainer, even if all of the current maintainers of the storage
subsystems were not present. But its clear that those who were present
wanted to move slowly, as Bottomley described:
This sort of thing doesn't get decided by fiat. If you can't get all of
the relevant parties to agree, you have to demonstrate the need. So
doing a rollup tree to test how much of the problem is solvable that way
seems like a reasonable first step.
The tree is available at
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/storage-tree.
The nightly diffs from the mainline and log of the pull script are available
as well. It is likely to take a bit of time to see if the storage-tree
solves the problem with integration of cross-storage-subsystem changes, but
it does
provide a good starting point.
Comments (none posted)
Kernel development news
By Jonathan Corbet
September 14, 2010
The level of interactive response provided by the kernel's CPU scheduler is
the subject of endless discussion and tweaking. It is one of those
problems which, seemingly, can never be fully solved to everybody's
satisfaction. Some recent discussions on the topic have shown, though,
that low-hanging fruit can remain after all these years; it's just a matter
of drawing attention to the right place.
The CFS scheduler divides time into periods, during which each process is
expected to run once. The length of the period should thus determine the
maximum amount of time that any given process can expect to have to wait to
be able to run - the maximum latency. That length, by default, is 6ms. If
there are two processes running, those 6ms will be divided up something
like this:
This assumes that both processes are completely CPU-bound, have the same
priority, and that nothing else perturbs the situation, naturally. If a
third ideal CPU-bound process shows up, that same period is divided into
smaller pieces:
This process of dividing the scheduler period cannot continue forever,
though. Every context switch has its cost in terms of operating system
overhead and cache behavior; switching too often will have a measurable
effect on the total throughput of the system. The current scheduler, by
default, draws the line at 2ms; if the average time slice threatens to go
below that length, the period will be extended instead. So if one more
cranker process shows up, the result will be:
In other words, once the load gets high enough, the kernel will start to
sacrifice latency in order to keep throughput up. In situations where the
load is quite high (kernel builds with a lot of parallel processes are
often mentioned), latencies can reach a point where users start to get
truly irritable.
Mathieu Desnoyers
decided he could improve the situation with this patch, which attempted to
shrink the minimum possible time slice until there were more than eight
running processes; in this way, he hoped to improve latencies on more
heavily-loaded systems.
Mathieu's patch included some test results showing that the maximum
latencies had been cut roughly in half. Even so, Peter Zijlstra dismissed the patch, saying "Not at all
charmed, this look like random changes without conceptual
integrity." That, in turn, earned a
mild rebuke from Linus, who felt that the kernel's latency performance
was not as good as it could be. After that, the discussion went on for a
while, leading to the interesting conclusion that everybody was partly
right.
Mathieu's patch was based on a slightly flawed understanding of how the
scheduler period was calculated, so it didn't do quite what he was
expecting. Rejecting the patch was, thus, the correct thing for the
scheduler maintainers to do. The patch did improve latencies,
though. It turns out that the
change that actually mattered was reducing the length of the minimum time
slice from 2ms to 750µs. That allows the scheduler to keep the same
period with up to eight processes, and reduces the expansion of the period
thereafter. The result is better latency measurements and, it seems, a
nicer interactive feel. A patch making just the minimum time slice change was
fast-tracked into the mainline and will be present in 2.6.36-rc5.
Interestingly, despite the concerns that a shorter time slice would affect
throughput, there has not been a whole lot of throughput benchmarking done
on this patch so far.
Things don't stop there, though. One of Mathieu's tests uses the
SIGEV_THREAD flag to timer_create(), causing the creation
of a new thread for each event. That new thread, it seems, takes a long
time to find its way into the CPU. The culprit here seems to be in the
code which tries to balance CPU access between a newly forked process and
its parent - a place which has often proved problematic in the past. Mike
Galbraith pointed out that the
START_DEBIT scheduler feature - which serves to defer a new task's
first execution into the next period - has an unpleasant effect on
latency. Turning that feature off improves things considerably, but with
costs felt elsewhere in the system; in particular, it allows fork-heavy
loads to unfavorably impact other processes.
Mathieu posted a patch adding a new feature
called START_NICE; if it is enabled, both processes returning from
a fork() will have their priority reduced for one scheduler
period. With that penalty, both processes can be allowed to run in the
current period; their effect on the rest of the system will be reduced.
The associated benchmark numbers show a significant improvement from this
change.
Meanwhile, Peter went away for a bit and came back with a
rather more complex patch demonstrating a different approach. With
this patch, new tasks are still put at the end of the queue to ensure that
they don't deprive existing processes of their current time slices. But,
if the new DEADLINE feature is turned on, each new task also gets
a deadline set to one scheduler period in the future. Should that deadline
pass without that process being scheduled, it will be run immediately.
That should put a cap on the maximum latency that new threads will see.
This patch is large and complex, though, and Peter warns that his testing
stopped once the code compiled. So this one is not something to expect for
2.6.36; if it survives benchmarking, though, we might see it become ready
for the next development cycle.
Comments (11 posted)
By Jonathan Corbet
September 15, 2010
Writeback is the process of writing dirty memory pages (i.e. those which
have been modified by applications) back to persistent storage, saving the
data and potentially freeing the pages for other use. System performance
is heavily dependent on getting writeback right; poorly-done writeback can
lead to poor I/O rates and extreme memory pressure. Over the last year, it
has become increasingly clear that the Linux kernel is not doing writeback
as well as it should; several developers have been putting time into
improving the situation. The
dynamic dirty throttling limits
patch from Wu Fengguang demonstrates a new, relatively complex approach
to making writeback better.
One of the key concepts behind writeback handling is that processes which
are contributing the most to the problem should be the ones to suffer the most for it. In
the kernel, this suffering is managed through a call to
balance_dirty_pages(), which is meant to throttle a process's
memory-dirtying behavior until the situation improves. That throttling is
done in a straightforward way: the process is given a shovel and told to
start digging. In other words, a process which has been tossed into
balance_dirty_pages() is put to work finding dirty pages and
arranging to have them written to disk. Once a certain number of pages
have been cleaned, the process is allowed to get back to the vital task of
creating more dirty pages.
[PULL QUOTE:
So, when the system is under memory pressure and very much
needs optimal performance from its block devices, it goes into a mode which
makes that performance worse.
END QUOTE]
There are some problems with cleaning pages in this way, many of which have
been covered elsewhere. But one of the key ones is that it tends to
produce seeky I/O traffic. When writeback is handled normally in the
background, the kernel does its best to clean substantial numbers of pages
of the same file at the same time. Since filesystems work hard to lay out
file blocks contiguously whenever possible, writing all of a file's pages
together should cause a relatively small number of head seeks, improving
I/O bandwidth. As soon as balance_dirty_pages() gets into the
act, though, the block layer is suddenly confronted with writeback from
multiple sources; that can only lead to a seekier I/O pattern and reduced
bandwidth. So, when the system is under memory pressure and very much
needs optimal performance from its block devices, it goes into a mode which
makes that performance worse.
Fengguang's 17-part patch makes a number of changes, starting with removing
any direct writeback work from balance_dirty_pages(). Instead,
the offending process simply goes to sleep for a while, secure in the
knowledge that writeback is being handled by other parts of the system.
That should lead to better I/O performance, but also to more predictable
and controllable pauses for memory-intensive applications.
Much of the rest of the patch series is aimed at improving that pause
calculation. It adds a new mechanism for estimating the actual bandwidth
of each backing device - something the kernel does not have a good handle
on, currently. Using that information, combined with the number of pages
that the kernel would like to see written out before allowing a dirtying
process to continue, a reasonable pause duration can be calculated. That
pause is not allowed to exceed 200ms.
The patch set tries to be smarter than that, though. 200ms is a long time
to pause a process which is trying to get some work done. On the other
hand, without a bit of care, it is also possible to pause processes for a
very short period of time, which is bad for throughput. For this patch
set, it was decided that optimal pauses would be between 10ms and 100ms.
This range is achieved by maintaining a separate
"nr_dirtied_pause" limit for every process; if the number of
dirtied pages for that process is below the limit, it is not forced to
pause. Any time that balance_dirty_pages() calculates a pause
time of less than 10ms, the limit is raised; if the pause turns out to be
over 100ms, instead, the limit is cut in half. The desired result is a
pause within the selected range which tends quickly toward the 10ms end
when memory pressure drops.
Another change made by this patch series is to try to come up with a global
estimate of the memory pressure on the system. When normal memory scanning
encounters dirty pages, the pressure estimate is increased. If, instead,
the kswapd process on the most memory-stressed node in the system
goes idle, then the estimate is decreased. This estimate is then used to
adjust the throttling limits applied to processes; when the system is under
heavy memory pressure, memory-dirtying processes will be put on hold sooner
than they otherwise would be.
There is one other important change made in this patch set. Filesystem
developers have been complaining for a while that the core memory
management code tells them to write back too little memory at a time. On a
fast device, overly small writeback requests will fail to keep the device
busy, resulting in suboptimal performance. So some filesystems (xfs and
ext4) actually ignore the amount of requested writeback; they will write
back many more pages than they were asked to do. That can improve
performance, but it is not without its problems; in particular, sending
massive write operations to slow devices can stall the system for
unacceptably long times.
Once this patch set is in place, there's a better way to calculate the best
writeback size. The system now knows what kind of bandwidth it can expect
from each device; using that information, it can size its requests to keep
the device busy for one second at a time. Throttling limits are also based
on this one-second number; if there are not enough dirty pages in the
system for one second of I/O activity, the backing device is probably not
being used to its full capacity and the number of dirty pages should be
allowed to increase. In summary: the bandwidth estimation allows the
kernel to scale dirty limits and I/O sizes to make the best use of all of
the devices in the system, regardless of any specific device's performance
characteristics.
Getting this code into the mainline could take a while, though. It is a
complicated set of changes to core code which is already complex; as such,
it will be hard for others to review. There have been some concerns raised
about the specifics of some of the heuristics. A large amount of
performance testing will also be required to get this kind of change
merged. So we may have to wait for a while yet, but better writeback
should be coming eventually.
Comments (2 posted)
By Jonathan Corbet
September 15, 2010
As the number of cores in systems increases, the need for fast
communications between processes running on those cores will also
increase. This week has seen the posting of a couple of patches aimed at
making interprocess messaging faster on Linux systems; both have the
potential to significantly improve system performance.
The first of these patches is motivated by a desire to make MPI faster.
Intra-node communications in MPI are currently handled with shared memory,
but that is still not fast enough for some users. Rather than copy
messages through a shared segment, they would rather deliver messages
directly into another process's address space. To this end, Christopher
Yeoh has posted a patch implementing what he calls cross memory attach.
This patch implements a pair of new system calls:
int copy_from_process(pid_t pid, unsigned long addr, unsigned long len,
char *buffer, int flags);
int copy_to_process(pid_t pid, unsigned long addr, unsigned long len,
char *buffer, int flags);
A call to copy_from_process() will attempt to copy len
bytes, starting at addr in the address space of the process
identified by pid into the given buffer. The current
implementation does not use the flags argument. As would be
expected, copy_to_process() writes data into the target process's
address space. Either both processes must have the same ownership or the copying
process must have the CAP_SYS_PTRACE capability; otherwise the copy will
not be allowed.
The patch includes benchmark numbers showing significant improvement with a
variety of different tests. The reaction to the concept was positive,
though some problems with the specific patch have been pointed out. Ingo
Molnar suggested that an iovec-based
interface (like readv() and writev()) might be
preferable; he also suggested naming the new system calls
sys_process_vm_read() and sys_process_vm_write().
Nobody has expressed opposition to the idea, so we might just see these
system calls in a future kernel.
Many of us do not run MPI on our systems, but the use of D-Bus is rather
more common. D-Bus was not designed for performance in quite the same way
as MPI, so its single-system operation is somewhat slower. There is a
central daemon which routes all messages, so a message going from one
process to another must pass through the kernel twice; it is also necessary
to wake the D-Bus daemon in the middle. That's not ideal from a
performance standpoint.
Alban Crequy has written
about an alternative: performing D-Bus processing in the kernel. To that
end, the "kdbus" kernel module introduces a new AF_DBUS socket
type. These sockets behave much like the AF_UNIX variety, but the
kernel listens in on the message traffic to learn about the names
associated with every process on the "bus"; once it has that information
recorded, it is able to deliver much of the D-Bus message traffic without
involving the daemon (which still exists to handle things the kernel
doesn't know what to do with).
When the daemon can be shorted out, a message can be delivered with only
one pass through the kernel and only one copy. Once again, significant
performance improvements have been measured, even though larger messages
must still be routed through the daemon. People have occasionally
complained about the performance of D-Bus for years, so there may be real
value in improving the system in this way.
It may be some time, though, before this code lands on our desktops. There
is a
git tree available with the patches, but they have never been cleaned
up and posted to the lists for review. The patch set is not small, so
chances are good that there will be a lot of things to fix before it can be
considered for mainline inclusion. The D-Bus daemon, it seems, will be
busy for a little while yet.
Comments (21 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>