Brief itemsreleased on September 12. "Nothing in particular stands out, although there's been more noise in GPU development than I'd like at this point (both Radeon and i915). But that should hopefully all be just stabilization. There's also been some PCIe/firmware interaction changes, that should fix way more issues than it breaks." The short-form changelog is in the announcement, or see the full changelog for all the details.
Stable updates: the 220.127.116.11 update was released on September 13. "It fixes a single bug that a number of users have reported in that their USB devices no longer work properly. Sometimes it causes lost keystrokes, and other times X refuses to boot as it can not communicate properly with some tablet devices."
Fixing bugs and making other improvements in the closed source driver is much harder than it is in the open driver, of course -- but if all you want to do is remove restrictions on available channels and tweak things like TX power, that's actually fairly easy with the binary drivers. That's why I say 'just as hackable'.
One of the outcomes from this year's Linux Storage and Filesystem Summit was a plan to create a combined tree to help ease the process of integrating changes to various storage subsystems. At the summit, James Bottomley "volunteered" himself to put the tree together, and that came to fruition with his announcement of the tree on September 10. Paralleling the discussion at the summit, there is still the lingering belief that more than just an automatically generated tree may be needed.
The tree currently collects patches from several subsystem trees, scsi, libata, and block, along with patches from the dm quilt repository. It is being automatically pulled and built nightly, much like linux-next. It will also be rebased daily against the mainline which will make it somewhat harder for kernel hackers to use—also like linux-next. Because of that, Dave Chinner didn't really see the storage-tree as being all that useful: "I really don't see a tree like this getting wide use - if I enjoyed the pain of rebasing against throw-away merge trees every day, then I'd already be using linux-next."
Bottomley acknowledged that complaint, noting that using linux-next had been suggested at the summit, but pointed out that the storage-tree is a much smaller target than linux-next: "The diffs to mainline in the current storage tree are still under a megabyte." Bottomley also noted that the summit participants were a bit skeptical that a tree without a "storage maintainer" to oversee it (a la Dave Miller's networking tree) might not prove to solve the problem, which was one of Chinner's concerns as well.
But there are political considerations too. "Unlike net, storage has never had a single maintainer, so it's a bit more political than just doing that by fiat", Bottomley said. Chinner was of the opinion that the summit is the obvious place to have made a decision to appoint a storage maintainer, even if all of the current maintainers of the storage subsystems were not present. But its clear that those who were present wanted to move slowly, as Bottomley described:
The tree is available at git://git.kernel.org/pub/scm/linux/kernel/git/jejb/storage-tree. The nightly diffs from the mainline and log of the pull script are available as well. It is likely to take a bit of time to see if the storage-tree solves the problem with integration of cross-storage-subsystem changes, but it does provide a good starting point.
Kernel development news
The CFS scheduler divides time into periods, during which each process is expected to run once. The length of the period should thus determine the maximum amount of time that any given process can expect to have to wait to be able to run - the maximum latency. That length, by default, is 6ms. If there are two processes running, those 6ms will be divided up something like this:
This assumes that both processes are completely CPU-bound, have the same priority, and that nothing else perturbs the situation, naturally. If a third ideal CPU-bound process shows up, that same period is divided into smaller pieces:
This process of dividing the scheduler period cannot continue forever, though. Every context switch has its cost in terms of operating system overhead and cache behavior; switching too often will have a measurable effect on the total throughput of the system. The current scheduler, by default, draws the line at 2ms; if the average time slice threatens to go below that length, the period will be extended instead. So if one more cranker process shows up, the result will be:
In other words, once the load gets high enough, the kernel will start to sacrifice latency in order to keep throughput up. In situations where the load is quite high (kernel builds with a lot of parallel processes are often mentioned), latencies can reach a point where users start to get truly irritable. Mathieu Desnoyers decided he could improve the situation with this patch, which attempted to shrink the minimum possible time slice until there were more than eight running processes; in this way, he hoped to improve latencies on more heavily-loaded systems.
Mathieu's patch included some test results showing that the maximum latencies had been cut roughly in half. Even so, Peter Zijlstra dismissed the patch, saying "Not at all charmed, this look like random changes without conceptual integrity." That, in turn, earned a mild rebuke from Linus, who felt that the kernel's latency performance was not as good as it could be. After that, the discussion went on for a while, leading to the interesting conclusion that everybody was partly right.
Mathieu's patch was based on a slightly flawed understanding of how the scheduler period was calculated, so it didn't do quite what he was expecting. Rejecting the patch was, thus, the correct thing for the scheduler maintainers to do. The patch did improve latencies, though. It turns out that the change that actually mattered was reducing the length of the minimum time slice from 2ms to 750µs. That allows the scheduler to keep the same period with up to eight processes, and reduces the expansion of the period thereafter. The result is better latency measurements and, it seems, a nicer interactive feel. A patch making just the minimum time slice change was fast-tracked into the mainline and will be present in 2.6.36-rc5. Interestingly, despite the concerns that a shorter time slice would affect throughput, there has not been a whole lot of throughput benchmarking done on this patch so far.
Things don't stop there, though. One of Mathieu's tests uses the SIGEV_THREAD flag to timer_create(), causing the creation of a new thread for each event. That new thread, it seems, takes a long time to find its way into the CPU. The culprit here seems to be in the code which tries to balance CPU access between a newly forked process and its parent - a place which has often proved problematic in the past. Mike Galbraith pointed out that the START_DEBIT scheduler feature - which serves to defer a new task's first execution into the next period - has an unpleasant effect on latency. Turning that feature off improves things considerably, but with costs felt elsewhere in the system; in particular, it allows fork-heavy loads to unfavorably impact other processes.
Mathieu posted a patch adding a new feature called START_NICE; if it is enabled, both processes returning from a fork() will have their priority reduced for one scheduler period. With that penalty, both processes can be allowed to run in the current period; their effect on the rest of the system will be reduced. The associated benchmark numbers show a significant improvement from this change.
Meanwhile, Peter went away for a bit and came back with a rather more complex patch demonstrating a different approach. With this patch, new tasks are still put at the end of the queue to ensure that they don't deprive existing processes of their current time slices. But, if the new DEADLINE feature is turned on, each new task also gets a deadline set to one scheduler period in the future. Should that deadline pass without that process being scheduled, it will be run immediately. That should put a cap on the maximum latency that new threads will see.
This patch is large and complex, though, and Peter warns that his testing stopped once the code compiled. So this one is not something to expect for 2.6.36; if it survives benchmarking, though, we might see it become ready for the next development cycle.dynamic dirty throttling limits patch from Wu Fengguang demonstrates a new, relatively complex approach to making writeback better.
One of the key concepts behind writeback handling is that processes which are contributing the most to the problem should be the ones to suffer the most for it. In the kernel, this suffering is managed through a call to balance_dirty_pages(), which is meant to throttle a process's memory-dirtying behavior until the situation improves. That throttling is done in a straightforward way: the process is given a shovel and told to start digging. In other words, a process which has been tossed into balance_dirty_pages() is put to work finding dirty pages and arranging to have them written to disk. Once a certain number of pages have been cleaned, the process is allowed to get back to the vital task of creating more dirty pages.
[PULL QUOTE: So, when the system is under memory pressure and very much needs optimal performance from its block devices, it goes into a mode which makes that performance worse. END QUOTE] There are some problems with cleaning pages in this way, many of which have been covered elsewhere. But one of the key ones is that it tends to produce seeky I/O traffic. When writeback is handled normally in the background, the kernel does its best to clean substantial numbers of pages of the same file at the same time. Since filesystems work hard to lay out file blocks contiguously whenever possible, writing all of a file's pages together should cause a relatively small number of head seeks, improving I/O bandwidth. As soon as balance_dirty_pages() gets into the act, though, the block layer is suddenly confronted with writeback from multiple sources; that can only lead to a seekier I/O pattern and reduced bandwidth. So, when the system is under memory pressure and very much needs optimal performance from its block devices, it goes into a mode which makes that performance worse.
Fengguang's 17-part patch makes a number of changes, starting with removing any direct writeback work from balance_dirty_pages(). Instead, the offending process simply goes to sleep for a while, secure in the knowledge that writeback is being handled by other parts of the system. That should lead to better I/O performance, but also to more predictable and controllable pauses for memory-intensive applications.
Much of the rest of the patch series is aimed at improving that pause calculation. It adds a new mechanism for estimating the actual bandwidth of each backing device - something the kernel does not have a good handle on, currently. Using that information, combined with the number of pages that the kernel would like to see written out before allowing a dirtying process to continue, a reasonable pause duration can be calculated. That pause is not allowed to exceed 200ms.
The patch set tries to be smarter than that, though. 200ms is a long time to pause a process which is trying to get some work done. On the other hand, without a bit of care, it is also possible to pause processes for a very short period of time, which is bad for throughput. For this patch set, it was decided that optimal pauses would be between 10ms and 100ms. This range is achieved by maintaining a separate "nr_dirtied_pause" limit for every process; if the number of dirtied pages for that process is below the limit, it is not forced to pause. Any time that balance_dirty_pages() calculates a pause time of less than 10ms, the limit is raised; if the pause turns out to be over 100ms, instead, the limit is cut in half. The desired result is a pause within the selected range which tends quickly toward the 10ms end when memory pressure drops.
Another change made by this patch series is to try to come up with a global estimate of the memory pressure on the system. When normal memory scanning encounters dirty pages, the pressure estimate is increased. If, instead, the kswapd process on the most memory-stressed node in the system goes idle, then the estimate is decreased. This estimate is then used to adjust the throttling limits applied to processes; when the system is under heavy memory pressure, memory-dirtying processes will be put on hold sooner than they otherwise would be.
There is one other important change made in this patch set. Filesystem developers have been complaining for a while that the core memory management code tells them to write back too little memory at a time. On a fast device, overly small writeback requests will fail to keep the device busy, resulting in suboptimal performance. So some filesystems (xfs and ext4) actually ignore the amount of requested writeback; they will write back many more pages than they were asked to do. That can improve performance, but it is not without its problems; in particular, sending massive write operations to slow devices can stall the system for unacceptably long times.
Once this patch set is in place, there's a better way to calculate the best writeback size. The system now knows what kind of bandwidth it can expect from each device; using that information, it can size its requests to keep the device busy for one second at a time. Throttling limits are also based on this one-second number; if there are not enough dirty pages in the system for one second of I/O activity, the backing device is probably not being used to its full capacity and the number of dirty pages should be allowed to increase. In summary: the bandwidth estimation allows the kernel to scale dirty limits and I/O sizes to make the best use of all of the devices in the system, regardless of any specific device's performance characteristics.
Getting this code into the mainline could take a while, though. It is a complicated set of changes to core code which is already complex; as such, it will be hard for others to review. There have been some concerns raised about the specifics of some of the heuristics. A large amount of performance testing will also be required to get this kind of change merged. So we may have to wait for a while yet, but better writeback should be coming eventually.
The first of these patches is motivated by a desire to make MPI faster. Intra-node communications in MPI are currently handled with shared memory, but that is still not fast enough for some users. Rather than copy messages through a shared segment, they would rather deliver messages directly into another process's address space. To this end, Christopher Yeoh has posted a patch implementing what he calls cross memory attach.
This patch implements a pair of new system calls:
int copy_from_process(pid_t pid, unsigned long addr, unsigned long len, char *buffer, int flags); int copy_to_process(pid_t pid, unsigned long addr, unsigned long len, char *buffer, int flags);
A call to copy_from_process() will attempt to copy len bytes, starting at addr in the address space of the process identified by pid into the given buffer. The current implementation does not use the flags argument. As would be expected, copy_to_process() writes data into the target process's address space. Either both processes must have the same ownership or the copying process must have the CAP_SYS_PTRACE capability; otherwise the copy will not be allowed.
The patch includes benchmark numbers showing significant improvement with a variety of different tests. The reaction to the concept was positive, though some problems with the specific patch have been pointed out. Ingo Molnar suggested that an iovec-based interface (like readv() and writev()) might be preferable; he also suggested naming the new system calls sys_process_vm_read() and sys_process_vm_write(). Nobody has expressed opposition to the idea, so we might just see these system calls in a future kernel.
Many of us do not run MPI on our systems, but the use of D-Bus is rather more common. D-Bus was not designed for performance in quite the same way as MPI, so its single-system operation is somewhat slower. There is a central daemon which routes all messages, so a message going from one process to another must pass through the kernel twice; it is also necessary to wake the D-Bus daemon in the middle. That's not ideal from a performance standpoint.
Alban Crequy has written about an alternative: performing D-Bus processing in the kernel. To that end, the "kdbus" kernel module introduces a new AF_DBUS socket type. These sockets behave much like the AF_UNIX variety, but the kernel listens in on the message traffic to learn about the names associated with every process on the "bus"; once it has that information recorded, it is able to deliver much of the D-Bus message traffic without involving the daemon (which still exists to handle things the kernel doesn't know what to do with).
When the daemon can be shorted out, a message can be delivered with only one pass through the kernel and only one copy. Once again, significant performance improvements have been measured, even though larger messages must still be routed through the daemon. People have occasionally complained about the performance of D-Bus for years, so there may be real value in improving the system in this way.
It may be some time, though, before this code lands on our desktops. There is a git tree available with the patches, but they have never been cleaned up and posted to the lists for review. The patch set is not small, so chances are good that there will be a lot of things to fix before it can be considered for mainline inclusion. The D-Bus daemon, it seems, will be busy for a little while yet.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds