Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.32-rc7, released on November 12. "Most of the commits are the kinds I like at this stage: one-liners and few-liners, but I have to admit that there's some bigger-than-I-would-have-liked patches to the Radeon KMS driver." The short-form changelog is in the announcement, or see the full changelog for all the details.

The 2.6.32-rc7 regression list shows a total of 41 unresolved regressions - a high number for this stage in the development cycle. So we may be a couple weeks away from the final 2.6.32 release yet.

Comments (none posted)

Quotes of the week

Well the purpose of the kernel isn't to provide an idiot filter, that is what the security policies and not giving people root is for.

-- Alan Cox

The lesson learnt here? Panic makes for poor decisions. I sent one patch what looked great at the time but have found out in the last few hours that it really sucks. While figuring this out for sure, I have to wait looking at a screen to painfully slowly update. To help the waiting, I found some beer, it's the Irish thing to do. Wonder what the rest of ye do.

-- Mel Gorman

Or to say in a more sarcastic way: the most visible effect the extra code you have to write for making things OOM-safe will be that due to higher memory/address space consumption the OOM situation will be coming earlier then without it.

-- Lennart Poettering

Yes, I realize it's ugly voodoo magic but dammit, it used to work!

-- Pekka Enberg

Comments (2 posted)

Some approaches to parallelism avoidance

By Jonathan Corbet
November 17, 2009

What do you do if you have a group of processes, but only want one of them to run at any given time? This kind of workload is not that uncommon; it appears in user-space threading applications, asynchronous I/O applications, and in applications which have background processing tasks. Stijn Devriendt has such a problem; he recently proposed a solution in the form of a new system call:

    int sched_wait_block(pid_t pid, struct timespec *uts);

This call would put the process to sleep until the process indicated by pid blocked, at which point the calling process would go back onto the run queue. It would thus allow a sort of "only run me when process pid is sleeping" semantic.

Ingo Molnar responded with a suggestion for a very different approach; to him, this problem is another nail for the "perf events" hammer. An interested process could sign up for "parallelism" events, then receive notifications when specific processes sleep or become runnable. He sees some real benefits from such a capability:

This would make a very powerful task queueing framework. It basically allows a 'lazy' user-space scheduler, which only activates if the kernel scheduler has run out of work.

Linus, though, had a very different suggestion: rather than create this whole framework, just add a relatively stupid "only run one of this group of threads at a time" mode to the scheduler. This mode, which could be specified with a new clone() flag, seems like it could solve most of the problems in this area without adding a new set of complicated interfaces.

As of this writing, only sched_wait_block() has an actual patch associated with it, and nobody has committed to writing any others. So the eventual outcome - if any - from this conversation is unclear at best, but it's an interesting exploration of approaches in any case.

Comments (4 posted)

eclone()

By Jonathan Corbet
November 18, 2009

Developers working to implement a checkpoint/restart capability for Linux want the ability to create a new process with a specific process ID. In the absence of that feature, restarted processes will suddenly find themselves with different PIDs, which can only lead to confusion. To implement explicit PID selection, the checkpoint/restart developers have proposed various extensions to the clone() system call with names like clone_with_pids() and clone_extended(). No version has yet been merged, and the proposed API continues to evolve.

The latest proposal is called eclone(); it looks like this:

    int eclone(u32 flags_low, struct clone_args *args, int args_size,
	       pid_t *pids);

The flags_low argument corresponds to the flags argument to the existing clone() call, which is running out of space for new flags. The pids argument is an optional list of PIDs to apply to the new child process, one for each namespace in which the process appears. Everything else goes into args:

    struct clone_args {
	u64 clone_flags_high;
	u64 child_stack_base;
	u64 child_stack_size;
	u64 parent_tid_ptr;
	u64 child_tid_ptr;
	u32 nr_pids;
	u32 reserved0;
	u64 reserved1;
    };

A number of these fields (child_stack_base, child_stack_size, parent_tid_ptr, child_tid_ptr) correspond to existing clone() arguments. clone_flags_high allows the addition of more flags; no new flags are defined in the eclone() proposal, though. The length of the pids array is given by nr_pids, and the reserved fields are there for future expansion.

Comments on the new proposal have been scarce; it may be that the development community has gotten a little tired of seeing these patches over and over. The silence could also mean that there are no objections to this proposal. One big obstacle could remain to the merging of this system call, though: it is there to support the checkpoint/restart facility, which is definitely not ready for merging into the mainline. Getting checkpoint/restart to a completed and maintainable state is likely to take some time; until then, there may be reluctance to add a new system call which does not, yet, have any real-world users.

Comments (9 posted)

Van de Ven: Some PowerTOP updates

Arjan van de Ven reports on new PowerTOP features on his blog. The new features live in the PowerTOP git repository and require small kernel patches that will likely end up in 2.6.33. The features look at audio and SATA power management as well as "who is spinning up my disk": "Using the perf kernel infrastructure, the git version of PowerTOP now has included the equivalent of the blockdump feature, and will report disk-waking application both in the regular interactive view as well as in the diagnostic 'dump' mode."

Comments (1 posted)

Kernel development news

High-order GFP_ATOMIC allocation trouble

By Jonathan Corbet
November 17, 2009

On its face, memory management would appear to be a straightforward task. When memory gets tight, the VM code need only evict the pages which will be unused for the longest time, making that memory available for shorter-term use. The hard part, of course, is identifying those pages. In the absence of perfect predictions of future memory use, the VM subsystem must rely upon a set of heuristics to make a set of (hopefully) reasonable choices. The design of heuristics which can handle most workloads is tricky, and even subtle code changes can lead to big changes in system behavior.

Since the beginning of the 2.6.31 development cycle, some users have been complaining about an increase in kernel memory allocation failures, leading to log messages, failed applications, and the occasional unwelcome appearance of the out-of-memory killer. Various bugs have been filed (see #14141 and #14265, for example) and a fair amount of head-scratching has gone on. But few developers really know where to start when looking at this kind of problem, and, of those who do, some have been content to write off the problem as being caused by higher-order allocations. So progress has been slow.

High-order (multi-page) allocations are a perennial problem on Linux systems; as memory fragments, it gets increasingly hard to find groups of physically-contiguous pages to satisfy higher-order allocation requests. Whenever possible, kernel code is written to avoid high-order allocations, but there are times when that is difficult. Many of the recently-reported problems seemingly have to do with certain not-top-of-the-line wireless network adapters which require contiguous memory chunks to operate. Fixing the problem is important - users of cheap network interfaces want to run Linux too - but there are also reports of single-page allocation failures.

Fortunately, Mel Gorman is not afraid to wander into that part of the kernel; he has been putting some serious time into reproducing the problem and trying to understand what has gone wrong since 2.6.30. Mel has posted a five-part patch series which tries to make allocation failures less likely again. Looking at what Mel has done provides a good lesson on just how subtle this kind of programming can be.

When looking at this code, it's worth bearing in mind that the kernel has two fundamental mechanisms for recovering memory when it is needed for new allocations. Direct reclaim is active memory cleaning done at allocation time; when an allocation falls short, the process trying to allocate the memory will go off and try to free some memory elsewhere in the system. Direct reclaim has the advantages of immediacy - reclaim work happens right away when memory pressure hits - and of dumping the work into processes which are allocating memory, but there are limits to how long any one process can spend reclaiming memory without introducing unacceptable latencies. So more extensive cleaning is pushed off to the kswapd kernel thread, which is dedicated to that task.

Current mainline kernels do not wake up kswapd from the direct reclaim code if the direct reclaim operation fails to get the job done. But if memory is that tight, kswapd should be running, especially if high-order allocations are needed. So the first patch in Mel's series is a simple one-liner which causes kswapd to be waked on direct allocation failure and, perhaps, to work harder on recovering higher-order chunks as well. That change brings behavior back to something closer to what older kernels did.

Patch #2 is a simple tweak which keeps realtime interrupt handlers from driving the memory allocation code too hard. Again, this is a reversion to behavior seen back in the 2.6.30 days.

The third patch is a bit more subtle. Direct reclaim will, if it is successful, result in the creation of I/O operations to write dirty pages to their backing store. There are limits to the number of block I/O operations which can be outstanding, though; once that limit is hit the underlying device is said to be "congested" and the task performing reclaim is forced to wait until things clear out a bit. This "congestion wait" keeps the system from filling up with pending I/O operations and serves to throttle processes performing memory allocations.

As it happens, there are actually two "wait for congestion" queues - one each for synchronous and asynchronous requests. "Synchronous" requests are those for which a process is actively waiting - read requests, usually - while asynchronous requests are those which do not have active waiters. In current kernels, direct reclaim waits on the asynchronous queue, while older kernels used the synchronous queue instead. Moving back to the synchronous queue makes a number of problems go away, but Mel sees that fix as being workload-specific. Instead, he has changed the direct reclaim code to make it wait for congestion to clear on both queues.

Why does this help? It seems to be a matter of letting kswapd get its job done. Kswapd, too, must wait when queues become congested; if direct reclaimers are frequently filling the I/O queues, kswapd will stall more often. It turns out that better results are had if kswapd is allowed to run for longer periods of time. Making direct reclaimers wait until both queues have cleared allows kswapd to get some real work done once it gets going. That is good for the creation of high-order chunks and the performance of the system in general.

Patch #4 also relates to kswapd's duty cycle. Kswapd will stop working and go to sleep once it decides that it has done enough; one definition of "enough" is when the amount of free memory reaches an upper watermark value. But if kswapd is running, chances are good that there is unmet demand for memory in the system; in that situation, the amount of free memory may not stay above the high watermark for very long. Mel's patch has kswapd start with a catnap rather than a real sleep; after 0.1 sec., kswapd wakes back up and reassesses the situation. If the amount of free memory has fallen below the high watermark in that time, kswapd goes back to work; otherwise it goes to sleep for real. In this way, kswapd will continue to work to free memory if the system is consuming it quickly.

The final patch touches on another aspect of waiting for congestion. When block devices become congested, kswapd waits for things to clear. But, Mel notes, that may not be the right thing to do in all situations:

However, on systems with large numbers of high-order atomics due to crappy network cards, it's important that kswapd keep working in parallel to save their sorry ass.

In the original version of the patch, kswapd would become increasingly resistant to waiting for congestion as the situation got worse. Motohiro Kosaki suggested an alternative approach, though, wherein kswapd simply refuses to wait as long as the high watermark is not reached, and Mel adopted it.

Mel's patch posting includes a fair amount of information on how he has tested it and what the results are. With the patch set applied, allocation failures are fewer, and system throughput improves as well. The sad truth about memory management patches, though, is that a change which improves one workload may worsen another. So these changes really need some widespread testing, especially since there is some interest in getting them into 2.6.32.

Comments (none posted)

Receive packet steering

By Jonathan Corbet
November 17, 2009

Contemporary networking hardware can move a lot of packets, to the point that the host computer can have a hard time keeping up. In recent years, CPU speeds have stopped increasing, but the number of CPU cores is growing. The implication is clear: if the networking stack is to be able to keep up with the hardware, smarter processing (such as generic receive offload) will not be enough; the system must also be able to distribute the work across multiple processors. Tom Herbert's receive packet steering (RPS) patch aims to help make that happen.

From the operating system's point of view, distributing the work of outgoing data across CPUs is relatively straightforward. The processes generating data will naturally spread out across the system, so the networking stack does not need to think much about it, especially now that multiple transmit queues are supported. Incoming data is harder to distribute, though, because it is coming from a single source. Some network interfaces can help with the distribution of incoming packets; they have multiple receive queues and multiple interrupt lines. Others, though, are equipped with a single queue, meaning that the driver for that hardware must deal with all incoming packets in a single, serialized stream. Parallelizing such a stream requires some intelligence on the part of the host operating system.

Tom's patch provides that intelligence by hooking into the receive path - netif_rx() and netif_receive_skb() - right when the driver passes a packet into the networking subsystem. At that point, it creates a hash from the relevant protocol data (IP addresses and port numbers, in particular) and uses it to pick a CPU; the packet is then enqueued for the target CPU's attention. By default, any CPU on the system is fair game for network processing, but the list of target CPUs for any given interface can be configured explicitly by the administrator if need be.

The code is relatively simple, but it succeeds in distributing the load of receive processing across the system. The use of the hash is important: it ensures that packets for the same stream of data end up on the same processor, increasing cache locality (and, thus, performance). This scheme is also nice in that it requires no driver changes at all, so it can be deployed quickly and with minimal disruption.

There is one place where drivers can help, though. The calculation of the hash requires accessing data from the packet header. That access will necessarily involve one or more cache misses on the CPU running the steering code - that data was just put there by the network interface and thus cannot be in any CPU's cache. Once the packet has been passed over to the CPU which will be doing the real work, that cache miss overhead is likely to be incurred again. Unnecessary cache misses are the bane of high-speed network processing; quite a bit of work has been done to eliminate them wherever possible. Adding a new cache miss for every packet in the steering code would be counterproductive.

It turns out that a number of network interfaces can, themselves, calculate a hash value for incoming packets. That processing comes for free, and it could eliminate the need to calculate that hash (and suffer the overhead of accessing the data) on the dispatching processor. To take advantage of this capability, the RPS patch adds a new rxhash field to the sk_buff (SKB) structure. Drivers which are able to obtain hash values from the hardware can place them in the SKB; the network stack will then skip the calculation of its own hash value. That should keep the packet's data out of the dispatching CPU's cache entirely, speeding processing.

How well does this work? The patch included some benchmark results using the netperf tool. An 8-core server with a tg3-based network interface went from 90,000 transactions per second to 285,000; an e1000-based adapter on the same system went from 90,000 to 292,000. Similar results are obtained for nForce and bnx2x chipsets on 16-core servers. It would appear that this patch does succeed in making networking processing faster on multi-core systems.

The patch, incidentally, comes from Google, which has a bit of experience with network processing. It has, evidently, been running on Google's production servers for a while. So the RPS patch is, hopefully, an early component of what will be a broad stream of contributions from Google as that company tries to work more closely with the mainline. It seems like a good start.

Comments (6 posted)

SamyGO: replacing television firmware

By Jake Edge
November 14, 2009

While it is quite common for consumer electronics—TVs, DVRs, and the like—to be running Linux these days, it is less common to see projects geared towards replacing and upgrading the Linux firmware in that class of devices. But that is exactly what the SamyGO project is doing for Samsung televisions. By using the source provided by Samsung, along with quite a bit of ingenuity, SamyGO allows users to telnet into their television—an amusing concept—but also to enable functionality beyond that which ships with the device.

The SamyGO wiki lists several modifications that can be made to the TV firmware. One of the main modifications seems to be enabling NFS or SMB/CIFS support so that media files from servers on the network can be played. The TVs already support getting media from the local network using Digital Living Network Alliance (DLNA) protocols, but there are restrictions on the audio and video formats and some playback functionality (pause, forward, rewind) depending on the DLNA server. By using NFS or CIFS, all of the formats and features available for USB-based playback are also available across the network.

Obviously, these are fairly high-end TVs, with both Ethernet connectivity and USB ports. The devices "supported" by SamyGO are LCD models in the LE-32-55Bxxx series and LED models from the UE-xx-B70xx series. The USB ports are available for viewing/playing additional media or for games. Using the "Games" menu with programs stored on a USB stick is one of the ways to run programs on the TV.

The USB ports are also used for a Samsung-branded WiFi "dongle" that owners can buy to avoid the wiring hassle of Ethernet. But, Linux supports far more wireless devices than just the Samsung devices, so SamyGO developers are working to enable others as well. In fact, the Ralink rt73 and rt2870 drivers have been modified in the kernel source supplied by Samsung to remove many additional device IDs, so that only the Samsung devices will work. There are now drivers available without that restriction.

The early efforts have been to get telnet working so that the TV filesystem could be explored. This is done by patching the firmware binaries provided by Samsung and then using the TV's firmware upgrade mechanism to install them on the device. The aptly named "Warning : Read Me First or Brick Your TV!" message in the SamyGO forum outlines the dangers of upgrading the firmware. For those that just want to try this all out, without upgrading any firmware, a safer method is also described, which masquerades as a game on a USB stick to enable telnet.

The kernel is 2.6.18-based with the addition of Samsung's Robust FAT File System (RFS), which is a filesystem for NAND flash devices. As the name would indicate, it is also FAT compatible. It is not in the mainline, however, nor have the SamyGO developers gotten it working for desktop distributions. For that reason, they have resorted to binary patching of the firmware.

Samsung has also released RFS source, along with a Linux porting guide that should be helpful in those efforts. Once RFS can be built for recent kernels, or a utility to create RFS images is made, developers will be able to build their own firmware images for these TVs. [ Update: see the comments below, there is no source RFS release. ]

The kernel source is available, but the project has not yet released any kernels built from it. The Ralink drivers were rebuilt after modifying the device IDs, though, so they can be inserted into the system. The kernel itself has been patched, adding OMAP architecture and sound support among other things, but there has been no mention of binary drivers on the forum, so it should be possible to build the released kernel—or something more recent.

So far, Samsung doesn't seem to have reacted to the project, either positively or negatively. Some concern has been expressed in the forum that working around the WiFi restrictions might raise the company's ire. But one would guess that the number of folks willing to risk bricking an expensive TV in order to use a cheaper WiFi dongle is relatively small—likely to go unnoticed by Samsung.

In the meantime, if the SamyGO hackers add other functionality that might be interesting to customers—there has been talk of web browsers for example—Samsung might just adopt it themselves. Either way, the code is out there for those who might want to give it a try.

Comments (46 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.32-rc7 ?

Architecture-specific

Martin Schwidefsky s390 patches for the next merge window (2.6.33) ?

Wu Zhangjin ftrace for MIPS ?

Tony Lindgren New omap boards and rx51/n900 updates for 2.6.33 ?

Marek Szyprowski Add Samsung S5PC110 SoC support ?

Build system

Luis R. Rodriguez scripts: add gen-linux-conf for remote kernel configuration ?

Core kernel code

Stefani Seibold kfifo: new API v0.6 ?

Tejun Heo workqueue: prepare for concurrency managed workqueue ?

Michel Lespinasse futex: add FUTEX_SET_WAIT operation ?

Bharata B Rao CFS Hard limits - v4 ?

Stijn Devriendt Initial prototype version of sched_wait_block ?

Development tools

Hitoshi Mitake perf bench: Add new subsystem "mem" and new suite "memcpy" ?

Hitoshi Mitake Measuring term of acquiring spinlock ?

Mathieu Desnoyers [RELEASE] LTTng 0.173, ltt-control 0.76, lttv 0.12.22, trace format 2.5 ?

Masami Hiramatsu tracepoint: Add signal events ?

Device drivers

Samu Onkalo Driver for ami305 magnetometer ?

Luis R. Rodriguez Stable compat-wireless 2.6.32-rc7 released ?

Meng, Chen GPIO driver for ST Microelectronics ConneXt (STA2X11/STA2X10), PCIe I/O Hub ?

Rafael J. Wysocki PCI run-time PM support ?

Wolfram Sang net/can: add driver for mscan family & mpc52xx_mscan ?

Christoph Hellwig libata: add TRIM support ?

Arjan van de Ven libata: Add ALPM power state accounting to the AHCI driver ?

Wan ZongShun ARM: Add spi controller driver support for NUC900 ?

Matthew Garrett [RFC] Add support for uevents on block device idle changes ?

Laurent Pinchart V4L core cleanups ?

Filesystems and block I/O

Christoph Hellwig XFS status update for October 2009 ?

Vivek Goyal Block IO Controller V2 ?

Corrado Zoccolo cfq: autotuning support ?

Jan Blunck BKL pushdown from do_new_mount() to the filesystems (v3) ?

Janitorial

Arnd Bergmann compat-ioctl.c diet ?

Memory management

Andi Kleen Allow memory hotplug and hibernation in the same kernel ?

KAMEZAWA Hiroyuki speculative page fault ?

Lee Schermerhorn Numa: Use Generic Per-cpu Variables for numa_*_id() ?

Andrea Arcangeli Transparent Hugepage support #1 ?

Mel Gorman Reduce GFP_ATOMIC allocation failures, candidate fix V3 ?

Virtualization and containers

Avi Kivity KVM updates for the 2.6.33 merge window (batch 1/2) ?

Adam Litke virtio: Add memory statistics reporting to the balloon driver (V2) ?

Benchmarks and bugs

Rafael J. Wysocki 2.6.32-rc7-git1: Reported regressions from 2.6.31 ?

Rafael J. Wysocki 2.6.32-rc7-git1: Reported regressions 2.6.30 -> 2.6.31 ?

Miscellaneous

Jon Masters ANNOUNCE: LKML Podcast ?

André Goddard Rosa introduce skip_spaces(), reducing code size plus some clean-ups ?

Jozsef Kadlecsik ipset 4.1 released ?

Page editor: Jonathan Corbet
Next page: Distributions>>