KS2009: Mini-summit readouts

By Jonathan Corbet
October 19, 2009

LWN's 2009 Kernel Summit coverage

The size and scope of the kernel makes it hard to discuss subsystem-specific issues in the larger kernel summit setting. So it has become an increasingly common practice to hold mini-summit events dedicated to specific corners of the kernel. At the main kernel summit, brief reports from the mini-summits are presented; there were four of these in Tokyo.

Realtime

Thomas Gleixner reported from the realtime mini-summit held recently in Dresden. Specific issues mentioned here included scalability issues, the big kernel lock, and the lock naming problem. Realtime Linux tends to run into specific scalability problems earlier than "regular" Linux systems; it's a sort of canary in the coal mine in that regard. One of the big problems at the moment is dcache_lock. BKL-removal has become a high priority for the realtime developers; it is seen as a prerequisite for getting much of the remaining realtime code merged into the mainline. The spinlock naming issue was mentioned, but discussion was deferred to the dedicated session on Tuesday.

See LWN's coverage of the realtime mini-summit for more information on what was discussed there.

Networking

The networking developers held a summit in Portland, just prior to LinuxCon. One of the topics discussed there was the recvmmsg() system call, being implemented by Arnoldo Carvalho de Melo. recvmmsg() can be a big performance win in situations where it is known that several datagrams can be expected.

Another area of development is transmit interrupt mitigation. The NAPI interface has helped the networking stack reduce receive interrupt overhead for years now, but transmit interrupts, too, have a performance cost. Making interrupt mitigation for outgoing packets work can be challenging, especially in the presence of broken hardware which implements the feature badly. The networking stack really needs to know when the network adapter has finished with a specific packet so that the associated resources can be freed. If the hardware doesn't make that information available properly, things can go downhill in a hurry.

It turns out that transmit interrupt mitigation is also important in virtualization situations. The receipt notification between guests can be surprisingly expensive; mitigation makes virtualized networking run much more efficiently.

Another area of development is the use of control groups for traffic shaping. Eventually it will be possible to control bandwidth use and port access by way of the control group mechanism.

The multiqueue work is still under development; for some situations (forwarding, for example) it's nearly optimal now. For others, such as for traffic going to a local destination, there is still work to be done. Ted Ts'o asked if there had been any thought of hooking into the scheduler to help ensure that incoming data is steered to the right CPU. The problem with that is that the network card makes that decision. There are cards which can be programmed to direct specific packets to the correct CPU; support for this feature in the networking stack is improving.

Power management

Len Brown presented the current state of power management and the results of the mini-summit held before the Linux Symposium in July. The goal of the power management developers at this point is to have power management enabled by default on all systems, with no impact on performance. There has been progress toward that goal. The addition of a debugging framework and the new, simplified driver API has helped a lot in that regard. Also helpful are the quality-of-service features and some improvements in Intel hardware.

It was wryly noted that it is now possible to suspend (and presumably resume) an S390 system.

Power management awareness at the hardware layer is growing; the mac80211 layer has recently gained suspend/resume awareness.

Unlike many subsystem maintainers, Len maintains statistics for reported and fixed bugs in the ACPI subsystem - and he shows them to others. The rate of reported bugs remains mostly constant over time, but the number of open bugs has been falling. In other words, the ACPI developers seem to be fixing bugs faster than they are introducing them.

ACPI support is in the mainline now; it was actually shipped before the official specification came out. It is, in fact, the first ACPI 4.0 implementation shipped in any operating system.

For 2.6.32, new features include power meter support and the much-discussed "force processor idle" patch. 2.6.33 will see a rework of the error reporting code and "IPMI op-region" support. What's not done is support for the D3Hot state; nobody really knows how to use that yet. There will eventually be support for the MCST feature, which tells the operating system how large the system can get - how much memory can be installed, for example. There are a number of other ACPI features - memory bandwidth monitoring, wake device, thermal extensions, etc. - which will be implemented if and when somebody makes hardware which actually has those features.

Work is being done to cherry-pick the most interesting features from TuxOnIce and get them upstream. The dream is to eventually see the TuxOnIce developer(s) working in the mainline rather than off in their own external tree. Also being worked on is suspend-to-RAM performance; in particular, resuming devices asynchronously looks like a way to get back to a working system more quickly.

Finally, a lot of work is being done on p-state and c-state tuning. There are some manufacturers who want to implement this functionality in the BIOS; Len and company would like to demonstrate that this work is better done at the operating system level.

I/O controllers

The I/O controller community has long suffered from an embarrassment of riches; there are several implementations out there. In fact, according to Fernando Luis Vásquez Cao, who presented the results of the I/O Controller mini-summit held just before the kernel summit, there are at least five controllers out there. It would be nice to get down to just one, which could then be merged into the mainline.

I/O controller users want a number of features, including the ability to control I/O traffic on either a proportional weight or a maximum bandwidth basis. That tends to drive the differences between the different implementations. A controller done at the I/O scheduler level makes proportional weight control relatively easy to implement; it also is relatively good at achieving full disk utilization. A controller above the virtual storage level, instead, is better for maximum bandwidth control; it can also control I/O to devices which bypass the I/O scheduler and can provide more topology-aware control.

Beyond that, full control of buffered writes also requires cooperation from the virtual memory code. That cooperation makes it possible to control the ratio of dirty pages, to perform page tracking, and to make the writeback code aware of the I/O controller infrastructure.

So which approach won out at the mini-summit? The conclusion which was reached was to implement a single controller which operates at all of the above-mentioned levels. The various I/O controller developers will work together to create this new tool which will do both BIO-based (high-level) and request-based (I/O scheduler level) control. There will be a single management infrastructure for both levels.

The current plan is to build the control group awareness into the CFQ I/O scheduler first. The developers would like to see this work merged for 2.6.33, but Fernando thinks that is ambitious; 2.6.34 seems more likely. Once that is done, work will being on I/O control at the BIO and VM levels.

The question that came up at the end of this presentation was: a similar agreement had been reached at the Storage and Filesystems workshop in April; what's different this time? It does seem that there is a higher level of determination to actually carry the plan through this time. Jens Axboe noted that he had "brought a big club" to the discussion this time around.

The other concern had to do with other I/O schedulers - what about people who don't use CFQ? The long-term plan, it seems, is to reduce the number of I/O schedulers in the system. Eventually, it is hoped, only the CFQ and noop schedulers will remain. So there is not much point in hooking I/O controllers into the other I/O schedulers.

Next: The state of the scheduler