The size and scope of the kernel makes it hard to discuss
subsystem-specific issues in the larger kernel summit setting. So it has
become an increasingly common practice to hold mini-summit events dedicated
to specific corners of the kernel. At the main kernel summit, brief
reports from the mini-summits are presented; there were four of these in
Thomas Gleixner reported from the realtime mini-summit held recently in
Dresden. Specific issues mentioned here included scalability issues, the
big kernel lock, and the lock naming problem. Realtime Linux tends to run
into specific scalability problems earlier than "regular" Linux systems;
it's a sort of canary in the coal mine in that regard. One of the big
problems at the moment is dcache_lock. BKL-removal has become a
high priority for the realtime developers; it is seen as a prerequisite for
getting much of the remaining realtime code merged into the mainline. The
spinlock naming issue was mentioned, but discussion was deferred to the
dedicated session on Tuesday.
See LWN's coverage of the
realtime mini-summit for more information on what was discussed there.
The networking developers held a summit in Portland, just prior to
LinuxCon. One of the topics discussed there was the recvmmsg()
system call, being implemented by Arnoldo Carvalho de Melo.
recvmmsg() can be a big performance win in situations where it is
known that several datagrams can be expected.
Another area of development is transmit interrupt mitigation. The NAPI
interface has helped the networking stack reduce receive interrupt overhead
for years now, but transmit interrupts, too, have a performance cost.
Making interrupt mitigation for outgoing packets work can be challenging,
especially in the presence of broken hardware which implements the feature
badly. The networking stack really needs to know when the network adapter
has finished with a specific packet so that the associated resources can be
freed. If the hardware doesn't make that information available properly,
things can go downhill in a hurry.
It turns out that transmit interrupt mitigation is also important in
virtualization situations. The receipt notification between guests can be
surprisingly expensive; mitigation makes virtualized networking run much
Another area of development is the use of control groups for traffic
shaping. Eventually it will be possible to control bandwidth use and port
access by way of the control group mechanism.
The multiqueue work is still
under development; for some situations (forwarding, for example) it's
nearly optimal now. For others, such as for traffic going to a local
destination, there is still work to be done. Ted Ts'o asked if there had
been any thought of hooking into the scheduler to help ensure that incoming
data is steered to the right CPU. The problem with that is that the
network card makes that decision. There are cards which can be programmed
to direct specific packets to the correct CPU; support for this feature in
the networking stack is improving.
Len Brown presented the current state of power management and the results
of the mini-summit held before the Linux Symposium in July. The goal of
the power management developers at this point is to have power management
enabled by default on all systems, with no impact on performance. There
has been progress toward that goal. The addition of a debugging framework
and the new, simplified driver API has helped a lot in that regard. Also
helpful are the quality-of-service features and some improvements in Intel
It was wryly noted that it is now possible to suspend (and presumably
resume) an S390 system.
Power management awareness at the hardware layer is growing; the mac80211
layer has recently gained suspend/resume awareness.
Unlike many subsystem maintainers, Len maintains statistics for reported
and fixed bugs in the ACPI subsystem - and he shows them to others. The
rate of reported bugs remains mostly constant over time, but the number of
open bugs has been falling. In other words, the ACPI developers seem to be
fixing bugs faster than they are introducing them.
ACPI support is in the mainline now; it was actually shipped before the
official specification came out. It is, in fact, the first ACPI 4.0
implementation shipped in any operating system.
For 2.6.32, new features include power meter support and the much-discussed
"force processor idle"
patch. 2.6.33 will see a rework of the error reporting code and "IPMI
op-region" support. What's not done is support for the D3Hot state;
nobody really knows how to use that yet. There will eventually be support
for the MCST feature, which tells the operating system how large the system
can get - how much memory can be installed, for example. There are a
number of other ACPI features - memory bandwidth monitoring, wake device,
thermal extensions, etc. - which will be implemented if and when somebody
makes hardware which actually has those features.
Work is being done to cherry-pick the most interesting features from
TuxOnIce and get them upstream. The dream is to eventually see the
TuxOnIce developer(s) working in the mainline rather than off in their own
Also being worked on is suspend-to-RAM performance; in particular,
resuming devices asynchronously looks like a way to get back to a working
system more quickly.
Finally, a lot of work is being done on p-state and c-state tuning. There
are some manufacturers who want to implement this functionality in
the BIOS; Len and company would like to demonstrate that this work is
better done at the operating system level.
The I/O controller community has long suffered from an embarrassment of riches;
there are several implementations out there. In fact, according to
Fernando Luis Vásquez Cao, who presented the results of the I/O
Controller mini-summit held just before the kernel summit, there are at
least five controllers out there. It would be nice to get down to just
one, which could then be merged into the mainline.
I/O controller users want a number of features, including the ability to
control I/O traffic on either a proportional weight or a maximum bandwidth
basis. That tends to drive the differences between the different
implementations. A controller done at the I/O scheduler level makes proportional
weight control relatively easy to implement; it also is relatively good at
full disk utilization. A controller above the virtual storage level,
instead, is better for maximum bandwidth control; it can also control I/O
to devices which bypass the I/O scheduler and can provide more
Beyond that, full control of buffered writes also requires cooperation from
the virtual memory code. That cooperation makes it possible to control the
dirty pages, to perform page tracking, and to make the writeback code aware
of the I/O controller infrastructure.
So which approach won out at the mini-summit? The conclusion which was
reached was to implement a single controller which operates at all of the
above-mentioned levels. The
various I/O controller developers will work together to create this new
tool which will do both BIO-based (high-level) and request-based (I/O
scheduler level) control. There will be a single management infrastructure
for both levels.
The current plan is to build the control group awareness into the CFQ I/O
scheduler first. The developers would like to see this work merged for
2.6.33, but Fernando thinks that is ambitious; 2.6.34 seems more likely.
Once that is done, work will being on I/O control at the BIO and VM
The question that came up at the end of this presentation was: a similar
agreement had been reached at the Storage and Filesystems workshop in
April; what's different this time? It does seem that there is a higher
level of determination to actually carry the plan through this time. Jens
Axboe noted that he had "brought a big club" to the discussion this time
The other concern had to do with other I/O schedulers - what about people
who don't use CFQ? The long-term plan, it seems, is to reduce the number
of I/O schedulers in the system. Eventually, it is hoped, only the CFQ and
noop schedulers will remain. So there is not much point in hooking I/O
controllers into the other I/O schedulers.
Next: The state of the scheduler
to post comments)