Kernel development
Brief items
Kernel release status
The 2.6.35 merge window is still open, so there is no development kernel release as of this writing. See the separate article below for a summary of the merge window to date.Stable updates: the 2.6.27.47, 2.6.32.14, and 2.6.33.5 stable updates were released on May 26. Note that there's only likely to be one more 2.6.33 update before support for that kernel stops.
Quotes of the week
Slab reduction
LWN took a brief look at the new SLEB memory allocator last week. Since then, Christoph Lameter has posted a second revision of the patch, but subsequent discussion suggests that SLEB is not likely to find its way into the mainline.The main detractor at the outset was Nick Piggin, author of the unmerged SLQB allocator. Nick saw SLEB as a step forward from SLUB, which he suggests should never have been merged:
What Nick would like to see at this point is not another in-kernel slab allocator (not even SLQB), but, instead, an effort to improve slab itself, which, he says, is already pretty close to optimal. And, regardless, Linus has made it clear that he's not interested in merging more allocators:
Nick's plan is to start by cleaning up the slab allocator to make the code more approachable than it is now. Then, any performance problems can be carefully fixed up, with an emphasis on not causing performance regressions elsewhere. Over time, he says, we should get farther with a single allocator that is used (and tested) by everybody than by ripping out code with a long development history and replacing it with something new and untried.
Suspend blocker suspense
As of last week's article on the Android suspend blocker mechanism, the conversation appeared to be slowing down. Such blessings, it seems, are never permanent; many electrons have been perturbed to continue this discussion since then. The end result is that the late entrance into the discussion by people with names like Alan Cox, Thomas Gleixner, and Peter Zijlstra has made the merging of this feature more unlikely.Alan's dissent was arguably the most coherent and constructive of just about any that have been posted thus far. He thinks that the problem being addressed by suspend blockers (misbehaving applications) is real, but the solution is wrong. He suggests, instead, the addition of a setpidle() system call which indicates the extent to which a process can prevent the system from going into an idle state. If the process is running an untrusted application, the system would be able to go idle (or suspend) even if that process is runnable at the time. More trusted processes (the ones which would be able to use suspend blockers in the Android scheme) would have a higher priority and would be able to run at any time.
The other piece of the solution, according to Alan, is to put pressure on the authors of badly-written applications. Thomas agrees:
A number of developers have expressed the fear that trying to mitigate the impact of badly-written applications in the kernel will only serve to take the pressure off developers, leading to more bad applications over time.
Meanwhile, Rafael Wysocki has sent a pull request for suspend blockers to Linus, saying:
As of this writing, Linus has not said what he intends to do. Given the way the conversation has gone, though, it would not be surprising to see the merge window end with no suspend blockers in the mainline. Merging a user-visible feature like suspend blockers is a move which cannot be undone after the 2.6.35 release; when there is this much disagreement, letting another development cycle go by may seem like the prudent thing to do.
Kernel development news
2.6.35 merge window part 2
Much code - some 6300 non-merge changesets - has gone into the mainline kernel since last week's article. Listing all of the changes would be an impossible task, even for the KernelNewbies folks who seem to get close, but an overview can be given. The most interesting user-visible changes are:
- The receive packet
steering and receive
flow steering mechanisms have been added to the networking subsystem.
- The memory compaction
patch set has been merged. This should lead to less memory
fragmentation and higher success rates for large allocations.
- A loophole which would, in some circumstances, allow a security module
to be registered at runtime has been closed; security modules must be
present at boot time.
- The network traffic control subsystem is now namespace-aware.
- The "Communication CPU to Application CPU Interface" (CAIF) protocol,
used to speak with ST-Ericsson modems, is now supported in the network
stack. Also supported in version 3 of the Layer Two Tunneling
Protocol (L2TP - RFC 3931).
- "FunctionFS" is a mechanism by which user-space USB drivers can create
USB gadgets with composite functionality. The "f_uvc" driver builds
on this feature to create a video capture device driven by data from
user space.
- There is now support for the "restricted access regions" mechanism
built into Intel "Moorestown" processors. RAR can be used to block
devices (including processors) out of specific regions of memory.
- The cpuidle "menu" governor now features idle pattern detection
which tries to be smarter about sleep-state selection based on recent
system history.
- The slab allocator has gained memory hotplug support.
- Extended attributes are now supported in the squashfs filesystem.
- The size of the in-kernel buffer used to hold data for a pipe can be
queried with the new F_GETPIPE_SZ fcntl() command,
and changed with F_SETPIPE_SZ. As of this writing, the units
used by this command are page-sized buffers, but that will almost
certainly change to bytes in the near future.
- Lots of new drivers:
- Input: Analog Devices AD714x capacitance touch sensors,
Hampshire serial touchscreens,
TI TCA6416 keypads,
Cando dual touch panels,
Minibox PicoLCD devices,
Prodikeys PC-MIDI Keyboard devices, and
Roccat Kone mice.
- Network: Atheros HTC based wireless cards,
USB-connected Agere Orinoco wireless cards, and
SBE wanPMC-C[421]E1T1 WAN adapters.
The ath5k driver has also gained adaptive noise immunity support, a
feature which is said to nearly double throughput in noisy
situations.
- Sound: USB Audio Class v2.0 compliant devices,
Wolfson Micro WM1133-EV1 on i.MX31ADS systems,
Wolfson Micro WM9090 amplifiers,
TI TWL6040 codecs,
Texas Instruments SDP4430 audio devices,
Zipit Z2 WM8750 audio devices, and
AudioScience ASI sound cards.
- Storage: devices with the SmartMedia/xd flash translation layer,
Ricoh R5C852 xD card readers,
MPC5121 built-in NAND flash controllers,
Denali NAND controller on Intel Moorestown systems,
Samsung OneNAND controllers, and
QLogic PCIe QLE InfiniBand host channel adapters.
- Systems and processors: much of the support code for the
ARM "MSM" architecture, as found in the G1/ADP1 handset, has been
merged. Additionally: OMAP3 SBC STALKER boards and
PowerPC 476 processors.
- Video4Linux: Trident TV Master tm5600/tm6000 chips,
SuperH VOU video output devices,
AK8813/AK8814 video encoders and
DataTranslation DT3155 frame grabbers.
- Miscellaneous: RDMA and iWARP on Chelsio T4 1GbE and 10GbE
adapters,
viafb-based i2c and GPIO devices,
SuperH IrDA devices,
Altera UARTs,
GSM MUX line discipline support (in the TTY layer),
Niagara2 stream processing units,
ACX565AKM (Nokia N900) panels,
Analog Devices ADIS16255 low power gyroscopes,
Analog Devices ADIS16300 and ADIS16400/5 inertial sensors,
Analog Devices ADIS16240 programmable impact sensors,
Analog Devices ADIS16260/5 digital gyroscope sensors,
Analog Devices ADIS16209 dual-axis digital inclinometer and
accelerometer devices,
Timberdale FPGA DMA engines,
ST-Ericsson DMA40 DMA controllers,
Texas Instruments ADS7871 A/D converters,
Freescale IMX2 hardware watchdogs,
Freescale MPC512x PSC SPI controllers, and
Cirrus EP93xx SPI controllers.
Additionally, "g_hid" is a USB gadget driver implementing the human interface device class specification, g_webcam is a gadget-side USB video device, and g_ffs allows the creation of USB composite functions.
- Input: Analog Devices AD714x capacitance touch sensors,
Hampshire serial touchscreens,
TI TCA6416 keypads,
Cando dual touch panels,
Minibox PicoLCD devices,
Prodikeys PC-MIDI Keyboard devices, and
Roccat Kone mice.
Changes visible to kernel developers include:
- There is a new variant of request_irq():
request_any_context_irq(unsigned int irq, irq_handler_t handler, unsigned long flags, const char *name, void *dev_id);This function connects the interrupt handler in the usual way. The difference is that it looks at how the interrupt line itself was set up (by architecture-specific code) and decides whether to establish a traditional hard interrupt handler or a threaded handler. The return value on success is either IRQC_IS_HARDIRQ or IRQC_IS_NESTED.
- Also related to request_irq(): the IRQF_DISABLED
flag now does nothing; it will be removed entirely in 2.6.36. All
interrupt handlers are now called with interrupts disabled; see this article for details on
the change.
- The timer slack
mechanism has been merged. Timer slack (which applies to old-style
timers, not hrtimers) allows the system to defer timer expiration by a
bounded amount in order to get timers to expire at the same time, thus
minimizing system wakeups.
- The new function ktime_to_ms() converts a kernel time value
to milliseconds.
- A number of unused security module hooks (task_setuid,
sb_post_remount, sb_post_pivotroot,
sb_umount_close, acct, inode_delete,
sb_umount_busy, task_setgid,
task_setgroups, sb_post_addmount,
cred_commit, and key_session_to_parent) have been
removed.
- The power management quality of service (pm_qos) system API has been
significantly changed; see this article for details.
- "Tagged directory support" has been added to sysfs. This feature
allows namespace-specific tags to be added to sysfs directories; these
tags then control which version of a directory (if any) is visible
within any given namespace.
- kref_set() has been removed after a determination that none
of its three users were correct.
- The read(), write() and mmap() methods in
struct bin_attribute have gained a new struct file
pointer argument.
- The "kdb" low-level kernel debugger has been joined with kgdb and
merged into the mainline.
- The checkpatch script will now complain if kernel configuration
options are added with fewer than four lines of help text.
- There is a bunch of new infrastructure in the Video4Linux2 subsystem, including a framework for supporting memory-to-memory devices (video processing engines), a mechanism for reporting asynchronous events to user space, and a new core subsystem for infrared controllers.
In a normal merge window, changes would continue to flow into the mainline until the end of the month. Linus has made it clear that he no longer guarantees "normal" merge windows, though, so the window could close sooner than that. Regardless, tune in next week for a summary of the final changes to be merged for 2.6.35.
2.6.35 Video4Linux2 enhancements
The 2.6.35 development cycle has been a busy one for the Video4Linux2 subsystem, with quite a bit of new infrastructure and some new drivers merged. This article will provide an overview of some of the new capabilities in V4L2.
Memory-to-memory devices
Video hardware often includes subsystems which are capable of processing video streams in various ways. The VIA chipset that your editor has recently been working with, for example, has a "high quality video" (HQV) engine which can be used to change video formats, rotate frames, convert between color spaces, perform deinterlacing, and more. It is not uncommon for video drivers to make use of an engine like this to make a wider range of formats and options available to applications. When used in this mode, the processing engine is hidden from user space; it looks like a part of the video input or output device.
But there can be value in making the video processing engine available as a device in its own right; applications will then be able to use it to accelerate operations on video data. The kernel has made other data-transformation engines available through various interfaces, the "dmaengine" API in particular. Simple DMA engines can move data around and possibly perform a transformation - exclusive-or with a value, for example. More complex engines can perform cryptographic transformations, and, indeed, are used for this purpose within the kernel's cryptographic code.
The dmaengine API has not been used for video processing engines, though. Your editor has not been told the reasons for that decision, but there are a couple of obvious guesses, starting with the fact that video engines might, in fact, not do DMA. For example, the VIA HQV engine requires that the relevant data be present in framebuffer memory. Perhaps more telling, though, is the complexity of the transformations which might be performed. Video data streams have an appalling number of formats and parameters; it takes a fairly complex API to allow applications to describe the sort of stream they want to deal with. Such an API could certainly be created for a new video processing facility, but that API already exists in the form of the V4L2 specification. It makes far more sense to reuse that API than to try to create it anew. Reusing the API happens naturally if the video processing engine looks like a V4L2 device in its own right.
So the new memory-to-memory (m2m) infrastructure is designed to enable the creation of V4L2 devices which move video frames from one memory buffer to another, performing some sort of transformation on the way. Frames are fed to the device as if it were an ordinary video output device, with all of the appropriate configuration done to describe the format of those frames. The video input side is, instead, configured with the format the application would like to have. The driver takes frames written to the device by applications, runs them through the processing engine, then makes them available for reading as if it were an ordinary video capture device.
The m2m API will only be discussed in the most superficial way here; see <media/v4l2-mem2mem.h> for more details. Drivers start by defining a set of callbacks:
struct v4l2_m2m_ops {
void (*device_run)(void *priv);
int (*job_ready)(void *priv);
void (*job_abort)(void *priv);
};
The device_run() callback will be called when there is work for the engine to do; job_abort(), instead, is called when the engine must be stopped as quickly as possible. The optional job_ready() function should return a nonzero if the driver could currently start a job without sleeping.
The callbacks are registered with:
struct v4l2_m2m_dev *v4l2_m2m_init(struct v4l2_m2m_ops *m2m_ops);
When the device is opened, the driver should make a call to:
struct v4l2_m2m_ctx *v4l2_m2m_ctx_init(void *priv,
struct v4l2_m2m_dev *m2m_dev,
void (*vq_init)(void *priv, struct videobuf_queue *,
enum v4l2_buf_type));
The priv value will be passed through to the above-described callbacks. Two calls to vq_init() will be made, one each for the input and output buffer queues; vq_init() should, in turn, make a call to the appropriate videobuf initialization function.
There is a whole set of helper functions meant to be used to implement many of the V4L2 operations: v4l2_m2m_reqbufs(), v4l2_m2m_qbuf(), v4l2_m2m_streamon(), v4l2_m2m_mmap(), etc. They handle most of the heavy lifting of implementing a memory-to-memory driver, calling the device_run() callback when there is work to do and buffers are available. As the driver fills buffers with processed frames and passes them back to the videobuf system, they will, in turn, be handed back to the application.
Once again, most of the details have been glossed over, but that's the core of what this API does. As of this writing, there are no drivers for real hardware using this API in the mainline, though some have been posted for review. A sample user can be seen in drivers/media/video/mem2mem_testdev.c.
Events
The V4L2 API is dominated by the description of video streams and the actual movement of frames. There are other things of interest, though, which may happen in an asynchronous manner. To support the passing of asynchronous events to user space, a new set of events operations has been added. The initial use of this code is to allow applications to request notification for vertical sync events or the loss of the video stream.
At the user-space API level, there are a few additions to the seemingly endless list of V4L2 ioctl() commands: VIDIOC_SUBSCRIBE_EVENT, VIDIOC_UNSUBSCRIBE_EVENT, and VIDIOC_DQEVENT. Once an application has subscribed to one or more events, it can learn about them in the usual ways, including signals and poll(); a VIDIOC_DQEVENT call allows the application to see what actually happened.
Within the kernel, the event API starts with a new mechanism for the management of file handles associated with a device. Each new open file must be set up with:
#include <media/v4l2-fh.h>
int v4l2_fh_init(struct v4l2_fh *fh, struct video_device *vdev);
void v4l2_fh_add(struct v4l2_fh *fh);
That sets up the connections which allow V4L2 drivers to associate things (like events) with specific open files.
A driver which supports events should start with a call to:
#include <media/v4l2-event.h>
int v4l2_event_alloc(struct v4l2_fh *fh, unsigned int n);
This call will try to ensure that storage is available for at least n events on this file handle. Actual events are signaled with:
struct v4l2_event {
__u32 type;
union {
struct v4l2_event_vsync vsync;
__u8 data[64];
} u;
/* ... */
};
void v4l2_event_queue(struct video_device *vdev,
const struct v4l2_event *ev);
In addition, there is the usual set of helper functions (v4l2_event_dequeue(), v4l2_event_subscribe(), ...) meant to be called from the driver's V4L2 callbacks.
Currently, events are only supported by the IVTV driver, so that is the place to look for a usage example.
The infrared core
Back in December, LWN looked at the state of infrared receiver drivers in the kernel - or, in the case of the long out-of-tree LIRC subsystem, out of the kernel. That discussion has long since died down, but the hacking did not stop. The result is that, with 2.6.35, the V4L2 subsystem has a new framework for IR receivers. There is support for a number of controllers at this point. The IR core also includes an interface where drivers (or user space) can feed simple "mark" or "space" timing information which is then decoded in software; that should make it possible to hook a number of user-space LIRC drivers into the system.
Documentation on the new IR core is sparse; basic kernel API information can be found in include/media/ir-core.h and ir-common.h.
In conclusion
It has been a busy merge window for one of the most active subsystems in the kernel. Over the last few years, the V4L2 subsystem has built up an impressive amount of infrastructure and has reached the point where it supports almost all of the available hardware. That said, there is still quite a bit of work in progress; traffic on the mailing list is concerned with multi-plane video buffer support, a new control framework, and more. So future merge windows are likely to be busy in this part of the tree as well.
One ring buffer to rule them all?
One of the outcomes from the Collaboration Summit in April was a plan to create a tracing ring buffer implementation that would work for both Ftrace and LTTng. There was also hope that perhaps the separate ring buffer added for perf could be folded in as well, so that the vision of a single ring buffer implementation in the kernel—from the 2008 Kernel Summit and Linux Plumbers Conference—could come to fruition. To that end, Steven Rostedt posted an RFC for a unified ring buffer, but before that conversation could really get going, it was diverted. It seems that Ingo Molnar thinks there are much bigger issues to resolve in the world of Linux tracing.
A ring buffer is a circular data structure that is often used to hold data that is produced and consumed by separate processes without synchronization. For tracing, the data is produced by the kernel outside of any specific process's context, but the consumer is in user space. The kernel needs to hand out pages from the head of the ring buffer to user space for consumption, while ensuring that it doesn't overwrite that data as it writes to the tail of the buffer.
In his RFC, Rostedt recounts the history of tracing ring buffers, noting that the Ftrace ring buffer did not become lockless until after perf came along. Since perf needed to be able to record events in non-maskable interrupt (NMI) contexts, it couldn't use the Ftrace ring buffer; instead, it used one of its own, written by Peter Zijlstra. Eventually, Rostedt changed the Ftrace ring buffer to be lockless, but at that point, it was too late for perf. In addition, the perf ring buffer allows user space to mmap() its pages, while the Ftrace ring buffer was geared to in-kernel users and splice().
So the kernel already has two tracing ring buffers, but there is also an out-of-tree ring buffer to consider, which is the one used by LTTng. That ring buffer has seen a lot of use in production Linux shops as well as being integrated into various embedded Linux distributions. In addition, much as was done with RCU, LTTng project lead Mathieu Desnoyers did a formal proof of the correctness of that ring buffer algorithm.
At the Collaboration Summit, there was a belief that the LTTng ring buffer could be extended to support all of the Ftrace use cases. It would seem that since then, Desnoyers has come up with ways to support the perf ring buffer use cases as well. Both Rostedt and Desnoyers would like to nail down all of the requirements for (at least) tracing ring buffers and put together a single implementation that works for all of them. Desnoyers has put together a git tree that includes a bare bones ring buffer as a starting point.
But Andi Kleen is not convinced of the need
for a unified ring buffer, at least one that encompasses other non-tracing
uses. The Ftrace ring buffer is very complex and "too clever
";
plenty of other subsystems use kfifo:
He goes on to ask "If perf's current ring buffer works for it why not
keep using it?
". Rostedt more or less agrees with the complexity argument, but notes
that there tends to be a misconception when people first look at the
problem:
Desnoyers also points out that tracing has some requirements that other ring buffer users may not have:
There are advantages to sharing a ring buffer implementation among the
different tracing solutions beyond just fulfilling Linus Torvalds's mandate
from the
2008 Kernel Summit. High-performance ring buffer implementations tend to
be more complex than standard code according to Desnoyers, "so it's good
if we can focus our review efforts on a single ring buffer
". In
addition, if there is a common implementation, any performance tweaks
will help all of its users.
There is another underlying reason for a single ring buffer, though. Molnar would like to see Ftrace phased out, with its functionality moved into perf. Rostedt is not necessarily opposed, but in order to do that, there needs to be some ring buffer implementation that supports both. The question is: how to get there?
Rather than directly commenting on the idea of a unified ring buffer
itself, Molnar
posted a request for discussion on
"Future tracing/instrumentation directions
". In it, he makes
the case for moving Ftrace functionality to perf and suggests that Rostedt
and Desnoyers help Zijlstra with "performance and simplification
work
"
of the perf ring buffer:
If we really want to create a new ring buffer abstraction i'd suggest we start with Peter's, it has a quite sane design and stayed simple and flexible [...]
Molnar believes that the performance of the current ring buffers
"sucks, in a big way. I've recently benchmarked it and it takes
hundreds of instructions to trace a single event
". He also thinks
that the current "ftrace/perf duality
" is hurting developers
and users. One of the main things he would like to eliminate is the
debugfs interface for Ftrace, but that will take some time:
There's also the detail that in some cases we want to print events in the kernel in a human readable way: for example EDAC/MCE and other critical events, trace-on-oops, etc. This too can be solved.
Thomas Gleixner and Ted Ts'o both spoke up in favor of the kernel events and tracepoints being discoverable from user space. Currently, that is well-supported by Ftrace using its debugfs interface, and both would like to see that continue. Gleixner wants simple tracing tools for embedded devices—eventually made a part of BusyBox—that don't have to change when new tracepoints or events are added. Ts'o on the other hand wants to be able to have bash-completion scripts that can figure out tracepoint names. Molnar agreed that it is important to maintain that ability going forward.
There is some debate about how badly the Ftrace ring buffer performs. Molnar is quite critical of its performance, but Rostedt disputes some of those findings. In the end, there doesn't seem to be much disagreement that a better performing ring buffer could be created, there is just a question of how it should be approached.
Rostedt does not think that starting with the perf ring buffer is the right
approach: "The current ring buffer in perf is very coupled with the perf
design.
" Molnar, though, is leery of bringing yet another
ring buffer implementation into the picture:
It doesnt mean we should disrupt _two_ implementations and put in a third one, unless there are strong technical reasons for doing so.
Those strong technical reasons may be found in the performance numbers for the various implementations. If Rostedt and Desnoyers can produce a ring buffer that works for Ftrace and perf, while performing better than either existing implementation, it seems likely that it would find a clear path into the kernel. As the discussion has trailed off, one gets the sense that the participants are now benchmarking and tweaking their implementations to try to achieve that.
The ring buffer implementation is at the heart of any Linux tracing solution; its capabilities and performance will largely dictate how intrusive tracing is on the rest of the system, which in turn impacts how useful the tracing output is. The fact that several smart developers have yet to come up with a super-low-impact solution speaks volumes about the difficulty of the problem. With all of the work that is currently going on, though, it seems likely that one way or another a high-performance ring buffer—with lower overall complexity—will come about.
Another interesting outcome from this discussion, short though it may have been, is that we are likely to see Ftrace fade away over time. The functionality won't disappear, it and much of the Ftrace code would be moved into perf, but Ftrace itself—which really started the (relatively) recent mainline tracing push—might well be gone sometime in the next few kernel development cycles.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
