Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.33-rc7 released on February 6. "I have to admit
that I wish we had way fewer regressions listed by this time... But we've
certainly fixed a few things, and it's been a week, so here's -rc7. I wish
I could say that it's the last -rc, but I strongly doubt that, and we'll
almost certainly have at least one more.
" See the
full changelog for the details.
Stable updates: 2.6.32.8 was released on February 9. "Sorry for the delay in releasing it, but there were a few crashes that
people had reported, combined with verifying that a security problem
really was fixed and backported properly, along with travel to and from
FOSDEM, all [of] which caused delays.
" 2.6.27.45 remains as the latest
stable update for 2.6.27.
Quotes of the week
LSM is essentially a trashcan and just about everything icky gets swept over there. That's fine, as long as one doesn't care whether their code makes sense and just wants to keep it away from unfriendly eyes.
Kernel development news
Who wrote 2.6.33
The release of the 2.6.33-rc7 prepatch indicates that this development cycle is headed toward a close, even if Linus thinks that a -rc8 will be necessary. As has become traditional, LWN has taken a look at some statistics related to this cycle and where the code came from.As of this writing, 10,500 non-merge commits have found their way into 2.6.33 - fairly normal by recent standards. These changes added almost 900,000 lines while deleting almost 520,000 others; as a result, the kernel grew by a mere 380,000 lines this time around. According to the most recent regression list, 97 regressions have been reported in 2.6.33, of which 20 remain unresolved.
Some 1,152 developers contributed code to 2.6.33. The most active of those were:
Most active 2.6.33 developers
By changesets Ben Hutchings 145 1.4% Frederic Weisbecker 145 1.4% Arnaldo Carvalho de Melo 138 1.3% Luis R. Rodriguez 130 1.2% Masami Hiramatsu 128 1.2% Bartlomiej Zolnierkiewicz 124 1.2% Eric Dumazet 108 1.0% Alan Cox 105 1.0% Manu Abraham 102 1.0% Thomas Gleixner 101 1.0% Eric W. Biederman 97 0.9% Roel Kluin 91 0.9% Alexander Duyck 88 0.8% Paul Mundt 87 0.8% Johannes Berg 80 0.8% Wey-Yi Guy 77 0.7% Alex Deucher 76 0.7% Jean Delvare 73 0.7% Al Viro 72 0.7%
By changed lines Bartlomiej Zolnierkiewicz 206468 18.1% Henk de Groot 50355 4.4% Jerry Chuang 49627 4.3% Ben Skeggs 37555 3.3% Philipp Reisner 23182 2.0% Eilon Greenstein 23123 2.0% Tomi Valkeinen 22508 2.0% Mike Frysinger 13116 1.1% Ben Hutchings 12680 1.1% Jakob Bornecrantz 11613 1.0% Wu Zhangjin 11325 1.0% Greg Kroah-Hartman 10468 0.9% Rajendra Nayak 9978 0.9% Manu Abraham 9625 0.8% jack wang 9171 0.8% Masami Hiramatsu 8973 0.8% Alan Cox 7672 0.7% David VomLehn 7331 0.6% Arnaldo Carvalho de Melo 7217 0.6%
While some of the usual names appear at the top of this list, there are some newcomers as well. Ben Hutchings did a lot of work with network drivers, including the addition of the SolarFlare SFC9000 driver (which has several co-authors). Frederic Weisbecker has been active in a number of areas, adding the hardware breakpoints code, removing the big kernel lock from the reiserfs filesystem, and working with tracing and the perf tool. Arnaldo Carvalho de Melo's work is almost all with the perf events subsystem and the perf tool in particular. Luis Rodriguez continues to work all over the wireless driver subsystem, and with the Atheros drivers in particular, and Masami Hiramatsu's largest contribution is the dynamic probing work.
In the "lines changed" column, Bartlomiej Zolnierkiewicz continues to work in fixing up some wireless drivers in the staging tree, deleting a lot of code in the process; he also continues his IDE driver work. Henk de Groot added the Agere driver for HERMES II chipsets, Jerry Chuang added the Realtek rtl8192u driver, and Ben Skeggs added much of the Nouveau driver.
Contributions to 2.6.33 came from 182 employers that your editor was able to identify. The most active of those are:
Most active 2.6.33 employers
By changesets (None) 1535 14.6% Red Hat 1223 11.6% Intel 1011 9.6% (Unknown) 868 8.3% IBM 500 4.8% Novell 390 3.7% Nokia 319 3.0% (Consultant) 316 3.0% Fujitsu 204 1.9% Texas Instruments 199 1.9% Atheros Communications 169 1.6% (Academia) 166 1.6% AMD 165 1.6% Oracle 136 1.3% Analog Devices 130 1.2% Renesas Technology 126 1.2% Pengutronix 125 1.2% HP 124 1.2% Solarflare Communications 123 1.2%
By lines changed (None) 304895 26.7% (Unknown) 109716 9.6% Red Hat 92991 8.1% Broadcom 54272 4.8% Realtek 49951 4.4% Intel 46302 4.1% Nokia 37505 3.3% Novell 27235 2.4% IBM 26783 2.3% (Consultant) 25845 2.3% Texas Instruments 24232 2.1% LINBIT 23247 2.0% Analog Devices 19677 1.7% VMWare 16045 1.4% Samsung 15707 1.4% Solarflare Communications 15054 1.3% JiangSu Lemote Corp. 11439 1.0% AMD 9218 0.8% Universal Scientific Industrial Co. 9194 0.8%
As usual, Red Hat maintains its position at the top of the list, but others are gaining; we may yet see a day when Red Hat is just one of several major contributors. Some readers may be surprised to see Broadcom near the top of the list, given that this company's reputation for contribution is not the best. The truth of the matter is that Broadcom has several developers contributing to various drivers in the networking and SCSI subsystems; it's only in the wireless realm that the trouble starts.
For the fun of it, your editor typed the "changeset percent" numbers for the last ten releases into a spreadsheet and got this plot:
The percentages are surprisingly stable over the course of almost three years. The most obviously identifiable trends, perhaps, are the steady increases in the contributions from Intel and Nokia.
All told, the process continues to function smoothly. The occasional complaint about certain companies not fully participating in the process notwithstanding, the picture is one of hundreds of companies cooperating to a high degree to create the Linux kernel despite their fierce competition elsewhere. The significant percentage of code coming from developers working on their own time shows that Linux is not just a corporate phenomenon, though. We have built a development community which is able to incorporate the interests and work of an astonishingly wide variety of people into a single kernel.
As always, thanks are due to Greg Kroah-Hartman, who has done a great deal of work to reduce the size of the "(Unknown)" entries in the tables above.
Scripting support for perf
The perf tool for performance analysis is adding functionality quickly. Since being added to the mainline in 2.6.31, primarily as a means to access various CPU performance counters, it has expanded its scope. Support for treating kernel tracepoint events like performance counter events came into the kernel at around the same time. More recently, though, Tom Zanussi has added support for using perl and python scripts with the perf tool, making it even easier to do sophisticated processing of perf events.
The perl support is already in the mainline, but Zanussi added a python scripting engine more recently. Interpreters for both perl and python can be embedded into the perf executable, which allows processing the raw perf trace data stream in either of those languages.
The perl scripting can be used from the 2.6.33-rc series, but the python support is only available by applying Zanussi's patches to the tip tree. Building perf in the tools/perf directory, which requires development versions of various libraries and tools (glibc, elfutils, libdwarf, perl, python, etc.), then gives access to the new functionality.
Multiple different example scripts are provided with perf, which can be listed from perf itself:
# perf trace -l List of available trace scripts: syscall-counts [comm] system-wide syscall counts syscall-counts-by-pid [comm] system-wide syscall counts, by pid failed-syscalls-by-pid [comm] system-wide failed syscalls, by pid workqueue-stats workqueue stats (ins/exe/create/destroy) check-perf-trace useless but exhaustive test script failed-syscalls [comm] system-wide failed syscalls wakeup-latency system-wide min/max/avg wakeup latency rw-by-file <comm> r/w activity for a program, by file rw-by-pid system-wide r/w activityThis list is a mix of perl and python scripts that live in the tools/perf/scripts/{perl,python} directories and get installed in the proper location (/root/libexec by default) after a make install.
The scripts themselves are largely generated by the perf trace command. Zanussi's documentation for perf-trace-perl and perf-trace-python explain the process of using perf trace to create the skeleton scripts, which can then be edited to add the required functionality. Adding two helper shell scripts (for recording and reporting) to the appropriate directory will add new scripts to the list produced by perf trace described above.
The installed scripts can then be used as follows:
# perf trace record failed-syscalls ^C[ perf record: Woken up 11 times to write data ] [ perf record: Captured and wrote 1.939 MB perf.data (~84709 samples) ]This captures the perf data into the appropriately named perf.data file, which can then be processed by:
# perf trace report failed-syscalls perf trace started with Perl script \ /root/libexec/perf-core/scripts/perl/failed-syscalls.pl failed syscalls, by comm: comm # errors -------------------- ---------- firefox 1721 claws-mail 149 konsole 99 X 77 emacs 56 [...] failed syscalls, by syscall: syscall # errors ------------------------------ ---------- sys_read 2042 sys_futex 130 sys_mmap_pgoff 71 sys_access 33 sys_stat64 5 sys_inotify_add_watch 4 [...] # perf trace report failed-syscalls-by-pid perf trace started with Python script \ /root/libexec/perf-core/scripts/python/failed-syscalls-by-pid syscall errors: comm [pid] count ------------------------------ ---------- firefox [10144] syscall: sys_read err = -11 1589 syscall: sys_inotify_add_watch err = -2 4 firefox [10147] syscall: sys_futex err = -110 7 [...]This simple example shows using the failed-syscalls script to gather the data, then processing it with the corresponding perl script as well as a compatible python script (failed-syscall-by-pid) that slices the same data somewhat differently. The first report shows a count of each system call that failed during the few seconds while the trace was active. It shows the number of errors by process, as well as by system call.
The second report combines the two and shows each process along with a which system calls failed for it, and how many times. There are also corresponding scripts that count all system calls, not just those that failed, and report on them similarly. Wakeup latency, file read/write activity, and workqueue statistics are the focus of some of the other provided scripts.
These scripting features will make it that much easier for kernel hackers—or possibly those who aren't—to access the perf functionality. The state of tracing and instrumentation in the kernel has been quick to develop over the last few development cycles. It doesn't look to be slowing down anytime soon.
USB autosuspend
Introduction
Linux has supported system suspend to RAM and disk for several years now. This valuable feature has a major drawback, however: a system cannot be used while it is suspended. Reducing the power a system consumes while in active use is an even nicer feature. It is called "runtime power management." This can be done by clocking down or switching off components. The current kernel supports this mainly in form of CPU frequency management and USB autosuspend.
The core kernel needs drivers to help it in order to do runtime power management; some support beyond what drivers need to do to support system suspension is necessary. Drivers need to tell the rest of the kernel when a device may be suspended without unduly impacting performance. Furthermore, drivers need to be able to suspend and resume a device in a live system without the process freezer protecting them from races. A driver for an ordinary character device need not worry about suspend() and resume() racing against open(), read(), write() or ioctl(). This is no longer true if a driver uses runtime power management, but techniques to avoid such races will be shown later.
USB was the first subsystem in the kernel to introduce runtime power management in the form of the USB autosuspend feature; its success has led to the generic framework just being merged.
USB 2.0 devices are rather simple in terms of power management. They know just two modes with respect to power management: active or suspended. They also retain all their internal state when suspended. This makes the job of drivers easy in the ideal case. The driver ceases IO to the device and suspends the device when it is no longer needed and reverses the process when it is needed again.
Testing USB autosuspend on a laptop with the average set of built-in USB devices whose drivers all supported autosuspend, I found power savings on order of about 1W. The 6 laptops I tested on drew about 15W of power on average, so USB autosuspend can reduce power consumption by about 7%.
That said, USB autosuspend is not just for laptops. All those single watts saved in a company's desktops will add up to serious power savings. Even the blades in a data center profit a bit as the root hubs are suspended, too.
API
The API for implementing USB autosuspend is based on drivers telling the core USB subsystem whenever a reason for not suspending a device arises or ceases to exist. The subsystem counts the reasons why a device must not be autosuspended; the core USB subsystem may then suspend a device whose counters have reached zero. "Counters" is not a typo: a USB device may consist of a multitude of interfaces, each of which may have its own driver.
The counters are manipulated with "get" and "put" functions which wake or suspend devices according to the state of the counters. They are provided in synchronous and asynchronous versions.
- usb_autopm_get_interface(struct usb_interface *);
- Increment the counter and guarantee the device has been resumed (may sleep)
- usb_autopm_put_interface(struct usb_interface *);
- Decrement the counter (may sleep)
- usb_autopm_get_interface_async(struct usb_interface *);
- Increment the counter, which will wake the device at a later time (safe in atomic contexts).
- usb_autopm_put_interface_async(struct usb_interface *);
- Decrement the counter (safe in atomic contexts)
The asynchronous versions were recently fixed in commit ccf5b801 for the 2.6.32 release; earlier kernels were buggy. Those stuck with an older kernel for some reason cannot use these functions.
For these manipulations of the counters to have any effect, a driver must tell the USB subsystem that it supports USB autosuspend. It does so by setting a flag in its usb_driver structure. For example, the kaweth driver includes this initialization:
static struct usb_driver kaweth_driver = { /* ... */ .supports_autosuspend = 1, };
The core USB subsystem guarantees drivers that for all its calls to methods of struct usb_driver, except for, of course, resume() and reset_resume(), the device in question has been resumed and won't be suspended while the call is in progress.
Sysfs
Two sysfs attributes are exported pertaining to USB autosuspend for each device.
- /sys/$DEVICE/power/level
- On for inactive autosuspend, auto for active autosuspend
- /sys/$DEVICE/power/autosuspend
- The delay between counters reaching zero and autosuspend in seconds.
The delay mentioned in this table serves a double function. Firstly, some devices have a large energy consumption when resuming; disks, for example, have to spin up. Suspending them for a very short time saves no energy. The delay is a heuristic to avoid such situations. Secondly some devices need time to process data even after the host has finished talking to them. So do not set this delay to zero unless you know what you are doing.
Detecting idleness
Most devices are, obviously, idle most of the time. Think about how often one uses the fingerprint sensor or the camera built into most modern laptops. Even an Ethernet adapter is almost always unused while the WLAN is active and vice versa.
User space tells the kernel when it may require services of a device; an application must open a device before it can use it. This is true for any device that maps to a character device node and also for network devices, which are upped and downed. The notable exceptions to this rule are few, mainly framebuffers and input devices. These require considerable work to provide good runtime power savings.
Autosuspend based on open and close
Code which follows this pattern the kernel will not enable autosuspend for a device for which a file descriptor is held open. It can also be used for network devices because they have an equivalent to open() and close() in the form of ifconfig up and ifconfig down.
Let us have a look at a driver that implements this simple form of autosuspend:
From the kaweth driver:
static int kaweth_open(struct net_device *net) { struct kaweth_device *kaweth = netdev_priv(net); int res; res = usb_autopm_get_interface(kaweth->intf); if (res) { err("Interface cannot be resumed."); return -EIO; }
The driver calls usb_autopm_get_interface() at the very beginning. This ensures that the device will not be autosuspended after it has returned without an error. The driver may henceforth assume that the device is usable and may ignore the issue of power management until the device is closed again. The driver must just make sure that it does no IO to the device before it calls usb_autopm_get_interface().
A similar pattern is followed when the device is closed:
static int kaweth_close(struct net_device *net) { struct kaweth_device *kaweth = netdev_priv(net); netif_stop_queue(net); /* ... */ kaweth_kill_urbs(kaweth); usb_autopm_put_interface(kaweth->intf);
The driver finishes all IO to the device, then calls usb_autopm_put_interface(). For a conventional driver waiting for all IO to finish is a very good idea; for a driver using this kind of autosuspend it is mandatory. Strictly speaking one cannot be sure exactly when transferred data has been processed by the hardware. That's why the core USB subsystem introduces a small delay between the counters reaching zero and the first attempt to autosuspend the device.
The normal implementations of suspend() and resume() needed to support system sleep need not be altered much, if at all. The reason they may need to be changed is locking, because resume() can be called directly from usb_autopm_get_interface(). Thus, resume() must not attempt to retake a lock already held when usb_autopm_get_interface(). In theory this restriction is obvious, in practice this is the most common bug in resume().
The resume() function also operates under some restrictions concerning memory allocations. It may use only GFP_NOIO or GFP_ATOMIC to allocate memory. This restriction arises because the kernel might otherwise try to resume another device to launder pages. One should take care to get this right; otherwise this bug will show itself in very rare spurious deadlocks almost impossible to debug.
A driver's little helpers
For some types of devices there's a generic driver for which subdrivers are written; USB serial devices are in that category. For such devices this simple form of autosuspend is already supported in generic code. A subdriver needs only to set supports_autosuspend.
Autosuspend for devices that user space has opened
Some devices are open for most of the running time of the system. For such devices, power saving measures which are active only in the closed mode are futile. The canonical example is the keyboard which is literally always open. To get significant power savings, the detection of idleness must be refined to the point that periods of actual idleness can be detected after user space has informed the kernel that services of a device may be required.
For output this is a comparatively easy task. As user space requests that the kernel perform output to a device, the device ceases to be idle. It becomes idle again when the output has been completed.
Let us look at an example for how output in the simple case is done.
As the open() method is no longer fine-grained enough an instrument to determine idleness, the detection is pushed down into the write() code path.
From the cdc-wdm driver (unrelated code has been removed):
static ssize_t wdm_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { u8 *buf; int rv = -EMSGSIZE, r, we; struct wdm_device *desc = file->private_data; struct usb_ctrlrequest *req; /* ... */ r = mutex_lock_interruptible(&desc->wlock); /* concurrent writes */ r = usb_autopm_get_interface(desc->intf); set_bit(WDM_IN_USE, &desc->flags); rv = usb_submit_urb(desc->command, GFP_KERNEL); if (rv < 0) { kfree(buf); clear_bit(WDM_IN_USE, &desc->flags); }
After some preliminaries a lock is taken and usb_autopm_get_interface() is called. Thereafter the driver knows that the device is and will remain active. I/O can be started just as if the driver didn't do runtime power management. However, care must be taken to balance the counters in the error case by calling usb_autopm_put_interface().
As I/O finishes, the counter must be decremented again. This is done in the completion handler using usb_autopm_put_interface_async().
This example from usbhid shows how to do it.
static void tx_complete (struct urb *urb) { /* ... */ usb_autopm_put_interface_async(dev->intf); urb->dev = NULL; entry->state = tx_done; defer_bh(dev, skb, &dev->txq); }It is literally a one-liner.
The PM message and using the return value of the suspend() method
There's another facet of autosuspend that deserves to be mentioned. In case all the counters mentioned here don't help, one can benignly fail an autosuspend returning -EBUSY from suspend(). If this is done during a full system suspend, the whole suspend operation will be aborted. Therefore this should really be limited to autosuspend in rare cases. Automatic suspend can be detected by testing the PM_EVENT_AUTO bit in the event field of the message parameter to suspend().
When suspend is aborted in this way, the core USB subsystem will retry the autosuspension after the above-mentioned delay.
Remote wakeup and spontaneous input
Handling input in the same manner as output hits a fundamental obstacle. The usual semantics of input operations are that input data a device generates is stored in a buffer and handed to user space as the read() system call is executed. A driver cannot normally predict when a device will volunteer input data.
To overcome this obstacle, USB has a feature called "remote wakeup". The feature is optional, but generally supported by devices it makes sense for.
A suspended device using remote wakeup can tell the system that it would like to transfer input data. The system is then required to resume the device. The feature can best be thought of as an analog of interrupts: like interrupts on PCI devices, remote wakeup with a USB device has to be explicitly enabled.
A driver requests that remote wakeup be enabled by setting the aptly-named needs_remote_wakeup flag in struct usb_interface. The core USB subsystem will never autosuspend a device that does not support remote wakeup if any of its interfaces' drivers request that remote wakeup be enabled.Let us look at an example of how a driver requests that remote wakeup be enabled:
From cdc-acm:
static int acm_tty_open(struct tty_struct *tty, struct file *filp) { struct acm *acm; /* ... */ if (usb_autopm_get_interface(acm->control) < 0) goto early_bail; else acm->control->needs_remote_wakeup = 1; /* ... */ usb_autopm_put_interface(acm->control);
Note that a driver has to make sure its device is active when it requests that remote wakeup be enabled. The device will be automatically be resumed as input data becomes ready to be transferred. The driver must take care that remote wakeup is disabled when the device is closed again.
Marking a device busy
Waking up a device has some cost in time and power; it takes about 40ms to wake up the device. Therefore staying in the suspended mode for less than a few seconds is not sensible. As already mentioned, there's a configurable delay between the time the counters reach zero and autosuspend is attempted. When using remote wakeup, however, the counters remain at zero all the time unless they are incremented due to output. Yet a delay after the last time a device is busy, that is, does I/O, and the next attempt to autosuspend the device is highly desirable.
An API is provided for that purpose:
- usb_mark_last_busy(struct usb_device *);
- Start the delay for the autosuspend anew from now on. Safe in atomic context
This function restarts the delay every time it is called.
Let us look at an example - from cdc-acm:
static void acm_read_bulk(struct urb *urb) { struct acm_ru *rcv = urb->context; struct acm *acm = rcv->instance; /* ... */ if (!ACM_READY(acm)) { dev_dbg(&acm->data->dev, "Aborting, acm not ready"); return; } usb_mark_last_busy(acm->dev); }
The driver marks the device busy as it receives data and then processes the received data. This way, autosuspend is attempted only if no input or output was performed for the duration of the configurable delay.
Sleepless in the kernel
What is to be done if a driver cannot sleep in its write path? In that case a simple solution can no longer be given. The driver needs to call usb_autopm_get_interface_async() for every call to the write path, just as in the above example. The difference is that the driver cannot be sure that the device is active after the call. Obviously, since it cannot wait for the device to become active, I/O must be queued.
From usbnet's usbnet_start_xmit():
spin_lock_irqsave(&dev->txq.lock, flags); retval = usb_autopm_get_interface_async(dev->intf); if (retval < 0) { spin_unlock_irqrestore(&dev->txq.lock, flags); goto drop; } #ifdef CONFIG_PM /* if this triggers the device is still asleep */ if (test_bit(EVENT_DEV_ASLEEP, &dev->flags)) { /* transmission will be done in resume */ usb_anchor_urb(urb, &dev->deferred); /* no use to process more packets */ netif_stop_queue(net); spin_unlock_irqrestore(&dev->txq.lock, flags); devdbg(dev, "Delaying transmission for resumption"); goto deferred; } #endif
The asynchronous API is used and errors handled. After that, if the device is still asleep, I/O is queued. The queued I/O must be actually started in resume().
From usbnet's usbnet_resume():
spin_lock_irq(&dev->txq.lock); while ((res = usb_get_from_anchor(&dev->deferred))) { skb = (struct sk_buff *)res->context; retval = usb_submit_urb(res, GFP_ATOMIC); if (retval < 0) { dev_kfree_skb_any(skb); usb_free_urb(res); usb_autopm_put_interface_async(dev->intf); } else { dev->net->trans_start = jiffies; __skb_queue_tail(&dev->txq, skb); } } smp_mb(); clear_bit(EVENT_DEV_ASLEEP, &dev->flags); spin_unlock_irq(&dev->txq.lock);
Here, I/O requests are taken from the queue and given to the hardware. Care must be taken to handle the counters correctly in the error case.
A driver's not so little helpers
Usbnet implements both forms of autosuspend for its subdrivers. If a subdriver sets supports_autosuspend it gets the simple form of autosuspended. If, instead, it defines
- manage_power(struct usbnet *dev, int on);
- Manage remote wakeup according to on (may sleep).
This function is supposed to set needs_remote_wakeup based on "on"; it also gets runtime power management while the interface is up.
Conclusion
I've tried to show how, in most cases, significant power savings can be had with little effort. I hope that many coders will find this useful in their work. In runtime power management the whole is more than the sum of the parts. Remember that all a device's interfaces must support autosuspend for a device to be autosuspended and all a hub's children must be suspended for the hub to be suspended. In this case the chain breaks at the weakest link. Thus I hope every driver developer makes at least a small effort to consider runtime power management.
[ The author would like to thank B1-Systems for their support. ]
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page:
Distributions>>