User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.39-rc1, released by Linus on March 29. "So 2.6.39-rc1 is out there, and the merge window is closed. I still have to look over the cleancache pull request (which I got in plenty of time, but decided that I want to review after the merge window craziness is over), but other than that, we're done." Significant changes in 2.6.39 will include the open by handle system calls, CLOCK_BOOTTIME, an option to force all interrupt handlers to run as threads, ipset, the transcendent memory core, the media controller subsystem, the CHOKe flow scheduler, and much more; see the long-format changelog for details.

Stable updates: the,,, and stable kernel updates were released on March 28. Each contains another long list of important fixes; note that the update will be the last for the 2.6.37 series.

Previously, was released on March 24 with a single build fix.

Comments (none posted)

Quotes of the week

The C preprocessor... It is ugly, inelegant, painful, annoying, and should have been strangled at birth -- but it is always there when you need it!
-- Paul McKenney

Make Linux Software presents the fastest ever embedded Linux boot for 720 MHz ARM and NAND flash memory. Linux boot time is 300 milliseconds from boot loader to shell.
-- Constantine Shulyupin

Our hammer is kernel patches and all problems look like nails, but we'd end up with better user interfaces and a better kernel if we'd just stop stuffing more and fatter user interface code into the kernel.
-- Andrew Morton

Comments (7 posted)

Jump label reworked

By Jonathan Corbet
March 30, 2011
The jump label mechanism was last seen here in October, 2010. In short, jump label allows the optimization of "highly unlikely" code branches to the point that their normal overhead is close to zero. This speedup is done with runtime code patching; that is also the cost: enabling or disabling the unlikely case is an expensive operation. Thus, jump label is best used for code which is almost never enabled; tracepoints and dynamic debugging statements are obvious cases.

There were a number of complaints about the initial jump label implementation, including the fact that it was somewhat awkward to use. In response, a reworked version has been posted which changes the interface considerably. One starts by declaring a "jump key":

    #include <linux/jump_label.h>

    struct jump_label_key my_key;

Enabling and disabling the key is a simple matter of calling:

    jump_label_inc(struct jump_label_key *key);
    jump_label_dec(struct jump_label_key *key);

And using the key to control the execution of rarely-needed code becomes:

    if (static_branch(&my_key)) {
	/* Unlikely stuff happens here */

In the absence of full jump label support, a jump key is represented by an atomic_t value. jump_label_inc() becomes atomic_inc(), jump_label_dec() becomes atomic_dec(), and static_branch() is implemented with atomic_read(). If jump label is configured into the kernel, enabling and disabling a jump key become heavier operations, while static_branch() becomes nearly free. For the intended use cases for jump labels, that is a worthwhile tradeoff.

As of this writing, these changes have not been merged for 2.6.39. There is always a possibility that they could be pulled in before -rc2, but chances are that, at this point, the new jump label will have to jump into 2.6.40.

Comments (3 posted)

Powering down APM

By Jonathan Corbet
March 30, 2011
The APM power management interface has never been much loved - even ACPI was seen as a better alternative. There has been little or no hardware made which depends on APM for some years; Windows evidently stopped supporting it in 2006. Linux does still support APM, though, and that support has a cost, so it is perhaps not surprising that Len Brown would like to remove that support as of 2.6.40.

Removal of APM support on that schedule is almost certainly not going to happen; a number of developers have expressed concerns that there may still be hardware out there in use which would then be unable to run new kernels. In general, the Linux kernel tries not to abandon users running older hardware. So APM may stay for a while, but there is a problem: keeping APM support, it seems, conflicts with some needed changes to the cpuidle code. The need to keep APM working, in other words, threatens to hold back improvements for the majority of users who have more current hardware.

The solution to this conflict may take the form of a partial removal of APM support. The most important APM feature for users of old systems is likely to be the ability to power-off the system; other features may be less important. As Andi Kleen noted, idle support probably matters less to such users:

Phasing out APM idle at least would be reasonable. Presumably even if the old laptops still work they are likely on AC because their batteries have long died. So using a bit more power in idle shouldn't be a big issue.

So APM support, as such, may stick around for a while, but it may begin to lose features as the kernel moves on.

Comments (25 posted)

Kernel development news

The 2.6.39 merge window concludes

By Jonathan Corbet
March 29, 2011
There have been just over 2,200 non-merge changesets pulled into the mainline since the second installment in this series; that makes 8,757 total changes for this development cycle. The 2.6.39 merge window is now closed, so the feature set for this kernel development cycle should be complete. User-visible changes merged in the final part of the merge window include:

  • Beginning user namespace support has been merged. User namespaces are a sort of container where processes can safely be given root access within the container without being able to affect the rest of the system. Full container support is a long-term project, but the user namespace patches get the kernel one step closer.

  • It is now possible for a suitably privileged process to write to a processes /proc/pid/mem file.

  • The "group isolation" tunable for the CFQ I/O scheduler has been removed; group isolation is always provided now that the performance issues associated with that mode have been fixed.

  • There is a new "mtdswap" block device which allows swapping directly to memory technology devices.

  • New hardware support includes:

    • Processors and systems: Samsung Laptop SABI interfaces, WMI Hotkeys for Dell All-In-One series, Intel Medfield platform thermal sensors, and Asus Notebook WMI interfaces.

    • Miscellaneous: MSM chipset SMD packet ports, Texas Instruments TWL4030 hardware monitoring controllers, ST-Ericsson AB8500 voltage monitors, Maxim Semiconductor MAX8997/8966 PMICs, Maxim 8997/8966 regulators, Texas Instruments TPS61050/61052 boost converters, Ricoh R5C592 card readers, and OLPC XO-1.5 ebook switches.

    • Video4Linux: Technisat USB2.0 DVB-S/S2 receivers, Silicionfile NOON010PC30 CIF camera sensors, DiBcom 9000 tuners, 3com homeconnect "ViCam" cameras, OmniVision OV9740 sensors, ST Microelectronics STV0367 demodulators, OMAP3 camera controllers, Divio NW80x-based camera controllers, and ITE Tech IT8712/IT8512 infrared transceivers.

Changes visible to kernel developers include:

  • The dma64_addr_t type is no longer used in the kernel; it has been removed.

  • The videobuf framework in the Video4Linux2 subsystem has been replaced by a newer "videobuf2" version.

  • The media controller subsystem, intended to allow the system to export information about the topology of complex media subsystems to user space, has been merged.

  • printk() and friends have a new "%pB" format specifier which prints a backtrace symbol and its offset.

  • The m68k and m68knommu architectures have been merged in the kernel source tree.

  • A support library for BCH (Bose-Chaudhuri-Hocquenghem) encoding and decoding has been added.

  • Some low-level interrupt-related functions have changed names:

    irq_data_get_irq_data() irq_data_get_irq_handler_data()
    set_irq_chained_handler() irq_set_chained_handler()
    set_irq_chip_and_handler_name() irq_set_chip_and_handler_name()
    set_irq_nested_thread() irq_set_nested_thread()

    The prototypes for these functions are otherwise unchanged.

The 2.6.39 kernel now goes into the stabilization phase of the development cycle. If the usual pattern holds, we can expect to see on the order of 2000 fixes merged between now and the final release, which is likely to happen in early June.

Comments (10 posted)

Dynamic devices and static configuration

By Jonathan Corbet
March 29, 2011
Linux users in the Good Old Days were treated to a number of experiences which are denied to newcomers; one of those was the tiresome task of figuring out where peripheral devices had chosen to put their I/O ports and interrupt lines and communicating that information to the kernel. Contemporary, self-describing hardware had taken a lot of the fun away in the name of making things Just Work. This kind of joy can still be had at the embedded level, though, where the trend toward discoverable hardware has not caught on in the same way. Recent discussions show that there is not, yet, a consensus among kernel developers regarding how such hardware should be configured.

The OMAP-based PandaBoard is a popular platform for those who are interested in experimenting with embedded applications. It comes with a dual-core processor, high-definition video capability, wireless networking, Bluetooth, an HDMI output, and the sadly standard closed graphics usually associated with these devices. It also has a "USB-attached" network port which is actually soldered to the board; it looks like a USB device, but it's not something the user could unplug without an act of significant violence.

This network port has moved developers toward violence for other reasons as well. It is recognizable as a network device, but there is no way to know that it is wired down. The board developers, in a move which is common in this area, also left out the small EEPROM which would normally contain the MAC address for this interface. In response to these design decisions, a standard Linux kernel booting on this board will call its network interface usb0 (a name normally used for USB point-to-point connections), and will generate a random MAC address for it. Anybody who might depend on a MAC address which is stable across boots will be out of luck.

This kind of non-discoverable hardware is common in the embedded sphere, so a number of techniques have been developed to allow the kernel to run on the resulting systems. The traditional approach is through the creation of "board files"; see board-msm7x30.c as an example. These files are meant to provide the kernel with enough information to understand the topology of the hardware it is running on; information related to specific devices is typically passed through a set of static platform_device structures, and through that structure's platform_data pointer in particular. As the driver initializes the device, it can refer to the platform_data pointer (which points to some sort of device-specific structure) for any information which it cannot get from the hardware itself.

The current platform_data implementation will not work for the PandaBoard, though, because platform_data is not passed to USB devices. These devices are meant to be entirely discoverable and self-describing, so it was thought that there would be no need for external configuration data in the kernel. The fact that these devices are dynamic means that their existence cannot be known or guaranteed when the board file is written, so trying to create static platform data for them would seem to make little sense.

The problem with this reasoning is that the PandaBoard's network interface is not fully discoverable and it is not dynamic. It is a sort of platform device disguised as a USB device. So Andy Green thought it would be reasonable to use platform data as a way of configuring this device; in particular, he would like to pass the device name (eth0 instead of usb0) and a MAC address via a platform_data pointer. What he got was an extended discussion making it clear that (1) the platform data mechanism is not universally loved, and (2) there is not a complete consensus on how this kind of problem should really be solved.

There are a couple of perceived problems with platform data; first of those is that it encodes the information about a specific hardware configuration in the kernel itself. That leads to a proliferation of board files in the kernel source - each of which is controlled by its own configuration option - and makes it hard to build kernels which can run on multiple boards. The platform_data pointer itself, being a void pointer, is seen as not being type-safe: there is no way for the compiler to ensure that every board file is passing the right type of pointer to every device driver. For these reasons, there is strong opposition to expanding the platform data mechanism.

What are the alternatives? One of those is to do everything in user space, using udev rules. This approach appeals to those who want to see no policy in kernel space, but it is hard to implement in this case; there is no information available to distinguish this wired-down network controller from the traditional USB variety. Some developers are also unconvinced that replacing in-kernel board files with fragile-looking (to them) user-space configuration files which must be pushed to distributors is the way toward a more robust solution. It is also argued that the device naming policy (usb0, in this case) is already in the kernel; the discussion is about the details of what that policy should be.

The other approach would be to use device trees, which are meant for just this type of application. A device tree would allow the passing of configuration-specific information into drivers without the need to put board-specific hacks into the drivers themselves. As more components show up in both consumer and deep embedded situations, this capability will only become more useful. For these reasons, Arnd Bergmann thought that this problem would be an ideal place to demonstrate the use of device trees:

Let's make this the first use case where a lot of people will want to have the device tree on ARM. The patch to the driver to check for a mac-address property is trivial, and we can probably come up with a decent way of parsing the device tree for USB devices, after all there is an existing spec for it.

The problem with the device tree approach is that its adoption, in general, is slow, especially in the ARM architecture which, arguably, has the most need of it. It does not seem like a solution for people who have a PandaBoard now and would like it to work; it is also not immediately applicable to all of those systems which are currently described by board files and platform data. While many people seem to see a transition to device trees as something which will happen eventually, few of them are holding their breath in anticipation of an immediate changeover.

So what is a PandaBoard owner to do? There are, it seems, a couple of short-term solutions which will fix this particular board without waiting for longer-term answers. One is a patch from Arnd which will cause USB-attached Ethernet devices to carry an ethN name unless they are known to be point-to-point connections. For the MAC address problem, Alan Cox has suggested a hack which would allow the board file to take control of the address assignment for a specific interface. Neither of these solutions addresses the real problem, but they will give some breathing room while the proper fix is debated.

Comments (2 posted)

Fighting fork bombs

By Jonathan Corbet
March 29, 2011
Unix-like systems tend to be well hardened against attacks from outside, but more vulnerable to attacks by local users. One of the softer spots in most systems has to do with "fork bombs" - processes which madly fork() until they run the system out of resources. These attacks are difficult to defend against and difficult to stop without a reboot; they can also, at times, be created inadvertently. If Hiroyuki Kamezawa has his way, fork bombs will be less of a problem in the future.

The problem with fork bombs is that they are moving targets; by the time a system administrator notices a rapidly-forking process, it may have created vast numbers of children and exited. Killing processes individually in a fork bomb situation is not really an option; even a program written especially for this task can be hard put to keep up with the stream of new processes. There is just no way to get a handle on the entire tree of offending processes from user space. So it is not surprising that the best response in this situation can be to hit the Big Red Button and start over. Even if, as in Kamezawa-san's case, hitting the button involves walking to another building where the afflicted system is housed.

Indeed, it can be hard to get a handle on this tree from kernel space as well. The process tree only exists, as such, as long as the parent processes remain alive; once a process exits, all of its children are reparented to the init process. That causes a flattening of the tree structure and makes it hard to identify all of the processes involved in the attack. So Kamezawa-san's patch starts with the addition of a new process tracking structure. It is organized as a simple tree reflecting the actual family structure of the processes on the system. It differs from existing data structures, though, in that this "history tree" persists even when some processes exit. That allows the kernel to view the entire tree of processes involved in a fork bomb even if those which launched the attack have long since gone away.

Keeping the entire history of all processes created over the lifetime of a Linux system would be a costly endeavor. Clearly, there comes a point where history needs to be discarded. Every so often (30 seconds by default), the kernel will try to determine whether there might possibly be a fork bomb attack in process; if no signs of an attack are detected, any tracking history which has existed for more then 30 seconds will be deleted.

How does the kernel decide whether it might be under attack? The way fork bombs incapacitate a system is usually through memory exhaustion, so the code looks for signs of memory stress: in particular, it looks to see if there have been any memory allocation stalls or kswapd runs since the last check. It also looks at whether the total number of processes on the system has increased. If none of those checks shows any reason for concern, the older history data will be removed from the system. If, instead, memory allocations are getting harder to come by or the number of processes is growing, the tracking structure will be kept around.

If a fork bomb runs the system out of memory, the kernel's first response will be to fire up the out-of-memory (OOM) killer. Given time, the OOM killer might manage to clean up the mess, but the fact of the matter is that the OOM killer is designed around finding the one process which is creating the problem and killing it. The OOM killer cannot identify a whole tree of rapidly-forking processes and do away with all of them.

Enter the fork bomb killer, which is invoked by the OOM killer. The fork bomb killer will perform a depth-first traversal of the process history tree, filling in each node with information on the total number of processes below that node and the total memory used by those processes. At the end, the process with the highest score is examined; if there are at least ten processes in the history below the high scorer, it is deemed to be a fork bomb; that process and all of its descendants will be killed. Problem solved - hopefully.

There are a couple of control knobs which have been placed under /sys/kernel/mm/oom. History tracking will only be performed if mm_tracking_enabled is set to "enabled" (which is the default setting). The value in mm_tracking_reset_interval_msecs controls how often the process tracking tree is cleaned up; the default value is 30,000 milliseconds. A possibly surprising omission is the lack of a knob controlling how many descendants a process must have before it is declared to be a fork bomb; the hardcoded value of ten seems low.

The reception for this patch has not been entirely favorable; commenters worry about the runtime cost of maintaining the tracking structure and suggest that user-space solutions may be better. Kamezawa-san seems resigned that the patch may not go in, saying "To go to other buildings to press reset-button is good for my health." Other administrators, who may not be within easy walking distance of their systems, may feel their health is better served by some extra fork bomb protection, though.

Comments (18 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O


Memory management


  • Mimi Zohar: EVM . (March 29, 2011)

Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds