Kernel development [LWN.net]

Kernel release status

The 2.6.38 kernel is out, released by Linus on March 14. "As to the "big picture", ie all the changes since 2.6.37, my personal favorite remains the VFS name lookup changes. They did end up causing some breakage, and Al has made it clear that he wants more cleanups, but on the whole I think it was surprisingly smooth." Other significant changes in 2.6.38 include transparent hugepage support, per-session group scheduling, a number of Btrfs improvements, and more. The always excellent KernelNewbies.org page has all the details.

Stable updates: the 2.6.37.4 and 2.6.32.33 updates were released on March 14. Both contain several important fixes.

Comments (none posted)

Quotes of the week

I think this is the really fundamental issue: anybody who makes a hard error out of something that is recoverable is a total moron.

-- Linus Torvalds

Golden rule #12: When the comments do not match the code, they probably are both wrong.

-- Steven Rostedt

But if you are correct, then it worries me that your patch will be the first of a trickle growing to a stream to an avalanche of patches where people align and reorder structures so that the most commonly accessed fields are at the beginnng of the cacheline, so that those can then be accessed minutely faster.

Aargh, and now I am setting off the avalanche with that remark. Please, someone, save us by discrediting George's argument.

-- Hugh Dickins

Comments (none posted)

Schultz: Diving into the Linux Networking Stack, Part I

Michael Schultz has posted an introductory look at the Linux networking stack, focusing on driver initialization and packet reception. It's a "how it works" discussion, rather than a look at the actual code. "In general network drivers follow a fairly typical route in processing: the kernel boots up, initializes data structures, sets up some interrupt routines, and tells the network card where to put packets when they are received. When a packet is actually received, the card signals the kernel causing it to do some processing and then cleans up some resources. I'll talk about the fairly generic routines that network devices share in common and then move to a concrete example with the igb driver."

Comments (none posted)

A group scheduling demonstration

By Jonathan Corbet
March 16, 2011

There has been much talk of the per-session group scheduling patch which is part of the 2.6.38 kernel, but it can be hard to see that code in action if one isn't doing a 20-process kernel build at the time. Recently, your editor inadvertently got a demonstration of group scheduling thanks to some unexpected results from a Rawhide system upgrade. The way the scheduler works was clearly shown in a way that could be captured at the time.

Rawhide users know that surprises often lurk behind the harmless-looking yum upgrade command. In this particular case, something in the upgrade (related to fonts, possibly) caused every graphical process in the system to decide that it was time to do some heavy processing. The result can be seen in this output from the top command:

The per-session heuristic had put most of the offending processes into a single control group, with the effect that they were mostly competing against each other for CPU time. These processes are, in the capture above, each currently getting 5.3% of the available CPU time. Two processes which were not in that control group were left essentially competing for the second core in the system; they each got 46%. The system had a load average of almost 22, and the desktop was entirely unresponsive. But it was possible to log into the system over the net and investigate the situation without really even noticing the load.

This isolation is one of the nicest features of group scheduling; even when a large number of processes go totally insane, their ability to ruin life for other tasks on the machine is limited. That, alone, justifies the cost of this feature.

Comments (19 posted)

2.6.39 merge window part 1

By Jonathan Corbet
March 16, 2011

Linus released the 2.6.38 kernel on March 14, and started merging patches for the 2.6.39 development cycle the following day. As of this writing, just over 1,000 patches have been merged into the mainline. Clearly the merging process has just begun for this cycle, but some interesting features have been added. User-visible changes merged so far include:

The open by handle system calls have been added. The final form of the API is:
```
    int name_to_handle_at(int dfd, const char *name, struct file_handle *handle,
			  int *mnt_id, int flag);
    int open_by_handle_at(int dirfd, struct file_handle *handle, int flags);
```
This functionality is intended for use by user-space file servers, which can more efficiently track files using file handles.
The open() system call has a new flag: O_PATH. A file opened with this flag will have had its path resolved by the kernel and is known to exist, but there is little else that can be done with it. System calls which operate on file descriptors directly (close() or dup(), for example) will work; these file descriptors can also be passed to another process over Unix-domain sockets using SCM_RIGHTS datagrams. The reason for the existence of O_PATH file descriptors is for use as the directory file descriptor in the various "*at()" system calls.
Tasks in the SCHED_IDLE class are now allowed to upgrade themselves into the SCHED_BATCH or SCHED_OTHER classes if their "nice" rlimit is adequate.
There is a new system call which allows the adjustment of POSIX clocks:
```
    int clock_adjtime(clock_id which_clock, struct timex *time);
```
Time adjustments possible are the same as for adjtimex(), but specific POSIX clocks may not support all operations.
The CLOCK_BOOTTIME POSIX clock has been added.
The new Smack SMACK64MMAP attribute can be used to control when specific libraries can be mapped by running programs.
New hardware support includes:
- Systems and processors: Intel "SandyBridge" CPUs, CompuLab TrimSlice boards, and several variations of the Seaboard evaluation platform.
- Block: ARASAN CompactFlash PATA controllers.
- Miscellaneous: picoXcell IPSEC and Layer2 crypto engines.

Changes visible to kernel developers include:

There is a new interrupt flag (IRQF_FORCE_RESUME) which forces the interrupt to be re-enabled at resume time regardless of whether it was disabled during suspend.
The kernel can now force (almost) all interrupt handlers to be run in threads; this capability is controlled with the threadirqs command line option. This is a useful debugging feature, as a crashing interrupt handler will, when running in a thread, merely cause a kernel oops instead of bringing down the whole system. Interrupt handlers which should never be forced into threads can be marked with IRQF_NO_THREAD, but its use is expected to be rare.
The object debugging infrastructure now allows the specification of a "debug hint" function; it returns an address which can be used to better identify a specific object. See this commit for details.
The long-deprecated SPIN_LOCK_UNLOCKED and RW_LOCK_UNLOCKED lock initializers have been removed.
The perf events subsystem has a new monitoring mode wherein it only watches processes belonging to a specific control group. The new -G option to perf provides access to this functionality.
The directed yield feature has been added to the fair scheduler; this feature should improve performance for guests virtualized with KVM.
There is a new mechanism for the dynamic addition of POSIX clocks; see <linux/posix_clock.h> for the details of the interface.
The x86 architecture has gained minimal device tree support.
There is a new global workqueue called system_freezable_wq; it differs from the others in that it can be frozen at suspend time.
Core subsystems can make use of the new syscore_ops mechanism to register power management callbacks without the need to create otherwise useless system devices.

If the usual rules apply, the 2.6.39 merge window can be expected to close around March 29, and the 2.6.39 release should happen around the first week of June.

Comments (5 posted)

Uprobes: 11th time is the charm?

By Jonathan Corbet
March 16, 2011

Last week's Kernel Page included an article about improving the ptrace() interface; the author of that work, Tejun Heo, was quoted as saying that part of the problem with ptrace() is that it has been starved of developer attention in favor of efforts to replace it entirely. One of those efforts is uprobes, which has also been featured on this page a few times. A new uprobes patch was posted on March 14; so this seems like a good time to have a look at it and further deprive ptrace() of attention. Uprobes looks like it is getting closer to acceptance, but it seems unlikely that the 11th revision will be the last.

The purpose of the uprobes subsystem is what one might expect: to enable the placement of probes into user-space executable process memory. These probes might be used to support a debugger like gdb (though uprobes is said to be unsuitable for use by gdb in its current form) or to support user-space tracing. This feature does thus duplicate some of the functionality provided by ptrace(), which will make its acceptance harder, especially since ptrace() is (more or less) a standardized interface. To succeed, uprobes will clearly have to do things better than ptrace() does.

The ptrace() interface is tied to processes; uprobes, instead, works with files. A probe is placed at a certain offset within a specific file; it will then trigger for every process which executes through the probe's location. If the code placing the probe is only interested in specific processes, it will need to filter the events itself. The interface may seem a little strange - users will probably almost always be interested in specific processes - but there are some advantages to doing things this way.

Underneath the hood, uprobes works by faulting in the page which will contain the probe. The instruction at the probe location is copied aside and replaced by a breakpoint. Every process which has that file mapped then gets a pointer in its mm structure pointing to the data describing the probe(s) for that file. When a process executes the breakpoint, the probe's handler function will be called; on that handler's return, the kernel will single-step the displaced instruction, then return to the location following the probe.

This "execute out of line" (XOL) mechanism has been controversial in the past because it requires the injection of a new virtual memory area (VMA) into every process which encounters probes. That VMA is seen as a distortion of the process's behavior which could have strange effects. The alternatives, though, are not entirely appealing either. The ptrace() approach is to put the original instruction back into its original location, execute it, then replace the breakpoint; that only works if every process which has the file mapped is stopped for the duration of the operation (otherwise they might execute the affected code while the breakpoint is missing). Uprobes, instead, is able to handle breakpoint hits without perturbing other processes. Another alternative discussed in the past is emulating the displaced instruction in the kernel; that requires having a full x86 emulator in kernel space, which is not entirely appealing either. So the current plan appears to be to stick with XOL.

Not having to stop the world when a breakpoint is hit is one of the advantages of uprobes, but there are others. It dispenses with the whole ptrace() mechanism involving signals, reparenting processes, and so on. Handling a probe hit does not require a context switch unless the probe itself does; many types of tracing tasks, for example, would never have to switch to another process. Uprobes also allows multiple applications to be tracing the same set of processes at the same time. All of these make the interface appealing to some users.

Who those users are is not clear to everybody, though. There is clearly some interest in the SystemTap camp, but the needs of SystemTap do not necessarily carry a lot of weight on linux-kernel. Thomas Gleixner put it this way:

And it does not matter at all whether systemtap can use this or not. If the main debuggers used like gdb are not going to use it then it's a complete waste. We don't need another debugging interface just for a single esoteric use case.

At times, gdb developers have indicated that they might be open to using a Linux-specific interface if there were advantages to doing so. Such use seems distant at the moment, though. More immediate users are likely to be found in the tracing community; uprobes opens the possibility of getting single stream of trace data covering both user and kernel space. ptrace() is not a useful interface for tracing, so something needs to be done (though there is still some disagreement over whether user-space tracing needs to involve the kernel at all). Uprobes might be that something.

In fact, this version of the uprobes patch includes an ftrace-based interface. Part 20 of the patch contains the entirety of the documentation for this feature, quoted below:

    # cd /sys/kernel/debug/tracing/
    # cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
    00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
    # objdump -T /bin/zsh | grep -w zfree
    0000000000446420 g    DF .text  0000000000000012  Base        zfree
    # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
    # cat uprobe_events
    p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
    # echo 1 > events/uprobes/enable
    # sleep 20
    # echo 0 > events/uprobes/enable
    # cat trace

An actual document is listed as a "TODO" item. The current interface looks a bit painful to use, and it appears to be limited to printing register contents for now. A more flexible and better documented interface could prove useful, though, especially if (as planned) it also can be made to work with the perf events subsystem.

The comments on the patch set indicate some concern about whether the kernel needs the feature or not. But even the more critical reviewers have been going over the code pointing out small things - the kind of review one does when one wants to help the author get the code into shape for merging. This code will not be merged for 2.6.39, and, for this type of code, making predictions for merging at any definite time is a hazardous affair. But, given sufficient will, it seems like uprobes could be made ready for inclusion sometime this year.

Comments (2 posted)

APIs for sensors

By Jonathan Corbet
March 16, 2011

Environmental sensors were, once upon a time, equipment which were only found in specialized settings like industrial process control or scientific research. They were expensive and tuned to a specific task. Increasingly, though, sensors are being attached to all kinds of devices. Mobile handsets have compasses, accelerometers, and more. Sensors for temperature, pressure, etc. are becoming increasingly common as well. The implications are fun; any Linux machine can be a versatile data collection device.

The only problem with all of this is that the Linux kernel does not yet have an established API - either internal or to user space - for sensors. There are interfaces for specific types of sensors; Video4Linux2 handles cameras, for example, and the hwmon subsystem deals with the specific class of sensors aimed at monitoring the health of the computer itself. In these areas, the interfaces are well established and interoperation is possible. For sensors which fall outside of these classes, though, there are no real rules. The outcome of this kind of situation is always the same: new devices are added with inconsistent interfaces, making life hard for application developers.

This situation came to light (again) with the recent submission of a pressure sensor driver which was implemented as a misc device. It used the input subsystem to present its interface; Jonathan Cameron, who has been working on sensor interfaces, pointed out that the patch would not be accepted in that form. Input devices are meant for human input; since most humans do not communicate with their systems via large ambient pressure changes, this device did not fit. So the driver needs another home. The hwmon subsystem was suggested, but the pressure sensor is not really a hardware monitor, so the driver is not welcome there either. Arnd Bergmann also does not like the use of the misc interface:

I generally try to prevent people from adding more ad-hoc interfaces to drivers/misc. Anything that is called a drivers/misc driver to me must qualify as "there can't possibly be a second driver with the same semantics", otherwise it should be part of another subsystem with clear rules, or be put into its own file system.

That leaves the industrial I/O (IIO) subsystem, which is meant "for devices that in some sense are analog to digital converters." IIO tries to handle a wide variety of sensors in some sort of standard way with support for events, higher bandwidth I/O, and more. There are quite a few drivers in the IIO subsystem now; the only problem is that the whole thing lives in the staging tree and the associated "TODO" list is reasonably long. The devices which are represented there now are not all consistent in their interface use - and the form of the desired interface is not at all clear.

Still, putting together such an interface is Jonathan's goal:

To my mind, there will one day be a suitable 'sensors' subsystem so an important side point is to try and minimise interface changes needed to move to that (IIO or something better). Sysfs is easy to fix, so lets at least work on shared interfaces in there. Hwmon is a mature and reasonable starting point; it's where we got a lot of IIO's similar interfaces from. The trick is convincing people to consider generality and it's a hard trick to pull off.

He adds that the interface and support for simple devices (those with slow data rates and hwmon-style sysfs interfaces) is in reasonably good shape. The question is how to get the rest of the job done.

One alternative would be to define an essentially new IIO core which would be merged into the mainline. Individual drivers could then be worked into shape and moved over once they are ready. The problem is that this could be a long process, and that the mainline versions of the drivers might not initially have all of the functionality of their black-sheep staging cousins. That would mean more maintenance work keeping both versions of the driver working for some time.

Still, that's the approach that Arnd recommends. The move to the mainline is the last good chance to define an interface which will then need to be supported for many years. So some pain now, if used properly, may be warranted in order to make life easier in the future. Getting driver developers to buy into this idea may not be entirely easy; most of them spend the bulk of their time doing something other than writing Linux driver code and may lack the desire to move to a new interface when what they have now works. But that's almost certainly the best way forward. Now is almost certainly a good time for people with an interest in this area to help in the development of the mainline version of the IIO interface.

Comments (14 posted)

Linus Torvalds Linux 2.6.38 ?

Greg KH Linux 2.6.37.4 ?

Greg KH Linux 2.6.32.33 ?

Grant Likely Refactor and enhance device tree platform registrations ?

Srikar Dronamraju 0: Inode based uprobes ?

Kirill A. Shutemov Introduce timer slack controller ?

Peter Zijlstra Rewrite sched_domain/sched_group creation ?

Lai Jiangshan rcu: introduce kfree_rcu() ?

Christopher Yeoh Cross Memory Attach v3 [PATCH] ?

KAMEZAWA Hiroyuki fork bomb killer ?

Kirill A. Shutemov Coccinelle: introduce list_move.cocci ?

Magnus Damm virtio: Virtio platform driver ?

Waldemar Rymarkiewicz NFC: Driver for Inside Secure MicroRead NFC chip ?

MyungJoo Ham MAX8997/8966 MFD (including PMIC&RTC) Initial Release ?

Bill Gatliff Implement a generic PWM framework ?

Rafael J. Wysocki Allow subsystems to avoid using sysdevs for defining "core" PM callbacks ?

Andy Green PLATFORM: Support for async platform_data ?

Po-Yu Chuang net: add Faraday FTGMAC100 Gigabit Ethernet driver ?

Dan Williams isci: core ?

mems applications Add STMicroelectronics LPS001WP pressure sensor device driver into misc ?

Chris Wilson ACPI/Intel: Rework Opregion support ?

Mike Waychison google firmware support ?

Ian Campbell xen network backend driver ?

Kim, Heungjun Add support for M-5MOLS 8 Mega Pixel camera ?

Huang Shijie add the GPMI controller driver for IMX23/IMX28 ?

Sage Weil introduce sys_syncfs to sync a single file system ?

Arne Jansen btrfs: scrub ?

Li Zefan Btrfs: New inode number allocator ?

Greg Thelen memcg: per cgroup dirty page accounting ?

Andrea Arcangeli thp: mremap support and TLB optimization ?

Stephen Wilson enable writing to /proc/pid/mem ?

Eric Paris [PATCH -v2] capabilites: allow the application of capability limits to usermode helpers ?

George Spelvin mm/slub: Add SLUB_RANDOMIZE support ?

Kees Cook security: Yama LSM ?

Thomas Renninger cpupowerutils - cpufrequtils extended with quite some features ?

Douglas Gilbert lsscsi 0.25 beta 1, adds --size ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Schultz: Diving into the Linux Networking Stack, Part I

A group scheduling demonstration

Kernel development news

2.6.39 merge window part 1

Uprobes: 11th time is the charm?

APIs for sensors

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Miscellaneous