Kernel development
Brief items
Kernel release status
The 2.6.38 kernel is out, released by Linus on March 14. "As to the "big picture", ie all the changes since 2.6.37, my personal favorite remains the VFS name lookup changes. They did end up causing some breakage, and Al has made it clear that he wants more cleanups, but on the whole I think it was surprisingly smooth." Other significant changes in 2.6.38 include transparent hugepage support, per-session group scheduling, a number of Btrfs improvements, and more. The always excellent KernelNewbies.org page has all the details.
Stable updates: the 2.6.37.4 and 2.6.32.33 updates were released on March 14. Both contain several important fixes.
Quotes of the week
Aargh, and now I am setting off the avalanche with that remark. Please, someone, save us by discrediting George's argument.
Schultz: Diving into the Linux Networking Stack, Part I
Michael Schultz has posted an introductory look at the Linux networking stack, focusing on driver initialization and packet reception. It's a "how it works" discussion, rather than a look at the actual code. "In general network drivers follow a fairly typical route in processing: the kernel boots up, initializes data structures, sets up some interrupt routines, and tells the network card where to put packets when they are received. When a packet is actually received, the card signals the kernel causing it to do some processing and then cleans up some resources. I'll talk about the fairly generic routines that network devices share in common and then move to a concrete example with the igb driver."
A group scheduling demonstration
There has been much talk of the per-session group scheduling patch which is part of the 2.6.38 kernel, but it can be hard to see that code in action if one isn't doing a 20-process kernel build at the time. Recently, your editor inadvertently got a demonstration of group scheduling thanks to some unexpected results from a Rawhide system upgrade. The way the scheduler works was clearly shown in a way that could be captured at the time.Rawhide users know that surprises often lurk behind the harmless-looking yum upgrade command. In this particular case, something in the upgrade (related to fonts, possibly) caused every graphical process in the system to decide that it was time to do some heavy processing. The result can be seen in this output from the top command:
The per-session heuristic had put most of the offending processes into a single control group, with the effect that they were mostly competing against each other for CPU time. These processes are, in the capture above, each currently getting 5.3% of the available CPU time. Two processes which were not in that control group were left essentially competing for the second core in the system; they each got 46%. The system had a load average of almost 22, and the desktop was entirely unresponsive. But it was possible to log into the system over the net and investigate the situation without really even noticing the load.
This isolation is one of the nicest features of group scheduling; even when a large number of processes go totally insane, their ability to ruin life for other tasks on the machine is limited. That, alone, justifies the cost of this feature.
Kernel development news
2.6.39 merge window part 1
Linus released the 2.6.38 kernel on March 14, and started merging patches for the 2.6.39 development cycle the following day. As of this writing, just over 1,000 patches have been merged into the mainline. Clearly the merging process has just begun for this cycle, but some interesting features have been added. User-visible changes merged so far include:
- The open by handle system calls have
been added. The final form of the API is:
int name_to_handle_at(int dfd, const char *name, struct file_handle *handle, int *mnt_id, int flag); int open_by_handle_at(int dirfd, struct file_handle *handle, int flags);
This functionality is intended for use by user-space file servers, which can more efficiently track files using file handles.
- The open() system call has a new flag: O_PATH. A
file opened with this flag will have had its path resolved by the
kernel and
is known to exist, but there is little else that can be done with it.
System calls which operate on file descriptors directly
(close() or dup(), for example) will work; these
file descriptors can also be passed to another process over
Unix-domain sockets using SCM_RIGHTS datagrams. The reason
for the existence of
O_PATH file descriptors is for use as the directory file
descriptor in the various "*at()" system calls.
- Tasks in the SCHED_IDLE class are now allowed to upgrade
themselves into the SCHED_BATCH or SCHED_OTHER
classes if their "nice" rlimit is adequate.
- There is a new system call which allows the adjustment of POSIX
clocks:
int clock_adjtime(clock_id which_clock, struct timex *time);
Time adjustments possible are the same as for adjtimex(), but specific POSIX clocks may not support all operations.
- The CLOCK_BOOTTIME POSIX clock has
been added.
- The new Smack SMACK64MMAP attribute can be used to control when
specific libraries can be mapped by running programs.
- New hardware support includes:
- Systems and processors: Intel "SandyBridge" CPUs,
CompuLab TrimSlice boards,
and several variations of the Seaboard evaluation platform.
- Block: ARASAN CompactFlash PATA controllers.
- Miscellaneous: picoXcell IPSEC and Layer2 crypto engines.
- Systems and processors: Intel "SandyBridge" CPUs,
CompuLab TrimSlice boards,
and several variations of the Seaboard evaluation platform.
Changes visible to kernel developers include:
- There is a new interrupt flag (IRQF_FORCE_RESUME) which
forces the interrupt to be re-enabled at resume time regardless of
whether it was disabled during suspend.
- The kernel can now force (almost) all interrupt handlers to be run in
threads; this capability is controlled with the threadirqs
command line option. This is a useful debugging feature, as a
crashing interrupt handler will, when running in a thread, merely
cause a kernel oops instead of bringing down the whole system.
Interrupt handlers which should never be forced into threads can be
marked with IRQF_NO_THREAD, but its use is expected to be rare.
- The object debugging infrastructure
now allows the specification of a "debug hint" function; it returns an
address which can be used to better identify a specific object. See
this
commit for details.
- The long-deprecated SPIN_LOCK_UNLOCKED and
RW_LOCK_UNLOCKED lock initializers have been removed.
- The perf events subsystem has a new monitoring mode wherein it only
watches processes belonging to a specific control group. The new
-G option to perf provides access to this
functionality.
- The directed yield feature has been
added to the fair scheduler; this feature should improve performance
for guests virtualized with KVM.
- There is a new mechanism for the dynamic addition of POSIX clocks; see
<linux/posix_clock.h> for the details of the interface.
- The x86 architecture has gained minimal device tree support.
- There is a new global workqueue called system_freezable_wq;
it differs from the others in that it can be frozen at suspend time.
- Core subsystems can make use of the new syscore_ops mechanism to register power management callbacks without the need to create otherwise useless system devices.
If the usual rules apply, the 2.6.39 merge window can be expected to close around March 29, and the 2.6.39 release should happen around the first week of June.
Uprobes: 11th time is the charm?
Last week's Kernel Page included an article about improving the ptrace() interface; the author of that work, Tejun Heo, was quoted as saying that part of the problem with ptrace() is that it has been starved of developer attention in favor of efforts to replace it entirely. One of those efforts is uprobes, which has also been featured on this page a few times. A new uprobes patch was posted on March 14; so this seems like a good time to have a look at it and further deprive ptrace() of attention. Uprobes looks like it is getting closer to acceptance, but it seems unlikely that the 11th revision will be the last.The purpose of the uprobes subsystem is what one might expect: to enable the placement of probes into user-space executable process memory. These probes might be used to support a debugger like gdb (though uprobes is said to be unsuitable for use by gdb in its current form) or to support user-space tracing. This feature does thus duplicate some of the functionality provided by ptrace(), which will make its acceptance harder, especially since ptrace() is (more or less) a standardized interface. To succeed, uprobes will clearly have to do things better than ptrace() does.
The ptrace() interface is tied to processes; uprobes, instead, works with files. A probe is placed at a certain offset within a specific file; it will then trigger for every process which executes through the probe's location. If the code placing the probe is only interested in specific processes, it will need to filter the events itself. The interface may seem a little strange - users will probably almost always be interested in specific processes - but there are some advantages to doing things this way.
Underneath the hood, uprobes works by faulting in the page which will contain the probe. The instruction at the probe location is copied aside and replaced by a breakpoint. Every process which has that file mapped then gets a pointer in its mm structure pointing to the data describing the probe(s) for that file. When a process executes the breakpoint, the probe's handler function will be called; on that handler's return, the kernel will single-step the displaced instruction, then return to the location following the probe.
This "execute out of line" (XOL) mechanism has been controversial in the past because it requires the injection of a new virtual memory area (VMA) into every process which encounters probes. That VMA is seen as a distortion of the process's behavior which could have strange effects. The alternatives, though, are not entirely appealing either. The ptrace() approach is to put the original instruction back into its original location, execute it, then replace the breakpoint; that only works if every process which has the file mapped is stopped for the duration of the operation (otherwise they might execute the affected code while the breakpoint is missing). Uprobes, instead, is able to handle breakpoint hits without perturbing other processes. Another alternative discussed in the past is emulating the displaced instruction in the kernel; that requires having a full x86 emulator in kernel space, which is not entirely appealing either. So the current plan appears to be to stick with XOL.
Not having to stop the world when a breakpoint is hit is one of the advantages of uprobes, but there are others. It dispenses with the whole ptrace() mechanism involving signals, reparenting processes, and so on. Handling a probe hit does not require a context switch unless the probe itself does; many types of tracing tasks, for example, would never have to switch to another process. Uprobes also allows multiple applications to be tracing the same set of processes at the same time. All of these make the interface appealing to some users.
Who those users are is not clear to everybody, though. There is clearly some interest in the SystemTap camp, but the needs of SystemTap do not necessarily carry a lot of weight on linux-kernel. Thomas Gleixner put it this way:
At times, gdb developers have indicated that they might be open to using a Linux-specific interface if there were advantages to doing so. Such use seems distant at the moment, though. More immediate users are likely to be found in the tracing community; uprobes opens the possibility of getting single stream of trace data covering both user and kernel space. ptrace() is not a useful interface for tracing, so something needs to be done (though there is still some disagreement over whether user-space tracing needs to involve the kernel at all). Uprobes might be that something.
In fact, this version of the uprobes patch includes an ftrace-based interface. Part 20 of the patch contains the entirety of the documentation for this feature, quoted below:
# cd /sys/kernel/debug/tracing/ # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh # objdump -T /bin/zsh | grep -w zfree 0000000000446420 g DF .text 0000000000000012 Base zfree # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events # cat uprobe_events p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420 # echo 1 > events/uprobes/enable # sleep 20 # echo 0 > events/uprobes/enable # cat trace
An actual document is listed as a "TODO" item. The current interface looks a bit painful to use, and it appears to be limited to printing register contents for now. A more flexible and better documented interface could prove useful, though, especially if (as planned) it also can be made to work with the perf events subsystem.
The comments on the patch set indicate some concern about whether the kernel needs the feature or not. But even the more critical reviewers have been going over the code pointing out small things - the kind of review one does when one wants to help the author get the code into shape for merging. This code will not be merged for 2.6.39, and, for this type of code, making predictions for merging at any definite time is a hazardous affair. But, given sufficient will, it seems like uprobes could be made ready for inclusion sometime this year.
APIs for sensors
Environmental sensors were, once upon a time, equipment which were only found in specialized settings like industrial process control or scientific research. They were expensive and tuned to a specific task. Increasingly, though, sensors are being attached to all kinds of devices. Mobile handsets have compasses, accelerometers, and more. Sensors for temperature, pressure, etc. are becoming increasingly common as well. The implications are fun; any Linux machine can be a versatile data collection device.The only problem with all of this is that the Linux kernel does not yet have an established API - either internal or to user space - for sensors. There are interfaces for specific types of sensors; Video4Linux2 handles cameras, for example, and the hwmon subsystem deals with the specific class of sensors aimed at monitoring the health of the computer itself. In these areas, the interfaces are well established and interoperation is possible. For sensors which fall outside of these classes, though, there are no real rules. The outcome of this kind of situation is always the same: new devices are added with inconsistent interfaces, making life hard for application developers.
This situation came to light (again) with the recent submission of a pressure sensor driver which was implemented as a misc device. It used the input subsystem to present its interface; Jonathan Cameron, who has been working on sensor interfaces, pointed out that the patch would not be accepted in that form. Input devices are meant for human input; since most humans do not communicate with their systems via large ambient pressure changes, this device did not fit. So the driver needs another home. The hwmon subsystem was suggested, but the pressure sensor is not really a hardware monitor, so the driver is not welcome there either. Arnd Bergmann also does not like the use of the misc interface:
That leaves the industrial I/O (IIO) subsystem, which is meant "for devices that in some sense are analog to digital converters." IIO tries to handle a wide variety of sensors in some sort of standard way with support for events, higher bandwidth I/O, and more. There are quite a few drivers in the IIO subsystem now; the only problem is that the whole thing lives in the staging tree and the associated "TODO" list is reasonably long. The devices which are represented there now are not all consistent in their interface use - and the form of the desired interface is not at all clear.
Still, putting together such an interface is Jonathan's goal:
He adds that the interface and support for simple devices (those with slow data rates and hwmon-style sysfs interfaces) is in reasonably good shape. The question is how to get the rest of the job done.
One alternative would be to define an essentially new IIO core which would be merged into the mainline. Individual drivers could then be worked into shape and moved over once they are ready. The problem is that this could be a long process, and that the mainline versions of the drivers might not initially have all of the functionality of their black-sheep staging cousins. That would mean more maintenance work keeping both versions of the driver working for some time.
Still, that's the approach that Arnd recommends. The move to the mainline is the last good chance to define an interface which will then need to be supported for many years. So some pain now, if used properly, may be warranted in order to make life easier in the future. Getting driver developers to buy into this idea may not be entirely easy; most of them spend the bulk of their time doing something other than writing Linux driver code and may lack the desire to move to a new interface when what they have now works. But that's almost certainly the best way forward. Now is almost certainly a good time for people with an interest in this area to help in the development of the mainline version of the IIO interface.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>