The 2.6.38 kernel is out
by Linus on March 14. "As to the "big picture", ie all the
changes since 2.6.37, my personal favorite remains the VFS name lookup
changes. They did end up causing some breakage, and Al has made it clear
that he wants more cleanups, but on the whole I think it was surprisingly
" Other significant changes in 2.6.38 include transparent hugepage support
, per-session group scheduling
, a number of
Btrfs improvements, and more. The always excellent KernelNewbies.org page
all the details.
Stable updates: the 126.96.36.199 and 188.8.131.52 updates were released on
March 14. Both contain several important fixes.
Comments (none posted)
I think this is the really fundamental issue: anybody who makes a
hard error out of something that is recoverable is a total moron.
-- Linus Torvalds
Golden rule #12: When the comments do not match the code, they probably
are both wrong.
-- Steven Rostedt
But if you are correct, then it worries me that your patch will be
the first of a trickle growing to a stream to an avalanche of
patches where people align and reorder structures so that the most
commonly accessed fields are at the beginnng of the cacheline, so
that those can then be accessed minutely faster.
Aargh, and now I am setting off the avalanche with that remark.
Please, someone, save us by discrediting George's argument.
-- Hugh Dickins
Comments (none posted)
Michael Schultz has posted an
introductory look at the Linux networking stack
, focusing on driver
initialization and packet reception. It's a "how it works" discussion,
rather than a look at the actual code. "In general network drivers
follow a fairly typical route in processing: the kernel boots up,
initializes data structures, sets up some interrupt routines, and tells the
network card where to put packets when they are received. When a packet is
actually received, the card signals the kernel causing it to do some
processing and then cleans up some resources. I'll talk about the fairly
generic routines that network devices share in common and then move to a
concrete example with the igb driver.
Comments (none posted)
There has been much talk of the per-session group scheduling patch which is
part of the 2.6.38 kernel, but it can be hard to see that code in action
if one isn't doing a 20-process kernel build at the time. Recently, your editor
inadvertently got a demonstration of group scheduling thanks to
some unexpected results from a Rawhide system upgrade. The way the
scheduler works was clearly shown in a way that could be captured at the
Rawhide users know that surprises often lurk behind the harmless-looking
yum upgrade command. In this particular case, something in
the upgrade (related to fonts, possibly) caused every graphical process in
the system to decide that it was time to do some heavy processing. The
result can be seen in this output from the top command:
The per-session heuristic had put most of the offending processes into a
single control group, with the effect that they were mostly competing
against each other for CPU time. These processes are, in the capture
above, each currently getting 5.3% of the available CPU time. Two processes
which were not in that control group were left essentially competing for
the second core in the system; they each got 46%. The system had a load
average of almost 22, and the desktop was entirely unresponsive. But it
was possible to log into the system over the net and investigate the situation
without really even noticing the load.
This isolation is one of the nicest features of group scheduling; even when
a large number of processes go totally insane, their ability to ruin life
for other tasks on the machine is limited. That, alone, justifies the cost
of this feature.
Comments (19 posted)
Kernel development news
Linus released the 2.6.38 kernel on March 14, and started merging patches
for the 2.6.39 development cycle the following day. As of this writing,
just over 1,000 patches have been merged into the mainline. Clearly the
merging process has just begun for this cycle, but some interesting
features have been added. User-visible
changes merged so far include:
- The open by handle system calls have
been added. The final form of the API is:
int name_to_handle_at(int dfd, const char *name, struct file_handle *handle,
int *mnt_id, int flag);
int open_by_handle_at(int dirfd, struct file_handle *handle, int flags);
This functionality is intended for use by user-space file servers,
which can more efficiently track files using file handles.
- The open() system call has a new flag: O_PATH. A
file opened with this flag will have had its path resolved by the
is known to exist, but there is little else that can be done with it.
System calls which operate on file descriptors directly
(close() or dup(), for example) will work; these
file descriptors can also be passed to another process over
Unix-domain sockets using SCM_RIGHTS datagrams. The reason
for the existence of
O_PATH file descriptors is for use as the directory file
descriptor in the various "*at()" system calls.
- Tasks in the SCHED_IDLE class are now allowed to upgrade
themselves into the SCHED_BATCH or SCHED_OTHER
classes if their "nice" rlimit is adequate.
- There is a new system call which allows the adjustment of POSIX
int clock_adjtime(clock_id which_clock, struct timex *time);
Time adjustments possible are the same as for adjtimex(), but
specific POSIX clocks may not support all operations.
- The CLOCK_BOOTTIME POSIX clock has
- The new Smack SMACK64MMAP attribute can be used to control when
specific libraries can be mapped by running programs.
- New hardware support includes:
- Systems and processors: Intel "SandyBridge" CPUs,
CompuLab TrimSlice boards,
and several variations of the Seaboard evaluation platform.
- Block: ARASAN CompactFlash PATA controllers.
- Miscellaneous: picoXcell IPSEC and Layer2 crypto engines.
Changes visible to kernel developers include:
- There is a new interrupt flag (IRQF_FORCE_RESUME) which
forces the interrupt to be re-enabled at resume time regardless of
whether it was disabled during suspend.
- The kernel can now force (almost) all interrupt handlers to be run in
threads; this capability is controlled with the threadirqs
command line option. This is a useful debugging feature, as a
crashing interrupt handler will, when running in a thread, merely
cause a kernel oops instead of bringing down the whole system.
Interrupt handlers which should never be forced into threads can be
marked with IRQF_NO_THREAD, but its use is expected to be rare.
- The object debugging infrastructure
now allows the specification of a "debug hint" function; it returns an
address which can be used to better identify a specific object. See
commit for details.
- The long-deprecated SPIN_LOCK_UNLOCKED and
RW_LOCK_UNLOCKED lock initializers have been removed.
- The perf events subsystem has a new monitoring mode wherein it only
watches processes belonging to a specific control group. The new
-G option to perf provides access to this
- The directed yield feature has been
added to the fair scheduler; this feature should improve performance
for guests virtualized with KVM.
- There is a new mechanism for the dynamic addition of POSIX clocks; see
<linux/posix_clock.h> for the details of the interface.
- The x86 architecture has gained minimal device tree support.
- There is a new global workqueue called system_freezable_wq;
it differs from the others in that it can be frozen at suspend time.
- Core subsystems can make use of the new syscore_ops
mechanism to register power management callbacks without the need to
create otherwise useless system devices.
If the usual rules apply, the 2.6.39 merge window can be expected to close
around March 29, and the 2.6.39 release should happen around the first
week of June.
Comments (5 posted)
Last week's Kernel Page included an article
about improving the ptrace() interface
; the author of that
work, Tejun Heo, was quoted as saying that part of the problem with
is that it has been starved of developer attention in
favor of efforts to replace it entirely. One of those efforts is uprobes,
which has also been featured on this page a few times. A new uprobes patch
was posted on
March 14; so this seems like a good time to have a look at it and
further deprive ptrace()
of attention. Uprobes looks like it is
getting closer to acceptance, but it seems unlikely that the 11th revision
will be the last.
The purpose of the uprobes subsystem is what one might expect: to enable
the placement of probes into user-space executable process memory. These
probes might be used to support a debugger like gdb (though uprobes is said to be unsuitable for use by gdb in its
current form) or to support user-space tracing. This feature does thus
duplicate some of the functionality provided by ptrace(), which
will make its acceptance harder, especially since ptrace() is
(more or less) a standardized interface. To succeed, uprobes will clearly
have to do things better than ptrace() does.
The ptrace() interface is tied to processes; uprobes, instead,
works with files. A probe is placed at a certain offset within a specific
file; it will then trigger for every process which executes through the
probe's location. If the code placing the probe is only interested in
specific processes, it will need to filter the events itself. The
interface may seem a little strange - users will probably almost always be
interested in specific processes - but there are some advantages to doing
things this way.
Underneath the hood, uprobes works by faulting in the page which will
contain the probe. The instruction at the probe location is copied aside
and replaced by a breakpoint. Every process which has that file mapped then
gets a pointer in its mm structure pointing to the data describing
the probe(s) for that file. When a process executes the breakpoint, the
function will be called; on that handler's return, the kernel will
single-step the displaced instruction, then return to the location following
This "execute out of line" (XOL) mechanism has been controversial in the
past because it requires the injection of a new virtual memory area (VMA)
into every process which encounters probes. That VMA is seen as a
distortion of the process's behavior which could have strange effects. The
alternatives, though, are not entirely appealing either. The
ptrace() approach is to put the original instruction back into its
original location, execute it, then replace the breakpoint; that only works
if every process which has the file mapped is stopped for the duration of
the operation (otherwise they might execute the affected code while the
breakpoint is missing). Uprobes, instead, is able to handle breakpoint hits without
perturbing other processes. Another alternative discussed in the past is
emulating the displaced instruction in the kernel; that requires having a
full x86 emulator in kernel space, which is not entirely appealing either.
So the current plan appears to be to stick with XOL.
Not having to stop the world when a breakpoint is hit is one of the
advantages of uprobes, but there are others. It dispenses with the whole
ptrace() mechanism involving signals, reparenting processes, and
so on. Handling a probe hit does not require a context switch unless the
probe itself does; many types of tracing tasks, for example, would never
have to switch
to another process. Uprobes also allows multiple applications to be
tracing the same set of processes at the same time. All of these make the
interface appealing to some users.
Who those users are is not clear to everybody, though. There is clearly
some interest in the SystemTap camp, but the needs of SystemTap do not
necessarily carry a lot of weight on linux-kernel. Thomas Gleixner put it this way:
And it does not matter at all whether systemtap can use this or
not. If the main debuggers used like gdb are not going to use it
then it's a complete waste. We don't need another debugging
interface just for a single esoteric use case.
At times, gdb developers have indicated
that they might be open to using
a Linux-specific interface if there were advantages to doing so. Such use
seems distant at the moment, though. More immediate users are likely to be
found in the tracing community; uprobes opens the possibility of getting
single stream of trace data covering both user and kernel space.
ptrace() is not a useful interface for tracing, so something needs
to be done (though there is still some disagreement over whether user-space
tracing needs to involve the kernel at all). Uprobes might be that
In fact, this version of the uprobes patch includes an ftrace-based
interface. Part 20 of the patch contains the entirety of the
documentation for this feature, quoted below:
# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g DF .text 0000000000000012 Base zfree
# echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
An actual document is listed as a "TODO" item. The current interface looks
a bit painful to use, and it appears to be limited to printing register
contents for now. A more flexible and better documented interface could
prove useful, though, especially if (as planned) it also can be made to
work with the perf events subsystem.
The comments on the patch set indicate some concern about whether the
kernel needs the feature or not. But even the more critical reviewers
have been going over the code pointing out small things - the kind of
review one does when one wants to help the author get the code into shape
for merging. This code will not be merged for 2.6.39, and, for this type
of code, making predictions for merging at any definite time is a hazardous
affair. But, given sufficient will, it seems like uprobes could be made
ready for inclusion sometime this year.
Comments (2 posted)
Environmental sensors were, once upon a time, equipment which were only
found in specialized settings like industrial process control or scientific
research. They were expensive and tuned to a specific task.
Increasingly, though, sensors are being attached to all kinds of devices.
Mobile handsets have compasses, accelerometers, and more. Sensors for
temperature, pressure, etc. are becoming increasingly common as well. The
implications are fun; any Linux machine can be a versatile data collection
The only problem with all of this is that the Linux kernel does not yet have
an established API - either internal or to user space - for sensors. There
are interfaces for specific types of sensors; Video4Linux2 handles
cameras, for example, and the hwmon subsystem deals with the specific class
of sensors aimed at monitoring the health of the computer itself. In these
areas, the interfaces are well established and interoperation is possible.
For sensors which fall outside of these classes, though, there are no real
rules. The outcome of this kind of situation is always the same: new
devices are added with inconsistent interfaces, making life hard for
This situation came to light (again) with the recent submission of a pressure sensor driver which was implemented
as a misc device. It used the input subsystem to present its interface;
Jonathan Cameron, who has been working on sensor interfaces, pointed out
that the patch would not be accepted in that form. Input devices are meant
for human input; since most humans do not communicate with their systems
via large ambient pressure changes, this device did not fit. So the
driver needs another home. The hwmon subsystem was suggested, but the
pressure sensor is not really a hardware monitor, so the driver is not
welcome there either. Arnd Bergmann also does
not like the use of the misc interface:
I generally try to prevent people from adding more ad-hoc
interfaces to drivers/misc. Anything that is called a drivers/misc
driver to me must qualify as "there can't possibly be a second
driver with the same semantics", otherwise it should be part of
another subsystem with clear rules, or be put into its own file
That leaves the industrial I/O (IIO) subsystem, which is meant "for devices
that in some sense are analog to digital converters." IIO tries to handle
a wide variety of sensors in some sort of standard way with support for
events, higher bandwidth I/O, and more. There are quite a few drivers in
the IIO subsystem now; the only problem is that the whole thing lives in
the staging tree and the associated "TODO" list is reasonably long. The
devices which are represented there now are not all consistent in their
interface use - and the form of the desired interface is not at all clear.
Still, putting together such an interface is Jonathan's goal:
To my mind, there will one day be a suitable 'sensors' subsystem so
an important side point is to try and minimise interface changes
needed to move to that (IIO or something better). Sysfs is easy to
fix, so lets at least work on shared interfaces in there. Hwmon is
a mature and reasonable starting point; it's where we got a lot of
IIO's similar interfaces from. The trick is convincing people to
consider generality and it's a hard trick to pull off.
He adds that the interface and support for simple devices (those with slow
data rates and hwmon-style sysfs interfaces) is in reasonably good shape.
The question is how to get the rest of the job done.
One alternative would be to define an essentially new IIO core which would
be merged into the mainline. Individual drivers could then be worked into
shape and moved over once they are ready. The problem is that this could
be a long process, and that the mainline versions of the drivers might not
initially have all of the functionality of their black-sheep staging
cousins. That would mean more maintenance work keeping both versions of
the driver working for some time.
Still, that's the approach that Arnd
recommends. The move to the mainline is the last good chance to define
an interface which will then need to be supported for many years. So some
pain now, if used properly, may be warranted in order to make life easier
in the future. Getting driver developers to buy into this idea may not be
entirely easy; most of them spend the bulk of their time doing something
other than writing Linux driver code and may lack the desire to move to a
new interface when what they have now works. But that's almost certainly
the best way forward. Now is almost certainly a good time for people with
an interest in this area to help in the development of the mainline version
of the IIO interface.
Comments (14 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>