The 2.6.39 kernel is out
; it was, as predicted, released
by Linus immediately after last
week's Edition was published. He indicated some uncertainty about whether
another -rc release would have been appropriate:
However, since I'm going to be at LinuxCon Japan in two weeks, the
choice for me ended up whether I should just release, or drag it
out *three* more weeks, or have some really messy merge window with
a break in between.
Prominent features in this release include IPset, the media controller subsystem, a couple of new
network flow schedulers, the block plugging
rework, the long-awaited removal of the big kernel lock, and more.
page, the LWN merge window
summaries (part 1, part 2, and part 3) and Thorsten
Leemhuis's summary on The H for more information about this
Stable updates: 22.214.171.124 was
released on May 21, followed by 126.96.36.199 and 188.8.131.52 on May 23. Each contains the
usual list of important fixes.
Comments (none posted)
Quite some time ago I was horrified by the private behaviour of a
hacker I deeply respected: malicious, hypocritical stuff. And it
caused an internal crisis for me: I thought we were all striving
together to make the world a better place. Here are the results I
- Being a great hacker does not imbue moral or ethical
- Being a great coder doesn't mean you're not a crackpot.
- Working on a great project doesn't mean you share my
motivations about it.
This wasn't obvious to me, and it seems it's not obvious to others.
-- Rusty Russell
The more I look at the arguments for why assholes are necessary to
good code, the more I have to wonder if some form of Stockholm
syndrome is at work.
Can we drop most of MCA, EISA and ISA bus if we are going to have a
big version change ? A driver spring clean is much overdue and it's
all in git in case someone wishes to sneak out at midnight and
bring some crawly horror back from the dead.
-- Alan Cox
UEFI stands for "Unified Extensible Firmware Interface", where
"Firmware" is an ancient African word meaning "Why do something
right when you can do it so wrong that children will weep and brave
adults will cower before you", and "UEI" is Celtic for "We missed
DOS so we burned it into your ROMs".
-- Matthew Garrett
Comments (14 posted)
Wireless networking hacker Luis Rodriguez has put together a set of videos
on how the Linux 802.11 layer works
aimed at developers writing and
supporting wireless drivers. "If you have engineers who need to
support the 802.11 Linux subsystem you at times see yourself needing to
educate each group through some sessions. In hopes of reusing educational
sessions I've decided to record my own series and post it on
" Topics covered include overviews of the 802.11 subsystem,
how the development process works, driver debugging, and more.
Full Story (comments: 5)
Linus's message warning that this merge window may be a little shorter than
usual ends with an interesting postscript: "The voices in my head
also tell me that the numbers are getting too big. I may just call the
thing 2.8.0. And I almost guarantee that this PS is going to result in more
discussion than the rest, but when the voices tell me to do things, I
Full Story (comments: 92)
Last week's article
on the idea of
expanding the "secure computing" facility by integrating it with the
perf/ftrace mechanism mentioned the unsurprising fact that the developers
of the existing security module mechanism were not entirely enthusiastic
about the creation of a new and completely different security framework.
Since then, discussion of the patch has continued, and opposition has come
from an entirely different direction: the tracing and instrumentation
Peter Zijlstra started off the new discussion with a brief note reading: "I strongly oppose
to the perf core being mixed with any sekurity voodoo (or any other active
role for that matter)." Thomas Gleixner jumped in with a more detailed description of
his objections. In his view, adding security features to tracepoints will
add overhead to the tracing system, make it harder to change things in the
future, and generally mix tasks which should not be mixed. It would be
better, he said, to keep seccomp as a separate facility which can share the
filtering mechanism once a suitable set of internal APIs has been worked
Ingo Molnar, a big supporter of this patch, is
undeterred; his belief is that more strongly integrated mechanisms will
create a more powerful and useful tool. Since the security decisions need
to be made anyway, he would like to see them made using the existing
instrumentation to the highest level possible. That argument does not
appear to be carrying the day, though; Peter replied:
But face it, you can argue until you're blue in the face, but both
tglx and I will NAK any and all patches that extend perf/ftrace
beyond the passive observing role.
As of this writing, that's where things stand. Meanwhile, the expanded
secure computing mechanism - which didn't use perf in its original form -
will miss this merge window and has no clear path into the mainline. Given
that Linus doesn't like the original idea
either, it's not at all clear that this functionality has a real
Comments (11 posted)
Kernel development news
As of this writing, some 5400 non-merge changesets have been pulled into
the mainline kernel for the next release. The initial indications are that
this development cycle will not have a huge number of exciting new
features, but there are still some interesting additions. Among the
user-visible changes are the following:
- There are two new POSIX clock types: CLOCK_REALTIME_ALARM and
CLOCK_BOOTTIME_ALARM; they can be used to set timers that
will wake the system from a suspended state. See this article for more information on
these new clocks.
- The Quick Fair Queue
packet scheduler has been added to the network stack.
- The just-in-time compiler for BPF packet
filters has been merged; only x86-64 is supported for now.
- There is a new networking system call:
int sendmmsg(int fd, struct mmsghdr *mmsg, unsigned int vlen,
unsigned int flags);
It is the counterpart to recvmmsg(), allowing a process to
send multiple messages with a single system call.
- The ICMP sockets feature has been
merged; its main purpose is to allow unprivileged programs to send
- Two new sysctl knobs allow the capabilities given to user-mode helpers
invoked by the kernel to be restricted; see the
commit for details.
- The tmpfs filesystem has gained support for extended attributes.
- The Xen block backend driver (allowing guests to export block devices
to other guests) has been merged.
- New hardware support includes:
- Systems and processors:
Netlogic XLR/XLS MIPS CPUs,
Lantiq MIPS-based SOCs,
PowerPC A2 and "wire speed processor" CPUs, and
Armadeus APF9328 development boards.
- Audio/video: Philips TEA5757 radio tuners,
Digigram Lola boards,
Apple iSight microphones,
Maxim max98095 codecs,
Wolfson Micro WM8915 codecs,
Asahi Kasei AK4641 codecs,
HP iPAQ hx4700 audio interfaces,
NXP TDA18212 silicon tuners,
Micron MT9V032 sensors,
Sony CXD2820R DVB-T/T2/C demodulators,
RedRat3 IR transceivers,
Samsung S5P and EXYNOS4 MIPI CSI receivers, and
Micronas DRXD tuners.
PenMount dual touch panels,
Maxim max11801 touchscreen controllers,
Analog Devices ADP5589 I2C QWERTY keypad and I/O expanders, and
Freescale MPR121 Touchkey controllers.
Marvell "WiFi-Ex" wireless adapters (SD8787 initially) and
Marvell 8787 Bluetooth interfaces.
Renesas USBHS controllers,
Samsung S5P EHCI controllers,
Freescale USB OTG transceivers, and
Samsung S3C24XX USB high-speed controllers.
CARMA DATA-FPGA programmers,
Broadcom's "advanced microcontroller bus architecture,"
Freescale SEC4/CAAM security engines,
Samsung S5PV210 crypto accelerators,
Maxim MAX16065, MAX16066,
MAX16067, MAX16068, MAX16070, and MAX16071 system managers,
Maxim MAX6642 temperature sensors,
TI UCD90XXX system health controllers,
TI UCD9200 system controllers,
Analog Devices ADM1275 hot-swap controllers,
Analog Devices AD5504, AD5501, AD5760, and AD5780 DACs,
Analog Devices AD7780 and AD7781 analog to digital convertors,
Analog Devices ADXRS450 Digital Output Gyroscopes,
Xilinx PS UARTs,
TAOS TSL2580, TSL2581, and TSL2583 light-to-digital converters,
Intel "management engine" interfaces,
nVidia Tegra embedded controllers, and
IEEE 1588 (precision time protocol) clocks.
Also added to the staging tree is the user-space support code for the
USB/IP subsystem which
allows a system to "export" its USB devices over the net.
Changes visible to kernel developers include:
- Prefetching is no longer used in linked list and hlist traversal;
this may be the
beginning of a much more extensive program to remove explicit prefetch
operations. See this article for more
information on the prefetch removal.
- There is a new strtobool() function for turning user-supplied
strings into boolean values:
int strtobool(const char *s, bool *res);
Anything starting with one of
[yY1] is considered to be true, while strings starting with
one of [nN0] are false; anything else gets an -EINVAL error.
- There is a whole series of new functions for converting user-space
strings to kernel-space integer values; all follow this pattern:
int kstrtol_from_user(const char __user *s, size_t count,
unsigned int base, long *res);
These functions take care of safely copying the string from user space
and performing the integer conversion.
- The kernel has a new generic binary search function:
void *bsearch(const void *key, const void *base, size_t num, size_t size,
int (*cmp)(const void *key, const void *elt));
This function will search for key in an array starting at
base containing num elements of the given
- The use of threads for the handling of interrupts on specific lines
can be controlled with irq_set_thread() and
- The static_branch() interface for
the jump label mechanism has been merged.
- The function tracer can now support multiple users with each tracing a
different set of functions.
- The alarm timer mechanism - which can set timers that fire even if the
system is suspended - has been merged.
- An object passed to kfree_rcu() will be handed to
kfree() after the next read-copy-update grace period. There
are a lot of RCU callbacks which only call kfree(); it should
be able to replace those with kfree_rcu() calls.
- The -Os (optimize for size) option is no longer the default for kernel
compiles; the associated costs in code quality were deemed to be too
high. Linus said: "I still happen to believe that I$ miss
costs are a major thing, but sadly, -Os doesn't seem to be the
solution. With or without it, gcc will miss some obvious code size
improvements, and with it enabled gcc will sometimes make choices that
aren't good even with high I$ miss ratios."
- The first rounds of ARM architecture cleanup patches have gone in. A
number of duplicated functionalities have been consolidated, and
support for a
number of (probably) never-used platform and board configurations have
- The W= parameter to kernel builds now takes values from 1
to 3. At the first level, only warnings deemed to have a high
chance of being relevant; a full kernel build generates "only" 4800 of
them. At W=3, developers get a full 86,000 warnings to look
at. Note that if you want all of the warnings, you need to say
The merge window for this development cycle is likely to end on
May 29, just before Linus boards a plane for Japan. At that time,
presumably, we will learn what the next release will be called; Linus has
made it clear that he thinks the 2.6.x numbers are getting too high and
that he thinks it's time for a change. Tune in next week for the
conclusion of this merge window and the end of kernel version number
Comments (9 posted)
Last week's Kernel Page included a brief
about the hiding of kernel addresses from user space. This
hiding has come under fire from a number of developers who say that it
breaks things (perf, for example) and that it does not provide any real
additional security. That said, there does seem to be a consensus around
the idea that it's better if attackers don't know where the kernel keeps
its data structures. As it turns out, there might be a better way to do
that than simply hiding pointer values.
There is no doubt that having access to the layout of the kernel in memory
is useful to attackers. As Dan Rosenberg put
I agree about the fact that kptr_restrict is an incomplete security
feature. However, I disagree that it lacks usefulness entirely.
Virtually every public kernel exploit in the past year leverages
/proc/kallsyms or other kernel address leakage to target an attack.
The hiding of kernel addresses is meant to deprive attackers of that extra
information, making their task harder. One big problem with that approach
is that most systems out there are running stock distribution kernels.
Getting the needed address information from the distributor's kernel
package is not a particularly challenging task. So, on these systems,
there is no real mystery about the layout of the kernel, regardless of
whether pointer values are allowed to leak to user space or not.
While all of this was being discussed, another idea came out: why not
randomize the location of the kernel in memory at boot time? Address space
layout randomization has been used to resist canned attacks for a long
time, but the kernel takes no such measure for itself. Given that the
kernel image is relocatable, there is no real reason why it always needs to
be loaded at the same address. If the kernel calculated a different offset
for itself at every boot, it could subtract that offset from pointer values
before passing them to user space. Those pointers could then be used by
tools like perf, but they would no longer be helpful for anybody seeking to
overwrite kernel data structures.
Dan has been looking into kernel-space randomization with some success; it
turns out that simply relocating the kernel is not that hard. That said,
he has run up against a few potential problems. The first of those is that
there is very little entropy available at the beginning of the boot
process, so the generation of a sufficiently random base address for the
kernel is not entirely straightforward. It seems that enough bits of
entropy can be derived from the real-time clock and time stamp counter to
make it hard for an attacker to derive the base address later on, but a
real random number would be better.
Next, as Linus pointed out, the kernel is
not infinitely relocatable. There are a number of alignment requirements
which constrain the kernel's placement, so, according to Linus, there is a
maximum of 8-12 bits of randomization available. That means that an
exploit attempt could find the right offset after a maximum of a few
thousand tries. Given that computers can try things very quickly, that
does not give a site administrator much time to respond.
As others pointed out, though, that amount of randomness is probably
enough. Failed exploit attempts have a high probability of generating a
kernel oops; even if an administrator does not notice the oops immediately,
it should come to their attention at some point. So the ability to
stealthily take over a system is gone. Beyond that, failed exploits may
well take down the system entirely (especially if, as is the case with many
RHEL systems, the "panic on oops" flag is set) or leave it in a state where
further exploit attempts cannot work. There is, it seems, a real advantage
to forcing an attacker to guess.
That advantage evaporates, though, if an attacker can somehow figure out
what offset a given system used at boot time. Dan noticed one way that
the unprivileged SIDT instruction can be used to locate the system's
interrupt descriptor table. That location could, in turn, be used to
calculate the kernel's base offset. Dynamic allocation of the table can
solve that problem at the cost of messing with some tricky very-early-boot
code. There could be other advantages to dynamically allocating the table,
though; if the table were put into the per-CPU
area, it might make the system a little more scalable.
So this problem can be solved,
but there will, beyond doubt, be other places where it will be possible for
an attacker to obtain a real kernel-space address. There are simply too
many ways in which that information might leak into user space. Plugging
all of those leaks looks like one of those long-term tasks that is never
really done. It may, however, be possible to get close enough to done that
attackers will not be able to count on knowing the true location of the
kernel in a running system. That may be a bit of security through
obscurity that is worth having.
Comments (11 posted)
Over time, software developers tend to learn that micro-optimization
efforts are generally not worthwhile, especially in the absence of hard
data pointing out a specific problem. Performance problems are often not
where we think they are, so undirected attempts to tweak things to make
them go faster can be entirely ineffective. Or, indeed, they can make
things worse. That is a lesson that the kernel developers have just
At the kernel level, performance often comes down to cache behavior.
Memory references which must actually be satisfied by memory are extremely
slow; good performance requires that needed data be in a CPU cache much of
the time. The kernel goes out of its way to use cache-hot memory when
possible; there has also been some significant work put into tasks like
reordering structures so that fields that are commonly accessed together
are found in the same cache line. As a general rule, these optimizations
have helped performance in measurable ways.
Cache misses are often unavoidable, but it is sometimes possible to attempt
to reduce their cost. If the kernel knows that it will be accessing memory
at a particular location in the near future, it can use a CPU-specific
prefetch instruction to begin the process of bringing the data into cache.
This instruction is made available to kernel code via the generic
prefetch() function; developers have made heavy use of it.
Consider, for example, this commonly-used macro from
#define list_for_each(pos, head) \
for (pos = (head)->next; prefetch(pos->next), pos != (head); \
pos = pos->next)
This macro (in a number of variants) is used to traverse a linked list.
The idea behind the prefetch() call here is to begin the process
of fetching the next entry in the list while the current entry is being
processed. Hopefully by the time the next loop iteration starts, the data
will have arrived - or, at least, it will be in transit. Linked lists are
known to be cache-unfriendly data structures, so it makes sense that this
type of optimization can help to speed things up.
Except that it doesn't - at least, not on x86 processors.
Andi Kleen may have been the first to question this optimization when he tried to remove the prefetches from
list operations last September. His patch generated little discussion,
though, and apparently fell through the cracks. Recently, Linus
did some profiling on one of his favorite
workloads (kernel builds) and found that the prefetch instructions were at
the top of the ranking. Performing the prefetching cost time, and that
time was not being repaid through better cache behavior; simply removing
the prefetch() calls made the build go faster.
Ingo Molnar, being Ingo, jumped in and did
a week's worth of research in an hour or so. Using perf and a slightly
tweaked kernel, he was able to verify that using the prefetch instructions
caused a performance loss of about 0.5%. That is not a headline-inspiring
performance regression, certainly, but this is an optimization which was
supposed to make things go faster. Clearly something is not working the
way that people thought it was.
Linus pointed out one problem at the outset: his test involved a lot of
traversals of singly-linked hlist hash table lists. Those lists
tend to be short, so there is not much scope for prefetching; in fact, much
of the time, the
only prefetch attempted used the null pointer that indicates the end
of the list. Prefetching with a null pointer seems silly, but it's also costly:
evidently every such prefetch on x86 machines (and, seemingly, ARM as well)
causes a translation lookaside buffer miss and a pipeline stall. Ingo
measured this effect and came to the conclusion that each null prefetch
cost about 20 processor cycles.
Clearly, null prefetches are a bad idea. It would be nice if the CPU
would simply ignore attempts to prefetch using a null pointer, but that's
things are, so, as is often the case, one ends up trying to solve the
problem in software instead. Ingo
did some testing with a version of prefetch() which would only
issue prefetch instructions for non-null pointers; that version did,
indeed, perform better. But it still performed measurably worse than
simply skipping the prefetching altogether.
CPU designers are well aware of the cost of waiting for memory; they have
put a great deal of effort into minimizing that cost whenever possible.
Among other things, contemporary CPUs have their own memory prefetch units
which attempt to predict which memory will be wanted next and start the
process of retrieving it early. One thing Ingo noticed in his tests is
that, even without any software prefetch operations, the number of prefetch
operations run by the CPU was about the same. So the hardware prefetcher
was busy during this time - and it was doing a better job than the software
at deciding what to fetch. Throwing explicit prefetch operations into the
mix, it seems, just had the effect of interfering with what the hardware
was trying to do.
Ingo summarized his results this way:
So the conclusion is: prefetches are absolutely toxic, even if the
NULL ones are excluded.
One immediate outcome from this work is that, for 2.6.40 (or whatever it
ends up being called), the prefetch() calls have been removed from
linked list, hlist, and sk_buff list traversal operations - just like Andi
Kleen tried to do in September. Chances are
good that other prefetch operations will be removed as well. There will
still be a place for prefetch() in the kernel, but only in
specific situations where it can be clearly shown to help performance. As
with other low-level optimizations (likely() comes to mind),
tossing in a prefetch because it seems like it might help is often not the
right thing to do.
One other lesson to be found in this experience is that numbers matter.
Andi was right when he wanted to remove these operations, but he did not
succeed in getting his patch merged. One could come up with a number of
reasons why things went differently this time, starting with the fact that
Linus took an interest in the problem. But it's also true that
performance-oriented patches really need to come with numbers to show that
they are achieving the desired effect; had Andi taken the time to quantify
the impact of his change, he would have had a stronger case for merging
Comments (51 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>