Brief items
The current development kernel remains 2.6.36-rc3; no new prepatches
have been released over the last week. Linus has returned from
his trip to Brazil and resumed
merging changes, so 2.6.36-rc4 can probably be expected in the near future.
Stable updates: there have been no 2.6 stable updates released over
the last week.
Comments (none posted)
The 2.4 kernel lives - for a little while longer, at least. Willy Tarreau
has just released the
2.4.37.10 update,
with a small set of important fixes. This might just be the last update in
this series, unless some sort of important fix comes in. "
If nothing
happens before September 2011, it's possible that there won't be any
2.4.37.11 at all. By that time, the 2.6 kernel will have been available for
almost 8 years, this should have been enough for anyone to have a look at
it. Users now have one year to migrate or to report critical bugs. I think
that's an honest deal." See the announcement for the full
description of his planned policy.
Comments (4 posted)
A lot of my conversations about union mounts with Al [Viro] go like
this:
Al: "Rewrite it this way."
Val: "But then how do we get the nameidata?"
Al: "Arrrrrrrrrrrrrggggh."
--
Valerie Aurora
Well damn, good detective work. I wonder how many of those nasty
random bad-page-state bug reports just got fixed. I dub thee
September's "Hero of the Linux kernel"!
--
Andrew Morton (to Jiri Slaby)
Kernel developers are paid to work on features, yes. They are not
paid to fix bugs for random folks who want run the latest stable
kernel.
--
Ted Ts'o
Comments (none posted)
By Jonathan Corbet
September 8, 2010
As anybody who has read
What
every programmer should know about memory knows, performance on
contemporary systems is often dominated by cache behavior. A single cache
miss can cause a processor stall lasting for hundreds of cycles. The
kernel employs many tricks and techniques to optimize cache behavior, but,
as is often the case with low-level optimization, it turns out that some of
those tricks are not as helpful as had been thought.
The kernel's linked list macros include a set of operators for iterating
through a list. At the top of a list-processing loop, the macros will
issue a prefetch operation for the next entry in the list. The hope is
that, by the time one entry has been processed, the CPU will have fetched
the following entry into its cache, avoiding a stall at the beginning of
the next trip through the loop. It
seems like the sort of micro-optimization which can only help, and nobody
has looked closely at these prefetch operations for a long time - until
now. Andi Kleen has just posted a patch removing most of those
prefetches.
Andi's contention is that, on contemporary processors, the prefetch
operations are actually making things worse. These processors already
prefetch everything they can get their hands on, so the explicit prefetch
is unlikely to help. Even if that prefetch does start a memory cycle
earlier than it would have otherwise happened, list processing loops tend to be so
short that the amount of additional parallelism gained is quite small.
Meanwhile the prefetch operations bloat the kernel image, increase register
use, and cause the compiler to generate worse code. So, he says, we are
better off without them.
With the prefetch operations removed, Andi's kernel image ends up being
10KB smaller. It also shows no performance regressions over mainline
kernels. Unless somebody else gets different results, that seems like
enough to justify putting this patch into the mainline.
Comments (11 posted)
By Jonathan Corbet
September 8, 2010
In kernel-related email communications, the general rule is to err on the side of
sending copies to too many recipients rather than too few. The volume on
the mailing lists is such that one can never assume that interested people
will see any specific message there, so it's customary to copy people
explicitly. Recently, though, a number of kernel developers have started
to
complain that they are getting copies of
patches that they have no interest in. Often, the selection of recipients
seems entirely random.
The culprit is the get_maintainer.pl script shipped with the
kernel source. This script is actually a useful tool; it will look in the
MAINTAINERS file to find the people who might be interested in a
specific patch. Potentially less usefully, it will also dig through the
repository history and list other developers who have made changes to the
files modified by a given patch. So anybody who has tweaked a given file
in the recent past - possibly making only trivial changes - will be listed
as people to copy on any other patches to the same file.
Looking at the file revision history can, indeed, be a useful way to find
the "real" maintainers; the information in MAINTAINERS, instead, can
be incomplete or outdated. But, clearly, one needs to look at what a
developer has actually done in a given area; fixing a file for an API
change does not mean that the developer is actively working on that code.
Many developers don't perform that check and, instead, just send mail to
everybody listed by the script.
The level of grumpiness caused by widely-broadcast patches seems to be on
the rise. Developers who don't want to receive an irritated response to
their postings might want to take a little care or, at least, use the
--nogit option to get_maintainer.pl.
Comments (6 posted)
Kernel development news
By Jonathan Corbet
September 8, 2010
Last week's article on stable
kernels drew a number of comments, both public and private. Those
comments suggested a couple of other ways of looking at how the stable tree
works and how patches get into it. That, in turn, has inspired this follow up
look at the stable kernel process.
A certain amount of unhappiness was expressed regarding the tables of
the most active stable contributors. Those tables attribute stable
contributions to the individuals who write the patches. Things were done
this way for two reasons: (1) the patch author is, indeed, the person
who fixed the bug, and (2) that is the information which is available
in the stable kernel repository. It made sense, your editor thought, to
assign credit - for the fix, but, also, potentially, for the bug which
required the fix - in this way.
It turns out that a number of people see stable contributions in a
different way. The real credit, they say, belongs to the person who
notices that a patch fixes a bug in stable kernels and ensures that the fix
gets directed to the stable kernel maintainer. There are people, often
those working on maintaining distributor kernels, who spend a lot of time
watching the patch stream and looking for just this kind of fix. It is a
lot of work, and the people who do that work certainly deserve credit for
the service they are performing for the community.
Your editor would be delighted to be able to produce a table crediting this
type of stable contributor. Unfortunately, the electronic trail needed to
create this table simply does not exist. One could try to play games by
looking at how the patch tags differ between the mainline and stable
versions of the fix; there will often be an extra signoff or Cc:
tag naming the person who forwarded the patch to the stable tree. But
such schemes will be approximate and error-prone. If we really want to
track and credit developers who flag patches for the stable tree, we almost
certainly need to add a new patch tag making that credit explicit in every
patch.
A related complaint came, via private mail, from a subsystem maintainer;
his point of view was that the subsystem maintainers are the people doing
the real legwork to get important fixes into the stable tree. A diligent
maintainer will be evaluating all patches as they are merged into the
subsystem tree, catching those which have stable kernel implications and
directing them accordingly. He suggested a study to evaluate the
percentage of stable patches coming out of each subsystem tree as a way to
identify which maintainers are on top of things.
Your editor, intrigued by that idea, ran a quick study. The table below
shows numbers for some selected subsystems for the 2.6.32 stable series.
Since 2.6.32 is still under maintenance, it will have received patches from
all of the mainline releases from 2.6.33 to the present. For each
subsystem, we can look at how many patches have gone into the mainline
(through 2.6.36-rc3) and how many of those went into the stable series.
The results look like this:
| Subsystem |
Patches |
Pct |
| (mainline) | (stable) |
| fs/ext4 |
216 |
90 |
42% |
| fs/btrfs |
155 |
42 |
27% |
| drivers/usb |
1003 |
112 |
11% |
| arch/x86 |
1877 |
176 |
9% |
| drivers/acpi |
291 |
24 |
8% |
| mm |
602 |
48 |
8% |
| kernel |
1471 |
96 |
7% |
| sound |
1369 |
88 |
6% |
| fs/ext3 |
58 |
3 |
5% |
| drivers/scsi |
1054 |
51 |
5% |
| net |
2324 |
98 |
4% |
| drivers/input |
381 |
13 |
3% |
| arch/powerpc |
917 |
18 |
2% |
| drivers/media |
1705 |
26 |
2% |
| block |
182 |
3 |
2% |
| arch/arm |
3221 |
19 |
<1% |
| tools |
873 |
3 |
<1% |
At the upper end of the table, it is unsurprising to find the ext4 and
btrfs filesystems showing a high percentage of stable patches. Both of
those filesystems are undergoing heavy stabilization work at the present,
so it makes sense that the bulk of the changes merged will be important
fixes. The relatively small percentage of ext3 changes going into the
stable tree was interesting; a quick check shows that many of the ext3
changes which did not go to stable
reflect API changes in the VFS and disk quota code.
That said, it also appears that a small number of fixes might have fallen
through the cracks.
It's hard to draw conclusions from much of the rest of the table; different
subsystems will naturally vary in the ratio of fixes to new features, so
they will never have the same percentage of patches going into the stable
tree. That said, there do seem to be some real variations in how many
fixes are being directed to stable by the subsystem maintainers.
One might, for example, wonder if a few more than 19 of the 3221
changes to the ARM architecture could have qualified for the stable tree.
This maintainer also pointed out one other aspect of the problem: the
maintainer's real job is often to say "no" in the same way as with mainline
patches. It seems that some developers have an expansive view of of which
changes are suitable for the stable tree, so they flag patches which are
too large and invasive, or which do not actually fix serious bugs. In
these cases, the maintainer must remove the stable tag and keep the patch
from going in that direction. Needless to say, this kind of activity is
even harder to track, so there will be no "stable rejections" table.
In any case, maintainers needing to turn away marginal stable patches seems
like the right kind of problem to have. Bugs are annoying in the best of
times, but they are doubly annoying when a fix exists but is not
distributed to people who need it. The stable tree seems to be doing a
good job of getting those fixes out; that makes Linux better for all of us.
Comments (none posted)
By Jonathan Corbet
September 7, 2010
One of the biggest internal changes in 2.6.36 will be the adoption of
concurrency-managed workqueues.
The short-term goal of this work is to reduce the number of kernel threads
running on the system while simultaneously increasing the concurrency of
tasks submitted to workqueues. To that end, the per-workqueue kernel
threads are gone, replaced by a central set of threads with names like
[kworker/0:0]; workqueue tasks are then dispatched to the threads
via an algorithm which tries to keep exactly one task running on each CPU
at all times. The result should be better use of the CPU for workqueue
tasks and less memory tied up by the workqueue machinery.
That is a worthwhile result in its own right, but it's really only a
beginning. The 2.6.36 workqueue patches were deliberately designed to
minimize the impact on the rest of the kernel, so they preserved the
existing workqueue API. But the new code is intended to do more than
replace workqueues with a cleverer implementation; it is really meant to be
a general-purpose task management system for the kernel. Making full use
of that capability will require changes in the calling code - and in code
which does not yet use workqueues at all.
In kernels prior to 2.6.36, workqueues are created with
create_workqueue() and a couple of variants. That function will,
among other things, start up one or more kernel threads to handle tasks
submitted to that workqueue. In 2.6.36, that interface has been preserved,
but the workqueue it creates is a different beast: it has no dedicated
threads and really just serves as a context for the submission of tasks.
The API is considered deprecated; the proper way to create a workqueue now is
with:
int alloc_workqueue(char *name, unsigned int flags, int max_active);
The name parameter names the queue, but, unlike in the older
implementation, it does not create threads using that name. The
flags parameter selects among a number of relatively complex
options on how work submitted to the queue will be executed; its value can
include:
- WQ_NON_REENTRANT: "classic" workqueues guaranteed
that no task would be run by two threads simultaneously on the same
CPU, but made no such guarantee across multiple CPUs. If it was
necessary to ensure that a task could not be run simultaneously
anywhere in the system, a single-threaded workqueue had to be used,
possibly limiting concurrency more than desired. With this flag, the
workqueue code will provide that systemwide guarantee while still
allowing different tasks to run concurrently.
- WQ_UNBOUND: workqueues were designed to run tasks on
the CPU where they were submitted in the hope that better memory cache
behavior would result. This flag turns off that behavior, allowing
submitted tasks to be run on any CPU in the system. It is intended
for situations where the tasks can run for a long time, to the point
that it's better to let the scheduler manage their location.
Currently the only user is the object processing code in the FS-Cache
subsystem.
- WQ_FREEZEABLE: this workqueue will be frozen when the
system is suspended. Clearly, workqueues which can run tasks as part
of the suspend/resume process should not have this flag set.
- WQ_RESCUER: this flag marks workqueues which may be
involved in memory reclaim; the workqueue code responds by ensuring
that there is always a thread available to run tasks on this queue.
It is used, for example, in the ATA driver code, which always needs to
be able to run its I/O completion routines to be sure it can free
memory.
- WQ_HIGHPRI: tasks submitted to this workqueue will put
at the head of the queue and run (almost) immediately. Unlike
ordinary tasks, high-priority tasks do not wait for the CPU to become
available; they will be run right away. That means that multiple
tasks submitted to a high-priority queue may contend with each other
for the processor.
- WQ_CPU_INTENSIVE: tasks on this workqueue can be
expected to use a fair amount of CPU time. To keep those tasks from
delaying the execution of other workqueue tasks, they will not be
taken into account when the workqueue code determines whether the CPU
is available or not. CPU-intensive tasks will still be delayed
themselves, though, if other tasks are already making use of the CPU.
The combination of the WQ_HIGHPRI and WQ_CPU_INTENSIVE
flags takes this workqueue out of the concurrency management regime
entirely. Any tasks submitted to such a workqueue will simply run as soon
as the CPU is available.
The final argument to alloc_workqueue() (we are still
talking about alloc_workqueue(), after all) is
max_active. This parameter limits the number of tasks which can
be executing simultaneously from this workqueue on any given CPU. The
default value (used if max_active is passed as zero) is 256, but
the actual maximum is likely to be far lower,
given that the workqueue code really only wants one task using the CPU at
any given time.
Code which requires that workqueue tasks be executed in the order in which
they are submitted can use a WQ_UNBOUND workqueue with
max_active set to one.
(Incidentally, much of the above was cribbed from Tejun Heo's in-progress document on workqueue
usage).
The long-term plan, it seems, is to convert all create_workqueue()
users over to an appropriate alloc_workqueue() call; eventually
create_workqueue() will be removed. That task may take a little
while, though; a quick grep turns up nearly 300 call sites.
An even longer-term plan is to merge a number of other kernel threads into
the new workqueue mechanism. For example, the block layer maintains a set
of threads with names like flush-8:0 and bdi-default;
they are charged with getting data written out to block devices. Tejun
recently posted a patch to
replace those threads with workqueues. This patch has made some developers
a little nervous - problems with writeback could create no end of trouble
when the system is under memory pressure. So it may be slow to get into
the mainline, but it will probably get there eventually unless regressions
turn up.
After that, there is no end of special-purpose kernel threads elsewhere in
the system. Not all of them will be amenable to conversion to workqueues,
but quite a few of them should be. Over time, that should translate to less
system resource use, cleaner "ps" output, and a better-running
system.
Comments (4 posted)
By Jonathan Corbet
September 8, 2010
In August, a longstanding kernel security hole related to overflowing the
stack area
was closed. But
it turns out there are other problems in this area, at least one of which
has been known about since late last year. Fixes are in the
works, but it's hard not to wonder if we are not handling security issues as
well as we should be.
Once again, the problem was reported by Brad Spengler, who posted a short
program demonstrating how easily things can be made to go wrong. The
program allocates a single 128KB array, which is filled as a long C
string. Then, an array of over 24,000 char * pointers is
allocated, with each entry pointing to the large string. The final step is
to call execv(), using this array as the arguments to the program
to be run. In other words, the exploit is telling the kernel to run a
program with as many huge arguments as it can.
Once upon a time, the kernel had a limit on the maximum number of pages
which could be used by a new program's arguments. This limit would have
prevented any problems resulting from the sort of abuse shown by Brad's
program, but it was removed for
2.6.23; it seems that any sort of limit made life difficult for
Google. In its place, a new check was put in which looks like this (from
fs/exec.c):
/*
* Limit to 1/4-th the stack size for the argv+env strings.
* This ensures that:
* - the remaining binfmt code will not run out of stack space,
* - the program will have a reasonable amount of stack left
* to work from.
*/
rlim = current->signal->rlim;
if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4) {
put_page(page);
return NULL;
}
The reasoning was clear: if the arguments cannot exceed one quarter of the
allowed size for the process's stack, they cannot get completely out of
control. It turns out that there's a fundamental flaw in that reasoning:
the stack size may well not be subject to a limit at all. In that case,
the value of the limit is -1 (all ones, in other words), and the
size check becomes meaningless. The end result
is that, in some situations, there is no real limit on the amount of stack
space which can be consumed by arguments to exec(). And,
unfortunately, the consequences are not limited to the offending process.
At a minimum, Brad's exploit is able to oops the system once the stack
tries to expand too far. He mentioned the
possibility of expanding the stack down to address zero - thus reopening
the threat of null-pointer exploits - but has not been able to figure out a
way to make such exploits work. The copying of all those arguments will,
naturally, consume large amounts of system memory; due to another glitch,
that memory use is not properly accounted for, so, if the out-of-memory
killer is brought in to straighten things out, it will not target the
process which is actually causing the problem. And, as if that were not
enough, the counting and copying of the argument strings is not preemptible
or killable; given that it can run for a very long time, it can be very
hard on the performance of the rest of the system.
Brad says that he first reported this problem in December, 2009, but got no
response. More recently, he sent a note to Kees Cook, who posted a partial fix in response. That fix had some
technical problems and was not applied, but Roland McGrath has posted a new set of fixes which gets closer. Roland
has taken a minimal approach, not wanting to limit argument sizes more than
absolutely necessary. So his patch just ensures that the stack will not
grow below the minimum allowed user-space memory address
(mmap_min_addr). That check, combined with the guard page added
to the stack region by the August fix, should prevent the stack from
growing into harmful areas. Roland has also added a preemption point to
the argument-copying code to improve interactivity in the rest of the
system, and a signal check
allowing the process to be killed if necessary. He has not addressed the
OOM killer issue, which will need to be fixed separately.
Roland's patch seems likely to fix the worst problems, though some
commenters feel that it does not go far enough. One assumes that fixes
will be headed toward distribution kernels in the near future. But there
are a couple of discouraging things to note from this episode:
- It seems that the code which is intended to block runaway resource
use in a core Linux system call was never really tested at its
extremes. The Linux kernel community does not have a whole lot of
people who do this kind of auditing and testing, unfortunately; that
leaves the task to the people who have an interest (either benign or
malicious) in security issues.
- It took some nine months after the initial report before anybody tried
to fix the problem. That is not the sort of rapid response that this
community normally takes pride in.
The problem may indicate a key shortcoming in how Linux kernel development
is supported. There are thousands of developers who are funded to spend at
least some of their time doing kernel work. Some of those are paid to work
in security-related areas like SELinux or AppArmor. But it's not at all
clear that anybody is funded simply to make sure that the core kernel is
secure. That may make it easier for security problems to slip into the
kernel, and it may slow down the response when somebody points out problems
in the code. There is a strong (and increasing) economic interest in
exploiting security issues in the kernel; perhaps we need to find a way to
increase the level of interest in preventing these issues in the first
place.
Comments (26 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>