The current development kernel is 2.6.32-rc6
on November 3.
There's been a number of other nasty regressions since 2.6.31 that
got fixed too (largely drivers, several of them suspend/resume related or
in some cases apparently most easily triggered that way), so I'm hoping the
delay resulted in a better -rc all around. And I'm obviously hopeful that
we didn't introduce any major new regressions.
changelog is in the announcement, or see the
full changelog for all the details.
There have been no stable kernel updates in the last week.
Comments (none posted)
Unfortunately, our biggest competitors are our previous kernels,
and we (were?) really good at writing really fast kernels. And most
of our users who are running the competition are completely
satisfied with all the features it has, so an "upgrade" that
causes a slowdown does not go down well. A feature that 0.01% of
people might use but causes a 0.1% slowdown for everyone
else... may not actually be a good idea. Performance is a feature
too, and every time we do this, we trade off a little bit of that
for things most people don't need.
-- Nick Piggin
The fact is, maintainership does _not_ mean ownership. It means
that you should be _responsible_ for the code, and you get credit
for it, but if problems happen you do NOT "own" it. Not at all.
If you don't understand that, you shouldn't be a maintainer.
-- Linus Torvalds
It looks like the Linux kernel maintainers are frowning on the
FatELF patches. Some got the idea and disagreed, some didn't seem
to hear what I was saying, and some showed up just to be rude.
I didn't really expect to be walking into the buzzsaw that I did. I
imagined people would discuss the merits and flaws of the idea and
we'd work towards an agreeable solution that improves Linux for
everyone. It sure seemed to be going that way at first. Ultimately,
I got hit over the head with package management, the bane of
third-party development, as a panacea for everything.
If anyone wants a choice quote from me about the recent Linux
holes, this is what I have to say: Linus is too busy thinking about
masturabating [sic] monkeys, he doesn't have time to care about
-- Theo de Raadt
Comments (19 posted)
Back in mid-October, Earl Chew reported
null pointer crash in the kernel pipe code. Initial response to his report
was somewhat slow, partly because the kernel he was running was based on
2.6.21. Earl took the time to dig through the code and identify the
problem, though; it turns out to be an old vulnerability which is still
present in current kernels.
What it comes down to is that there is a race condition in the pipe code.
Prior to 2.6.32-rc6, the code which opens a pipe (for write-only access, in
this case) looks like:
pipe_write_open(struct inode *inode, struct file *filp)
The problem is that if the final close of this pipe slips in at the wrong
time, inode->i_pipe may have been set to null. So this is yet
another null pointer vulnerability; the rest is just a matter of writing
the exploit. That exploit must face the challenge that the window of
opportunity is quite short, but computers are very good at continually
trying things until something works.
makes the code much more careful about checking the current status of the
pipe and refusing new opens if the final close has already happened.
Distributors are shipping updates.
This particular bug is attracting attention because it is in the core
kernel and (relatively) straightforward to trigger. But it is far from
unique. A quick look at commits since 2.6.31 turns up no fewer than 34
which explicitly fix null pointer dereference bugs. Quite a few more fix
things that could be null pointer bugs, and there's no telling how many
more were fixed without an explicit mention in the commit title. Null
pointer bugs are common, and are likely to remain so for quite some time.
What is surprising about this bug is that some distributions are still
vulnerable to it. We have had the ability to keep null pointer bugs from
being exploitable for some time, but certain distributions - generally of
the "enterprise" variety - disable that protection by default. Sites
running such distributions might want to be sure that they have the
vm.mmap_min_addr knob set to a reasonable value; either that or expect to
be vulnerable to more null pointer exploits in the future.
Comments (9 posted)
The IDE drivers have been a relative backwater for a while now; most
distributions have made the transition to the newer libata-based PATA
driver set. But IDE remains in the kernel with no indication that it's no
longer the preferred way of doing things. This can be a problem because,
among other things, it encourages developers to submit new IDE-based drivers
, only to be told
that such drivers are no longer being accepted.
To help head off such problems, Robert Hancock has submitted a patch to mark IDE as deprecated. David
Miller has accepted the patch for 2.6.33,
but it might not yet actually get there. David sees a couple of things
which need to be fixed first:
- He would like to see libata create IDE-style device names
(/dev/hdX) so that systems using those names in their
fstab files will continue to work. One might argue that any
such change is a few years late - most systems have been through the
pain of that change already. At this point, mounting by label or UUID
is common, so few users should be affected by the loss of old-style
device names. And, as Alan Cox pointed
out, udev rules can always be written to create those names if
need be. So this requirement may not stick.
- There are some IDE devices which are not yet supported in libata; the
"pmac" driver (for PowerMac on-board IDE devices) is the most-cited
example. Until these devices have support in libata, the IDE layer
clearly cannot be deprecated or removed.
Alan has also suggested that IDE will die of its own accord, and that there
is no need for additional pressure for users to move from it. The warning
may go in anyway, though, just for those who don't get the message in other
ways. If it prevents one developer from spending time on a new IDE driver,
it's probably worthwhile.
Comments (9 posted)
Kernel development news
It can be tempting to dismiss scalability work as being of interest mainly
to companies running massive server systems; most "ordinary" Linux users
are not running into the kind of problems that scalability-oriented
developers are trying to fix. But, of course, the truth of the matter is
that those users haven't encountered those problems yet
. The past
work of scalability-oriented developers is what makes our current desktop
and laptop systems work as well as they do; their current work will enable
next year's consumer-level systems. So Nick Piggin's Japan Linux Symposium
talk on virtual filesystem scalability will be of interest to anybody who
anticipates using Linux in the future.
That said, one of the key constraints on scalability work is that it must
not worsen performance on current systems. So Nick is taking care that his
VFS work will improve scalability with no impact on single-threaded
performance. Beyond that, he is aiming to improve scalability within a
single filesystem - forcing system administrators to split their
filesystems to get better performance would be cheating. To get there, he
has identified five specific bottlenecks which must be addressed.
The first of those is files_lock; it is, he says, the
easiest to fix. This global lock protects a per-superblock list of open
files; it is needed by the file open and close paths. As the number of
threads grows, this lock limits the scalability of filesystem-oriented
workloads. The lock itself is only part of the problem; the real issue is
that a single list_head is never going to be scalable in
multiprocessor situations. In this case, it turns out that the kernel
almost never needs to read the full list of open files; that only happens
at unmount time. So turning the single list into a per-CPU list is a
viable option; it eliminates the locking altogether and makes the
management of the list scalable. The only tricky part is when files are
removed; that requires cross-CPU access to the list.
Next on the list is vfsmount_lock, which is used when
finding mounts from directory entry ("dentry") structures. This lock is
taken when crossing mount points in the path lookup process; it is also
used at mount and unmount time. Pathname lookup is clearly a
performance-critical path in the kernel, so getting rid of a global lock
can only be a good thing. Nick considered using read-copy-update
(RCU) for pathname lookup, but he found it to still be too slow. Part of
is the need to block all readers at unmount time, something that RCU cannot
do on its own.
The solution is to go to per-CPU locks. Nick has introduced a variant
on per-CPU locks called brlocks, or "big
reader locks." These locks share the name and goal of the 2.4.x brlocks which were removed in the 2.5
development cycle, but the implementation is different. Essentially, a
brlock is per-CPU for read access, but write access excludes all other
users on all CPUs. Since pathname lookup is a read-only operation, brlocks
will be fast where the kernel needs them to be; unmounts will be slow, but
those are relatively rare operations.
mnt_count is a per-filesystem reference count, incremented
for each open and decremented for each close. Like the global list
described above, this
global counter limits the scalability of opens and closes. Once again,
going per-CPU is the obvious solution here, with the minor problem that a
put() operation must check whether the (global) count is zero.
But, as it happens, that case only comes about when the filesystem is not
actually mounted, so this check need not be performed most of the time.
The hardest one to fix is dcache_lock. Most VFS operations
need it, with the sole exception of name lookup, which has used RCU for a
while now. Some operations - LRU scanning and reclaim in the dentry cache
in particular - can hold the lock for a long time. And the lock covers a
whole bunch of different - and sometimes unknown - things. The exporting
of dcache_lock to filesystems has not helped here; individual
filesystems are using it for their own, not always clear, ends. So a
developer trying to bring dcache_lock under control must start by trying to
figure out what it is being used to protect.
Nick has done his best to split apart the various locking cases; these
include the dentry cache hash, the dentry LRU list, the inode dentry alias
list, various statistics, etc. Some of this stuff is moved under the
protection of the per-dentry spinlock (d_lock); other things, like
the dentry hash and LRU, get new locks. There are a lot of problems still,
starting with lock-ordering challenges. Nick is working around some of
these using non-blocking "trylock" operations, but that kind of code tends
to be hard to merge. The various locking cases are still not truly
independent from each other; among other things, that imposes more ordering
requirements. And walking up the directory tree (trying to determine a
path name from a dentry, usually) becomes much harder in the absence of a
In summary, cleaning up dcache_lock looks like a long and messy
project. This is just the lock which is showing up as the worst bottleneck
in some situations, though, so the work needs to be done.
Finally, there is the matter of inode_lock, which is needed
by most inode operations (lookup, creation, destruction, writeback, sync,
etc). As with dcache_lock, Nick has split the locking into a
number of independent classes - the inode itself, the inode hash, the LRU
list, and so on. Some of these classes are moved under the per-inode lock,
while specific locks have been added for some cases. The per-superblock
inode list has been made into a per-CPU variable, as have the counters used
to generate statistics. Nick has also made the allocation of inode numbers
into a per-CPU operation by assigning a range of numbers to each
processor. This means that inode numbers are no longer allocated
sequentially; it's not clear whether that will be a problem or not.
So what comes of all this work? Nick claims "great" open/close
scalability, and "good" create/unlink scalability. He showed the results
of running a microbenchmark which just did close(open(path))
repeatedly; with current mainline, he was able to get 450 operations/second
on each of 64 CPUs. With the scalability patches added, that rate went up
to over 300,000 operations/second - a significant improvement. Running
unlink(creat(path)) shows better scalability even with two CPUs -
but it does, for some reason, impose a cost on single-threaded workloads on
the ia-64 architecture.
The VFS scalability work is clearly worth doing; we'll all be glad that
these problems have been ironed out someday. But there's still some messy
things to clean up, so this patch set (or the gnarlier parts of it, anyway)
may take a while on their way into the mainline.
Comments (none posted)
Sharing code where it is possible is normally considered a good thing, but
there are some limits to what can be shared. One of the limiting factors
is often license compatibility; GPL code, in particular, often cannot be
combined with code under other licenses and then distributed.
The kernel is licensed under the GPL, but, since it's rare that anyone
wants to combine its code with user-space applications, license
incompatibilities have not been much of a problem.
There is, however, some
infrastructure that could be shared with user-space tracing
benefiting both—if those parts of the kernel were available under
more permissive licenses. Mathieu Desnoyers, who has developed much of
that infrastructure, has set out to try to relicense some fairly small portions of the
kernel under dual licenses, so that the code can be shared.
Essentially, Desnoyers would like to be able to use the kernel tracing
infrastructure in the Linux Trace
Toolkit Next Generation (LTTng) user-space tracer (UST). He describes the need as follows:
The intent is to allow the tracer code developed both on the kernel-side
as part of Ftrace and LTTng and on the userspace side within UST to be
shared when appropriate. As a result, we can consider userland-only
solutions to user-space tracing without rewriting all the kernel
tracing infrastructure from scratch.
All of the files are currently licensed under the GPLv2, but Desnoyers
would like to
see the C files available under a dual GPLv2/LGPLv2.1 license, and the
header files under a dual GPLv2/BSD license. In order to do that—at
least under the most inclusive interpretation of copyright—he must
get permission for the relicensing from each contributor to those files.
His message to linux-kernel listed the few remaining contributors
that he had
not yet heard from.
The files of interest are kernel/marker.c and
kernel/tracepoint.c, along with the corresponding header files in
include/linux. For 2.6.32, kernel markers have been removed, with
all users converted over to use trace events, but marker.[ch] are
still used by UST. The idea is that the C files could be
turned into a user-space library that could be dynamically linked to
applications that required it, while the header files (with an even more
permissive license) could be used to add static tracepoints to any
application, proprietary or free.
For the most part, the relicensing has been met with approval from the
developers who responded, with several saying that they didn't think their
contributions warranted requiring their approval, but they gave it anyway.
Steven Rostedt ran the C file relicensing by Red Hat's legal department and
was granted permission for all of the Red Hat contributions to be dual
licensed under the GPLv2/LGPLv2.1. The header file GPLv2/BSD dual
licensing is still pending with Red Hat, according to Desnoyers.
There are still a few developers who have not responded, but their
contributions are quite small, and could be rewritten rather easily if
necessary. A bigger stumbling block may be opposition from Ingo Molnar, who seems to
consider the relicensing process to be legally dubious: "the
legality of such relicensing is questionable as that code was never
developed outside of the kernel but as part of the kernel". In
addition, he has technical concerns:
But i also disagree with it on a technical level: code duplication is
_bad_. Why does the code have to be duplicated in user-space like that?
I'd like Linux tracing code to be in the kernel repo. Why isn't this done
properly, as part of the kernel project - to make sure it all stays in
So for those two grounds i cannot give my permission for this
Whether Molnar's permission is actually required is something of an open
question as his employer (Red Hat) has already given permission for his
work to be relicensed. But, if there are serious concerns that
lead to a "nack" from him on the relicensing patch, things get rather
murky. It may be that there is a disconnect between Desnoyers and Molnar
such that Desnoyers's intent is not clear. As Pierre-Marc Fournier points out, not relicensing the code leads to
code duplication as well:
So the GPL code will have to be rewritten. And this will result in the
exact same drawbacks you are trying to avoid by being against
dual-licensing. The goal of dual-licensing is to make it possible to
keep the code in sync between kernel and userspace, not the opposite!
Essentially, Desnoyers wants user-space applications to be able to contain
tracepoints that are based on the same code that is used now in the
kernel. Those applications may be under a variety of free or proprietary
licenses, but the tracepoints are just a static instrumentation technique
that could be shared. As Rostedt puts it:
But what I think is trying to be done here is to use the same types of
MACROS that we have in the kernel to do tracing in userspace. That a
userspace program can add their own "TRACE_EVENT" and that the headers
there will create a tracepoint for them the same way we currently do in
Molnar has gone quiet on the topic, as has the thread, but the idea,
overall, seems reasonable. While it does expose a kernel interface
to user space, it doesn't tie the kernel to any ABI/API for the future. If the
kernel needs to change, either the user-space libraries will change right
along with it, or there will be a fork. Given that the players involved
work on both the kernel and user-space sides of the problem, that seems
somewhat unlikely to happen, but it certainly doesn't seem like that split
need happen now.
Comments (3 posted)
The Linux memory management code does its best to ensure that memory will
always be available when some part of the system needs it. That effort
notwithstanding, it is still possible for a system to reach a point where
no memory is available. At that point, things can grind to a painful halt,
with the only possible solution (other than rebooting the system) being to
kill off processes until a sufficient amount of memory is freed up. That
grim task falls to the out-of-memory (OOM) killer. Anybody who has ever
had the OOM killer unleashed on a system knows that it does not always pick
the best processes to kill, so it is not surprising that making the OOM
killer smarter is a recurring theme in Linux virtual memory development.
Before looking at the latest attempt to improve the OOM killer, it is worth
mentioning that it is possible to configure a Linux system in a way which
all but guarantees that the OOM killer will never make an appearance. OOM
situations are caused by the kernel's willingness to overcommit memory. As
a general rule, processes only use a portion of the address space they have
allocated, so limiting allocations to the total amount of RAM and swap
space on the system would lead to underutilization of system memory. But
that limitation can be imposed on systems which can never be allowed to go
into an OOM state; simply set the vm.overcommit_memory sysctl knob
to 2. Individual processes are much more likely to see allocation
failures in this mode, but the system as a whole will not overcommit its
Most systems will allow overcommitted memory, though, because the
alternative is too limiting. Overcommit works almost always, but the
threat of a day when the Firefox developers add one memory leak too many
always looms. When that sad occasion comes to be, it would be nice if the
OOM killer would target that leaky Firefox process instead of, say, the X
server and PostgreSQL. Many attempts have been made to add
smarts to the OOM killer over the years; there's also a means by which the system
administrator can steer the OOM killer toward or away from specific
processes. But manual configuration is only suitable for certain,
relatively static workloads; for the rest, the OOM killer often proves less
discriminating than one would like.
The latest attempt to fix the OOM
killer comes from Hiroyuki Kamezawa. This patch makes a number of
fundamental changes to the selection of OOM victims. The result is an OOM
killer which is smarter in some ways, but which takes a somewhat different
approach to the selection of its victims.
One of the factors that the current OOM killer takes into account, naturally, is
the amount of memory being used by each process. But the measure used
(mm->total_vm) is somewhat crude: it penalizes processes using
a lot of shared memory and says little about how much physical memory the
process is using. Hiroyuki's patch tries to move away from total_vm in
most situations, looking at the actual resident set size (RSS) and possibly
taking into account the amount of swap space used as well.
Figuring in swap usage is controversial. A program which is using a lot of
swap is clearly putting pressure on memory, but, if that program has been
mostly swapped out, killing it will not immediately free much RAM. Eventually
other processes can be shifted into the newly-freed swap space, but it
might make more sense to just do away with those other processes at the
outset. Even so, Hiroyuki's patch, for now, will figure in swap space if
specific constraints do not force the use of other criteria.
One constraint which can change the calculation is when the memory shortage
is specific to low memory - the region of memory which can be directly
addressed by the kernel. When a low-memory allocation is required, nothing
else will do, so there is little value in killing processes which are not
hogging low-memory pages. With Hiroyuki's patch, the VM subsystem tracks
how much low memory each process is using as a separate statistic. If the
OOM situation is caused by an attempt to allocate low memory, the OOM
killer's "badness" function will focus on processes holding large amounts
of low memory.
gnome-session is likely to free substantial amounts of memory, but
the user's gratitude may be surprisingly limited.
The current OOM killer makes an attempt to target "fork bomb" processes by
adding half of each child's "badness" value to its parent. A process with
a lot of children will thus have a high badness and will thus come under
the OOM killer's baleful gaze sooner. The problem here, of course, is that
some processes legitimately have lots of children - the session manager
for the user's desktop environment is a good example. Killing
gnome-session is likely to free substantial amounts of memory, but
the user's gratitude may be surprisingly limited.
The patch changes the fork bomb detector significantly. The new code
counts only the child processes which have been running for less than a
specific amount of time (five minutes in the posted patch). If one process
has newborn children which make up at least 1/8 of the processes on the system,
that process is deemed to be a fork bomb; it is duly rewarded with a spot
at the top of the OOM killer's short list.
Finally, the current OOM killer tries to kill newly-created processes,
while allowing long-running processes to continue. Hiroyuki feels that
this approach creates a loophole for long-running processes which slowly
leak memory. That web browser may have been running for a long time and is
thus a high-value process, but it has been dropping memory on the floor for
that long time and is also the cause of the problem. So the new code
changes the calculation to look at how long it has been since the process has
expanded its virtual memory size. A process which has been running for a
long time, but which has not grown in that time, will look better than one
which has been expanding.
There seems to be little disagreement with the idea that the OOM killer
needs a rework, but not everybody is sold on this approach yet. It looks
like a very large change, which makes some people nervous. It also shifts
the focus of the OOM killer's attention in a significant way: the current
heuristics were designed to be as unsurprising to the user as possible,
while the new ones are focused more strongly on freeing RAM quickly. But,
given that the existing heuristics are still clearly producing plenty of
surprises, perhaps a more goal-oriented approach makes sense.
(Naturally, no article on the OOM killer is complete without a link to this 2004 comment from Andries
Comments (105 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>