The current development kernel is 2.6.33-rc5
on January 21. It
contains a number of fixes - the patch rate for 2.6.33 remains fairly
As of 2.6.33-rc5, there are 23
unresolved regressions (of 75 reported) in this development kernel.
Stable updates: 220.127.116.11 was released on
January 22, followed by 18.104.22.168 on January 25;
both contain a fair number of important fixes. 22.214.171.124 is in the review process as of this writing; it
contains 98 fixes, and can be expected sometime on or after January 28.
Comments (none posted)
So I can work with crazy people, that's not the problem. They just
need to _sell_ their crazy stuff to me using non-crazy arguments,
and in small and well-defined pieces. When I ask for killer
features, I want them to lull me into a safe and cozy world where
the stuff they are pushing is actually useful to mainline people
In other words, every new crazy feature should be hidden in a nice
solid "Trojan Horse" gift: something that looks _obviously_ good at
-- Linus Torvalds
There is only one real sensible solution for this: Do _not_ use
kgdb - which is the modus operandi of every sane kernel developer
on the planet.
-- Thomas Gleixner
OK... lookup_instantiate_filp() is a god-awful mess, so it's OK to
be confused by it - its authors definitely had been.
-- Al Viro
Comments (4 posted)
Normally, a kernel which doesn't crash is considered to be a good thing.
It can be a source of true frustration, though, for those who want to see
the system go down in flames. The reliability of the system means that
somebody waiting for a crash may grow old indeed in the process.
Simon Kagstrom has heard the pain expressed by such users; in response, he
has posted a kernel module
just for people who want to be able to destroy their systems on demand.
This module creates a directory (provoke_crash) in debugfs, filled
with a number of useful files. For those with simple needs, a write to
bugon results in a straightforward BUG() call. Users with more
discriminating tastes can write to null_dereference to cause a null
pointer dereference, overwrite_allocation to write beyond a heap
allocation, or corrupt_stack to overwrite the stack. And truly
kinky users can go for oops_interrupt_context to get a null
dereference in softirq mode, write_after_free to step on freed
memory, or unaligned_load_store to perform badly-aligned memory
Needless to say, this isn't a module one would ordinarily want to leave
loaded into a production system; it's better kept in a secret place and
pulled out after the kids go to sleep. Unless, of course, you have a real
use for it; Simon has been employing it to make sure that kmsg_dump() does the
right thing in various crash scenarios. For most developers, though, work
is normally dominated by the need to avoid crashes; since they'll
have little use for this feature, it's not clear that this little module
will ever make its way into the mainline.
Comments (16 posted)
Linux has long had the mincore()
system call which allows an
application to determine whether a given page is in RAM or not. There is
no easy way, though, to tell whether a given page from a file is in the
page cache or not. An application can mmap()
the file and use
on it, but that can be slow. So Chris Frost has
proposed a new fincore()
system call to handle this task:
int fincore(int fd, loff_t start, loff_t len, unsigned char *vec);
A call to fincore() will look at the pages of the file associated
with fd in the range indicated by start and
len. For each page of the file, one byte of vec will be
set to a non-zero value if that page is in memory. Naturally, this answer
is an approximation - the situation can change while the system call is
That, however, can be good enough for Chris's needs. His objective is to
speed up applications which perform large numbers of non-sequential file
reads. The traditional readahead code deals poorly with this kind of
application, since the access pattern cannot be predicted ahead of time.
But the application often does know about a sequence of reads in
advance; if the kernel could be told to pull in those pages ahead of time,
it could order the I/O operations optimally and make the whole thing go
faster. When doing this for sqlite and the GIMP, Chris reports significant
The fadvise() system call can be used to request prefetching of
file data. But there's a problem: it's hard for a prefetch library to know
how much system memory is available. If too little data is prefetched, the
performance gains will not be what they could be. Prefetching too much
data, however, can lead to thrashing. Hence the fincore() system
call: if prefetched pages are no longer present by the time the application
gets around to using them, the library knows that it is asking for too much
and can back off.
Andrew Morton likes the patch:
I must say, the syscall appeals to my inner geek. Lot of
applications are leaving a lot of time on the floor due to bad disk
access patterns. A really smart library which uses this facility
could help all over the place.
Jamie Lokier, though, wondered if it might
not be a better idea to find a way to inform applications more directly
that their pages are being evicted prior to use.
This is the first posting for this system call, so it has not gotten a lot
of attention yet; more discussion will certainly be necessary before it
could be merged. In the mean time, the libprefetch site has more
information on this whole project.
Comments (4 posted)
Kernel development news
Those of you who have come to appreciate Dave Airlie's kitten-filled
presentations might just have been dissatisfied with his linux.conf.au
talk, which was called "So you moved graphics drivers to the kernel.. what
next? I can haz ponies?" But ponies, too, can be cute, and the update on
the state of graphics drivers in the kernel was well worth the listening.
It has now been about a year since kernel mode setting (KMS) was merged
into the mainline kernel. KMS ends the "mess" which came from having
graphics drivers in user space; digging out of that particular hole took a
good seven years or so. But now our graphics drivers are in the kernel,
just like most other drivers.
Beyond cleaning up the mess, there are a few other good reasons for merging
KMS. One is that the system is now able to make full use of the
power-saving features of the hardware; before KMS, the kernel never really
knew enough about what was going on with the hardware to do this. The
Intel drivers can now perform as well as Windows with regard to power
saving; the ATI drivers, instead, are not quite there yet. Another nice
feature is the ability to use a kernel debugger on a system with graphics
running; it's now possible to trap into the debugger, then return to a
running system and have everything just work.
One of the reasons why KMS took so long to merge is that it places a number
of new requirements on the kernel. At the top of the list is a proper
manager for graphical memory. That's a hard problem, one that the graphics
developers always intended to get to sometime Real Soon Now. Eventually
the TTM developers got to it,
but they quickly ran into a number of API difficulties. After some effort,
the Intel developers
decided that a generic approach to the memory management API wasn't going
to work; out of that realization came the GEM memory manager, which only
tried to solve the Intel problem.
Developers working on ATI chipsets, in turn, soon realized that GEM did not
have the capabilities that they needed. So they went back to TTM, but not
before bolting something that looks a lot like the GEM API onto it. TTM
was recently merged, making KMS possible for ATI chipsets as well.
So what is coming? One future feature is the Gallium 3D
architecture. Gallium, says Dave, is starting to work, but full
functionality will take a while yet. Moving drivers to Gallium is going to
be a painful exercise; there are already plenty of APIs that these drivers
need to support.
DRI2 is also coming along. This
feature really needed KMS to work properly, especially when compositing is
being used. There are still performance issues which must be resolved,
Another thing to look forward to is the Wayland
display server. Wayland can be seen as a simpler, smaller replacement for
X built on KMS. It can run GTK and GL
applications now; there is also an X server emulator which can run on top
of it. A few difficulties remain, including the fundamental fact that
Wayland is not X; since X is the standard in this area, alternatives are
going to be hard to sell. The Wayland
developers also have not yet really dealt with the input problem, but input
is a big piece of the X code. So Wayland, too, will be a while in coming;
it may find its way into embedded situations first.
Dave spent some time on the current state of the graphics drivers.
Intel, he says, is currently in the leading position. It supports KMS
for everything - well, almost everything; the "chipset we won't name" (the
proprietary GMA500) still lacks support. The driver is feature-complete,
but Dave isn't quite ready to call it "mature"; another release or two will
be required first. As discussed
here previously, the driver will need to retain user mode setting (UMS)
support for some time, but the current upstream X.org sources have already
removed UMS from the X server.
The ATI/AMD drivers are further behind, but getting closer; this
driver is harder than the Intel driver, due to the large number of chipset
variations. Chipsets from R100 to R700 are currently supported; R800
support can be expected within a few weeks. The driver works "nearly as
well as the old stuff" at this point; suspend and resume work better than
before. Support for power-saving features is missing but expected for
2.6.34. The Radeon driver is currently in the staging tree, but it might
move out before the end of the 2.6.33 development cycle.
What about the RadeonHD driver? That fork of the driver is primarily the
result of a disagreement over the use of ATI's BIOS tables; the Radeon
driver has an interpreter for these tables, while RadeonHD reimplements the
functionality that those tables provide. Using the BIOS tables makes life
a lot easier; it lets the driver ignore a lot of the details associated
with different chipset variations. The BIOS table code is part of the KMS
implementation which has been merged into the mainline; that should, Dave
thinks, resolve this disagreement.
The "pony" displayed for the Nouveau discussion was a Trojan horse.
Nouveau, of course, was merged
for 2.6.33. The driver has just lost its
user-mode support; it will be KMS only. Chipsets from the NV4 through the
G80 are supported, with the final pieces to be filled in soon. The
"ctxprogs" firmware is being figured out; the NV40 version has already been
replaced with a rewritten, freely-licensed equivalent and NV50 is in the
works. Dave noted that, whatever one thinks about NVIDIA's approach to
working with the community, its hardware tends to be relatively good and
easy to work with.
When Dave was asked about support for non-Linux systems, he replied that
most of them have been left behind at this point. There is, apparently, an
OpenSolaris port being done within Sun, but no code has been released from
that group. One other audience member asked about running X without root
privileges: that does work now, and Moblin is doing it. There are some
problems remaining, though, especially with fast user switching. In the
absence of a revoke() system call, there's no way to guarantee
that one user isn't listening in on another. Since revoke() is
known to be a hard problem, it's not clear how this issue will be
Comments (32 posted)
The utrace tracing framework has had a tortuous path towards the mainline,
but it always seemed like it was headed that direction. Over the past week
or so, things have gotten rather murkier for the mainline inclusion of
utrace. Linus Torvalds made a pronouncement that would seem to
leave SystemTap without a future in the mainline—something that many
had suspected for a while—but also put the future of utrace in
doubt. Further discussion may have provided a way forward, but,
at least in its current form, mainline utrace seems very unlikely.
The discussion resulted from a request by
Frank Ch. Eigler to include utrace into linux-next. That led to a
discussion about whether it was ready for linux-next—because it was
likely to be merged in the next release cycle—or whether it should spend
some time in another tree. Since an earlier version of utrace
was in Andrew Morton's -mm tree, that was a potential path. Morton said
that utrace "didn't break anything", but:
I still don't think I've seen a really compelling reason for merging
it. At least, I wouldn't be able to explain why we did it. But
presumably there _are_ such reasons, because it was a lot of development
Someone please sell this to us.
Morton also dredged up a response he had
gotten from Oleg Nesterov the last time he asked, which listed various
potential uses for utrace. In-kernel uses for utrace are
important—new features are rarely merged without one—and
an earlier utrace merge attempt ran into
opposition because it lacked one. This time around,
Nesterov and Roland McGrath included a rewrite of the ptrace()
system call using utrace as part of the patch submission. It was hoped
that rewriting the notoriously ugly ptrace() code using the
cleaner utrace API would be the last hurdle for inclusion into the mainline.
But, replacing the guts of the ptrace() call, even though it may
clean things up, is controversial. ptrace() is part of the kernel
ABI that must be maintained—ugly or not—but cleaning it up is
not without its risks, as Morton points
ptrace is a nasty, complex part of the kernel which has a long history
of problems, but it's all been pretty quiet in there for the the past few
years. This leads one to expect that a rip-out-n-rewrite is a
high-risk prospect. So, quite reasonably, one looks for a good reason
for taking such risk.
The risk is small, though, according to
Eigler, because "this code has been deployed in fedora
and rhel for several *years*, with millions of users. It's not some
rickety experiment." Eigler also added to Nesterov's list
of utrace uses as SystemTap's user-space probing is based on utrace. But
SystemTap and one of the other potential uses on that list, namely
reworking seccomp to use utrace, are what set
So if things like system tap and "security models that go behind the
kernel by tying into utrace" are the reasons for utrace, color me utterly
uninterested. In fact, color me actively hostile. I think that's the worst
possible situation that we'd ever be in as kernel people (namely exactly
the "do things in kernel space by hiding behind utrace without having
kernel people involved")
Torvalds's complaint stems from the fact that utrace provides no user-space
interface at all. It is purely an internal kernel API that is meant to be
used by kernel code like the ptrace() rewrite, but also for kernel
modules, which is part of what worries Torvalds. It provides lots of hooks
that can be used by "random crazy out-of-tree crap", but
doesn't provide any benefit to user space at all, he said:
If somebody were to argue that "this is a simple series of patches to
clean up ptrace and make it possible to strace a debugged process", then
that would have been different. That's not what you or others have been
doing. You've been pushing exactly the _reverse_ of that, namely how great
it is for some random totally new features that I'm convinced aren't even
used by a lot of people.
One of the biggest problems with ptrace() is its signal-oriented
interface. Programs using ptrace() act as the parent process of
the tracee and must use wait() to detect state changes. For that
reason, there can only be one ptrace() active for a
particular process. So an strace of a program that is being
debugged with gdb will not succeed. The ptrace()
implementation using utrace would change that, but not directly, as there
would still need to be a kernel piece that attached another utrace engine.
An in-kernel gdb
utrace—floated as an RFC back in November—could provide that kernel
piece, but was met with a fair amount of resistance when it was proposed.
The limitation that ptrace() imposes is seen as something that
could, perhaps should, be lifted, but adding a relatively large,
kernel-only API to do that
is excessive. As Torvalds puts it:
Maybe somebody would be interested in trying to take the utrace
improvements, and scaling down what they promise, and ignoring all input
except for "I want to strace and gdb at the same time".
So stop the crazy "new kernel interfaces" crap. Stop the crazy "maybe we
can use it for ftrace and generic user event tracing too". Stop the crazy.
The elephant in the room, of course, is SystemTap. It creates, builds, and
kernel modules for doing its tracing, and uses utrace for the user-space
tracing. That model is not popular with most kernel
developers, especially for an out-of-tree solution—the APIs that it
relies on are far too volatile. SystemTap must be
updated when those interfaces change, and all of the previous versions
must be maintained so that SystemTap can still be used with older kernels.
Because of that, SystemTap may be out-of-sync with development kernels, which
makes its utility for kernel hackers quite small.
The utrace proponents are pushing it as something useful in its own right,
completely separate from its use in SystemTap, but one gets the sense that
many of the kernel developers aren't quite buying that. Ted Ts'o tries to explain his concerns to Eigler
export a syscall (which is an ABI that we are willing to promise will
be stable), but rather a set of kernel API's (which we never promise
to be stable), and the fact that there will be out-of-tree programs
that are going to be trying to depend on that interface (much like
Systemtap does today when it creates kernel modules) [...]
He goes on to compare the situation to that of the NVIDIA graphics drivers,
Kyle Moffett to propose a variation on Godwin's
law: "As an LKML discussion grows longer, the probability of an unfavorable
comparison involving nVidia or Microsoft approaches 1." More to the
point, though, Moffett said he was uninterested in SystemTap:
in things like the ability to stack gdb with strace, the RFC gdb-stub
posted a week ago, etc. None of those abilities would be out-of-tree
modules at all [...]
Ts'o sees those features as potentially
useful, but points out that they should be submitted with utrace for
review. It may be that utrace in its present form does not survive that
So what should be reviewed is utrace *plus* these other
userland interfaces, which may get critiqued and improved, and utrace
patches can be reviewed in light of these new features. But be
warned.... if it turns out that only 30% of utrace is only needed to
support gdb stacking with strace, etc., the other 70% will likely get
ejected and the utrace patches streamlined to support these in-tree
Without an in-tree "killer feature" that only utrace can provide, there is
going to be resistance to merging such an easily-abused API. Several
suggestions were made—notably by Torvalds and Ingo Molnar—to
enhance ptrace() itself to support some new features (such as
multiple active calls or the ability to read/write more than a word at a
time between the two processes), but that would mean scrapping much or all
of the utrace work. Nesterov and McGrath, who are the ptrace()
maintainers, have been largely silent
throughout the discussion, but, previously, they have made it clear that they
would much rather work with the utrace-based ptrace()
implementation. So it is unclear when or if enhancements to the current
code might happen.
Without utrace, SystemTap will have to find other ways to hook user space,
but that doesn't really faze the kernel developers—particularly after
Torvalds's unequivocal rejection of that approach—as there are other
tracing solutions in the pipeline. Ftrace and perf events are slowly
building capabilities, and are doing so in-tree. They are likely to grow
the needed features to support kernel and user-space tracing a la
SystemTap (and DTrace). Molnar specifically invites the SystemTap developers to
Also, if any systemtap person is interested in helping us create a more
generic filter engine out of the current ftrace filter engine (which is really
a precursor of a safe, sandboxed in-kernel script engine), that would be
excellent as well. Right now we support simple C-syntax expressions like:
perf record -R -f -e irq:irq_handler_entry --filter 'irq==18 || irq==19'
More could be done - a simple C-like set of function perhaps - some minimal
per probe local variable state, etc. (perhaps even looping as well, with a
limit on number of [predicate] executions per filter invocation.)
It is unfortunate, in many ways, that SystemTap has gotten to this point.
While it is possible that Torvalds could change his mind, he and other
kernel developers find the new tracing
features to be "a million times superior" to SystemTap. That
could leave Red Hat holding the SystemTap bag
for quite some time to come, as it will need to support it for existing,
and likely future,
RHEL versions. It is interesting to note that this alternate solution,
based on Ftrace, etc., is also largely coming out of Red Hat.
It seems possible that utrace will be a casualty here as well. By
incorporating features that were needed for SystemTap, and not providing a
user-space interface, it tried to both do too much and too little. There
are some potential ways forward, but its unclear whether they
will be pursued. Torvalds points
to the realtime tree as an example of how to get "crazy" things merged:
Yeah, it's taken them years, and they still have out-of-tree stuff. And
yeah, they had to change some things to make them more palatable to the
mainline kernel - the whole fundamental raw spinlock change is just the
most recent example of that.
But on the whole, I think it's actually worked out pretty well for them. I
think the mainline kernel has improved in the process, but I also suspect
that _their_ RT patches have also improved thanks to having to make the
work more palatable to people like me who don't care all that deeply about
their particular flavor of crazy.
There are definitely lessons here, but the standard ones don't seem to
apply. SystemTap and utrace were developed in the open, as free software
from the outset, and were fairly often discussed on linux-kernel.
SystemTap in particular was regularly criticized, to seemingly no
avail. The biggest lesson—and the hardest to learn, especially after
a feature has shipped—may be that
ignoring the advice and complaints of the kernel developers is likely to
come back and bite in the end. It is not terribly surprising, really, but
that seems to be what is happening here.
Comments (13 posted)
Much of the POSIX system call interface is known for the elegance and
simplicity of its design; that is what has enabled this API to endure and
thrive for decades. The ptrace()
system call has no such
reputation. One of the many motivations behind the development of the
utrace layer (see the
) was first to clean up the implementation of
, but then
to enable it to be replaced entirely. Subsequent discussion shows that
this is a distant hope, though, and that we will be struck with
for a long time.
The purpose of ptrace() is to allow one process to monitor and
modify the state of another. It exists to support interactive debuggers
and related utilities like strace, but other users exist as well.
User-mode Linux uses ptrace() for its internal management, and
there are various sandboxing schemes which use it. In general,
users are able to get ptrace() to do what they want, but they
rarely come away pleased with the experience.
What are the problems with ptrace()? Whenever system calls have
to work with extended state within the kernel, the preferred mechanism for
referring to that state in user space is the file descriptor. With file
descriptors, many of the existing system calls do natural things, and
well-defined mechanisms exist for event multiplexing. But
ptrace() doesn't use file descriptors; it depends, instead, on a
rather more arcane mechanism. A process to be traced is removed from its
normal place in the process tree; the process doing the tracing becomes its
new parent. In other words, ptrace() sets up a sort of temporary
foster home for children under scrutiny. The new parent can then learn
about events in the child through the wait() system call.
This API is hard to fit into normal application event loops. It also
implies that any given process can be traced by only one other process at
any given time. This may not seem like a problem - how often does one want
to run two debuggers on a process? - but it does get in the way.
Developers working on debugging tools and users wanting to trace a
sandboxed process are two types of users who cannot do what they want with
ptrace(). It is also defined as a complex, multiplexer call (see
man page for details) which is hard to understand and hard to use
Finally, ptrace() is hard to implement correctly and consistently.
As a result, there has been a long history of obnoxious bugs associated
with it, and user-space code which uses ptrace() tends to become
encrusted with non-portable workarounds. It is, in
summary, not surprising that there is interest in creating a replacement.
Oleg Nesterov expressed things succinctly:
"I must admit that personally I think the current ptrace api is
unfixable, we need the new one in the long term."
Getting to the new one could be hard, though. The first problem is that
ptrace() is a standard function which is part of the kernel ABI.
As long as users exist, it really cannot be removed from the kernel. So a
ptrace() replacement will not improve life for the kernel
development community anytime in the near future; indeed, it will make it
harder, since there will be two tracing interfaces to support instead of
one. Duplicating functionality in this way can be done when the need is
strong enough, but it's not something that the community will rush into
without a great deal of thought.
Maintaining ptrace() as a compatibility interface might be
acceptable if it were clearly a temporary thing with a clear possibility of
removal in the future, and if there were clear advantages of doing so. But
it's not entirely clear where the advantages are. For example, Kyle
The killer app for this will be the ability to delete thousands of
lines of code from GDB, strace, and all the various other tools
that have to painfully work around the major interface gotchas of
ptrace(), while at the same time making their handling of complex
processes much more robust.
There are a couple of related problems with this idea, starting with the
fact that tools like GDB don't just run on Linux systems with shiny new
kernels. They need to work on older kernels indefinitely, not to mention on
all those other platforms which lack the good taste to implement every new
system call created for Linux. So those "thousands of lines" (and it
really is that much code) will not be going anywhere; the GDB developers
will have to maintain them forever - or something fairly close to that.
So for GDB, too, a new tracing API would represent an increase in the
maintenance load - if they use it. But the fact of the matter is that
special, Linux-only interfaces tend to have very limited uptake. As expressed by Ingo Molnar:
Special Linux system calls have a checkered past, they tend to
not be used by much anything, and thus they tend to be a breeding
ground of both bugs, maintenance complexity and security
problems. Lack of attention is never good.
That said, Tom Tromey has indicated that
GDB might use a new API if there were clear advantages to doing so:
Nevertheless, if the Linux kernel were to present a new user-space
API, and if it had an advantage over ptrace, then we would port GDB
to use it. There are other platforms where, IIRC, we now use some
/proc thing instead of ptrace.
Tom goes on to list a few features that he would like to see in a
replacement for ptrace(). That highlights one final obstacle to
any kind of new API: no such thing has been implemented or even specified
by anybody. The creation of a new system call - especially for a task as
complicated as tracing - is not an easy thing to do. Without a great deal
of care, we risk creating yet another substandard API with its own warts
which must be maintained forever. So a proposed
replacement would have to get through an extended process of criticism,
argument, and opposition, and it would have to demonstrate some real users
- a GDB port, for example. That, alone, ensures that any ptrace()
replacement will be years away.
So it's not surprising that justifying utrace as a means to replace
ptrace() is not working very well, and it's not surprising that
developers are talking about possible ways of extending ptrace()
instead. Playing with the ptrace() API is not without its risks -
code which uses it tends to be a bit of a house of cards which can be
broken by subtle changes in semantics. But it may still be an easier route
to moderately more sane and usable tracing in the relatively near future.
Comments (2 posted)
Patches and updates
Core kernel code
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>