Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.33-rc5, released on January 21. It contains a number of fixes - the patch rate for 2.6.33 remains fairly high.As of 2.6.33-rc5, there are 23 unresolved regressions (of 75 reported) in this development kernel.
Stable updates: 2.6.32.5 was released on January 22, followed by 2.6.32.6 on January 25; both contain a fair number of important fixes. 2.6.32.7 is in the review process as of this writing; it contains 98 fixes, and can be expected sometime on or after January 28.
Quotes of the week
In other words, every new crazy feature should be hidden in a nice solid "Trojan Horse" gift: something that looks _obviously_ good at first sight.
A module for crashing the kernel
Normally, a kernel which doesn't crash is considered to be a good thing. It can be a source of true frustration, though, for those who want to see the system go down in flames. The reliability of the system means that somebody waiting for a crash may grow old indeed in the process.Simon Kagstrom has heard the pain expressed by such users; in response, he has posted a kernel module just for people who want to be able to destroy their systems on demand. This module creates a directory (provoke_crash) in debugfs, filled with a number of useful files. For those with simple needs, a write to bugon results in a straightforward BUG() call. Users with more discriminating tastes can write to null_dereference to cause a null pointer dereference, overwrite_allocation to write beyond a heap allocation, or corrupt_stack to overwrite the stack. And truly kinky users can go for oops_interrupt_context to get a null dereference in softirq mode, write_after_free to step on freed memory, or unaligned_load_store to perform badly-aligned memory operations.
Needless to say, this isn't a module one would ordinarily want to leave loaded into a production system; it's better kept in a secret place and pulled out after the kids go to sleep. Unless, of course, you have a real use for it; Simon has been employing it to make sure that kmsg_dump() does the right thing in various crash scenarios. For most developers, though, work is normally dominated by the need to avoid crashes; since they'll have little use for this feature, it's not clear that this little module will ever make its way into the mainline.
fincore()
Linux has long had the mincore() system call which allows an application to determine whether a given page is in RAM or not. There is no easy way, though, to tell whether a given page from a file is in the page cache or not. An application can mmap() the file and use mincore() on it, but that can be slow. So Chris Frost has proposed a new fincore() system call to handle this task:
int fincore(int fd, loff_t start, loff_t len, unsigned char *vec);
A call to fincore() will look at the pages of the file associated with fd in the range indicated by start and len. For each page of the file, one byte of vec will be set to a non-zero value if that page is in memory. Naturally, this answer is an approximation - the situation can change while the system call is running.
That, however, can be good enough for Chris's needs. His objective is to speed up applications which perform large numbers of non-sequential file reads. The traditional readahead code deals poorly with this kind of application, since the access pattern cannot be predicted ahead of time. But the application often does know about a sequence of reads in advance; if the kernel could be told to pull in those pages ahead of time, it could order the I/O operations optimally and make the whole thing go faster. When doing this for sqlite and the GIMP, Chris reports significant speedups.
The fadvise() system call can be used to request prefetching of file data. But there's a problem: it's hard for a prefetch library to know how much system memory is available. If too little data is prefetched, the performance gains will not be what they could be. Prefetching too much data, however, can lead to thrashing. Hence the fincore() system call: if prefetched pages are no longer present by the time the application gets around to using them, the library knows that it is asking for too much and can back off.
Andrew Morton likes the patch:
Jamie Lokier, though, wondered if it might not be a better idea to find a way to inform applications more directly that their pages are being evicted prior to use.
This is the first posting for this system call, so it has not gotten a lot of attention yet; more discussion will certainly be necessary before it could be merged. In the mean time, the libprefetch site has more information on this whole project.
Kernel development news
LCA: Graphics driver ponies
Those of you who have come to appreciate Dave Airlie's kitten-filled presentations might just have been dissatisfied with his linux.conf.au talk, which was called "So you moved graphics drivers to the kernel.. what next? I can haz ponies?" But ponies, too, can be cute, and the update on the state of graphics drivers in the kernel was well worth the listening.It has now been about a year since kernel mode setting (KMS) was merged into the mainline kernel. KMS ends the "mess" which came from having graphics drivers in user space; digging out of that particular hole took a good seven years or so. But now our graphics drivers are in the kernel, just like most other drivers.
Beyond cleaning up the mess, there are a few other good reasons for merging
KMS. One is that the system is now able to make full use of the
power-saving features of the hardware; before KMS, the kernel never really
knew enough about what was going on with the hardware to do this. The
Intel drivers can now perform as well as Windows with regard to power
saving; the ATI drivers, instead, are not quite there yet. Another nice
feature is the ability to use a kernel debugger on a system with graphics
running; it's now possible to trap into the debugger, then return to a
running system and have everything just work.
One of the reasons why KMS took so long to merge is that it places a number of new requirements on the kernel. At the top of the list is a proper manager for graphical memory. That's a hard problem, one that the graphics developers always intended to get to sometime Real Soon Now. Eventually the TTM developers got to it, but they quickly ran into a number of API difficulties. After some effort, the Intel developers decided that a generic approach to the memory management API wasn't going to work; out of that realization came the GEM memory manager, which only tried to solve the Intel problem.
Developers working on ATI chipsets, in turn, soon realized that GEM did not have the capabilities that they needed. So they went back to TTM, but not before bolting something that looks a lot like the GEM API onto it. TTM was recently merged, making KMS possible for ATI chipsets as well.
So what is coming? One future feature is the Gallium 3D architecture. Gallium, says Dave, is starting to work, but full functionality will take a while yet. Moving drivers to Gallium is going to be a painful exercise; there are already plenty of APIs that these drivers need to support. DRI2 is also coming along. This feature really needed KMS to work properly, especially when compositing is being used. There are still performance issues which must be resolved, though.
Another thing to look forward to is the Wayland display server. Wayland can be seen as a simpler, smaller replacement for X built on KMS. It can run GTK and GL applications now; there is also an X server emulator which can run on top of it. A few difficulties remain, including the fundamental fact that Wayland is not X; since X is the standard in this area, alternatives are going to be hard to sell. The Wayland developers also have not yet really dealt with the input problem, but input is a big piece of the X code. So Wayland, too, will be a while in coming; it may find its way into embedded situations first.
Dave spent some time on the current state of the graphics drivers. Intel, he says, is currently in the leading position. It supports KMS for everything - well, almost everything; the "chipset we won't name" (the proprietary GMA500) still lacks support. The driver is feature-complete, but Dave isn't quite ready to call it "mature"; another release or two will be required first. As discussed here previously, the driver will need to retain user mode setting (UMS) support for some time, but the current upstream X.org sources have already removed UMS from the X server.
The ATI/AMD drivers are further behind, but getting closer; this driver is harder than the Intel driver, due to the large number of chipset variations. Chipsets from R100 to R700 are currently supported; R800 support can be expected within a few weeks. The driver works "nearly as well as the old stuff" at this point; suspend and resume work better than before. Support for power-saving features is missing but expected for 2.6.34. The Radeon driver is currently in the staging tree, but it might move out before the end of the 2.6.33 development cycle.
What about the RadeonHD driver? That fork of the driver is primarily the result of a disagreement over the use of ATI's BIOS tables; the Radeon driver has an interpreter for these tables, while RadeonHD reimplements the functionality that those tables provide. Using the BIOS tables makes life a lot easier; it lets the driver ignore a lot of the details associated with different chipset variations. The BIOS table code is part of the KMS implementation which has been merged into the mainline; that should, Dave thinks, resolve this disagreement.
The "pony" displayed for the Nouveau discussion was a Trojan horse. Nouveau, of course, was merged for 2.6.33. The driver has just lost its user-mode support; it will be KMS only. Chipsets from the NV4 through the G80 are supported, with the final pieces to be filled in soon. The "ctxprogs" firmware is being figured out; the NV40 version has already been replaced with a rewritten, freely-licensed equivalent and NV50 is in the works. Dave noted that, whatever one thinks about NVIDIA's approach to working with the community, its hardware tends to be relatively good and easy to work with.
When Dave was asked about support for non-Linux systems, he replied that most of them have been left behind at this point. There is, apparently, an OpenSolaris port being done within Sun, but no code has been released from that group. One other audience member asked about running X without root privileges: that does work now, and Moblin is doing it. There are some problems remaining, though, especially with fast user switching. In the absence of a revoke() system call, there's no way to guarantee that one user isn't listening in on another. Since revoke() is known to be a hard problem, it's not clear how this issue will be resolved.
Back to the drawing board for utrace?
The utrace tracing framework has had a tortuous path towards the mainline, but it always seemed like it was headed that direction. Over the past week or so, things have gotten rather murkier for the mainline inclusion of utrace. Linus Torvalds made a pronouncement that would seem to leave SystemTap without a future in the mainline—something that many had suspected for a while—but also put the future of utrace in doubt. Further discussion may have provided a way forward, but, at least in its current form, mainline utrace seems very unlikely.
The discussion resulted from a request by
Frank Ch. Eigler to include utrace into linux-next. That led to a
discussion about whether it was ready for linux-next—because it was
likely to be merged in the next release cycle—or whether it should spend
some time in another tree. Since an earlier version of utrace
was in Andrew Morton's -mm tree, that was a potential path. Morton said
that utrace "didn't break anything
", but:
Someone please sell this to us.
Morton also dredged up a response he had gotten from Oleg Nesterov the last time he asked, which listed various potential uses for utrace. In-kernel uses for utrace are important—new features are rarely merged without one—and an earlier utrace merge attempt ran into opposition because it lacked one. This time around, Nesterov and Roland McGrath included a rewrite of the ptrace() system call using utrace as part of the patch submission. It was hoped that rewriting the notoriously ugly ptrace() code using the cleaner utrace API would be the last hurdle for inclusion into the mainline.
But, replacing the guts of the ptrace() call, even though it may clean things up, is controversial. ptrace() is part of the kernel ABI that must be maintained—ugly or not—but cleaning it up is not without its risks, as Morton points out:
The risk is small, though, according to
Eigler, because "this code has been deployed in fedora
and rhel for several *years*, with millions of users. It's not some
rickety experiment.
" Eigler also added to Nesterov's list
of utrace uses as SystemTap's user-space probing is based on utrace. But
SystemTap and one of the other potential uses on that list, namely
reworking seccomp to use utrace, are what set
Torvalds off:
Torvalds's complaint stems from the fact that utrace provides no user-space
interface at all. It is purely an internal kernel API that is meant to be
used by kernel code like the ptrace() rewrite, but also for kernel
modules, which is part of what worries Torvalds. It provides lots of hooks
that can be used by "random crazy out-of-tree crap
", but
doesn't provide any benefit to user space at all, he said:
One of the biggest problems with ptrace() is its signal-oriented interface. Programs using ptrace() act as the parent process of the tracee and must use wait() to detect state changes. For that reason, there can only be one ptrace() active for a particular process. So an strace of a program that is being debugged with gdb will not succeed. The ptrace() implementation using utrace would change that, but not directly, as there would still need to be a kernel piece that attached another utrace engine.
An in-kernel gdb "stub" using utrace—floated as an RFC back in November—could provide that kernel piece, but was met with a fair amount of resistance when it was proposed. The limitation that ptrace() imposes is seen as something that could, perhaps should, be lifted, but adding a relatively large, kernel-only API to do that is excessive. As Torvalds puts it:
So stop the crazy "new kernel interfaces" crap. Stop the crazy "maybe we can use it for ftrace and generic user event tracing too". Stop the crazy.
The elephant in the room, of course, is SystemTap. It creates, builds, and loads kernel modules for doing its tracing, and uses utrace for the user-space tracing. That model is not popular with most kernel developers, especially for an out-of-tree solution—the APIs that it relies on are far too volatile. SystemTap must be updated when those interfaces change, and all of the previous versions must be maintained so that SystemTap can still be used with older kernels. Because of that, SystemTap may be out-of-sync with development kernels, which makes its utility for kernel hackers quite small.
The utrace proponents are pushing it as something useful in its own right, completely separate from its use in SystemTap, but one gets the sense that many of the kernel developers aren't quite buying that. Ted Ts'o tries to explain his concerns to Eigler
He goes on to compare the situation to that of the NVIDIA graphics drivers,
which leads
Kyle Moffett to propose a variation on Godwin's
law: "As an LKML discussion grows longer, the probability of an unfavorable
comparison involving nVidia or Microsoft approaches 1.
" More to the
point, though, Moffett said he was uninterested in SystemTap:
Ts'o sees those features as potentially useful, but points out that they should be submitted with utrace for review. It may be that utrace in its present form does not survive that review:
Without an in-tree "killer feature" that only utrace can provide, there is going to be resistance to merging such an easily-abused API. Several suggestions were made—notably by Torvalds and Ingo Molnar—to enhance ptrace() itself to support some new features (such as multiple active calls or the ability to read/write more than a word at a time between the two processes), but that would mean scrapping much or all of the utrace work. Nesterov and McGrath, who are the ptrace() maintainers, have been largely silent throughout the discussion, but, previously, they have made it clear that they would much rather work with the utrace-based ptrace() implementation. So it is unclear when or if enhancements to the current code might happen.
Without utrace, SystemTap will have to find other ways to hook user space, but that doesn't really faze the kernel developers—particularly after Torvalds's unequivocal rejection of that approach—as there are other tracing solutions in the pipeline. Ftrace and perf events are slowly building capabilities, and are doing so in-tree. They are likely to grow the needed features to support kernel and user-space tracing a la SystemTap (and DTrace). Molnar specifically invites the SystemTap developers to collaborate:
perf record -R -f -e irq:irq_handler_entry --filter 'irq==18 || irq==19'
More could be done - a simple C-like set of function perhaps - some minimal
per probe local variable state, etc. (perhaps even looping as well, with a
limit on number of [predicate] executions per filter invocation.)
It is unfortunate, in many ways, that SystemTap has gotten to this point.
While it is possible that Torvalds could change his mind, he and other
kernel developers find the new tracing
features to be "a million times superior
" to SystemTap. That
could leave Red Hat holding the SystemTap bag
for quite some time to come, as it will need to support it for existing,
and likely future,
RHEL versions. It is interesting to note that this alternate solution,
based on Ftrace, etc., is also largely coming out of Red Hat.
It seems possible that utrace will be a casualty here as well. By incorporating features that were needed for SystemTap, and not providing a user-space interface, it tried to both do too much and too little. There are some potential ways forward, but its unclear whether they will be pursued. Torvalds points to the realtime tree as an example of how to get "crazy" things merged:
But on the whole, I think it's actually worked out pretty well for them. I think the mainline kernel has improved in the process, but I also suspect that _their_ RT patches have also improved thanks to having to make the work more palatable to people like me who don't care all that deeply about their particular flavor of crazy.
There are definitely lessons here, but the standard ones don't seem to apply. SystemTap and utrace were developed in the open, as free software from the outset, and were fairly often discussed on linux-kernel. SystemTap in particular was regularly criticized, to seemingly no avail. The biggest lesson—and the hardest to learn, especially after a feature has shipped—may be that ignoring the advice and complaints of the kernel developers is likely to come back and bite in the end. It is not terribly surprising, really, but that seems to be what is happening here.
Replacing ptrace()
Much of the POSIX system call interface is known for the elegance and simplicity of its design; that is what has enabled this API to endure and thrive for decades. The ptrace() system call has no such reputation. One of the many motivations behind the development of the utrace layer (see the accompanying article) was first to clean up the implementation of ptrace(), but then to enable it to be replaced entirely. Subsequent discussion shows that this is a distant hope, though, and that we will be struck with ptrace() for a long time.The purpose of ptrace() is to allow one process to monitor and modify the state of another. It exists to support interactive debuggers and related utilities like strace, but other users exist as well. User-mode Linux uses ptrace() for its internal management, and there are various sandboxing schemes which use it. In general, users are able to get ptrace() to do what they want, but they rarely come away pleased with the experience.
What are the problems with ptrace()? Whenever system calls have to work with extended state within the kernel, the preferred mechanism for referring to that state in user space is the file descriptor. With file descriptors, many of the existing system calls do natural things, and well-defined mechanisms exist for event multiplexing. But ptrace() doesn't use file descriptors; it depends, instead, on a rather more arcane mechanism. A process to be traced is removed from its normal place in the process tree; the process doing the tracing becomes its new parent. In other words, ptrace() sets up a sort of temporary foster home for children under scrutiny. The new parent can then learn about events in the child through the wait() system call.
This API is hard to fit into normal application event loops. It also implies that any given process can be traced by only one other process at any given time. This may not seem like a problem - how often does one want to run two debuggers on a process? - but it does get in the way. Developers working on debugging tools and users wanting to trace a sandboxed process are two types of users who cannot do what they want with ptrace(). It is also defined as a complex, multiplexer call (see the man page for details) which is hard to understand and hard to use efficiently.
Finally, ptrace() is hard to implement correctly and consistently.
As a result, there has been a long history of obnoxious bugs associated
with it, and user-space code which uses ptrace() tends to become
encrusted with non-portable workarounds. It is, in
summary, not surprising that there is interest in creating a replacement.
Oleg Nesterov expressed things succinctly:
"I must admit that personally I think the current ptrace api is
unfixable, we need the new one in the long term.
"
Getting to the new one could be hard, though. The first problem is that ptrace() is a standard function which is part of the kernel ABI. As long as users exist, it really cannot be removed from the kernel. So a ptrace() replacement will not improve life for the kernel development community anytime in the near future; indeed, it will make it harder, since there will be two tracing interfaces to support instead of one. Duplicating functionality in this way can be done when the need is strong enough, but it's not something that the community will rush into without a great deal of thought.
Maintaining ptrace() as a compatibility interface might be acceptable if it were clearly a temporary thing with a clear possibility of removal in the future, and if there were clear advantages of doing so. But it's not entirely clear where the advantages are. For example, Kyle Moffett said:
There are a couple of related problems with this idea, starting with the fact that tools like GDB don't just run on Linux systems with shiny new kernels. They need to work on older kernels indefinitely, not to mention on all those other platforms which lack the good taste to implement every new system call created for Linux. So those "thousands of lines" (and it really is that much code) will not be going anywhere; the GDB developers will have to maintain them forever - or something fairly close to that.
So for GDB, too, a new tracing API would represent an increase in the maintenance load - if they use it. But the fact of the matter is that special, Linux-only interfaces tend to have very limited uptake. As expressed by Ingo Molnar:
That said, Tom Tromey has indicated that GDB might use a new API if there were clear advantages to doing so:
Tom goes on to list a few features that he would like to see in a replacement for ptrace(). That highlights one final obstacle to any kind of new API: no such thing has been implemented or even specified by anybody. The creation of a new system call - especially for a task as complicated as tracing - is not an easy thing to do. Without a great deal of care, we risk creating yet another substandard API with its own warts which must be maintained forever. So a proposed replacement would have to get through an extended process of criticism, argument, and opposition, and it would have to demonstrate some real users - a GDB port, for example. That, alone, ensures that any ptrace() replacement will be years away.
So it's not surprising that justifying utrace as a means to replace ptrace() is not working very well, and it's not surprising that developers are talking about possible ways of extending ptrace() instead. Playing with the ptrace() API is not without its risks - code which uses it tends to be a bit of a house of cards which can be broken by subtle changes in semantics. But it may still be an easier route to moderately more sane and usable tracing in the relatively near future.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Memory management
Networking
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
