|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.33-rc5, released on January 21. It contains a number of fixes - the patch rate for 2.6.33 remains fairly high.

As of 2.6.33-rc5, there are 23 unresolved regressions (of 75 reported) in this development kernel.

Stable updates: 2.6.32.5 was released on January 22, followed by 2.6.32.6 on January 25; both contain a fair number of important fixes. 2.6.32.7 is in the review process as of this writing; it contains 98 fixes, and can be expected sometime on or after January 28.

Comments (none posted)

Quotes of the week

So I can work with crazy people, that's not the problem. They just need to _sell_ their crazy stuff to me using non-crazy arguments, and in small and well-defined pieces. When I ask for killer features, I want them to lull me into a safe and cozy world where the stuff they are pushing is actually useful to mainline people _first_.

In other words, every new crazy feature should be hidden in a nice solid "Trojan Horse" gift: something that looks _obviously_ good at first sight.

-- Linus Torvalds

There is only one real sensible solution for this: Do _not_ use kgdb - which is the modus operandi of every sane kernel developer on the planet.
-- Thomas Gleixner

OK... lookup_instantiate_filp() is a god-awful mess, so it's OK to be confused by it - its authors definitely had been.
-- Al Viro

Comments (4 posted)

A module for crashing the kernel

By Jonathan Corbet
January 26, 2010
Normally, a kernel which doesn't crash is considered to be a good thing. It can be a source of true frustration, though, for those who want to see the system go down in flames. The reliability of the system means that somebody waiting for a crash may grow old indeed in the process.

Simon Kagstrom has heard the pain expressed by such users; in response, he has posted a kernel module just for people who want to be able to destroy their systems on demand. This module creates a directory (provoke_crash) in debugfs, filled with a number of useful files. For those with simple needs, a write to bugon results in a straightforward BUG() call. Users with more discriminating tastes can write to null_dereference to cause a null pointer dereference, overwrite_allocation to write beyond a heap allocation, or corrupt_stack to overwrite the stack. And truly kinky users can go for oops_interrupt_context to get a null dereference in softirq mode, write_after_free to step on freed memory, or unaligned_load_store to perform badly-aligned memory operations.

Needless to say, this isn't a module one would ordinarily want to leave loaded into a production system; it's better kept in a secret place and pulled out after the kids go to sleep. Unless, of course, you have a real use for it; Simon has been employing it to make sure that kmsg_dump() does the right thing in various crash scenarios. For most developers, though, work is normally dominated by the need to avoid crashes; since they'll have little use for this feature, it's not clear that this little module will ever make its way into the mainline.

Comments (16 posted)

fincore()

By Jonathan Corbet
January 27, 2010
Linux has long had the mincore() system call which allows an application to determine whether a given page is in RAM or not. There is no easy way, though, to tell whether a given page from a file is in the page cache or not. An application can mmap() the file and use mincore() on it, but that can be slow. So Chris Frost has proposed a new fincore() system call to handle this task:

    int fincore(int fd, loff_t start, loff_t len, unsigned char *vec);

A call to fincore() will look at the pages of the file associated with fd in the range indicated by start and len. For each page of the file, one byte of vec will be set to a non-zero value if that page is in memory. Naturally, this answer is an approximation - the situation can change while the system call is running.

That, however, can be good enough for Chris's needs. His objective is to speed up applications which perform large numbers of non-sequential file reads. The traditional readahead code deals poorly with this kind of application, since the access pattern cannot be predicted ahead of time. But the application often does know about a sequence of reads in advance; if the kernel could be told to pull in those pages ahead of time, it could order the I/O operations optimally and make the whole thing go faster. When doing this for sqlite and the GIMP, Chris reports significant speedups.

The fadvise() system call can be used to request prefetching of file data. But there's a problem: it's hard for a prefetch library to know how much system memory is available. If too little data is prefetched, the performance gains will not be what they could be. Prefetching too much data, however, can lead to thrashing. Hence the fincore() system call: if prefetched pages are no longer present by the time the application gets around to using them, the library knows that it is asking for too much and can back off.

Andrew Morton likes the patch:

I must say, the syscall appeals to my inner geek. Lot of applications are leaving a lot of time on the floor due to bad disk access patterns. A really smart library which uses this facility could help all over the place.

Jamie Lokier, though, wondered if it might not be a better idea to find a way to inform applications more directly that their pages are being evicted prior to use.

This is the first posting for this system call, so it has not gotten a lot of attention yet; more discussion will certainly be necessary before it could be merged. In the mean time, the libprefetch site has more information on this whole project.

Comments (4 posted)

Kernel development news

LCA: Graphics driver ponies

By Jonathan Corbet
January 26, 2010
Those of you who have come to appreciate Dave Airlie's kitten-filled presentations might just have been dissatisfied with his linux.conf.au talk, which was called "So you moved graphics drivers to the kernel.. what next? I can haz ponies?" But ponies, too, can be cute, and the update on the state of graphics drivers in the kernel was well worth the listening.

It has now been about a year since kernel mode setting (KMS) was merged into the mainline kernel. KMS ends the "mess" which came from having graphics drivers in user space; digging out of that particular hole took a good seven years or so. But now our graphics drivers are in the kernel, just like most other drivers.

Beyond cleaning up the mess, there are a few other good reasons for merging KMS. One is that the system is now able to make full use of the power-saving features of the hardware; before KMS, the kernel never really knew enough about what was going on with the hardware to do this. The Intel drivers can now perform as well as Windows with regard to power saving; the ATI drivers, instead, are not quite there yet. Another nice [Dave
Airlie] feature is the ability to use a kernel debugger on a system with graphics running; it's now possible to trap into the debugger, then return to a running system and have everything just work.

One of the reasons why KMS took so long to merge is that it places a number of new requirements on the kernel. At the top of the list is a proper manager for graphical memory. That's a hard problem, one that the graphics developers always intended to get to sometime Real Soon Now. Eventually the TTM developers got to it, but they quickly ran into a number of API difficulties. After some effort, the Intel developers decided that a generic approach to the memory management API wasn't going to work; out of that realization came the GEM memory manager, which only tried to solve the Intel problem.

Developers working on ATI chipsets, in turn, soon realized that GEM did not have the capabilities that they needed. So they went back to TTM, but not before bolting something that looks a lot like the GEM API onto it. TTM was recently merged, making KMS possible for ATI chipsets as well.

So what is coming? One future feature is the Gallium 3D architecture. Gallium, says Dave, is starting to work, but full functionality will take a while yet. Moving drivers to Gallium is going to be a painful exercise; there are already plenty of APIs that these drivers need to support. DRI2 is also coming along. This feature really needed KMS to work properly, especially when compositing is being used. There are still performance issues which must be resolved, though.

Another thing to look forward to is the Wayland display server. Wayland can be seen as a simpler, smaller replacement for X built on KMS. It can run GTK and GL applications now; there is also an X server emulator which can run on top of it. A few difficulties remain, including the fundamental fact that Wayland is not X; since X is the standard in this area, alternatives are going to be hard to sell. The Wayland developers also have not yet really dealt with the input problem, but input is a big piece of the X code. So Wayland, too, will be a while in coming; it may find its way into embedded situations first.

Dave spent some time on the current state of the graphics drivers. Intel, he says, is currently in the leading position. It supports KMS for everything - well, almost everything; the "chipset we won't name" (the proprietary GMA500) still lacks support. The driver is feature-complete, but Dave isn't quite ready to call it "mature"; another release or two will be required first. As discussed here previously, the driver will need to retain user mode setting (UMS) support for some time, but the current upstream X.org sources have already removed UMS from the X server.

The ATI/AMD drivers are further behind, but getting closer; this driver is harder than the Intel driver, due to the large number of chipset variations. Chipsets from R100 to R700 are currently supported; R800 support can be expected within a few weeks. The driver works "nearly as well as the old stuff" at this point; suspend and resume work better than before. Support for power-saving features is missing but expected for 2.6.34. The Radeon driver is currently in the staging tree, but it might move out before the end of the 2.6.33 development cycle.

What about the RadeonHD driver? That fork of the driver is primarily the result of a disagreement over the use of ATI's BIOS tables; the Radeon driver has an interpreter for these tables, while RadeonHD reimplements the functionality that those tables provide. Using the BIOS tables makes life a lot easier; it lets the driver ignore a lot of the details associated with different chipset variations. The BIOS table code is part of the KMS implementation which has been merged into the mainline; that should, Dave thinks, resolve this disagreement.

The "pony" displayed for the Nouveau discussion was a Trojan horse. Nouveau, of course, was merged for 2.6.33. The driver has just lost its user-mode support; it will be KMS only. Chipsets from the NV4 through the G80 are supported, with the final pieces to be filled in soon. The "ctxprogs" firmware is being figured out; the NV40 version has already been replaced with a rewritten, freely-licensed equivalent and NV50 is in the works. Dave noted that, whatever one thinks about NVIDIA's approach to working with the community, its hardware tends to be relatively good and easy to work with.

When Dave was asked about support for non-Linux systems, he replied that most of them have been left behind at this point. There is, apparently, an OpenSolaris port being done within Sun, but no code has been released from that group. One other audience member asked about running X without root privileges: that does work now, and Moblin is doing it. There are some problems remaining, though, especially with fast user switching. In the absence of a revoke() system call, there's no way to guarantee that one user isn't listening in on another. Since revoke() is known to be a hard problem, it's not clear how this issue will be resolved.

Comments (32 posted)

Back to the drawing board for utrace?

By Jake Edge
January 27, 2010

The utrace tracing framework has had a tortuous path towards the mainline, but it always seemed like it was headed that direction. Over the past week or so, things have gotten rather murkier for the mainline inclusion of utrace. Linus Torvalds made a pronouncement that would seem to leave SystemTap without a future in the mainline—something that many had suspected for a while—but also put the future of utrace in doubt. Further discussion may have provided a way forward, but, at least in its current form, mainline utrace seems very unlikely.

The discussion resulted from a request by Frank Ch. Eigler to include utrace into linux-next. That led to a discussion about whether it was ready for linux-next—because it was likely to be merged in the next release cycle—or whether it should spend some time in another tree. Since an earlier version of utrace was in Andrew Morton's -mm tree, that was a potential path. Morton said that utrace "didn't break anything", but:

I still don't think I've seen a really compelling reason for merging it. At least, I wouldn't be able to explain why we did it. But presumably there _are_ such reasons, because it was a lot of development work.

Someone please sell this to us.

Morton also dredged up a response he had gotten from Oleg Nesterov the last time he asked, which listed various potential uses for utrace. In-kernel uses for utrace are important—new features are rarely merged without one—and an earlier utrace merge attempt ran into opposition because it lacked one. This time around, Nesterov and Roland McGrath included a rewrite of the ptrace() system call using utrace as part of the patch submission. It was hoped that rewriting the notoriously ugly ptrace() code using the cleaner utrace API would be the last hurdle for inclusion into the mainline.

But, replacing the guts of the ptrace() call, even though it may clean things up, is controversial. ptrace() is part of the kernel ABI that must be maintained—ugly or not—but cleaning it up is not without its risks, as Morton points out:

ptrace is a nasty, complex part of the kernel which has a long history of problems, but it's all been pretty quiet in there for the the past few years. This leads one to expect that a rip-out-n-rewrite is a high-risk prospect. So, quite reasonably, one looks for a good reason for taking such risk.

The risk is small, though, according to Eigler, because "this code has been deployed in fedora and rhel for several *years*, with millions of users. It's not some rickety experiment." Eigler also added to Nesterov's list of utrace uses as SystemTap's user-space probing is based on utrace. But SystemTap and one of the other potential uses on that list, namely reworking seccomp to use utrace, are what set Torvalds off:

So if things like system tap and "security models that go behind the kernel by tying into utrace" are the reasons for utrace, color me utterly uninterested. In fact, color me actively hostile. I think that's the worst possible situation that we'd ever be in as kernel people (namely exactly the "do things in kernel space by hiding behind utrace without having kernel people involved")

Torvalds's complaint stems from the fact that utrace provides no user-space interface at all. It is purely an internal kernel API that is meant to be used by kernel code like the ptrace() rewrite, but also for kernel modules, which is part of what worries Torvalds. It provides lots of hooks that can be used by "random crazy out-of-tree crap", but doesn't provide any benefit to user space at all, he said:

If somebody were to argue that "this is a simple series of patches to clean up ptrace and make it possible to strace a debugged process", then that would have been different. That's not what you or others have been doing. You've been pushing exactly the _reverse_ of that, namely how great it is for some random totally new features that I'm convinced aren't even used by a lot of people.

One of the biggest problems with ptrace() is its signal-oriented interface. Programs using ptrace() act as the parent process of the tracee and must use wait() to detect state changes. For that reason, there can only be one ptrace() active for a particular process. So an strace of a program that is being debugged with gdb will not succeed. The ptrace() implementation using utrace would change that, but not directly, as there would still need to be a kernel piece that attached another utrace engine.

An in-kernel gdb "stub" using utrace—floated as an RFC back in November—could provide that kernel piece, but was met with a fair amount of resistance when it was proposed. The limitation that ptrace() imposes is seen as something that could, perhaps should, be lifted, but adding a relatively large, kernel-only API to do that is excessive. As Torvalds puts it:

Maybe somebody would be interested in trying to take the utrace improvements, and scaling down what they promise, and ignoring all input except for "I want to strace and gdb at the same time".

So stop the crazy "new kernel interfaces" crap. Stop the crazy "maybe we can use it for ftrace and generic user event tracing too". Stop the crazy.

The elephant in the room, of course, is SystemTap. It creates, builds, and loads kernel modules for doing its tracing, and uses utrace for the user-space tracing. That model is not popular with most kernel developers, especially for an out-of-tree solution—the APIs that it relies on are far too volatile. SystemTap must be updated when those interfaces change, and all of the previous versions must be maintained so that SystemTap can still be used with older kernels. Because of that, SystemTap may be out-of-sync with development kernels, which makes its utility for kernel hackers quite small.

The utrace proponents are pushing it as something useful in its own right, completely separate from its use in SystemTap, but one gets the sense that many of the kernel developers aren't quite buying that. Ted Ts'o tries to explain his concerns to Eigler

[...] utrace doesn't export a syscall (which is an ABI that we are willing to promise will be stable), but rather a set of kernel API's (which we never promise to be stable), and the fact that there will be out-of-tree programs that are going to be trying to depend on that interface (much like Systemtap does today when it creates kernel modules) [...]

He goes on to compare the situation to that of the NVIDIA graphics drivers, which leads Kyle Moffett to propose a variation on Godwin's law: "As an LKML discussion grows longer, the probability of an unfavorable comparison involving nVidia or Microsoft approaches 1." More to the point, though, Moffett said he was uninterested in SystemTap:

I'm interested in things like the ability to stack gdb with strace, the RFC gdb-stub posted a week ago, etc. None of those abilities would be out-of-tree modules at all [...]

Ts'o sees those features as potentially useful, but points out that they should be submitted with utrace for review. It may be that utrace in its present form does not survive that review:

So what should be reviewed is utrace *plus* these other userland interfaces, which may get critiqued and improved, and utrace patches can be reviewed in light of these new features. But be warned.... if it turns out that only 30% of utrace is only needed to support gdb stacking with strace, etc., the other 70% will likely get ejected and the utrace patches streamlined to support these in-tree users.

Without an in-tree "killer feature" that only utrace can provide, there is going to be resistance to merging such an easily-abused API. Several suggestions were made—notably by Torvalds and Ingo Molnar—to enhance ptrace() itself to support some new features (such as multiple active calls or the ability to read/write more than a word at a time between the two processes), but that would mean scrapping much or all of the utrace work. Nesterov and McGrath, who are the ptrace() maintainers, have been largely silent throughout the discussion, but, previously, they have made it clear that they would much rather work with the utrace-based ptrace() implementation. So it is unclear when or if enhancements to the current code might happen.

Without utrace, SystemTap will have to find other ways to hook user space, but that doesn't really faze the kernel developers—particularly after Torvalds's unequivocal rejection of that approach—as there are other tracing solutions in the pipeline. Ftrace and perf events are slowly building capabilities, and are doing so in-tree. They are likely to grow the needed features to support kernel and user-space tracing a la SystemTap (and DTrace). Molnar specifically invites the SystemTap developers to collaborate:

Also, if any systemtap person is interested in helping us create a more generic filter engine out of the current ftrace filter engine (which is really a precursor of a safe, sandboxed in-kernel script engine), that would be excellent as well. Right now we support simple C-syntax expressions like:
    perf record -R -f -e irq:irq_handler_entry --filter 'irq==18 || irq==19'
More could be done - a simple C-like set of function perhaps - some minimal per probe local variable state, etc. (perhaps even looping as well, with a limit on number of [predicate] executions per filter invocation.)

It is unfortunate, in many ways, that SystemTap has gotten to this point. While it is possible that Torvalds could change his mind, he and other kernel developers find the new tracing features to be "a million times superior" to SystemTap. That could leave Red Hat holding the SystemTap bag for quite some time to come, as it will need to support it for existing, and likely future, RHEL versions. It is interesting to note that this alternate solution, based on Ftrace, etc., is also largely coming out of Red Hat.

It seems possible that utrace will be a casualty here as well. By incorporating features that were needed for SystemTap, and not providing a user-space interface, it tried to both do too much and too little. There are some potential ways forward, but its unclear whether they will be pursued. Torvalds points to the realtime tree as an example of how to get "crazy" things merged:

Yeah, it's taken them years, and they still have out-of-tree stuff. And yeah, they had to change some things to make them more palatable to the mainline kernel - the whole fundamental raw spinlock change is just the most recent example of that.

But on the whole, I think it's actually worked out pretty well for them. I think the mainline kernel has improved in the process, but I also suspect that _their_ RT patches have also improved thanks to having to make the work more palatable to people like me who don't care all that deeply about their particular flavor of crazy.

There are definitely lessons here, but the standard ones don't seem to apply. SystemTap and utrace were developed in the open, as free software from the outset, and were fairly often discussed on linux-kernel. SystemTap in particular was regularly criticized, to seemingly no avail. The biggest lesson—and the hardest to learn, especially after a feature has shipped—may be that ignoring the advice and complaints of the kernel developers is likely to come back and bite in the end. It is not terribly surprising, really, but that seems to be what is happening here.

Comments (13 posted)

Replacing ptrace()

By Jonathan Corbet
January 27, 2010
Much of the POSIX system call interface is known for the elegance and simplicity of its design; that is what has enabled this API to endure and thrive for decades. The ptrace() system call has no such reputation. One of the many motivations behind the development of the utrace layer (see the accompanying article) was first to clean up the implementation of ptrace(), but then to enable it to be replaced entirely. Subsequent discussion shows that this is a distant hope, though, and that we will be struck with ptrace() for a long time.

The purpose of ptrace() is to allow one process to monitor and modify the state of another. It exists to support interactive debuggers and related utilities like strace, but other users exist as well. User-mode Linux uses ptrace() for its internal management, and there are various sandboxing schemes which use it. In general, users are able to get ptrace() to do what they want, but they rarely come away pleased with the experience.

What are the problems with ptrace()? Whenever system calls have to work with extended state within the kernel, the preferred mechanism for referring to that state in user space is the file descriptor. With file descriptors, many of the existing system calls do natural things, and well-defined mechanisms exist for event multiplexing. But ptrace() doesn't use file descriptors; it depends, instead, on a rather more arcane mechanism. A process to be traced is removed from its normal place in the process tree; the process doing the tracing becomes its new parent. In other words, ptrace() sets up a sort of temporary foster home for children under scrutiny. The new parent can then learn about events in the child through the wait() system call.

This API is hard to fit into normal application event loops. It also implies that any given process can be traced by only one other process at any given time. This may not seem like a problem - how often does one want to run two debuggers on a process? - but it does get in the way. Developers working on debugging tools and users wanting to trace a sandboxed process are two types of users who cannot do what they want with ptrace(). It is also defined as a complex, multiplexer call (see the man page for details) which is hard to understand and hard to use efficiently.

Finally, ptrace() is hard to implement correctly and consistently. As a result, there has been a long history of obnoxious bugs associated with it, and user-space code which uses ptrace() tends to become encrusted with non-portable workarounds. It is, in summary, not surprising that there is interest in creating a replacement. Oleg Nesterov expressed things succinctly: "I must admit that personally I think the current ptrace api is unfixable, we need the new one in the long term."

Getting to the new one could be hard, though. The first problem is that ptrace() is a standard function which is part of the kernel ABI. As long as users exist, it really cannot be removed from the kernel. So a ptrace() replacement will not improve life for the kernel development community anytime in the near future; indeed, it will make it harder, since there will be two tracing interfaces to support instead of one. Duplicating functionality in this way can be done when the need is strong enough, but it's not something that the community will rush into without a great deal of thought.

Maintaining ptrace() as a compatibility interface might be acceptable if it were clearly a temporary thing with a clear possibility of removal in the future, and if there were clear advantages of doing so. But it's not entirely clear where the advantages are. For example, Kyle Moffett said:

The killer app for this will be the ability to delete thousands of lines of code from GDB, strace, and all the various other tools that have to painfully work around the major interface gotchas of ptrace(), while at the same time making their handling of complex processes much more robust.

There are a couple of related problems with this idea, starting with the fact that tools like GDB don't just run on Linux systems with shiny new kernels. They need to work on older kernels indefinitely, not to mention on all those other platforms which lack the good taste to implement every new system call created for Linux. So those "thousands of lines" (and it really is that much code) will not be going anywhere; the GDB developers will have to maintain them forever - or something fairly close to that.

So for GDB, too, a new tracing API would represent an increase in the maintenance load - if they use it. But the fact of the matter is that special, Linux-only interfaces tend to have very limited uptake. As expressed by Ingo Molnar:

Special Linux system calls have a checkered past, they tend to not be used by much anything, and thus they tend to be a breeding ground of both bugs, maintenance complexity and security problems. Lack of attention is never good.

That said, Tom Tromey has indicated that GDB might use a new API if there were clear advantages to doing so:

Nevertheless, if the Linux kernel were to present a new user-space API, and if it had an advantage over ptrace, then we would port GDB to use it. There are other platforms where, IIRC, we now use some /proc thing instead of ptrace.

Tom goes on to list a few features that he would like to see in a replacement for ptrace(). That highlights one final obstacle to any kind of new API: no such thing has been implemented or even specified by anybody. The creation of a new system call - especially for a task as complicated as tracing - is not an easy thing to do. Without a great deal of care, we risk creating yet another substandard API with its own warts which must be maintained forever. So a proposed replacement would have to get through an extended process of criticism, argument, and opposition, and it would have to demonstrate some real users - a GDB port, for example. That, alone, ensures that any ptrace() replacement will be years away.

So it's not surprising that justifying utrace as a means to replace ptrace() is not working very well, and it's not surprising that developers are talking about possible ways of extending ptrace() instead. Playing with the ptrace() API is not without its risks - code which uses it tends to be a bit of a house of cards which can be broken by subtle changes in semantics. But it may still be an easier route to moderately more sane and usable tracing in the relatively near future.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.33-rc5 ?
Greg KH Linux 2.6.32.5 ?
Greg KH Linux 2.6.32.6 ?
Thomas Gleixner 2.6.31.12-rt20 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Memory management

Networking

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Jozsef Kadlecsik ipset-4.2 released ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds