Kernel development
Brief items
Kernel release status
The current 2.6 development kernel is 2.6.29-rc3, released on January 28. Some 430 changesets were merged since 2.6.29-rc2; most of these are fixes, but there's also a reorganization of the filesystem Kconfig files, a couple of drivers for the i.MX31 processor, a driver for TI OMAP High Speed Multimedia card interfaces, and a driver for Freescale QUICC Engine USB host controllers. The short-form changelog is in Linus's announcement; see the full changelog for lots of details.The current stable 2.6 kernel is 2.6.28.2, released on January 24; the 2.6.27.13 update was released at the same time. Both contain a fairly long list of fixes for a number of serious problems.
Kernel development news
Quotes of the week
There's lowering the barrier for entry, and there's not having a barrier at all. The latter is what I'm concerned that staging/ has become.
I strongly support the notion that high-level review is only warranted on code that is reviewable and looks tasteful, and that code which doesn't meet basic style should not be merged at all.
But you're operating on a completely different level!
You chose this example to demonstrate, by (if I may) expandio ad absurdum, that our current approach is flawed. Obviously you *knew* that it could be converted to a pointer, and equally obviously this would require us to process relocations before parsing version symbols. Clearly, you understood that this would mean we had to find another solution for struct module versioning, but you knew that that was always the first symbol version anyway.
You no-doubt knew that we could potentially save 7% on our module size using this approach. But obviously not wanting to criticize my code, you instead chose this oh-so-subtle intimation where I would believe the triumph to be mine alone!
I am humbled by your genius, and I only hope that my patch series approaches the Nirvanic perfection you foresaw.
LCA: A new approach to asynchronous I/O
Asynchronous I/O has been a problematic issue for the Linux kernel for many years. The current implementation is difficult to use, incomplete in its coverage, and hard to support within the kernel. More recently, there has been an attempt to resolve the problem with the syslet concept, wherein kernel threads would be used to make almost any system call potentially asynchronous. Syslets have their own problems, though, not the least of which being that their use can cause a user-space process to change its process ID over time. Work on this area has slowed, with few updates being seen since mid-2007.
Zach Brown is still working on the asynchronous I/O problem, though; he
used his linux.conf.au talk to discuss his current approach. The new
"acall" interface has the potential to resolve many of the problems which
have been seen in this area, but it is early-stage work which is likely to
evolve somewhat before it is seriously considered for mainline inclusion.
One of the big challenges with asynchronous kernel operations is that the kernel's idea of how to access task state is limited. For the most part, system calls expect the "current" variable to point to the relevant task structure. That proves to be a problem when things are running asynchronously, and, potentially, no longer have direct access to the originating process's state. The current AIO interface resolves this problem by splitting things into two phases: submission and execution. The submission phase has access to current and is able to block, but the execution phase is detached from all that. The end result is that AIO support requires a duplicate set of system call handlers and a separate I/O path. That, says Zach, is "why our AIO support still sucks after ten years of work."
The fibril or syslet idea replaces that approach with one which is conceptually different: system call handlers remain synchronous, and kernel threads are used to add asynchronous operation on top. This work has taken the form of some tricky scheduler hacks; if an operation which is meant to be asynchronous blocks, the scheduler quickly shifts over to another thread and returns to user space in that thread. That allows the preservation of the state built up to the blocking point and it avoids the cost of bringing in a new thread if the operation never has to block. But these benefits at the cost of changing the calling process's ID - a change which is sure to cause confusion.
When Zach inherited this work, he decided to take a fresh look at it with the explicit short-term goal of making it easy to implement the POSIX AIO specification. Other features, such as syslets (which allow a process to load a simple program into the kernel for asynchronous execution) can come later if it seems like a good idea. The end result is the "acall" API; this code has not yet been posted to the lists for review, but it is available from Zach's web site.
With this interface, a user-space process specifies an asynchronous operation with a structure like this:
struct acall_submission {
u32 nr;
u32 flags;
u64 cookie;
u64 completion_ring_pointer;
u64 completion_pointer;
u64 id_pointer;
u64 args[6];
};
In this structure, nr identifies which system call is to be invoked asynchronously, while args is the list of arguments to pass to that system call. The cookie field is a value used by the calling program to identify the operation; it should be non-zero if it is to be used. The flags and various _pointer fields will be described shortly.
To submit one or more asynchronous requests, the application will call:
long acall_submit(struct acall_submission **submissions,
unsigned long nr);
submissions is a list of pointers to requests, and nr is the length of that list. The return value will be the number of operations actually submitted. If something goes wrong in the submission process, the current implementation will return a value less than nr, but the error code saying exactly what went wrong will be lost if any operations were submitted successfully.
By default, acall_submit() will create a new kernel thread for each submitted operation. If the flags field for any request contains ACALL_SUBMIT_THREAD_POOL, that request will, instead, be submitted to a pool of waiting threads. Those threads are specific to the calling process, and they will only sit idle for 200ms before exiting. So submission to the thread pool may make sense if the application is submitting a steady stream of asynchronous operations; otherwise the kernel will still end up creating individual threads for each operation. Threads in the pool do not update their task state before each request, so they might be behind the current state of the calling process.
If the id_pointer field is non-NULL, acall_submit() will treat it as a pointer to an acall_id structure:
struct acall_id {
unsigned char opaque[16];
};
This is a special value used by the application to identify this operation to the kernel. Internally it looks like this:
struct acall_kernel_id {
u64 cpu;
u64 counter;
};
It is, essentially, a key used to look up the operation in a red/black tree.
The completion_pointer field, instead (if non-NULL), points to a structure like:
struct acall_completion {
u64 return_code;
u64 cookie;
};
The final status of the operation can be found in return_code, while cookie is the caller-supplied cookie value. Once that cookie has a non-zero value, the return code will be valid.
The application can wait for the completion of specific operations with a call to:
long acall_comp_pwait(struct acall_id **uids,
unsigned long nr,
struct timespec *utime,
const sigset_t *sigmask,
size_t sigsetsize);
The uids array contains pointers to acall_id structures identifying the operations of interest; nr is the length of that array. If utime is not NULL, it points to a timespec structure specifying how long acall_comp_pwait() should wait before giving up. A set of signals to be masked during the operation can be given with sigmask and sigsetsize. A return value of one indicates that at least one operation actually completed.
An application submitting vast numbers of asynchronous operations may want to avoid making another system call to get the status of completed operations. Such applications can set up one or more completion rings, into which the status of completed operations will be written. A completion ring looks like:
struct acall_completion_ring {
uint32_t head;
uint32_t nr;
struct acall_completion comps[0];
};
Initially, head should be zero, and nr should be the real length of the comps array. When the kernel is ready to store the results of an operation, it will first increment head, then put the results into comps[head % nr]. So a specific entry in the ring is only valid once the cookie field becomes non-zero. The kernel makes no attempt to avoid overwriting completion entries which have not yet been consumed by the application; it is assumed that the application will not submit more operations than will fit into a ring.
The actual ring to use is indicated by the completion_ring_pointer value in the initial submission. Among other things, that means that different operations can go into different rings, or that the application can switch to a differently-sized ring at any time. In theory, it also means that multiple processes could use the same ring, though waiting for completion will not work properly in that case.
If the application needs to wait until the ring contains at least one valid entry, it can call:
long acall_ring_pwait(struct acall_completion_ring *ring,
u32 tail, u32 min,
struct timespec *utime,
const sigset_t *sigmask,
size_t sigsetsize);
This call will wait until the given ring contains at least min events since the one written at index tail. The utime, sigmask, and sigsetsize arguments have the same meaning as with acall_comp_pwait().
Finally, an outstanding operation can be canceled with:
long acall_cancel(struct acall_id *uid);
Cancellation works by sending a KILL signal to the thread executing the operation. Depending on what was being done, that could result in partial execution of the request.
This API is probably subject to change in a number of ways. There is, for example, no limit to the size of the thread pool other than the general limit on the number of processes. Every request is assigned to a thread immediately, with threads created as needed; there is no way to queue a request until a thread becomes available in the future. The ability to load programs into the kernel for asynchronous execution ("syslets") could be added as well, though Zach gave the impression that he sees syslets as a relatively low-priority feature.
Beyond the new API, this asynchronous operation implementation differs from its predecessors in a couple of ways. Requests will always be handed off to threads for execution; there is no concept of executing synchronously until something blocks. That may increase the overhead in cases where the request could have been satisfied without blocking, though the use of the thread pool should minimize that cost. But the big benefit is that the calling process no longer changes its ID when things do block. That results in a more straightforward user-space API with minimal surprises - certainly a good thing to do.
Linus was at the presentation, and seemed to think that the proposed API was not completely unreasonable. So it may well be that, before too long, we'll see a version of the acall API proposed for the mainline. And that could lead to a proper solution to the asynchronous I/O problem at last.
Snet and the LSM API
A new security module, called snet (which is short for "security for network syscalls") was recently posted as an RFC on the linux-security-module mailing list. Its purpose is rather simple—much simpler than the two current mainline users of the LSM interface—intercept system calls for networking and call out to user space to determine if they are to be allowed. The idea is to be able to create Linux versions of the "personal firewall" that is popular on Windows machines. Reaction to snet was mixed, partially because of a disdain for that type of security tool, but also because it is implemented using LSM.
Snet, developed by Samir Bellabes, consists of a kernel piece which uses LSM to hook the "interesting" socket-related system calls (socket(), bind(), connect(), listen(), and accept()), as well as a user space library that can be used to accept or deny those calls. Communication between the kernel and user space is handled by a netlink socket using libnl. The decisions are then cached in the kernel to reduce the number of calls required to user space. That last part is important because personal firewalls typically pop up a request on the user's display asking them to decide whether to allow the system call. Timeouts can be established for the user-space calls, along with a default response if the timeout is reached.
This "user request" feature of personal firewalls is one thing that many
find objectionable. As Paul Moore puts it:
"my opinion is that it is a poor option for security and typically
only results in training
the user to click the 'allow' button when the pfwall [dialog] box pops
up on his/her screen
". Yet it is a "feature" of other operating
systems and not completely unreasonable for Linux to support. From that
perspective, snet seems like a reasonable starting point.
There are a few other problems, though, stemming from the decision to use the LSM API. Peter Dolding seems to think this capability should be added to netfilter, rather than built as a standalone solution. Others pointed out that netfilter is sufficiently low-level that any context about users or processes that are performing these operations is not available. That could change, but it would take a concerted effort to change the netfilter code, which doesn't seem likely near-term, if ever.
A larger problem comes from the inability to stack LSM modules. If a user is interested in the kinds of protection that snet can provide, they must forgo any other LSM-implemented security solution (i.e. SELinux, Smack, AppArmor, TOMOYO, etc.). A parallel discussion about LSM stacking is also occurring on linux-security-module, partially motivated by the needs of snet and other "smaller" security solutions. Those tools do not implement a full-scale security solution a la SELinux or Smack, but instead try to handle a smaller subset of the problem.
LSM stacking also came up at the LCA security panel, so it is certainly on the minds of Linux security developers. Casey Schaufler sums up the current state of affairs along with a look to a possible future:
I would be very interested to see an LSM that does nothing but multiplex other LSMs. That would make multiple unrelated LSMs feasible without trying to create something that could deal with SELinux's and Smack's different notions of network access control model. You could revive the notion of loadable modules while you're at it. The LSM Multiplexer LSM could put any restrictions on the LSMs it is willing to support.
It seems likely that someone will try to build an LSM-multiplexer before too long. In addition to snet, the TuxGuardian project appears to be reawakening after a period of quiet. It is similar to snet, and also uses LSM to trap network accesses. Other projects are also mentioned in the threads on linux-security-module. In the end, it is just too limiting to require that all security modules implement a full-scale security solution, and since LSM is the only accepted way to implement some of these hooks, some middle ground will likely be found.
In another related thread, Schaufler notes that a lot of what is being described for personal firewalls could be implemented using SELinux—at least as a starting point. One sticking point to that particular solution is the user interaction required. It is hard to see how an SELinux-derived solution could interact with the user for some decisions, but not others. It also is clearly outside of the scope of what SELinux is intended for.
While snet may implement "bad security" in some minds, the discussion about it, especially with regard to LSM stacking has been very valuable. It may turn out that there is no sane way to stack arbitrary security modules in a way that a) makes sense and b) doesn't drive all of the security developers insane. But there are some reasonable use cases for that capability so it would seem that an investigation of those possibilities is warranted. With luck we will soon see where it leads.
A SystemTap update
SystemTap has been under active development for a some years. More than 35 people have contributed enhancements in the last year. But newer developments, like the ability to dynamically trace user space programs, seem to have been very quietly introduced and, thus, have not always been noticed by users that are not yet using SystemTap extensively. So this article will take a look at what currently works out of the box, what that box should contain to make things work, the work in progress, and the challenges SystemTap faces to be more powerful and get more widespread adoption.
SystemTap's goal is to provide full system observability on production systems, which is safe, non-intrusive, (near) zero-overhead and which allows ubiquitous data collection across the whole system for any interesting event that could happen. To achieve this goal, SystemTap defines the stap language, in which the user defines probes, actions, and data acquisition. The SystemTap translator and runtime guarantees that probe points are only placed on safe locations and that probe functions cannot generate too much overhead when collecting data. For dynamic probes on addresses inside the kernel, SystemTap uses kprobes; for dynamic probes in user space programs, instead, SystemTap uses its cousin uprobes [PDF]. This provides a unified way of probing and then collecting data for observing the whole system. To dynamically find locations for probe points, arguments of the probed functions and the variables in scope at the probe point, SystemTap uses the debuginfo (Dwarf) standard debugging information that the compiler generates.
So, to provide an ideal setting for using SystemTap, GNU/Linux distributions should provide easy access to debuginfo for the kernel and user space programs. Almost all distributors do this. The kernel supports kprobes, which has been in the upstream kernel for some years, and uprobes, which comes with (and is automatically loaded by) SystemTap, but which relies on the full utrace framework, which isn't yet in the mainline kernel. (The latest few releases of the Fedora family, including Red Hat Enterprise Linux and CentOS, do include full utrace support by default). SystemTap works without debuginfo, but the range of probes and the amount of data you can collect is then very limited. And it works without utrace support, but then you won't be able to do deep user space probing, only observe direct user/kernel space interactions.
There are various probe variants one can use with SystemTap, but the most interesting ones are the debuginfo-based probes for the kernel, kernel modules, and user space applications. These can use function, statement or return variants, and wildcards, such as:
-
kernel.function("rpc_new_task"): a named kernel function, -
process("/bin/ls").function("*"): any function entry in a specific process, -
module("usb*").function("*sync*").return: every return of a function containing the word sync, in any module starting with usb, or -
kernel.statement("bio_init@fs/bio.c+3"): for a specific statement in a particular file
Depending on the type of probe, one can access specifics of the
probe point. For the debuginfo based probes these
are $var for in-scope variables or function
arguments, $var->field for accessing structure
fields, $var[N] for array elements, $return
for the return value of a function in a return probe, and meta
variables like $$vars to get a string representation of
all the in-scope variables at a particular probe point. All access to
such constructs are safeguarded by the SystemTap runtime to make sure
no illegal accesses can occur.
Given that one has the debuginfo of a program installed, one can easily get a simple call trace of a specific program, including all function parameters and return values with the following stap script:
probe process("/bin/ls").function("*").call
{
printf("=>%s(%s)\n", probefunc(), $$parms);
}
probe process("/bin/ls").function("*").return
{
printf("<=%s:%s\n", probefunc(), $$return);
}
The examples included with SystemTap come with much more powerful versions that show timed, per-thread call graphs, optionally showing only children of a particular function call.
While these probing and data extraction constructs are powerful, they do require some knowledge of the kernel or program code base. Since you are often interested in what is happening and not precisely how, SystemTap comes with "tapsets," which are utility functions and aliases for groups of interesting probes in a particular subsystem. Examples include system calls, NFS operations, signals, sockets, etc. Currently these tapsets are distributed with SystemTap itself, but ideally each program or subsystem would come with its own tapset of interesting events provided by the program or subsystem maintainer.
Just printing out events while they occur is not always ideal. First, you may be overwhelmed by volume of the output; second, you might only be interested in a specific subset of the same event (only certain parameters, only calls that take longer than a specific time, only from the process that does the most calls over a specific time frame, etc.). Finally, processing all the events on your production system might interfere with the thing you are trying to observe. Especially at the start of your investigations, when you might not yet be sure what the interesting events are, you may do some very wide probing to see what is going on.
For this reason the stap language supports variables that can be used as associative arrays, simple control structures and data aggregation functions to do simple statistics during probe time, with very low overhead and without having to call external programs that might interfere with the system being probed.
The following script might be how you would start investigating a problem involving a system which seems to do an excessive amount of reads. It uses the "vfs" tapset and an associative array to store the number of reads a particular executable with a specific process ID does:
global totals;
probe vfs.read
{
totals[execname(), pid()]++
}
probe end
{
printf("== totals ==\n")
foreach ([name,pid] in totals-)
printf("%s (%d): %d \n", name, pid, totals[name,pid])
}
This will give you a list of executables and their pid sorted by the total number of vfs reads done while the script was running. These facilities in the stap language help greatly to minimize any overhead of the tracing framework. If you would try to do the same thing by just printing each vfs event and then post-processing the results with Perl, you might end up with Perl itself being the process doing the most vfs calls, or worse, by having to parse megabytes of trace data, Perl might start trashing the system even more, making it harder to determine the root cause of the original problem.
SystemTap now also supports static markers in the kernel. This allows subsystem maintainers to mark specific events as interesting, providing a format string of the arguments to the event that can be easily parsed by tracing tools. The advantage of static markers over tapsets is that they are in-code and so might be easier to maintain, though you probably still want to have an associated tapset for utilities to nicely format the arguments or associate various markers with each other. Also, they can work without needing any DWARF debuginfo around, but you lose the ability to inspect local variables or function parameters not passed to the marker. You use them with a command like:
probe kernel.mark("kernel_sched_wakeup")
The tapset can then
access the arguments through $argN and get the argument
format string of the marker with $format.
An alternate way of adding static markers to the kernel, tracepoints, is not yet directly supported in SystemTap. Tracepoints have the disadvantage that they require the DWARF debuginfo to be around because they don't currently specify the types of their arguments except through their function prototypes. So SystemTap can currently only use tracepoints via hand-written intermediary code that maps them to markers.
The development version of SystemTap recently got support for user-space static markers. Although SystemTap defines its own STAP_PROBE macros for usage in applications that want to add static markers, there is also an alternative tracing tool, Dtrace, that has its own way for programs to embed static markers. SystemTap supports the convention used by Dtrace by providing an alternative include file and build preprocessor so that programs using DTRACE_PROBE macros can be compiled as if for Dtrace and have their static markers show up with SystemTap.
Luckily, there are various programs that already have such markers defined. For example PostgreSQL has various static markers to trace higher-level events like transactions and database locks. Currently one has to adapt the build process of such programs by hand, but the next version of SystemTap will come with scripts that will automate that process.
While SystemTap works well on GNU/Linux distributions that support it, there are a couple of challenges to overcome to make it more ubiquitous and easier for more people to use out of the box. This goes beyond work on the SystemTap code base itself. Since the goal is to provide full system observability, from low-level kernel events to high-level application events, there is work to be done all across the GNU/Linux stack. Also needed is better integration into more distributions, providing default installation of SystemTap and tapsets, easy access to debuginfo for deep inspection, binaries compiled with marker support for high-level events, etc. The two main challenges to make SystemTap more powerful and easier to use on any distribution are debuginfo and better kernel support.
A lot of power of SystemTap comes from the fact that it can use DWARF debuginfo from the kernel and applications to do very detailed inspection. But this power comes at a price, since the debuginfo is often large. For example, on Fedora, the kernel debuginfo package is far larger than the kernel package itself. One easy win will be to split the debuginfo package into the DWARF files and the source files, which are needed for a debugger, but not directly for a tracer like SystemTap. Fedora plans to do this for its next release. The elfutils team is also working on a framework for Dwarf transformation and compression that could be used as post-processor on the output of the compiler.
SystemTap sometimes suffers from the same issues you might have with a debugger: the compiler has optimized the code, but forgot where it put a certain variable after the optimization. Of course this is always the variable you are most interested in. Alexandre Oliva is working on improving the local variable debug information in GCC. His variable tracking assignments [PDF] branch in GCC aims to improve debug information by annotating assignments early in the compilation process and carrying over such annotations throughout all optimization passes so that you can always accurately track variables, even in optimized code.
Finally, there is work being done on having a SystemTap "client and server" that could be used on production systems where you might not even want to have any tools or debuginfo installed. You can then set up a development client that has the same configuration as the production system with the addition of the SystemTap translator and all debuginfo, create and test your scripts there. The final result of this work could then be used on the bare-bones production server.
Most of the SystemTap runtime, like the kprobes support, is maintained in the upstream linux kernel, but there is some stuff still missing. This leads to distributions having to add patches to their kernel, especially to support user space tracing. In particular, the utrace framework is still not upstream. Over the last few kernel releases, various parts have been merged, including the utrace user_regset framework, which creates an interface for code accessing the user-space view of any machine specific state, and the tracehook work, which provides a framework for all the user process tracing. The actual utrace framework sits on top of these components; the ptrace() interface is implemented as utrace client. Anything that changes the ptrace implementation is hairy stuff, so there is a large ptrace testsuite to make sure that nothing breaks. One idea under consideration is to push utrace upstream in two installments. At first, using utrace or ptrace on a process would be mutually exclusive. That could pave to path to get pure-utrace upstream in first and then do proper ptrace cooperation in a second go.
This approach would also provide the way for uprobes, which depends on the utrace framework, to be submitted upstream. Uprobes components such as breakpoint insertion and removal and the single-stepping infrastructure are also potentially useful for other user space tracers and debuggers. Like with utrace, one idea is to factor out these portions of uprobes so that it can be used by multiple clients as a shared user-space breakpoint support (ubs) layer. With multiple clients using the same layer, upstream acceptance might be easier.
One candidate for using both the utrace and the uprobes layer besides SystemTap is Froggy, which provides an alternative debugger interface to ptrace. The GDB Archer project would like to serve as testbed for Froggy, which they hope will also make GDB more robust when linked with libpython, which is being used for GDB scripting.
In the past, kernel maintainers were skeptical about tracing, which resulted in tracing frameworks like dprobes, LTT and parts of the SystemTap runtime being maintained outside the main kernel tree. But now that there is actually no shortage of tracing options in the kernel, people like Ted Ts'o have been urging the SystemTap hackers to push as much as possible upstream. Ted also encourages the developers to focus more on the kernel hackers as first-rate customers, rather than focusing exclusively on the whole system experience for production setups. The SystemTap developers have been successful in making their module support "just work" with any kernel. It currently works with kernel versions between 2.6.9 and the latest, 2.6.28; it is also regularly tested against the latest -rc kernels. But, maybe they have been a little too successful, because having this activity be more visible on the linux kernel mailing list would be good publicity. In response, there is now an active SystemTap bug called "Make upstream kernel developers happy" that calls for more frequent postings on the main kernel mailing list, improvements in the usage of debuginfo as described above, and pushing utrace and uprobes upstream first as a priority.
There is still work to do, but over the last couple of years the GNU/Linux tracing and debugging experience has kept improving. Hopefully soon, all these parts will fall into place and provide hackers with a fairly nice environment for not only debugging on development systems, but also for unobtrusive tracing on production systems.
About the author: Mark Wielaard is a Senior Software engineer at Red Hat working in the Engineering Tools group hacking on SystemTap.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
