LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.21-rc3, released by Linus on March 6. It contains quite a few fixes and some KVM enhancements. Says Linus: "...there's some hope that it will work more widely than -rc1 and -rc2 did." The long-form changelog has the details.

As of this writing, no patches have been merged into the mainline git repository since -rc3 was released.

The current -mm tree is 2.6.21-rc2-mm2. Recent changes to -mm include a set of memory anti-fragmentation patches (see below), the dropping of the buffered filesystem I/O patches, the Devicescape wireless stack (now rebranded "mac80211"), and a new krealloc() memory allocation function.

For older kernels: 2.6.19.6 and 2.6.19.7 were released on March 2. They contain a fair number of fixes, at least one of which is security-related. "Barring anything major, there will not be any more 2.6.19 releases. If you disagree with this, please let the stable team know about the patches that you feel must be in a new release. We need to move on to flushing out the very large backlog of 2.6.20-stable patches."

2.6.16.43-rc1 was released on March 1. It contains a fair number of fixes and a few new hwmon drivers.

Comments (none posted)

Kernel development news

Quotes of the week

Ooh you have a vm patch that helps swap on the desktop! I can help you here with my experience from swap prefetch.

1. Get it reviewed and have noone show any evidence it harms
2. Find hundreds of users who can testify it helps
3. Find a way of quantifying it.
4. ...
5. Merge into mainline.

There, that should get you as far as 4. I haven't figured out what 4 is yet. I believe it may be goto 1;

-- Con Kolivas (thanks to Jos Poortvliet).

-mm is crap at present. Well. Mainline is crap at present, and -mm is crap^2. I think I might be about to throw vast amounts of code overboard.
-- Andrew Morton

I'm really fed up with having to pull big changes after the merge window, because it just doesn't seem to let up. I'm going to go postal on the next maintainer who doesn't understand what "merge window" and "fixes only" means.
-- Linus Torvalds

Comments (3 posted)

The Rotating Staircase Deadline Scheduler

CPU scheduling seems to be one of those eternally unfinished jobs. Developers can work on the CPU scheduler for a while and make it work better, but there will always be workloads which are not served as well as users would like. Users of interactive systems, in particular, tend to be sensitive to scheduler latencies. In response, the current scheduler has grown an elaborate array of heuristics which attempt to detect which processes are truly interactive and give them priority in the CPU. The result is complicated code - and people still complain about interactive response.

Enter Con Kolivas, who has been working on improving interactivity for some time. His latest proposal is the Rotating Staircase Deadline Scheduler (RSDL), which attempts to provide good interactive response with a relatively simple design, complete fairness, and bounded latency. This work takes ideas from Con's earlier staircase scheduler (covered here in June, 2004), but with a significantly different approach.

[RSDL] Like many schedulers, the RSDL maintains a priority array, as is crudely diagrammed to the left. At each level there is a list of processes currently wanting to run at that priority; each process has a quota of time it is allowed to execute at that priority. The processes at the highest priority are given time slices, and the scheduler rotates through them using a typical round-robin algorithm.

When a process uses its quota at a given priority level, it is dropped down to the next priority and given a new quota. That process can thus continue to run, but only after the higher-priority processes have had their turn. As processes move down the staircase, they increasingly must contend with the lower-priority processes which have been patiently waiting on the lower levels. The end result is that even the lowest-priority processes get at least a little CPU time eventually.

An interesting feature of this scheduler is that each priority level has a quota of its own. Once the highest priority level has used its quota, all processes running at that level are pushed down to the next-lower level, regardless of whether they have consumed their individual CPU time quotas or not. As a result of this "minor rotation" mechanism, processes waiting at lower priority levels need only cool their heels for a bounded period of time before all other processes are running at their level. The maximum latency for any process waiting to run is thus bounded, and can be calculated; there is no starvation with this scheduler.

As processes use up their time, they are moved to a second array, called the "expired" array; there they are placed back at their original priority. Processes in the expired array do not run; they are left out in the cold until no more processes remain in the currently active array - or until all processes are pushed off the bottom of the active array as a result of minor rotations. At that point, a "major rotation" happens: the active and expired arrays are switched and the whole series of events restarts from the beginning.

The current scheduler tries to locate interactive tasks by tracking how often each process sleeps; those seen to be interactive are then rewarded with a priority boost. The RSDL does away with all that. Instead, processes which sleep simply do not use all of their time at the higher priority levels. When they run, they are naturally advantaged over their CPU-hungry competition. If a process sleeps through a major rotation, its quota goes back into the run queue's priority-specific quota value. Thus, it will be able to run at high priority even if other high-priority processes, which have been running during this time, have been pushed to lower priorities through minor rotations. All of this should add up to quick response from interactive applications.

A few benchmarks posted by Con show that systems running with RSDL perform slightly better than with the stock 2.6.20 scheduler. The initial reports from testers have been positive, with one person urging that RSDL go into 2.6.21. That will not happen at this point in the release cycle, but Linus is favorable to including RSDL in a future kernel:

I agree, partly because it's obviously been getting rave reviews so far, but mainly because it looks like you can think about behaviour a lot better, something that was always very hard with the interactivity boosters with process state history.

Con has recently been heard to complain about difficulties getting his interactivity improvements into the mainline. This time around, however, he may find the course of events to be rather more gratifying.

Comments (10 posted)

Short topics in memory management

Memory management has been a relatively quiet topic over much of the life of the 2.6.x kernels. Many of the worst problems have been solved and the MM hackers have gone on to other things. That does not mean that there is no more work to do, however; indeed, things might be about to heat up. A few recent discussions illustrate the sort of pressures which may lead to a renewed interest in memory management work in the near future.

Mel Gorman's fragmentation avoidance patches have been discussed here a few times in the past. The core idea behind Mel's work is to identify pages which can be easily moved or reclaimed and group them together. Movable pages include those allocated to user space; moving them is just a matter of changing the relevant page table entries. Reclaimable pages include kernel caches which can be released should the need arise. Grouping these pages together makes it easy for the kernel to free large blocks of memory, which is useful for enabling high-order allocations or for vacating regions of memory entirely.

In the past, reviewers of Mel's patches have disagreed over how they should work. Some argue in favor of maintaining separate free lists for the different types of allocations, while others feel that this sort of memory partitioning is just what the kernel's zone system was created to do. So, this time around, Mel has posted two sets of patches: a list-based grouping mechanism and a new ZONE_MOVABLE zone which is restricted to movable allocations.

[page distribution graphic] The difference this time around is that the two patches are designed to work together. By default, there is no movable zone, so the list-based mechanism handles the full job of keeping alike allocations together. The administrator can configure in ZONE_MOVABLE at boot time with the kernelcore= option, which specifies the amount of memory which is not to be put into that zone. In addition, Mel has posted some comprehensive information on how performance is affected by these patches. In an unusual move, Mel has included a set of videos showing just how memory allocations respond to system stress with different allocation mechanisms in place; the image at the right shows one frame from one of those videos. The demonstration is convincing, but one is left with the uneasy hope that the creation of multimedia demonstrations will not become necessary to get patches into the kernel in the future.

These patches have found their way into the -mm tree, though Andrew Morton is still unclear on whether he thinks they are worthwhile or not. Among other things, he is concerned about how they fit with other, related work, especially memory hot-unplugging and per-container memory limits. While patches addressing both areas have been posted, nothing is really at a point where it is ready to be merged. This discussion between Mel and Andrew is worth reading for those who are interested in this topic.

The hot removal of memory can clearly be helped by Mel's work - memory which is subject to removal can be restricted to movable and reclaimable allocations, allowing it to be vacated if need be. Not everybody is convinced that hot-unplugging is a useful feature, though. In particular, Linus is opposed to the idea. The biggest potential use for hot-unplugging is for virtualization; it allows a hypervisor to move memory resources between guests as their needs change. Linus points out that most virtualization mechanisms already have mechanisms which allow the addition and removal of individual pages from guests; there is, he says, no need for any other support for memory changes.

Another use for this technique is allowing systems to conserve power by turning off banks of memory when they are not needed. Clearly, one must be able to move all useful data out of a memory bank before powering it down. Linus is even more dismissive of this idea:

The whole DRAM power story is a bedtime story for gullible children. Don't fall for it. It's not realistic. The hardware support for it DOES NOT EXIST today, and probably won't for several years. And the real fix is elsewhere anyway...

More information on his objections is available here for those who are interested. In short, Linus thinks it would make much more sense to look at turning off entire NUMA nodes rather than individual memory banks. That notwithstanding, Mark Gross has posted a patch enabling memory power-down which includes some basic anti-fragmentation techniques. Says Mark:

To be clear PM-memory will not be useful unless you have workloads that can take advantage of it. The identified workloads are not desktop workloads. However; there is a non-zero number of interested users with applicable workloads that make pushing the enabling patches out to the community worth while. These workloads tend to be within network elements and servers where memory utilization tracks traffic load.

It has also been suggested that resident set size limits (generally associated with containers) can solve many of the same problems that the anti-fragmentation work is aimed at. Rik van Riel was heard to complain in response that RSS limits could aggravate the scalability problems currently being experienced by the Linux memory management system. That drew questions from people like Andrew, who were not really aware of those problems. Rik responded with a few relatively vague examples; his ability to be specific is evidently restricted by agreements with the customers experiencing the problems.

That led to a whole discussion on whether it makes any sense to try to address memory management problems without test cases which demonstrate those problems. Rik argues that fixing test cases tends to break things in the real world. Andrew responds:

Somehow I don't believe that a person or organisation which is incapable of preparing even a simple testcase will be capable of fixing problems such as this without breaking things.

Rik has put together a page describing some problem workloads in an attempt to push the discussion forward.

One of Andrew's points is that trying to fix memory management problems caused by specific workloads in the kernel will always be hard; the kernel simply does not always have the information to know which pages will be needed soon and which can be discarded. Perhaps, he says, the right answer is to make it easier for user space to communicate its expected future needs. To that end, he put together a pagecache management tool for testing. It works as an LD_PRELOAD library which intercepts file-related system calls, tracks application usage, and tells the kernel to drop pages out of the cache after they have been used. The result is that common operations (copying a kernel tree, for example) can be carried out without forcing other useful data out of the page cache.

There were some skeptical responses to this posting. There was also some interest and some discussion of how smarter, application-specific policies could be incorporated into the tool. A possible backup tool policy, for example, would force the output file out of memory immediately, track pages read from other files and force them back out - but only if they were not already in the page cache, and so on. It remains to be seen whether anybody will run with this tool and try to use it to solve real workload problems, but there is some potential there. The kernel does not always know best.

Comments (26 posted)

Introducing utrace

The interface for tracing programs under Linux is the ptrace() system call. It is used primarily by debuggers, but there are other applications too; User-mode Linux can use ptrace(), for example. The interface gets the job done, but there are few system calls which endure more criticism. The list of ptrace() shortcomings is long, its interface is difficult for user-space developers to use and for kernel-space developers to maintain, it is inefficient, and it has been the source of more than one security problem over the years. Still, ptrace() endures; it is part of the user-space API and there is nothing better available.

Soon there may be a better alternative, in the form of the "utrace" patch (by Roland McGrath) which is currently in the -mm tree. Utrace replaces ptrace() entirely, while maintaining the same interface to user space. As such, it is a useful cleanup of a difficult system call. The real value of utrace, however, is likely to be seen in new tracing interfaces in the future.

The core utrace code does not interface with user space at all; instead, it is an in-kernel API which can be used to build kernel-based tracing mechanisms. These mechanisms are based around the concept of a "tracing engine," which is defined by the usual structure full of method pointers. This structure (struct utrace_engine_ops) has fourteen callbacks, each covering something which the traced process might do or have done to it. For example, one callback is:

    u32 (*report_syscall_entry)(struct utrace_attached_engine *engine,
				struct task_struct *tsk,
				struct pt_regs *regs);

Whenever the traced process invokes a system call, the tracing engine will (if it has asked for this event) receive a call to its report_syscall_entry() callback. The call happens at a "safe" time before the system call is executed; no locks are held, and the tracing process can safely access the traced process's state. The callback returns a bitmask specifying what happens next; the bitmask can change the tracing state, detach the engine, hide the event from other tracing engines, and more.

A tracing engine is put into service with:

    struct utrace_attached_engine *
    utrace_attach(struct task_struct *target, int flags,
	      	  const struct utrace_engine_ops *ops, 
		  unsigned long data);

This call will attach the engine to the given target process. There can be more than one engine attached to any given process - a significant difference from ptrace(). A newly-attached engine does not actually do anything, one can think of it as being in an idling state. Putting the engine into gear requires setting one or more action flags with:

    int utrace_set_flags(struct task_struct *target,
			 struct utrace_attached_engine *engine,
			 unsigned long flags);

There is a special flag (UTRACE_EVENT(QUIESCE)) which puts the target process into a quiescent state. In general, operating on the task first requires setting this flag, then waiting for a callback (to the report_quiesce() engine method) that says the process is truly stopped. There is a whole other set of events which can be requested: forking, execing a new program, receiving a signal, process death, system call entry and exit, etc. Single-stepping through instructions and program blocks is also handled through the event mechanism.

A signal can be forced into the target process with:

    int utrace_inject_signal(struct task_struct *target,
			     struct utrace_attached_engine *engine,
			     u32 action, siginfo_t *info,
			     const struct k_sigaction *ka);

Signals injected in this manner are delivered to the target process immediately; they are not queued in the usual manner.

There is more to the utrace API than is described in this brief overview, including an API for describing and working with CPU registers; see the excellent documentation file packaged with the patch for more details. Also included with the patch is a complete reimplementation of ptrace() built on top of utrace.

Reimplementing ptrace() is only so interesting, however, even if the result is a big improvement. The real purpose behind utrace looks to be to inspire the creation of the next generation of user-space process tracing APIs, and more. Roland told your editor:

The intent of the utrace API is not just to facilitate my writing the one great new userland API to replace ptrace. Its core purpose is to put writing a new user debugging facility more on par with writing a software device driver, a filesystem, or a network stack, so that many people can come up with ideas and experiment without doing brain surgery every time. It ties up the really nasty low-level implementation issues, and lets different unrelated facilities coexist without interfering with each other.

In other words, while utrace should enable the eventual retirement of ptrace(), there is more coming than that. If and when utrace makes it into the mainline, look for it to inspire interesting developments in a number of areas.

Comments (11 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds