The current 2.6 prepatch is 2.6.21-rc3
by Linus on
March 6. It contains quite a few fixes and some KVM enhancements.
Says Linus: "...there's some hope that it will work more widely than
-rc1 and -rc2 did.
" The long-form
has the details.
As of this writing, no patches have been merged into the mainline git
repository since -rc3 was released.
The current -mm tree is 2.6.21-rc2-mm2. Recent changes
to -mm include a set of memory anti-fragmentation patches (see below), the
dropping of the buffered filesystem I/O patches, the Devicescape wireless stack (now rebranded "mac80211"),
and a new krealloc()
memory allocation function.
For older kernels: 126.96.36.199 and 188.8.131.52 were released on
March 2. They contain a fair number of fixes, at least one of which
is security-related. "Barring anything major, there will not be any
more 2.6.19 releases. If you disagree with this, please let the stable
team know about the patches that you feel must be in a new release. We
need to move on to flushing out the very large backlog of 2.6.20-stable
184.108.40.206-rc1 was released on
March 1. It contains a fair number of fixes and a few new hwmon
Comments (none posted)
Kernel development news
Ooh you have a vm patch that helps swap on the desktop! I can help you here
with my experience from swap prefetch.
1. Get it reviewed and have noone show any evidence it harms
2. Find hundreds of users who can testify it helps
3. Find a way of quantifying it.
5. Merge into mainline.
There, that should get you as far as 4.
I haven't figured out what 4 is yet. I believe it may be goto 1;
-- Con Kolivas
(thanks to Jos Poortvliet).
-mm is crap at present. Well. Mainline is crap at present, and
-mm is crap^2. I think I might be about to throw vast amounts of
-- Andrew Morton
I'm really fed up with having to pull big changes after the merge
window, because it just doesn't seem to let up. I'm going to go
postal on the next maintainer who doesn't understand what "merge
window" and "fixes only" means.
-- Linus Torvalds
Comments (3 posted)
CPU scheduling seems to be one of those eternally unfinished jobs.
Developers can work on the CPU scheduler for a while and make it work
better, but there will always be workloads which are not served as well as
users would like. Users of interactive systems, in particular, tend to be
sensitive to scheduler latencies. In response, the current scheduler has
grown an elaborate array of heuristics which attempt to detect which
processes are truly interactive and give them priority in the CPU. The
result is complicated code - and people still complain about interactive
Enter Con Kolivas, who has been working on improving interactivity for some
time. His latest proposal is the Rotating Staircase Deadline
Scheduler (RSDL), which attempts to provide good interactive response with a
relatively simple design, complete fairness, and bounded latency. This
work takes ideas from
Con's earlier staircase scheduler (covered here in June, 2004), but
with a significantly different approach.
Like many schedulers, the RSDL maintains a priority array, as is crudely
diagrammed to the left. At each level there is a list of processes
currently wanting to run at that priority; each process has a quota of time
it is allowed to execute at that priority. The processes at the highest
priority are given time slices, and the scheduler rotates through them
using a typical round-robin algorithm.
When a process uses its quota at a given priority level, it is dropped down
to the next priority and given a new quota. That process can thus continue
to run, but only after the higher-priority processes have had their turn.
As processes move down the staircase, they increasingly must contend with
the lower-priority processes which have been patiently waiting on the lower
end result is that even the lowest-priority processes get at least a little
CPU time eventually.
An interesting feature of this scheduler is that each priority level has a
quota of its own. Once the highest priority level has used its quota, all
processes running at that level are pushed down to the next-lower level,
regardless of whether they have consumed their individual CPU time quotas
or not. As a result of this "minor rotation" mechanism, processes waiting
at lower priority levels need only
cool their heels for a bounded period of time before all other processes
are running at their level. The maximum latency for any process waiting to
run is thus bounded, and can be calculated; there is no starvation with
As processes use up their time, they are moved to a second array, called the
"expired" array; there they are placed back at their original priority.
Processes in the expired array do not run; they are left out in the cold
until no more processes remain in the currently active array - or until all
processes are pushed off the bottom of the active array as a result of
minor rotations. At that point, a "major rotation" happens: the active and
expired arrays are switched and the whole series of events restarts from
The current scheduler tries to locate interactive tasks by tracking how
often each process sleeps; those seen to be interactive are then rewarded
with a priority boost. The RSDL does away with all that. Instead,
processes which sleep simply do not use all of their time at the higher
priority levels. When they run, they are naturally advantaged over their
CPU-hungry competition. If a process sleeps through a major rotation, its
quota goes back into the run queue's priority-specific quota value. Thus,
it will be able to run at high priority even if other high-priority
processes, which have been running during this time, have been pushed to
lower priorities through minor rotations. All of this should add up to
quick response from interactive applications.
A few benchmarks posted by Con show
that systems running with RSDL perform slightly better than with the stock
2.6.20 scheduler. The initial reports from testers have been positive,
with one person urging that RSDL go into
2.6.21. That will not happen at this point in the release cycle, but
Linus is favorable to including RSDL in a
I agree, partly because it's obviously been getting rave reviews so
far, but mainly because it looks like you can think about behaviour
a lot better, something that was always very hard with the
interactivity boosters with process state history.
Con has recently been heard to complain about difficulties getting his
interactivity improvements into the mainline. This time around, however,
he may find the course of events to be rather more gratifying.
Comments (10 posted)
Memory management has been a relatively quiet topic over much of the life
of the 2.6.x kernels. Many of the worst problems have been solved and the
MM hackers have gone on to other things. That does not mean that there is
no more work to do, however; indeed, things might be about to heat up. A
few recent discussions illustrate the sort of pressures which may lead to a
renewed interest in memory management work in the near future.
Mel Gorman's fragmentation avoidance patches have been discussed here a few
times in the past. The core idea behind Mel's work is to identify pages
which can be easily moved or reclaimed and group them together. Movable
pages include those allocated to user space; moving them is just a matter
of changing the relevant page table entries. Reclaimable pages include
kernel caches which can be released should the need arise. Grouping
these pages together makes it easy for the kernel to free large blocks of
memory, which is useful for enabling high-order allocations or for vacating
regions of memory entirely.
In the past, reviewers of Mel's patches have disagreed over how they should
work. Some argue in favor of maintaining separate free lists for the
different types of allocations, while others feel that this sort of memory
partitioning is just what the kernel's zone system was created to do. So,
this time around, Mel has posted two sets of patches: a list-based grouping mechanism
and a new ZONE_MOVABLE
zone which is restricted to movable allocations.
The difference this time around is that the two patches are designed to
work together. By default, there is no movable zone, so the list-based
mechanism handles the full job of keeping alike allocations together. The
administrator can configure in ZONE_MOVABLE at boot time with the
kernelcore= option, which specifies the amount of memory which is
not to be put into that zone. In addition, Mel has posted some comprehensive information on how
performance is affected by these patches. In an unusual move, Mel has
included a set of videos showing just how memory allocations respond to
system stress with different allocation mechanisms in place; the image at
the right shows one frame from one of those videos. The demonstration is
convincing, but one is left with the uneasy hope that the creation of
multimedia demonstrations will not become necessary to get patches into the
kernel in the future.
These patches have found their way into the -mm tree, though Andrew Morton
is still unclear on whether he thinks they are worthwhile or not. Among
other things, he is concerned about how they fit with other, related work,
especially memory hot-unplugging and per-container memory limits. While
patches addressing both areas have been posted, nothing is really at a
point where it is ready to be merged. This
discussion between Mel and Andrew is worth reading for those who are
interested in this topic.
The hot removal of memory can clearly be helped by Mel's work - memory
which is subject to removal can be restricted to movable and reclaimable
allocations, allowing it to be vacated if need be. Not everybody is
convinced that hot-unplugging is a useful feature, though. In particular,
Linus is opposed to the idea. The biggest
potential use for hot-unplugging is for virtualization; it allows a
hypervisor to move memory resources between guests as their needs change.
Linus points out that most virtualization mechanisms already have
mechanisms which allow the addition and removal of individual pages from
guests; there is, he says, no need for any other support for memory
Another use for this technique is allowing systems to conserve power by
turning off banks of memory when they are not needed. Clearly, one must be
able to move all useful data out of a memory bank before powering it down.
Linus is even more dismissive of this idea:
The whole DRAM power story is a bedtime story for gullible
children. Don't fall for it. It's not realistic. The hardware
support for it DOES NOT EXIST today, and probably won't for several
years. And the real fix is elsewhere anyway...
More information on his objections is available here for those who are interested. In short,
Linus thinks it would make much more sense to look at turning off entire
NUMA nodes rather than individual memory banks. That notwithstanding, Mark
Gross has posted a patch enabling
memory power-down which includes some basic anti-fragmentation
techniques. Says Mark:
To be clear PM-memory will not be useful unless you have workloads
that can take advantage of it. The identified workloads are not
desktop workloads. However; there is a non-zero number of
interested users with applicable workloads that make pushing the
enabling patches out to the community worth while. These workloads
tend to be within network elements and servers where memory
utilization tracks traffic load.
It has also been suggested that resident set size limits (generally
associated with containers) can solve many of the same problems that the
anti-fragmentation work is aimed at. Rik van Riel was heard to complain in response that RSS limits
could aggravate the scalability problems currently being experienced by
the Linux memory management system. That drew questions from people like
Andrew, who were not really aware of those problems. Rik responded with a few relatively vague
examples; his ability to be specific is evidently restricted by agreements
with the customers experiencing the problems.
That led to a whole discussion on whether it makes any sense to try to
address memory management problems without test cases which demonstrate
those problems. Rik argues that fixing
test cases tends to break things in the real world. Andrew responds:
Somehow I don't believe that a person or organisation which is
incapable of preparing even a simple testcase will be capable of
fixing problems such as this without breaking things.
Rik has put together a page
describing some problem workloads in an attempt to push the discussion
One of Andrew's points is that trying to fix memory management problems
caused by specific workloads in the kernel will always be hard; the kernel
simply does not always have the information to know which pages will be
needed soon and which can be discarded. Perhaps, he says, the right answer
is to make it easier for user space to communicate its expected future
needs. To that end, he put together a pagecache management tool for
testing. It works as an LD_PRELOAD library which intercepts
file-related system calls, tracks application usage, and tells the kernel
to drop pages out of the cache after they have been used. The result is
that common operations (copying a kernel tree, for example) can be carried
out without forcing other useful data out of the page cache.
There were some skeptical responses to this posting. There was also
some interest and some discussion of how smarter, application-specific
policies could be incorporated into the tool. A possible backup tool policy, for example,
would force the output file out of memory immediately, track pages read
from other files and force them back out - but only if they were not
already in the page cache, and so on. It remains to be seen whether
anybody will run with this tool and try to use it to solve real workload
problems, but there is some potential there. The kernel does not always
Comments (26 posted)
The interface for tracing programs under Linux is the ptrace()
system call. It is used primarily by debuggers, but there are other
applications too; User-mode Linux can use ptrace()
, for example.
The interface gets the job done, but there are few system calls which
endure more criticism. The list of ptrace()
shortcomings is long,
its interface is difficult for user-space developers to use and for
kernel-space developers to maintain, it is inefficient, and it has been the
source of more than one security problem over the years. Still,
endures; it is part of the user-space API and there is
nothing better available.
Soon there may be a better alternative, in the form of the "utrace" patch
(by Roland McGrath) which is currently in the -mm tree. Utrace replaces
ptrace() entirely, while maintaining the same interface to user
space. As such, it is a useful cleanup of a difficult system call. The
real value of utrace, however, is likely to be seen in new tracing
interfaces in the future.
The core utrace code does not interface with user space at all; instead, it
is an in-kernel API which can be used to build kernel-based tracing
mechanisms. These mechanisms are based around the concept of a "tracing
engine," which is defined by the usual structure full of method pointers.
This structure (struct utrace_engine_ops) has fourteen callbacks,
each covering something which the traced process might do or have done to
it. For example, one callback is:
u32 (*report_syscall_entry)(struct utrace_attached_engine *engine,
struct task_struct *tsk,
struct pt_regs *regs);
Whenever the traced process invokes a system call, the tracing engine will
(if it has asked for this event) receive a call to its
report_syscall_entry() callback. The call happens at a "safe"
time before the system call is executed; no locks are held, and the tracing
process can safely access the traced process's state. The callback returns
a bitmask specifying what happens next; the bitmask can change the tracing
state, detach the engine, hide the event from other tracing engines, and
A tracing engine is put into service with:
struct utrace_attached_engine *
utrace_attach(struct task_struct *target, int flags,
const struct utrace_engine_ops *ops,
unsigned long data);
This call will attach the engine to the given target process.
There can be more than one engine attached to any given process - a
significant difference from ptrace(). A newly-attached engine
does not actually do anything, one can think of it as being in an idling
state. Putting the engine into gear requires setting one or more action
int utrace_set_flags(struct task_struct *target,
struct utrace_attached_engine *engine,
unsigned long flags);
There is a special flag (UTRACE_EVENT(QUIESCE)) which puts the
target process into a quiescent state. In general, operating on the task
first requires setting this flag, then waiting for a callback (to the
report_quiesce() engine method) that says the process is truly
stopped. There is a whole other set of events which can be requested:
forking, execing a new program, receiving a signal, process death, system
call entry and exit, etc. Single-stepping through instructions and program
blocks is also handled through the event mechanism.
A signal can be forced into the target process with:
int utrace_inject_signal(struct task_struct *target,
struct utrace_attached_engine *engine,
u32 action, siginfo_t *info,
const struct k_sigaction *ka);
Signals injected in this manner are delivered to the target process
immediately; they are not queued in the usual manner.
There is more to the utrace API than is described in this brief overview,
including an API for describing and working with CPU registers;
see the excellent documentation file
packaged with the patch for more details. Also included with the patch is
a complete reimplementation of ptrace() built on top of utrace.
Reimplementing ptrace() is only so interesting, however, even if
the result is a big improvement. The real purpose behind utrace looks to
be to inspire the creation of the next generation of user-space process
tracing APIs, and more. Roland told your editor:
The intent of the utrace API is not just to facilitate my writing
the one great new userland API to replace ptrace. Its core purpose
is to put writing a new user debugging facility more on par with
writing a software device driver, a filesystem, or a network stack,
so that many people can come up with ideas and experiment without
doing brain surgery every time. It ties up the really nasty
low-level implementation issues, and lets different unrelated
facilities coexist without interfering with each other.
In other words, while utrace should enable the eventual retirement of
ptrace(), there is more coming than that. If and when utrace
makes it into the mainline, look for it to inspire interesting developments
in a number of areas.
Comments (11 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>