The current 2.6 prepatch is 2.6.18-rc5
by Linus on
August 27. As one would
expect for this stage in the 2.6.18 cycle, this patch adds a bunch of fixes
but not much else. See the
for the details.
A very small number of patches have gone into the mainline git repository
since -rc5 was released.
The current -mm tree is 2.6.18-rc4-mm3; changes in this
release are mostly bug fixes and minor updates.
Stable kernel 126.96.36.199 was
released on August 26. There is a fairly long list of fixes in this
release, including at least four which are security-related.
Comments (none posted)
Kernel development news
Rik van Riel has put up a guide to getting code merged into the kernel
on the kernelnewbies.org site. "However, some people react badly to the opinions and suggestions of the people who took hours out of their time to review their code. Some people even flame them to a crisp. Once you have turned enough of the linux-kernel 'top dogs' against you, it will become extremely hard to get your code merged. If only because nobody will take the time again to review the next iteration of your code.
Comments (none posted)
Modern processors support a number of power states. When there is
nothing of any real interest going on, they can be instructed to power down
to one of potentially several different levels. Since processors on most
systems are idle much of the time, this capability can be put to use to
bring about a significant reduction in power use. Cutting power demand is
most helpful on systems with limited power sources - laptops, portable
music players, Linux-powered penguin robots, etc. - but cutting power
consumption is also a good thing to do in most other environments as well.
Powering down the CPU becomes an even more useful thing to do once a
dynamic tick mechanism is in use - something which appears possible for the
Linux i386 port in 2.6.19. The elimination of the periodic clock
interrupt will allow the processor to sleep for longer periods of time when
there is nothing to do. Longer sleeps can translate into deeper power
saving modes, reducing consumption even further.
The problem that can come up, however, is that the more aggressive power
management modes will, by their nature, cause the processor to take longer
to get back into an operating state. So, as the processor is put more
deeply to rest, the system's latency in responding to external events will
increase. In some situations, that latency can cause the system to fail to
operate properly. Audio or video data might get dropped, a network adapter
may start to see errors, or that robotic penguin could fail to respond in
time to a cyber-walrus threat. The usual response to that problem, beyond
hunting walruses to extinction, is to simply disable the power-saving
behavior. but such drastic responses should not really be necessary.
Various devices in the system, when operating in certain modes, will need
to obtain responses from the system
within a given period of time. The drivers for those devices
know how the device is being operated at any given moment, so they know what
the latency requirements are. If the system as a whole had that
information, it could tune its operations to the minimum latency
requirements in effect at the moment, and could change its operations as
the requirements change. But there is no mechanism in the system for
handling - and reacting to - this information.
Arjan van de Ven has set out to change this situation with a latency tracking infrastructure
patch. This work adds a set of new functions which may be used by drivers
to indicate their latency requirements:
void set_acceptable_latency(char *identifier, int usecs);
void modify_acceptable_latency(char *identifier, int usecs);
void remove_acceptable_latency(char *identifier);
When a driver enters a mode where it has specific latency requirements (a
camera driver starts acquiring frame data, say), it can tell the system
about the maximum latency it can handle with
set_acceptable_latency(). The identifier parameter is
only used for identifying the request later on; usecs is the
maximum latency in microseconds. The latency requirement can be changed
with modify_acceptable_latency(), or eliminated altogether with
The back end of the latency infrastructure includes a notifier chain for
letting interested subsystems know when the maximum acceptable latency has
changed. The current consumer of this information is the ACPI subsystem,
which can use it to adjust the processor's idle state to meet that
requirement. One could imagine that a smart dynamic tick implementation
could use this information as well.
In the current patch, only one subsystem (the IPW2100 wireless network
driver) declares its latency requirements. This version of the patch has
been proposed for inclusion in the -mm kernel, however, with the idea that
other driver maintainers could start to make use of it. Unless some sort
of surprising objection comes up, the latency management infrastructure
looks likely to be a part of the 2.6.19 kernel.
Comments (8 posted)
The internal kernel API has developed a number of conventions over the
years. One of the most prevalent has to do with the return values from
functions. In many cases, a function will return zero as an indicator of
success, or a negative error code on failure. This convention goes against
the normal C conventions for boolean values - a "false" value means that
everything is OK. But it reflects the fact that, while all happy functions
are alike, every unhappy function is unhappy in its own way. It is useful
to be able to return a variety of error codes.
There are exceptions to this convention, however. One of the more famous
is copy_to_user() and copy_from_user(), both of which
will, on failure, return the number of bytes which were not copied. Back
in 2002, Rusty Russell audited 5500 calls to these functions and determined
that 415 of them interpreted the return value incorrectly. He proposed
changing the interface to match the kernel's conventions, but had no
success. See the
May 23, 2002 LWN Kernel Page for more on this episode.
More recently, Alan Stern has been burned by
the workqueue interface. Functions like queue_work() return a
"normal" boolean value - zero on failure, non-zero if the requested work
was actually queued. Alan suggested that these functions should be
changed, and offered to fix up all in-tree callers in the process. The
answer he got back was that fixing the return code would be a good thing,
but that the name of the functions should be changed at the same time. Otherwise
out-of-tree code could misinterpret the new return value with no indication
to the programmer.
The resulting patch does just
that. With this patch, the functions for adding work to an arbitrary
int add_work_to_q(struct workqueue_struct *queue,
struct work_struct *work);
int add_delayed_work_to_q(struct workqueue_struct *queue,
struct work_struct *work,
unsigned long delay);
int add_delayed_work_to_q_on(int cpu,
struct workqueue_struct *queue,
struct work_struct *work,
unsigned long delay);
As expected, these functions return zero on success and a negative error
code (-EBUSY) on failure. The return code makes sense because the
only reason for the operation to fail in current code is if the given
work_struct is already on a workqueue.
Similar changes have been made to the functions which operate on the
generic, shared workqueue (schedule_work() and friends). They are
int add_work(struct work_struct *work);
int add_delayed_work(struct work_struct *work, unsigned long delay);
int add_delayed_work_on(int cpu, struct work_struct *work,
unsigned long delay);
In all each case, wrapper functions with the old names have been provided
so that out-of-tree code which has not been updated will not break. Most
of the time, anyway. It seems that most in-tree callers never bothered to
check the return value from these functions in the first place, and Alan has concluded
that out-of-tree callers will be the same. So the new version of the old
functions are declared as void, returning no value at all.
Instead, they log a warning when an operation fails. As a result of this
change, code which actually checks the return value will fail to compile,
and, presumably, the author will update it to the new functions.
Everything else will continue to run as it always did.
Alan has also proposed an addition to the kernel coding style document. It
reads (in part):
If the name of a function is an action or an imperative command,
the function should return an error-code integer. If the name
is a predicate, the function should return a "succeeded" boolean.
There does not seem to be much disagreement over this proposal, so that is
likely to be how things go. This convention is still not likely to extend
to copy_to_user() and copy_from_user(), however.
Comments (5 posted)
Your editor remembers a time when "the computer" was a single, large
machine shared among many users. This large machine was, one might say,
not quite as powerful as the systems we work on - or carry around to play
music on - today, so sharing it between dozens (or more) people was bound
to lead to conflicts. Accordingly, most timesharing systems in those days
implemented complex resource quota mechanisms to keep users in bounds. When
these systems worked well, they let people get their work done while
minimizing violence in the hallways.
It is probably safe to say that almost all deployed Linux systems spend
most of their time serving a single user or task. There is little need to
keep users from stepping on each others' toes within a single system;
instead, they can fight over the use of external resources like network
bandwidth. So patches which implement
such mechanisms (such as the class-based kernel resource
management system) have generally not gotten very far. The driving
need to fence users within a portion of a system's resources just has not
Virtualization and containers may change that situation, however. The
purpose of these systems is to isolate users from each other. But if one
container is able to use a disproportionate amount of some vital system
resource, the others will feel its presence. The illusion of having a
machine to one's self loses some of its credibility if that machine, say, has no
memory available to it. As these projects gather steam, they are
motivating another look at resource usage management structures.
CKRM, now known as resource
groups, may well make a resurgence. In the mean time, however, another
approach has been proposed in the form of the resource beancounters patch.
The beancounter developers appear to have tried to take a lighter-weight
approach, but this patch still ends up touching a number of places in the
The core object in this mechanism is, yes, the "beancounter." Each
beancounter in the system tracks the resource usage of a group of processes - presumably
all of the processes running within a specific container. Beancounters
contain a reference count, a unique ID, and an array of resource values; for
each tracked resource, this array contains a pair of limits, current usage, historical
minimum and maximum use, and a count of how many times an attempt to
increase usage of that resource was denied. Each process in the system
contains a pointer to its (probably shared) beancounter object. There is
also a second beancounter, called fork_bc, which is used for any
child processes created with fork().
A new system call, get_bcid(), returns the ID number for the
current process's beancounter object. A suitably privileged user can call:
int set_bcid(bcid_t id);
to change its current and fork IDs to a new value. Privileged
processes can also change any process's limits with:
int set_bclimit(bcid_t id, unsigned long resource, unsigned long *limits);
Here, resource identifies which resource limit is being changed,
and limits points to an array of two values holding the "barrier"
and "limit" values. The barrier value is intended to be a sort of soft
limit, where some allocations might fail, but others are allowed to
In the posted patch, only one resource is tracked: kernel memory. For this
resource, the "barrier" limit applies to most allocations; once the barrier
is hit, allocation attempts will fail. The allocation of page tables and
related structures, however, can go all the way to the "limit" value. So,
while a process may start to see operations failing as a result of
excessive kernel memory use, it should still be able to have its page
faults handled normally while it tries to recover.
The kernel allocates memory in many places, and not all of those should be
charged to the process that happens to be running at the time. The
beancounter patch adds a couple of new GFP flags to make the difference
explicit. In the default case, memory allocations are not charged to any
specific beancounter. Whenever an allocation function is called with the
__GFP_BC flag set, however, the current beancounter will be
charged. An additional flag (__GFP_BC_LIMIT) specifies that the
higher limit value is to be used. There is also a SLAB_BC flag
which can cause all allocations from a given slab cache to be charged.
Finally, there is a new vmalloc_bc() function which performs the
Needless to say, finding every allocation which should be tracked and
charged to a beancounter would be a large task. The current patch does not
even try; instead, it marks enough specific allocations to catch some of
the larger uses of kernel memory and show how the whole system works. That
may be as far as it gets; getting driver writers, for example, to think
about whether their memory allocations should be charged seems like an
Whether this patch set will get any further than CKRM (sorry, "resource
groups") remains to be seen. There are some concerns about how accounting
for shared resources are handled - does the process group which first
faults in the C library get charged for the whole thing, giving others a
free ride? Then, many developers will continue to see no real need for this sort
of accounting structure. The growing use of virtualization techniques may
just be the factor which pushes this kind of patch into the kernel,
Comments (5 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
- =?iso-8859-1?Q?J=F6rn?= Engel: LogFS.
(August 24, 2006)
Page editor: Jonathan Corbet
Next page: Distributions>>