LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.18-rc5, released by Linus on August 27. As one would expect for this stage in the 2.6.18 cycle, this patch adds a bunch of fixes but not much else. See the long-format changelog for the details.

A very small number of patches have gone into the mainline git repository since -rc5 was released.

The current -mm tree is 2.6.18-rc4-mm3; changes in this release are mostly bug fixes and minor updates.

Stable kernel 2.6.16.28 was released on August 26. There is a fairly long list of fixes in this release, including at least four which are security-related.

Comments (none posted)

Kernel development news

A guide to getting code merged

Rik van Riel has put up a guide to getting code merged into the kernel on the kernelnewbies.org site. "However, some people react badly to the opinions and suggestions of the people who took hours out of their time to review their code. Some people even flame them to a crisp. Once you have turned enough of the linux-kernel 'top dogs' against you, it will become extremely hard to get your code merged. If only because nobody will take the time again to review the next iteration of your code."

Comments (none posted)

An API for specifying latency constraints

Modern processors support a number of power states. When there is nothing of any real interest going on, they can be instructed to power down to one of potentially several different levels. Since processors on most systems are idle much of the time, this capability can be put to use to bring about a significant reduction in power use. Cutting power demand is most helpful on systems with limited power sources - laptops, portable music players, Linux-powered penguin robots, etc. - but cutting power consumption is also a good thing to do in most other environments as well.

Powering down the CPU becomes an even more useful thing to do once a dynamic tick mechanism is in use - something which appears possible for the Linux i386 port in 2.6.19. The elimination of the periodic clock interrupt will allow the processor to sleep for longer periods of time when there is nothing to do. Longer sleeps can translate into deeper power saving modes, reducing consumption even further.

The problem that can come up, however, is that the more aggressive power management modes will, by their nature, cause the processor to take longer to get back into an operating state. So, as the processor is put more deeply to rest, the system's latency in responding to external events will increase. In some situations, that latency can cause the system to fail to operate properly. Audio or video data might get dropped, a network adapter may start to see errors, or that robotic penguin could fail to respond in time to a cyber-walrus threat. The usual response to that problem, beyond hunting walruses to extinction, is to simply disable the power-saving behavior. but such drastic responses should not really be necessary.

Various devices in the system, when operating in certain modes, will need to obtain responses from the system within a given period of time. The drivers for those devices know how the device is being operated at any given moment, so they know what the latency requirements are. If the system as a whole had that information, it could tune its operations to the minimum latency requirements in effect at the moment, and could change its operations as the requirements change. But there is no mechanism in the system for handling - and reacting to - this information.

Arjan van de Ven has set out to change this situation with a latency tracking infrastructure patch. This work adds a set of new functions which may be used by drivers to indicate their latency requirements:

    #include <linux/latency.h>

    void set_acceptable_latency(char *identifier, int usecs);
    void modify_acceptable_latency(char *identifier, int usecs);
    void remove_acceptable_latency(char *identifier);

When a driver enters a mode where it has specific latency requirements (a camera driver starts acquiring frame data, say), it can tell the system about the maximum latency it can handle with set_acceptable_latency(). The identifier parameter is only used for identifying the request later on; usecs is the maximum latency in microseconds. The latency requirement can be changed with modify_acceptable_latency(), or eliminated altogether with remove_acceptable_latency().

The back end of the latency infrastructure includes a notifier chain for letting interested subsystems know when the maximum acceptable latency has changed. The current consumer of this information is the ACPI subsystem, which can use it to adjust the processor's idle state to meet that requirement. One could imagine that a smart dynamic tick implementation could use this information as well.

In the current patch, only one subsystem (the IPW2100 wireless network driver) declares its latency requirements. This version of the patch has been proposed for inclusion in the -mm kernel, however, with the idea that other driver maintainers could start to make use of it. Unless some sort of surprising objection comes up, the latency management infrastructure looks likely to be a part of the 2.6.19 kernel.

Comments (8 posted)

Workqueues and internal API conventions

The internal kernel API has developed a number of conventions over the years. One of the most prevalent has to do with the return values from functions. In many cases, a function will return zero as an indicator of success, or a negative error code on failure. This convention goes against the normal C conventions for boolean values - a "false" value means that everything is OK. But it reflects the fact that, while all happy functions are alike, every unhappy function is unhappy in its own way. It is useful to be able to return a variety of error codes.

There are exceptions to this convention, however. One of the more famous is copy_to_user() and copy_from_user(), both of which will, on failure, return the number of bytes which were not copied. Back in 2002, Rusty Russell audited 5500 calls to these functions and determined that 415 of them interpreted the return value incorrectly. He proposed changing the interface to match the kernel's conventions, but had no success. See the May 23, 2002 LWN Kernel Page for more on this episode.

More recently, Alan Stern has been burned by the workqueue interface. Functions like queue_work() return a "normal" boolean value - zero on failure, non-zero if the requested work was actually queued. Alan suggested that these functions should be changed, and offered to fix up all in-tree callers in the process. The answer he got back was that fixing the return code would be a good thing, but that the name of the functions should be changed at the same time. Otherwise out-of-tree code could misinterpret the new return value with no indication to the programmer.

The resulting patch does just that. With this patch, the functions for adding work to an arbitrary workqueue become:

    int add_work_to_q(struct workqueue_struct *queue, 
                      struct work_struct *work);
    int add_delayed_work_to_q(struct workqueue_struct *queue,
                              struct work_struct *work,
			      unsigned long delay);
    int add_delayed_work_to_q_on(int cpu,
                                 struct workqueue_struct *queue,
				 struct work_struct *work,
				 unsigned long delay);

As expected, these functions return zero on success and a negative error code (-EBUSY) on failure. The return code makes sense because the only reason for the operation to fail in current code is if the given work_struct is already on a workqueue.

Similar changes have been made to the functions which operate on the generic, shared workqueue (schedule_work() and friends). They are now:

    int add_work(struct work_struct *work);
    int add_delayed_work(struct work_struct *work, unsigned long delay);
    int add_delayed_work_on(int cpu, struct work_struct *work,
                            unsigned long delay);

In all each case, wrapper functions with the old names have been provided so that out-of-tree code which has not been updated will not break. Most of the time, anyway. It seems that most in-tree callers never bothered to check the return value from these functions in the first place, and Alan has concluded that out-of-tree callers will be the same. So the new version of the old functions are declared as void, returning no value at all. Instead, they log a warning when an operation fails. As a result of this change, code which actually checks the return value will fail to compile, and, presumably, the author will update it to the new functions. Everything else will continue to run as it always did.

Alan has also proposed an addition to the kernel coding style document. It reads (in part):

If the name of a function is an action or an imperative command, the function should return an error-code integer. If the name is a predicate, the function should return a "succeeded" boolean.

There does not seem to be much disagreement over this proposal, so that is likely to be how things go. This convention is still not likely to extend to copy_to_user() and copy_from_user(), however.

Comments (5 posted)

Resource beancounters

Your editor remembers a time when "the computer" was a single, large machine shared among many users. This large machine was, one might say, not quite as powerful as the systems we work on - or carry around to play music on - today, so sharing it between dozens (or more) people was bound to lead to conflicts. Accordingly, most timesharing systems in those days implemented complex resource quota mechanisms to keep users in bounds. When these systems worked well, they let people get their work done while minimizing violence in the hallways.

It is probably safe to say that almost all deployed Linux systems spend most of their time serving a single user or task. There is little need to keep users from stepping on each others' toes within a single system; instead, they can fight over the use of external resources like network bandwidth. So patches which implement such mechanisms (such as the class-based kernel resource management system) have generally not gotten very far. The driving need to fence users within a portion of a system's resources just has not been there.

Virtualization and containers may change that situation, however. The purpose of these systems is to isolate users from each other. But if one container is able to use a disproportionate amount of some vital system resource, the others will feel its presence. The illusion of having a machine to one's self loses some of its credibility if that machine, say, has no memory available to it. As these projects gather steam, they are motivating another look at resource usage management structures.

CKRM, now known as resource groups, may well make a resurgence. In the mean time, however, another approach has been proposed in the form of the resource beancounters patch. The beancounter developers appear to have tried to take a lighter-weight approach, but this patch still ends up touching a number of places in the kernel.

The core object in this mechanism is, yes, the "beancounter." Each beancounter in the system tracks the resource usage of a group of processes - presumably all of the processes running within a specific container. Beancounters contain a reference count, a unique ID, and an array of resource values; for each tracked resource, this array contains a pair of limits, current usage, historical minimum and maximum use, and a count of how many times an attempt to increase usage of that resource was denied. Each process in the system contains a pointer to its (probably shared) beancounter object. There is also a second beancounter, called fork_bc, which is used for any child processes created with fork().

A new system call, get_bcid(), returns the ID number for the current process's beancounter object. A suitably privileged user can call:

    int set_bcid(bcid_t id);

to change its current and fork IDs to a new value. Privileged processes can also change any process's limits with:

    int set_bclimit(bcid_t id, unsigned long resource, unsigned long *limits);

Here, resource identifies which resource limit is being changed, and limits points to an array of two values holding the "barrier" and "limit" values. The barrier value is intended to be a sort of soft limit, where some allocations might fail, but others are allowed to proceed.

In the posted patch, only one resource is tracked: kernel memory. For this resource, the "barrier" limit applies to most allocations; once the barrier is hit, allocation attempts will fail. The allocation of page tables and related structures, however, can go all the way to the "limit" value. So, while a process may start to see operations failing as a result of excessive kernel memory use, it should still be able to have its page faults handled normally while it tries to recover.

The kernel allocates memory in many places, and not all of those should be charged to the process that happens to be running at the time. The beancounter patch adds a couple of new GFP flags to make the difference explicit. In the default case, memory allocations are not charged to any specific beancounter. Whenever an allocation function is called with the __GFP_BC flag set, however, the current beancounter will be charged. An additional flag (__GFP_BC_LIMIT) specifies that the higher limit value is to be used. There is also a SLAB_BC flag which can cause all allocations from a given slab cache to be charged. Finally, there is a new vmalloc_bc() function which performs the appropriate accounting.

Needless to say, finding every allocation which should be tracked and charged to a beancounter would be a large task. The current patch does not even try; instead, it marks enough specific allocations to catch some of the larger uses of kernel memory and show how the whole system works. That may be as far as it gets; getting driver writers, for example, to think about whether their memory allocations should be charged seems like an uphill battle.

Whether this patch set will get any further than CKRM (sorry, "resource groups") remains to be seen. There are some concerns about how accounting for shared resources are handled - does the process group which first faults in the C library get charged for the whole thing, giving others a free ride? Then, many developers will continue to see no real need for this sort of accounting structure. The growing use of virtualization techniques may just be the factor which pushes this kind of patch into the kernel, however.

Comments (5 posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Device drivers

Documentation

Filesystems and block I/O

  • =?iso-8859-1?Q?J=F6rn?= Engel: LogFS. (August 24, 2006)

Janitorial

Memory management

Networking

Architecture-specific

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds