LWN.net Logo

Advertisement

Advanced thin client solution for Linux, based on Open Source. Mix Windows and Linux applications on the same desktop.

Advertise here

Kernel development

Kernel release status

The current 2.6 prepatch remains 2.6.9-rc1; no new prepatches have been released since August 24.

The flow of patches into Linus's BitKeeper repository continues, however, and a new prepatch could come out at any time. That repository now contains the removal of the ancient, unused "busmouse" driver, infrastructure for cluster-wide file locking, a number of DRM subsystem cleanups, the out-of-line spinlock patch, AMD dual-core support, more filesystem conversions to the new symbolic link resolution code (which will eventually allow an increase in the maximum link depth), a new waitid() system call implementing the POSIX call by the same name, a "fake NUMA" mode for x86-64 testing, a small-footprint tmpfs implementation, the base KProbes patch, a set of IDE updates, support for scheduler profiling (seeing where context switches come from), automatic TCP window scaling calculation, a kobject change (it uses kref now), a USB gadget interface update with "On The Go" support, a big ALSA update, the removal of the Philips webcam driver, numerous network driver updates, some random number generator fixes, a fix for the audio CD writing memory leak, some VFS interface improvements, executable support in hugetlb mappings, the Whirlpool digest algorithm, some virtual memory tweaks, a number of asynchronous I/O fixes and improvements, a User-mode Linux update, the "flex mmap" user-space memory layout (covered here last June), a number of scheduler tweaks, the removal of the very last suser() call, and lots of fixes.

The current tree from Andrew Morton is 2.6.9-rc1-mm4. Recent changes to -mm include CacheFS (covered here last week), the removal of lockmeter (it got broken by the out-of-line spinlock patch), special code for handling misrouted interrupts on x86 systems, the new sysfs event layer patch (see below), and M32R architecture support.

The current 2.4 prepatch remains 2.4.28-pre2; no prepatches have been released since August 25.

Comments (none posted)

Kernel development news

Figuring out kernel event reporting

Robert Love's kernel event notification patch was covered here last July. This patch enables the reporting of events to interested user-space software, which can then communicate with the user and generally respond to the events. As the Linux desktop projects become more capable and all-encompassing, they need to know more about what is going on with the system; the events layer is meant to be the mechanism which makes that information available.

Robert has recently posted a new version of the patch which changes the proposed interface significantly. It looks, however, like the patch will change yet again. As it turns out, there is still a fair amount of uncertainty about how best to represent and report kernel events.

The initial version of the patch required four pieces of information for each event: the type (a general class, like "hotplug"), the object generating the event, the signal (saying what is happening), and an explanatory string. The new version eliminates the descriptive string, and turns the object into a proper kobject, which will be communicated to user space as its location in sysfs. This interface is simpler, and it solves the problem of how to generate predictable and consistent object names, but there are still questions on how events should really be represented.

The easier part of the discussion has to do with the "type" parameter, which allows user-space applications to filter out events which will not be of interest. Kernel-generated events are expected to be relatively rare, however, so there will be little cost in simply receiving all of them and ignoring the uninteresting ones. So the type value associated with events may go away.

The more interesting question has to do with the representation of the "signal" parameter. That signal is currently a verb, describing something which has happened with the object of interest. If the object is a CPU, the signal might be "overheating". An alternative implementation would be to replace the signal with an attribute of the object; for a processor event, the temperature attribute would be passed. User space would then read the value of that attribute in sysfs to figure out what is really going on. This approach would force a structure onto the signal names, and would point user space to where it needs to go to learn more about what is going on. On the other hand, there may not always be attributes available to describe a given event, and the approach could be seen as overly restrictive.

Meanwhile, Greg Kroah-Hartman pointed out that the simplified send_kevent() interface strongly resembles another, existing kernel interface:

    int send_kevent(struct kobject *kobj, const char *signal);
    void kobject_hotplug(const char *action, struct kobject *kobj);

Given that kobject_hotplug() is also an event reporting mechanism, why not unify the two? The big difference, at this point, would seem to be that send_kevent() uses the netlink interface to communicate with user space, while the hotplug code runs /sbin/hotplug and passes the relevant information via the environment. Perhaps the best thing to do, says Greg, is to have the hotplug code also send a copy of its events via netlink, and use it for everything?

The idea of sending the same events out by way of two different transports does not appeal to many developers, however; it seems better to decide which is best and go with it. The netlink transport is strongly favored by the desktop crowd, which dislikes the unpredictable delays and ordering associated with event handling via /sbin/hotplug. On the other hand, netlink is not available early in the boot process, but it is important to be able to handle hotplug events then.

In the end, the hybrid approach may persist for some time. A future system might use /sbin/hotplug at boot time, then turn it off once everything is up and running. The one sure conclusion is that this is an area in need of further thought and experimentation.

Comments (1 posted)

NETIF_F_LLTX

One of the key network driver methods is called hard_start_xmit(); its job is to put a network packet onto the wire (or, at least, queue it for transmission). The networking subsystem protects calls to this method with a lock (xmit_lock) in the net_device structure so that only one call will be happening at any given time. This lock also protects a few configuration operations.

As it turns out, quite a few network drivers implement their own locking internally as well. There are contexts (such as in interrupt handlers) where the xmit_lock will not be held, so some other provision must be made for mutual exclusion. So the hard_start_xmit() method, in those drivers, is called with a redundant lock held. It all works, but it adds overhead to a performance-critical path.

Andi Kleen has put together a patch which addresses this duplicate locking. With this patch (which appears likely to be merged), drivers which do their own transmit locking can set the NETIF_F_LLTX "feature" flag. When a packet is to be handed to an interface with that flag set, no additional locking is performed by the networking code. As an added feature, the driver can attempt to take its internal lock with spin_trylock(), and immediately return -1 if that attempt fails; the networking subsystem will then retry the transmission later. In this way, the driver can avoid stalling the CPU while waiting for the lock; there should be, after all, no slowdown if the packet is added to the transmission ring a little bit later.

Comments (1 posted)

Kswapd and high-order allocations

The core memory allocation mechanism inside the kernel is page-based; it will attempt to find a certain number of contiguous pages in response to a request (where "a certain number" is always a power of two). After the system has been running for a while, however, "higher-order" allocations requiring multiple contiguous pages become hard to satisfy. The virtual memory subsystem fragments physical memory to the point that the free pages tend to be separated from each other.

Curious readers can query /proc/buddyinfo to see how fragmented the currently free pages are. On a 1GB system, your editor currently sees the following:

      Node 0, zone   Normal 258 9 5 0 1 2 0 1 1 0 0

On this system, 258 single pages could be allocated immediately, but only nine contiguous pairs exist, and only five groups of four pages can be found. If something comes along which needs a lot of higher-order allocations, the available memory will be exhausted quickly, and those allocations may start to fail.

Nick Piggin has recently looked at this issue and found one area where improvements can be made. The problem is with the kswapd process, which is charged with running in the background and making free pages available to the memory allocator (by evicting user pages). The current kswapd code only looks at the number of free pages available; if that number is high enough, kswapd takes a rest regardless of whether any of those pages are contiguous with others or not. That can lead to a situation where high-order allocations fail, but the system is not making any particular effort to free more contiguous pages.

Nick's patch is fairly straightforward; it simply keeps kswapd from resting until a sufficient number of higher-order allocations are possible.

It has been pointed out, however, that the approach used by kswapd has not really changed: it chooses pages to free without regard to whether those pages can be coalesced into larger groups or not. As a result, it may have to free a great many pages before it, by chance, creates some higher-order groupings of pages. In prior kernels, no better approach was possible, but 2.6 includes the reverse-mapping code. With reverse mapping, it should be possible to target contiguous pages for freeing and vastly improve the system's performance in that area.

Linus's objection to this idea is that it overrides the current page replacement policy, which does its best to evict pages which, with luck, will not be needed in the near future. Changing the policy to target contiguous blocks would make higher-order allocations easier, but it could also penalize system performance as a whole by throwing out useful pages. So, says Linus, if a "defragmentation" mode is to be implemented at all, it should be run rarely and as a separate process.

The other approach to this problem is to simply avoid higher-order allocations in the first place. The switch to 4K kernel stacks was a step in this direction; it eliminated a two-page allocation for every process created. In current kernels, one of the biggest users of high-order allocations would appear to be high-performance network adapter drivers. These adapters can handle large packets which do not fit in a single page, so the kernel must perform multi-page allocations to hold those packets.

Actually, those allocations are only required when the driver (and its hardware) cannot handle "nonlinear" packets which are spread out in memory. Most modern hardware can do scatter/gather DMA operations, and thus does not care whether the packet is stored in a single, contiguous area of memory. Using the hardware's scatter/gather capabilities requires additional work when writing the driver, however, and, for a number of drivers, that work has not yet been done. Addressing the high-order allocation problem from the demand side may prove to be far more effective than adding another objective to the page reclaim code, however.

Comments (6 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Kernel building

Memory management

Networking

Architecture-specific

Security-related

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds