|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 2.5.59, which was released by Linus on January 16. It includes a number of architecture-specific updates, an XFS update, support for the SHA-386 and SHA-512 algorithms in the crypto API, a new NUMA scheduler (see below), and some sysfs work. The long-format changelog has the details.

This will be the last release from Linus for a bit, since he will be traveling through the end of the month. There are currently no additional patches merged into his BitKeeper tree.

The current stable kernel is 2.4.20; Marcelo has not released any 2.4.21 prepatches since January 6.

Comments (none posted)

Kernel development news

The new NUMA scheduler

The O(1) scheduler was integrated relatively early in the 2.5 development cycle with great results. So it could be a bit surprising to see a new set of scheduler changes going in at this late, feature-frozen date. The inclusion of a new NUMA scheduler in 2.5.59, however, is a relatively safe move which will help Linux perform well on high-end systems.

NUMA (non-uniform memory access) systems, of course, are distinguished by an architecture which makes some memory "closer" to certain processors than others. Each "node" in a NUMA system contains one or more processors, along with an array of local memory. Processors can access memory belonging to other nodes, but that access will be relatively slow. To get top (or even reasonable) performance on NUMA systems, the kernel must keep each process - and its memory - within a single node whenever possible.

The memory allocation side has been in place for some time; the Linux kernel memory allocator sets up one or more zones for each node, and allocates new pages from the current node's zones whenever possible. But the scheduler, as found in 2.5.58, will happily move processes between nodes in its efforts to keep all processors busy. There has been a NUMA scheduler patch floating around for a while, but it has not been merged, perhaps because it made too many changes to the scheduler for non-NUMA systems.

More recently, the NUMA scheduler patch has been reworked (by Martin Bligh, Erich Focht, Michael Hohnbaum, and others) around a simple observation: most of the NUMA problems can be solved by simply restricting the current scheduler's balancing code to processors within a single node. If the rebalancer - which moves processes across CPUs in order to keep them all busy - only balances inside a node, the worst processor imbalances will be addressed without moving processes into a foreign-node slow zone.

A simple (three-line) patch which did nothing but add the within-node restriction yielded most of the benefits of the full NUMA scheduler; indeed, it performed better on some benchmarks. Real-world loads, however, will require a scheduler which can distribute processes evenly across nodes. Occasionally it is necessary, even, to move processes to a slower node; a lot of CPU time on a lightly-loaded node will give better performance than waiting in the run queue on a heavily-loaded node. So a bit of complexity had to be added back into the new scheduler to complete the job.

The 2.5.59 scheduler distributes processes across NUMA nodes in two places. The first is in the exec() system call. A process which calls exec() is very simple to move, since almost all of its context, including memory, is being thrown away. For many loads, proper balancing at exec() time is enough to get good performance.

Some loads, however, will tend to pile up processes within a single node. Any process which forks many times, for example, will find itself competing with all of its children for the same node's resources (unless, of course, those children call exec() and are moved to a new node). To address this problem, the new NUMA scheduler will occasionally look for a large load imbalance between nodes, and, if one is found, move processes to balance things out. This rebalancing happens once for every ten or hundred intra-node rebalancings, depending on the architecture.

The scheduler has seen continued tweaking since 2.5.59 came out. The most significant change, perhaps, is to move the explicit load balancing out of the main scheduler code (where it could get called many times per second on an idle processor) and to restrict it to the scheduler's "timer tick" routine. That change allows more exact control over when the rebalancings happen. A recent patch from Ingo Molnar performs fairly frequent rebalancings (intra-node every 1ms, and globally every 2ms) when the current processor is idle; if the processor is busy the rebalancings only happen every 200 (local) and 400ms (global).

Linus raised an interesting point when he merged the NUMA scheduler: can this scheduler handle hyperthreading as well? Hyperthreaded processors implement two (or more) virtual CPUs on the same physical processor; one processor can be running while the other waits for memory access. Hyperthreading can certainly be seen as a sort of NUMA system, since the sibling processors share a cache and thus have faster access to memory that either one has accessed recently. So the same algorithm should really work in this case.

Treating hyperthreaded systems as NUMA systems has a a certain conceptual elegance, but it may not be the way the Linux kernel goes in the end. The most recent hyperthreading patch from Ingo Molnar takes a different approach: rather than mess with "rebalancing" processes across the same physical processor, why not just use the same run queue for both? Sibling processes on a hyperthreaded core are truly equivalent; it does not matter which process runs on which virtual processor as long as they are all busy. So NUMA and hyperthreading may stay as distinct cases for now.

Comments (4 posted)

What ever happened to 32-bit dev_t?

One of the things that has been on the 2.5 "to do" list since before there was a 2.5 is expanding the dev_t type to 32 bits. dev_t, of course, is currently a 16-bit value holding the eight-bit major and minor device numbers. The small size of the device number fields has been a constraining factor for people building systems with thousands of devices for some time; it had been pretty well assumed that it would be expanded in this development cycle.

Almost three months into the feature freeze, the dev_t expansion is nowhere in sight. It remains necessary, however; consider this statement from Alan Cox:

32bit dev_t IMHO is essential to 2.6. Essential enough that if its not in the base 2.6 all the vendors have to get together and issue a Linus incompatible but common 32bit dev_t interface.

32-bit dev_t as an added vendor patch would make for a big difference between the Linus kernel tree and that which is shipped by the distributors. But large distributor patches to the kernel are not that uncommon. The real issue here is that no 32-bit dev_t patch has been posted - whether for integration or not.

Expanding dev_t is not a trivial task. The interface with user space must be handled carefully to avoid breaking older applications. The kernel currently tracks devices through the static blkdevs and chrdevs arrays, which are indexed by the major device number. This approach works when there are only 256 possible device numbers, but falls apart when you can have thousands of them. And, despite a continued effort to stamp them out, there are, beyond doubt, many places in the kernel which assume implicitly that device numbers are eight bits wide.

So the dev_t expansion will be somewhat invasive and destabilizing - though certainly achievable. It really should happen sooner rather than later. If it is true that a larger dev_t will be a part of the 2.6 kernel actually seen by customers, then this work is one of the factors which is delaying the 2.6 release.

Comments (2 posted)

Patches and updates

Kernel trees

Stephen Hemminger 2.5.58-dcl1 ?
Andrea Arcangeli 2.4.21pre3aa1 ?

Architecture-specific

Build system

Core kernel code

Development tools

John Bradford Kernel Bug Database 2.0 ?

Device drivers

Documentation

Memory management

Andrew Morton 2.5.59-mm1 ?
Andrew Morton 2.5.59-mm2 ?

Security-related

Stephen D. Smalley Add LSM sysctl hook to 2.5.59 ?
Stephen D. Smalley Add LSM syslog hook to 2.5.59 ?
Chris Wright 2.5.59-lsm1 ?

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds