The current development kernel is 2.5.59
, which was released
by Linus on January 16.
It includes a number of architecture-specific updates, an
XFS update, support for the SHA-386 and SHA-512 algorithms in the crypto
API, a new NUMA scheduler (see below), and some sysfs work.
has the details.
This will be the last release from Linus for a bit, since he will be
traveling through the end of the month. There are currently no additional
patches merged into his BitKeeper tree.
The current stable kernel is 2.4.20; Marcelo has not released any
2.4.21 prepatches since January 6.
Comments (none posted)
Kernel development news
The O(1) scheduler was integrated relatively early in the 2.5 development
cycle with great results. So it could be a bit surprising to see a new set
of scheduler changes going in at this late, feature-frozen date. The
inclusion of a new NUMA scheduler in 2.5.59, however, is a relatively safe
move which will help Linux perform well on high-end systems.
NUMA (non-uniform memory access) systems, of course, are distinguished by
an architecture which makes some memory "closer" to certain processors than
others. Each "node" in a NUMA system contains one or more processors,
along with an array of local memory. Processors can access memory belonging
to other nodes, but that access will be relatively slow. To get top (or
even reasonable) performance on NUMA systems, the kernel must keep each
process - and its memory - within a single node whenever possible.
The memory allocation side has been in place for some time; the Linux
kernel memory allocator sets up one or more zones for each node, and
allocates new pages from the current node's zones whenever possible. But
the scheduler, as found in 2.5.58, will happily move processes between
nodes in its efforts to keep all processors busy. There has been a NUMA
scheduler patch floating around for a while, but it has not been merged,
perhaps because it made too many changes to the scheduler for non-NUMA
More recently, the NUMA scheduler patch has been reworked (by Martin Bligh,
Erich Focht, Michael Hohnbaum, and others) around a simple observation:
most of the NUMA problems can be solved by simply restricting the current
scheduler's balancing code to processors within a single node. If the
rebalancer - which moves processes across CPUs in order to keep them all
busy - only balances inside a node, the worst processor imbalances will be
addressed without moving processes into a foreign-node slow zone.
A simple (three-line) patch which did nothing but add the within-node
restriction yielded most of the benefits of the full NUMA scheduler;
indeed, it performed better on some benchmarks. Real-world loads, however,
will require a scheduler which can distribute processes evenly across
nodes. Occasionally it is necessary, even, to move processes to a slower
node; a lot of CPU time on a lightly-loaded node will give better
performance than waiting in the run queue on a heavily-loaded node. So a
bit of complexity had to be added back into the new scheduler to complete
The 2.5.59 scheduler distributes processes across NUMA nodes in two
places. The first is in the exec() system call. A process which
calls exec() is very simple to move, since almost all of its
context, including memory, is being thrown away. For many loads, proper
balancing at exec() time is enough to get good performance.
Some loads, however, will tend to pile up processes within a single node.
Any process which forks many times, for example, will find itself competing
with all of its children for the same node's resources (unless, of course,
those children call exec() and are moved to a new node). To
address this problem, the new NUMA scheduler will occasionally look for a
large load imbalance between nodes, and, if one is found, move processes to
balance things out. This rebalancing happens once for every ten or hundred
intra-node rebalancings, depending on the architecture.
The scheduler has seen continued tweaking since 2.5.59 came out. The most
significant change, perhaps, is to move the explicit load balancing out of
the main scheduler code (where it could get called many times per second on
an idle processor) and to restrict it to the scheduler's "timer tick"
routine. That change allows more exact control over when the rebalancings
happen. A recent patch from Ingo Molnar performs fairly frequent
rebalancings (intra-node every 1ms, and globally every 2ms) when the
current processor is idle; if the processor is busy the rebalancings only
happen every 200 (local) and 400ms (global).
Linus raised an interesting point when he
merged the NUMA scheduler: can this scheduler handle hyperthreading as
well? Hyperthreaded processors implement two (or more) virtual CPUs on the
same physical processor; one processor can be running while the other waits
for memory access. Hyperthreading can certainly be seen as a sort of NUMA
system, since the sibling processors share a cache and thus have faster
access to memory that either one has accessed recently. So the same
algorithm should really work in this case.
Treating hyperthreaded systems as NUMA systems has a a certain conceptual
elegance, but it may not be the way the Linux kernel goes in the end. The
most recent hyperthreading patch from Ingo
Molnar takes a different approach: rather than mess with "rebalancing"
processes across the same physical processor, why not just use the same run
queue for both? Sibling processes on a hyperthreaded core are truly
equivalent; it does not matter which process runs on which virtual
processor as long as they are all busy. So NUMA and hyperthreading may
stay as distinct cases for now.
Comments (4 posted)
One of the things that has been on the 2.5 "to do" list since before there
a 2.5 is expanding the dev_t
type to 32 bits.
, of course, is currently a 16-bit value holding the
eight-bit major and minor device numbers. The small size of the device
number fields has been a constraining factor for people building systems
with thousands of devices for some time; it had been pretty well assumed
that it would be expanded in this development cycle.
Almost three months into the feature freeze, the dev_t expansion
is nowhere in sight. It remains necessary, however; consider this statement from Alan Cox:
32bit dev_t IMHO is essential to 2.6. Essential enough that if its
not in the base 2.6 all the vendors have to get together and issue
a Linus incompatible but common 32bit dev_t interface.
32-bit dev_t as an added vendor patch would make for a big
difference between the Linus kernel tree and that which is shipped by the
distributors. But large distributor patches to the kernel are not that
uncommon. The real issue here is that no 32-bit dev_t patch has
been posted - whether for integration or not.
Expanding dev_t is not a trivial task. The interface with user
space must be handled carefully to avoid breaking older applications. The
kernel currently tracks devices through the static blkdevs and
chrdevs arrays, which are indexed by the major device number.
This approach works when there are only 256 possible device numbers, but
falls apart when you can have thousands of them. And, despite a
continued effort to stamp them out, there are, beyond doubt, many places in
the kernel which assume implicitly that device numbers are eight bits
So the dev_t expansion will be somewhat invasive and
destabilizing - though certainly achievable. It really should happen
sooner rather than later. If it is
true that a larger dev_t will be a part of the 2.6 kernel actually
seen by customers, then this work is one of the factors which is delaying
the 2.6 release.
Comments (2 posted)
Patches and updates
Core kernel code
- Cliff White: Re-aim-7.
(January 22, 2003)
Page editor: Jonathan Corbet
Next page: Distributions>>