Brief items
The current 2.6 development kernel remains 2.6.27-rc3. There have
been a lot of patches merged into the mainline repository, however, and the
2.6.27-rc4 release can be expected at almost any time. Along with all the
fixes, 2.6.27-rc4 will add support for the multitouch trackpad on new Apple
laptops, more reshuffling of architecture-specific include files, a number
of XFS improvements, interrupt stacks for the SPARC64 architecture, the
removal of the obsolete Auerswald USB sound driver, and new drivers for TI
TUSB 6010 USB controllers, Inventra HDRC USB controllers, and National
Semiconductor adcxx8sxxx analog-to-digital converters.
The current stable 2.6 kernel is 2.6.26.3; it was released (along with 2.6.25.16) on August 20.
Both updates contain a large number of fixes for a wide variety of serious
problems.
Comments (1 posted)
Kernel development news
This isn't the first time that I've seen kernel developers claim
that it's better to work around the kernel in userspace than it is
to fix it. I could understand this if we didn't have the source
code to our own kernel, but we do.
The kernel isn't sacred and it isn't a separate part of the
system. It needs to be seen as just one component of a fully
integrated system, especially by its developers.
-- Scott
James Remnant
Our (complexity(config system) * complexity(header files)) is so
large that compilation testing doesn't prove anything useful. You
just have to check your homework very carefully and don earplugs
for the inevitable explosions.
--
Andrew Morton
Guys, please: regressions are serious, top-priority emergencies.
We drop everything and run around with our hair on fire when we
hear about one (don't we?). Please, if you have a report of a
regression or a fix for one, Cc: everyone in the world on it.
--
Andrew Morton
As it is, it seems like some people think that the merge window is
when you send any random crap that hasn't even been tested, and
then after the merge window you send the stuff that looks
"obviously good".
How about raising your quality control a bit, so that I don't have
to berate you? Send the _obviously good_ stuff during the merge
window, and don't send the "random crap" AT ALL. And then, during
the -rc series, you don't do any "obviously good" stuff at all, but
you do the "absolutely required" stuff.
--
Linus Torvalds
Comments (none posted)
By Jonathan Corbet
August 18, 2008
Kernel code must often wait for something to happen elsewhere in the
system. The preferred way to wait is to use any of a number of interfaces
to wait queues, allowing the processor to perform other tasks in the mean
time. If the kernel code in question is running in an atomic mode, though,
it cannot block, so the use of wait queues is not an option.
Traditionally, in such situations, the programmer simply must code a busy
wait which sits in a tight loop until the required event takes place.
Busy waits are always undesirable, but, in some situations, they become
even more so. If the wait is going to be relatively long, it would be
better to put the processor into a lower power state. After all, nobody
cares if it executes its empty loop at full speed, or, even, whether the
loop executes at all. If the wait is running within a virtualized guest,
the situation can be even worse: by looping in the processor, a busy wait
can actively prevent the running of the code which will eventually provide
the event which is being waited for. In a virtualized environment, it is
far better to simply suspend the virtual system altogether than to let it
busy wait.
Jeremy Fitzhardinge has proposed a solution to this problem in the form of
the trigger API. A trigger
can be thought of as a special type of continuation intended for use in a
specific environment: situations where preemption is disabled and sleeping
is not possible, but where it is necessary to wait for an external event.
A trigger is set up in either of the two usual patterns:
#include <linux/trigger.h>
DEFINE_TRIGGER(my_trigger);
/* ... or ... */
trigger_t my_trigger;
trigger_init(&my_trigger);
There is a sequence of calls which must be made by code intending to
wait for a trigger:
trigger_reset(&my_trigger);
while(!condition)
trigger_wait(&my_trigger);
trigger_finish(&my_trigger);
Triggers are designed to be safe against race conditions, in that if a
trigger is fired after the trigger_reset() call, the subsequent
trigger_wait() call will return immediately. As with any such
primitive, false "wakeups" are possible, so it is necessary to check for
the condition being waited for and wait again if need be.
Code which wishes to signal completion to a thread waiting on a trigger
need only make a call to:
void trigger_kick(trigger_t *trigger);
This code should, of course, ensure that the waiting thread will see that
the resource it was waiting for is available before calling
trigger_kick().
A reader of the generic implementation of triggers may be forgiven for
wondering what the point is; most of the functions are empty, and
trigger_wait() turns into a call to cpu_relax(). In
other words, it's still a busy wait, just like before except that now it's
hidden behind a set of trigger functions. The idea, of course, is that
better versions of these functions can be defined in architecture-specific
code.
If the target architecture is actually a virtual machine environment, for
example, a
trigger can simply suspend the execution of the machine altogether. To
that end, there is a new set of paravirt_ops allowing hypervisors to
implement the trigger operations.
Jeremy has also created an implementation for the x86 architecture which
uses the relatively new monitor and mwait instructions.
In this implementation, a trigger is a simple integer variable. A call to
trigger_reset() turns into a monitor instruction,
informing the processor that it should watch out for changes to that
integer variable. The mwait instruction built into
trigger_wait() halts the processor until the monitored variable is
written to. No more busy waiting is required.
There is a certain elegance to the monitor/mwait
implementation, but Arjan van de Ven worries that it may prove to be too slow. So
changes to the x86 implementation are possible. There have not been a lot
of comments about the API itself, though, so the trigger functions may well
make it into the mainline in something close to their current form.
Comments (4 posted)
By Jonathan Corbet
August 19, 2008
Whenever a Linux system communicates with the rest of the world, it must
follow a whole set of rules on how that communication is done. Basic
TCP/IP networking would work poorly indeed in the lack of an observed
agreement on how the networking medium should be used. Wireless networking
has all of those constraints, plus a set of its own. Since wireless
interfaces are radios, they must conform to rules on the frequencies they
can use, how much power they may emit, and so on. If all goes well, Linux
will finally have a centralized mechanism for ensuring that wireless
devices are operated according to that wider set of rules.
Regulations on radio transmissions bring some extra challenges. They are
legal code, so their violation can bring users, vendors, and distributors
into unwanted conversations with representatives of spectrum enforcement
agencies. The legal code is inherently local, while wireless devices are
inherently mobile, so those devices must be able to modify their behavior
to match different sets of rules at different times. And some wireless
devices can be programmed in quite flexible ways; they can be operated far
outside of their allowed parameters. The possibility that one of these
devices could be configured - accidentally or intentionally - in a way
which interferes with other uses of the spectrum is very real.
The potential for legal problems associated with wireless interfaces has
cast a shadow over Linux for a while. Some vendors have used it as an
excuse for their failure to provide free drivers. Others (Intel, for
example), have reworked their hardware to lock up regulatory compliance
safely within the firmware. And still, vendors and Linux distributors have
worried about what kind of sanctions might come down if Linux systems are
seen to be operating in violation of the law somewhere on the planet.
Despite all that, the Linux kernel has no central mechanism for ensuring
regulatory compliance; it is up to individual drivers to make sure that
their hardware does not break the rules. This situation may be about to change,
though, as the Central
Regulatory Domain Agent (CRDA) patch set, currently being
developed by
Luis Rodriguez, approaches readiness.
At the core of CRDA is struct ieee80211_regdomain, which describes
the rules associated with a given legal regime. It is a somewhat
complicated structure, but its contents are relatively simple to
understand. They include a set of allowable frequency ranges; for each
range, the maximum bandwidth, allowable power, and antenna gain are
listed. There's also a set of flags for special rules; some domains, for
example, do not allow outdoor operation or certain types of modulation.
Each domain is associated with a two-letter identifying code which,
normally, is just a country code.
There is a new mac80211 function which drivers can call to get the current
regulatory domain information. But, unless the system has some clue of
where on the planet it is currently located, that information will be for
the "world domain," which, being
designed to avoid offending spectrum authorities worldwide, is quite
restrictive. Location information is often available from wireless access
points, allowing the system to configure itself without user intervention.
Individual drivers can also provide a "location hint" to the regulatory
core, perhaps based on regulatory information written to a device's EEPROM
by its vendor. If need be, the system administrator can also configure in
a location by hand.
The database of domains and associated rules lives in user space, where it
can be easily updated by distributors. When the name of the domain is set
within the kernel, an event is generated for udev which, in turn, will be
configured to run the crda utility. This tool will use the domain
name to look up the rules in the database, then use a netlink socket to
pass that information back to the kernel. From there, individual drivers
are told of the new rules via a notifier function.
[PULL QUOTE:
No distributors have made any policy plans public, but one
assumes that the signing keys for the CRDA database will not be distributed
with the system.
END QUOTE]
The database is a binary file which is digitally signed; if the signature
does not match a set of public keys built into crda, then
crda will refuse to use it. This behavior will protect against a
corrupted database, but is also useful for keeping users from modifying it
by hand. No distributors have made any policy plans public, but one
assumes that the signing keys for the CRDA database will not be distributed
with the system. We're dealing with free software, so getting around this
kind of restriction will not prove challenging for even moderately
determined users, but it should prevent some people from cranking their
transmitted power to the maximum just to see what happens.
The CRDA mechanism, once merged into the kernel and once the wireless
drivers actually start using it, should be enough to ensure that Linux
systems with well-behaved users will be well-behaved transmitters. Whether
that will be enough to satisfy the regulatory agencies (some of which have
been quite explicit on their doubts about whether open-source regulatory
code can ever be acceptable) remains to be seen. But it is about the best
that we can do in a free software environment.
Comments (12 posted)
By Jonathan Corbet
August 19, 2008
Certain kinds of programmers are highly enamored with threads, to the point
that they use large numbers of them in their applications. In fact, some
applications create many thousands of threads. Happily for this kind of
developer - and their users - thread creation on Linux is quite fast. At
least, most of the time. A situation where that turned out not to be the
case gives an interesting look at what can happen when scalability and
historical baggage collide.
A user named Pardo recently noted that, in
some situations, thread creation time on x86_64 systems can slow
significantly - as in, by about two orders of magnitude. He was observing
thread creation rates of less than 100/second; at such rates, the term
"quite fast" no longer applies. Happily, Pardo also did much of the work
required to track down the problem, making its resolution quite a bit
easier.
The problem with thread creation is the allocation of the stack to be used
by the new thread. This allocation, done with mmap(), requires
locating a few pages' worth of space in the process's address range. Calls
to mmap() can be quite frequent, so the low-level code which finds
the address space for the new mapping is written to be quick. Normally, it
remembers (in mm->free_area_cache) the address just past the
end of the previous allocation, which
is usually the beginning of a big hole in the address space. So allocating
more space does not require any sort of search.
The mmap() call which creates a thread's stack is special, though,
in that it involves the obscure, Linux-specific MAP_32BIT flag.
This flag causes the allocation to be constrained to the bottom 2GB of the
virtual address space - meaning it should really have been called
MAP_31BIT instead. Thread stacks are kept in lower memory for a
historical reason: on some early 64-bit processors, context switches were
faster if the stack address fit into 32 bits. An application involving
thousands of threads cannot help being highly sensitive to context switch
times, so this was an optimization worth making.
The problem is that this kind of constrained allocation causes
mmap() to forget about mm->free_area_cache; instead,
it performs a linear search through all of the virtual memory areas (VMAs)
in the process's address space. Each thread stack will require at least
one VMA, so this search gets longer as more threads are created.
Where things really go wrong, though, is when there is no longer room to
allocate a stack in the bottom 2GB of memory. At that point, the
mmap() call will return failure to user space, which must then
retry the operation without the MAP_32BIT flag. Even worse, the
first call will have reset mm->free_area_cache, so the retry
operation must search through the entire list of VMAs a second time before
it is able to find a suitable piece of address space. Unsurprisingly,
things start to get really slow at that point.
But the really sad thing is that the performance benefit which came from
using 32-bit stack addresses no longer exists with contemporary
processors. Whatever problem caused the context-switch slowdown for larger
addresses has long since been fixed. So this particular performance
optimization would appear to have become something other than optimal.
The solution which comes immediately to mind is to simply ignore the
MAP_32BIT flag altogether. That approach would require that
people experiencing this problem install a new kernel, but it would be
painless beyond that. Unfortunately, nobody really knows for sure when the
performance penalty for large stack addresses went away or how many
still-deployed systems might be hurt by removing the MAP_32BIT
behavior. So Andi Kleen, who first implemented this behavior, has argued against its removal. He also points
out that larger addresses could thwart a "pointer compression" optimization
used by some Java virtual machine implementations. Andi would rather see
the linear search through VMAs turned into something smarter.
In the end, MAP_32BIT will remain, but the allocation of thread
stacks in lower memory is going away anyway. Ingo Molnar has merged a single-line patch creating a new
mmap() flag called MAP_STACK. This flag is defined as
requesting a memory range which is suitable for use as a thread stack, but,
in fact, it does not actually do anything. Ulrich Drepper will cause glibc
to use this new flag as of the next release. The end result is that, once
a user system has a new glibc and a fixed kernel, the old stack behavior
will go away and that particular performance problem will be history.
Given this outcome,
why not just ignore MAP_32BIT in the kernel and avoid the need
for a C library upgrade? MAP_32BIT is part of the user-space ABI,
and nobody really knows how somebody might be using it. Breaking the ABI
is not an option, so the old behavior must remain. On the other
hand, one could argue for simply removing the use of MAP_32BIT in
the creation of thread stacks, making the kernel upgrade unnecessary. As
it happens, switching to MAP_STACK will have the same effect;
older kernels, which do not recognize that flag, will simply ignore it.
But if, at some future point, it turns out there still is a performance
problem with higher-memory stacks on real systems, the kernel can be
tweaked to implement the older behavior when it's running on an affected
processor. So, with luck, all the bases are covered and this particular issue
will not come back again.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>