Kernel development news
The kernel isn't sacred and it isn't a separate part of the system. It needs to be seen as just one component of a fully integrated system, especially by its developers.
How about raising your quality control a bit, so that I don't have to berate you? Send the _obviously good_ stuff during the merge window, and don't send the "random crap" AT ALL. And then, during the -rc series, you don't do any "obviously good" stuff at all, but you do the "absolutely required" stuff.
Busy waits are always undesirable, but, in some situations, they become even more so. If the wait is going to be relatively long, it would be better to put the processor into a lower power state. After all, nobody cares if it executes its empty loop at full speed, or, even, whether the loop executes at all. If the wait is running within a virtualized guest, the situation can be even worse: by looping in the processor, a busy wait can actively prevent the running of the code which will eventually provide the event which is being waited for. In a virtualized environment, it is far better to simply suspend the virtual system altogether than to let it busy wait.
Jeremy Fitzhardinge has proposed a solution to this problem in the form of the trigger API. A trigger can be thought of as a special type of continuation intended for use in a specific environment: situations where preemption is disabled and sleeping is not possible, but where it is necessary to wait for an external event.
A trigger is set up in either of the two usual patterns:
#include <linux/trigger.h> DEFINE_TRIGGER(my_trigger); /* ... or ... */ trigger_t my_trigger; trigger_init(&my_trigger);
There is a sequence of calls which must be made by code intending to wait for a trigger:
trigger_reset(&my_trigger); while(!condition) trigger_wait(&my_trigger); trigger_finish(&my_trigger);
Triggers are designed to be safe against race conditions, in that if a trigger is fired after the trigger_reset() call, the subsequent trigger_wait() call will return immediately. As with any such primitive, false "wakeups" are possible, so it is necessary to check for the condition being waited for and wait again if need be.
Code which wishes to signal completion to a thread waiting on a trigger need only make a call to:
void trigger_kick(trigger_t *trigger);
This code should, of course, ensure that the waiting thread will see that the resource it was waiting for is available before calling trigger_kick().
A reader of the generic implementation of triggers may be forgiven for wondering what the point is; most of the functions are empty, and trigger_wait() turns into a call to cpu_relax(). In other words, it's still a busy wait, just like before except that now it's hidden behind a set of trigger functions. The idea, of course, is that better versions of these functions can be defined in architecture-specific code. If the target architecture is actually a virtual machine environment, for example, a trigger can simply suspend the execution of the machine altogether. To that end, there is a new set of paravirt_ops allowing hypervisors to implement the trigger operations.
Jeremy has also created an implementation for the x86 architecture which uses the relatively new monitor and mwait instructions. In this implementation, a trigger is a simple integer variable. A call to trigger_reset() turns into a monitor instruction, informing the processor that it should watch out for changes to that integer variable. The mwait instruction built into trigger_wait() halts the processor until the monitored variable is written to. No more busy waiting is required.
There is a certain elegance to the monitor/mwait implementation, but Arjan van de Ven worries that it may prove to be too slow. So changes to the x86 implementation are possible. There have not been a lot of comments about the API itself, though, so the trigger functions may well make it into the mainline in something close to their current form.
Regulations on radio transmissions bring some extra challenges. They are legal code, so their violation can bring users, vendors, and distributors into unwanted conversations with representatives of spectrum enforcement agencies. The legal code is inherently local, while wireless devices are inherently mobile, so those devices must be able to modify their behavior to match different sets of rules at different times. And some wireless devices can be programmed in quite flexible ways; they can be operated far outside of their allowed parameters. The possibility that one of these devices could be configured - accidentally or intentionally - in a way which interferes with other uses of the spectrum is very real.
The potential for legal problems associated with wireless interfaces has cast a shadow over Linux for a while. Some vendors have used it as an excuse for their failure to provide free drivers. Others (Intel, for example), have reworked their hardware to lock up regulatory compliance safely within the firmware. And still, vendors and Linux distributors have worried about what kind of sanctions might come down if Linux systems are seen to be operating in violation of the law somewhere on the planet. Despite all that, the Linux kernel has no central mechanism for ensuring regulatory compliance; it is up to individual drivers to make sure that their hardware does not break the rules. This situation may be about to change, though, as the Central Regulatory Domain Agent (CRDA) patch set, currently being developed by Luis Rodriguez, approaches readiness.
At the core of CRDA is struct ieee80211_regdomain, which describes the rules associated with a given legal regime. It is a somewhat complicated structure, but its contents are relatively simple to understand. They include a set of allowable frequency ranges; for each range, the maximum bandwidth, allowable power, and antenna gain are listed. There's also a set of flags for special rules; some domains, for example, do not allow outdoor operation or certain types of modulation. Each domain is associated with a two-letter identifying code which, normally, is just a country code.
There is a new mac80211 function which drivers can call to get the current regulatory domain information. But, unless the system has some clue of where on the planet it is currently located, that information will be for the "world domain," which, being designed to avoid offending spectrum authorities worldwide, is quite restrictive. Location information is often available from wireless access points, allowing the system to configure itself without user intervention. Individual drivers can also provide a "location hint" to the regulatory core, perhaps based on regulatory information written to a device's EEPROM by its vendor. If need be, the system administrator can also configure in a location by hand.
The database of domains and associated rules lives in user space, where it can be easily updated by distributors. When the name of the domain is set within the kernel, an event is generated for udev which, in turn, will be configured to run the crda utility. This tool will use the domain name to look up the rules in the database, then use a netlink socket to pass that information back to the kernel. From there, individual drivers are told of the new rules via a notifier function.
[PULL QUOTE: No distributors have made any policy plans public, but one assumes that the signing keys for the CRDA database will not be distributed with the system. END QUOTE] The database is a binary file which is digitally signed; if the signature does not match a set of public keys built into crda, then crda will refuse to use it. This behavior will protect against a corrupted database, but is also useful for keeping users from modifying it by hand. No distributors have made any policy plans public, but one assumes that the signing keys for the CRDA database will not be distributed with the system. We're dealing with free software, so getting around this kind of restriction will not prove challenging for even moderately determined users, but it should prevent some people from cranking their transmitted power to the maximum just to see what happens.
The CRDA mechanism, once merged into the kernel and once the wireless drivers actually start using it, should be enough to ensure that Linux systems with well-behaved users will be well-behaved transmitters. Whether that will be enough to satisfy the regulatory agencies (some of which have been quite explicit on their doubts about whether open-source regulatory code can ever be acceptable) remains to be seen. But it is about the best that we can do in a free software environment.
A user named Pardo recently noted that, in some situations, thread creation time on x86_64 systems can slow significantly - as in, by about two orders of magnitude. He was observing thread creation rates of less than 100/second; at such rates, the term "quite fast" no longer applies. Happily, Pardo also did much of the work required to track down the problem, making its resolution quite a bit easier.
The problem with thread creation is the allocation of the stack to be used by the new thread. This allocation, done with mmap(), requires locating a few pages' worth of space in the process's address range. Calls to mmap() can be quite frequent, so the low-level code which finds the address space for the new mapping is written to be quick. Normally, it remembers (in mm->free_area_cache) the address just past the end of the previous allocation, which is usually the beginning of a big hole in the address space. So allocating more space does not require any sort of search.
The mmap() call which creates a thread's stack is special, though, in that it involves the obscure, Linux-specific MAP_32BIT flag. This flag causes the allocation to be constrained to the bottom 2GB of the virtual address space - meaning it should really have been called MAP_31BIT instead. Thread stacks are kept in lower memory for a historical reason: on some early 64-bit processors, context switches were faster if the stack address fit into 32 bits. An application involving thousands of threads cannot help being highly sensitive to context switch times, so this was an optimization worth making.
The problem is that this kind of constrained allocation causes mmap() to forget about mm->free_area_cache; instead, it performs a linear search through all of the virtual memory areas (VMAs) in the process's address space. Each thread stack will require at least one VMA, so this search gets longer as more threads are created.
Where things really go wrong, though, is when there is no longer room to allocate a stack in the bottom 2GB of memory. At that point, the mmap() call will return failure to user space, which must then retry the operation without the MAP_32BIT flag. Even worse, the first call will have reset mm->free_area_cache, so the retry operation must search through the entire list of VMAs a second time before it is able to find a suitable piece of address space. Unsurprisingly, things start to get really slow at that point.
But the really sad thing is that the performance benefit which came from using 32-bit stack addresses no longer exists with contemporary processors. Whatever problem caused the context-switch slowdown for larger addresses has long since been fixed. So this particular performance optimization would appear to have become something other than optimal.
The solution which comes immediately to mind is to simply ignore the MAP_32BIT flag altogether. That approach would require that people experiencing this problem install a new kernel, but it would be painless beyond that. Unfortunately, nobody really knows for sure when the performance penalty for large stack addresses went away or how many still-deployed systems might be hurt by removing the MAP_32BIT behavior. So Andi Kleen, who first implemented this behavior, has argued against its removal. He also points out that larger addresses could thwart a "pointer compression" optimization used by some Java virtual machine implementations. Andi would rather see the linear search through VMAs turned into something smarter.
In the end, MAP_32BIT will remain, but the allocation of thread stacks in lower memory is going away anyway. Ingo Molnar has merged a single-line patch creating a new mmap() flag called MAP_STACK. This flag is defined as requesting a memory range which is suitable for use as a thread stack, but, in fact, it does not actually do anything. Ulrich Drepper will cause glibc to use this new flag as of the next release. The end result is that, once a user system has a new glibc and a fixed kernel, the old stack behavior will go away and that particular performance problem will be history.
Given this outcome, why not just ignore MAP_32BIT in the kernel and avoid the need for a C library upgrade? MAP_32BIT is part of the user-space ABI, and nobody really knows how somebody might be using it. Breaking the ABI is not an option, so the old behavior must remain. On the other hand, one could argue for simply removing the use of MAP_32BIT in the creation of thread stacks, making the kernel upgrade unnecessary. As it happens, switching to MAP_STACK will have the same effect; older kernels, which do not recognize that flag, will simply ignore it. But if, at some future point, it turns out there still is a performance problem with higher-memory stacks on real systems, the kernel can be tweaked to implement the older behavior when it's running on an affected processor. So, with luck, all the bases are covered and this particular issue will not come back again.
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds