No -mm kernels have been released over the last week.
The current stable 2.6 kernel is 18.104.22.168, released on June 11.
Kernel development news
-- David Miller
The full text of the DCO can be found in SubmittingPatches file in the Documentation directory.
This change was motivated by the actions of one kernel subsystem maintainer who feels that the UK Data Protection Act requires that he strip email addresses from patches which pass through him. The new version of the DCO will, in theory, turn a "Signed-off-by:" header into an active granting of permission to redistribute the contact information which comes with the patch.a patch to remove devfs. The devfs filesystem, a virtual filesystem which provides a dynamic /dev directory, had been unpopular with many kernel developers since long before it was merged in 2.3.46. It was never enabled by most distributions, and, in more recent times, had seen little maintenance. Meanwhile, the user-space udev utility had developed to the point where it could fill in for devfs. Since there was no 2.7 on the horizon, and 2.6 was officially open to user-visible changes, it seemed like a good time to close the devfs chapter forevermore.
Except that, as it turns out, the developers were not quite ready to eliminate a user-visible feature on such short notice. After some discussion, it was decided that changes of this kind should happen after a one-year warning period. As a result, a file was created in the Documentation directory (here's the almost-2.6.12 version) which listed features scheduled for removal and the target date. Devfs went into the file, with July, 2005 as the time for its ultimate demise.
July is nearly here, and Greg has not forgotten. He has returned with a 22-part patch which removes every trace of devfs from a surprisingly large portion of the kernel. It would seem that devfs had gotten its fingers into just about everything. In the absence of some sort of surprise, this patch seems certain to be merged for 2.6.13. If there are any devfs users out there, they have gotten their last warning.
One of the biggest sources of interrupt latency is periods when the processor has simply disabled interrupt delivery. Device drivers often disable interrupts - on the local processor at least - to avoid creating race conditions with themselves. Even (or especially) when spinlocks are used to control concurrency with interrupt handlers, interrupts must be disabled. Imagine a driver which duly acquires a spinlock before working with its data structures. One of that driver's devices raises an interrupt while the lock is held, and the interrupt handler runs on the same CPU. That interrupt handler will try to acquire the same spinlock, and, finding it busy, will proceed to spin until the lock becomes free. But, since the interrupt handler has preempted the only thread which can ever release the lock, it will spin forever. That is a different sort of interrupt latency altogether, and one which even general-purpose kernels try to avoid. The usual technique is simply to disable interrupts while holding a spinlock which might be acquired by an interrupt handler. Disabling interrupts solves the immediate problem, but it can lead to increased interrupt latency.
Ingo Molnar's realtime preemption patches improve the situation by moving interrupt handlers into their own processes. Since interrupt handlers are scheduled with everything else, and since "spinlocks" no longer spin with this patch set, the sort of deadlock described in the previous paragraph can not happen. So there is no longer any need to disable interrupts when acquiring spinlocks. Changing the locking primitives eliminated the major part of the code in the kernel which runs with interrupts disabled.
Daniel Walker recently noticed that one could do a little better - and followed up with a patch showing how. Fixing the locking primitives got rid of most of the driver code which runs with interrupts turned off, but it did nothing for all of the places where drivers explicitly disable interrupts themselves with a call to local_irq_disable(). In most of these cases, the driver is simply trying to avoid racing with its interrupt handler. But when interrupt handlers run in their own threads, all that is really needed to avoid concurrency problems is to disable preemption. So Daniel's patch reworks local_irq_disable() to turn off preemption while leaving the interrupt configuration alone. For the few cases where it is truly necessary to disable interrupts at the hardware level, hard_local_irq_disable() (later renamed to raw_local_irq_disable()) has been provided.
One might argue that disabling preemption is counterproductive, given that any code which runs with preemption disabled will contribute to the scheduling latency problem. But any code which disables interrupts already runs with preemption turned off, so the situation is not made any worse by this patch. It could, in fact, be improved: all that really needs to be protected against is preemption by one specific interrupt handler thread. The extra scheduler complexity which would be required to implement that solution is unlikely to be worth it, however; better to just fix the drivers to use locks. So Ingo picked up Daniel's patch, spent a few minutes completely reworking it, and added it to his realtime preemption patch set.
Meanwhile, Karim Yaghmour was heard wondering:
It does seem that not everybody understands what the Adeos patch (available from the Gna server) does. The description of Adeos, in its current form, as a "nanokernel" probably does this work a disservice; what Adeos really comes down to is a patch to the kernel's interrupt handling code.
To reduce interrupt latency, Adeos takes the classic approach of adding a layer of indirection. The patch adds an "interrupt pipeline" to the low-level, architecture-specific code. Any "domain" (read "piece of code") can register itself with this interrupt pipeline, providing a priority as it does so. Whenever a hardware interrupt arrives, Adeos works its way down the pipeline, calling into the handler of each domain which has expressed an interest in that interrupt. The higher-priority handlers are, of course, called first.
In this world, the regular Linux interrupt subsystem is registered as just another Adeos domain. Any code which absolutely, positively must have its interrupts arrive within microseconds can register itself as a higher-priority domain. When interrupt time comes, the high-priority code can respond to the interrupt before Linux even hears about it. Since nothing in Linux can possibly get in the way (unless it does evil things to the hardware), there is no need to worry about which parts of Linux might create latency problems.
Some benchmark results were recently posted; they showed generally better performance from Adeos than from the realtime preemption patch. Some issues have been raised, however, with how those numbers were collected; the tests are set to be rerun in the near future.
Meanwhile, a slow debate over inclusion of the realtime work continues, with some participants pushing for the code to be merged eventually, others being skeptical, and a few asking for the realtime discussion to be removed from linux-kernel altogether. One viewpoint worth considering can be found in this posting from Gerrit Huizenga, who argued that the realtime patches of today resemble the scalability patches from a few years ago, and that they must follow a similar path toward inclusion:
Ingo Molnar clearly understands this; he has consistently worked toward making the realtime patches minimally intrusive and useful in many situations. Parts of the realtime work have already been merged, and this process may continue. There may come a time when developers will be surprised to discover that most of the realtime preemption patch can be found in the mainline.this LWN Driver Porting Series article; rather more details are available from the networking chapter in LDD3.
One of the things NAPI-compliant drivers must do is to specify the "weight" of each interface. The weight parameter helps to determine how important traffic from that interface is - it limits the number of packets each interface can feed to the networking core in each polling cycle. This parameter also controls whether the interface runs in the polling mode or not; by the NAPI conventions, an interface which does not have enough built-up traffic to fill its quota of packets (where the quota is determined by the interface's weight) should go back to the interrupt-driven mode. The weight is thus a fundamental parameter controlling how packet reception is handled, but there has never been any real guidance from the networking crew on how the weight should be set. Most driver writers pick a value between 16 and 64, with interfaces capable of higher speeds usually setting larger values.
Some recent discussions on the netdev list have raised the issue of how the weight of an interface should be set. In particular, the e1000 driver hackers have discovered that their interface tends to perform better when its weight is set lower - with the optimal value being around 10. Investigations into this behavior continue, but a few observations have come out; they give a view into what is really required to get top performance out of modern hardware.
One problem, which appears to be specific to the e1000, is that the interface runs out of receive buffers. The e1000 driver, in its poll() function, will deliver its quota of packets to the networking core; only when that process is complete does the driver concern itself with providing more receive buffers to the interface. So one short-term tactic would be to replenish the receive buffers more often. Other interface drivers tend not to wait until an entire quota has been processed to perform this replenishment. Lowering the weight of an interface is one way to force this replenishment to happen more often without actually changing the driver's logic.
But questions remain: why is the system taking so long to process 64 packets that a 256-packet ring is being exhausted? And why does performance increase for smaller weights even when packets are not being dropped? One possible explanation is that the actual amount of work being done for each packet in the networking core can vary greatly depending on the type of traffic being handled. Big TCP streams, in particular, take longer to process than bursts of small UDP packets. So, depending on the workload, processing one quota's worth of packets might take quite some time.
This processing time affects performance in a number of ways. If the system spends large bursts of time in software interrupt mode to deal with incoming packets, it will be starving the actual application for processor time. The overall latency of the system goes up, and performance goes down. Smaller weights can lead to better interleaving of system and application time.
A related issue is this check in the networking core's polling logic:
if (budget <= 0 || jiffies - start_time > 1) goto softnet_break;
Essentially, if the networking core spends more than about one half of one jiffy (very approximately 500 μsec on most systems) polling interfaces, it decides that things have gone on for long enough and it's time to take a break. If one high-weight interface is taking a lot of time to get its packets through the system, the packet reception process can be cut short early, perhaps before other interfaces have had their opportunity to deal with their traffic. Once again, smaller weights can help to mitigate this problem.
Finally, an overly large weight can work against the performance of an interface when traffic is at moderate levels. If the driver does not fill its entire quota in one polling cycle, it will turn off polling and go back into interrupt-driven mode. So a steady stream of traffic which does not quite fill the quota will cause the driver to bounce between the polling and interrupt modes, and the processor will have to handle far more interrupts that would otherwise be expected. Slower interfaces (100 Mb/sec and below) are particularly vulnerable to this problem; on a fast system, such interfaces simply cannot receive enough data to fill the quota every time.
From all this information, some conclusions have emerged:
Changing the code to implement these conclusions is likely to be a long process. Fundamental tweaks in the core of the networking code can lead to strange performance regressions in surprising places. In the mean time, Stephen Hemminger has posted a patch which creates a sysfs knob for the interface weight. That patch has been merged for 2.6.12, so people working on networking performance problems will soon be able to see if adjustable interface weights can be part of the solution.
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds