Release status
Kernel release status
The current 2.6 prepatch remains 2.6.12-rc6. The trickle of patches
into Linus's git repository has slowed recently, and the official 2.6.12
release may well have happened by the time you read this.
No -mm kernels have been released over the last week.
The current stable 2.6 kernel is 2.6.11.12, released on June 11.
Comments (none posted)
Kernel development news
Quote of the week
Me caveman
Me plug in wireless router
Me watch pretty lights
Me turn on computer
Me up interface
Computer work
Me no care other cavemen use wireless link
-- David Miller
Comments (1 posted)
A new home for netdev
If it seems like the networking hackers are especially quiet as of late, it
may be that you failed to note the netdev mailing list's move. This list,
long hosted on oss.sgi.com, is now one of the many (majordomo-managed)
kernel lists on vger.kernel.org. The move has not been broadly advertised,
and the subscriber list does not appear to have been transferred from the
old list to the new one. If your netdev mail has stopped, chances are you
need to subscribe to the new list.
Comments (none posted)
The Developer's Certificate of Origin, v1.1
When 2.6.12 is released, it will include a new version of the "developer's
certificate of origin," the statement which must be made by anybody
submitting a patch for merging into the mainline. Version 1.1 of the
DCO includes a new phrase:
I understand and agree that this project and the contribution are
public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
The full text of the DCO can be found in SubmittingPatches file in the Documentation
directory.
This change was motivated by the actions of one kernel subsystem maintainer
who feels that the UK Data Protection Act requires that he strip email
addresses from patches which pass through him. The new version of the DCO
will, in theory, turn a "Signed-off-by:" header into an active granting of
permission to redistribute the contact information which comes with the
patch.
Comments (4 posted)
The end of the devfs story
Almost one year ago, the kernel developers decided to formally recognize
the new development model, where large changes were welcome in the stable
2.6 series. At that time, Greg Kroah-Hartman decided to test out the new
model by posting
a patch to remove
devfs. The devfs filesystem, a virtual filesystem which provides a
dynamic
/dev directory, had been unpopular with many kernel
developers since long before it was
merged in 2.3.46. It was
never enabled by most distributions, and, in more recent times, had seen
little maintenance. Meanwhile, the user-space
udev utility had
developed to the point where it could fill in for devfs. Since there was
no 2.7 on the horizon, and 2.6 was officially open to user-visible changes,
it seemed like a good time to close the devfs chapter forevermore.
Except that, as it turns out, the developers were not quite ready to
eliminate a user-visible feature on such short notice. After some
discussion, it was decided that changes of this kind should happen after a
one-year warning period. As a result, a file was created in the
Documentation directory (here's the almost-2.6.12 version) which listed features
scheduled for removal and the target date. Devfs went into the file, with
July, 2005 as the time for its ultimate demise.
July is nearly here, and Greg has not forgotten. He has returned with a 22-part patch which removes every trace of
devfs from a surprisingly large portion of the kernel. It would seem that
devfs had gotten its fingers into just about everything. In the absence of
some sort of surprise, this patch seems certain to be merged for 2.6.13.
If there are any devfs users out there, they have gotten their last
warning.
Comments (4 posted)
Realtime and interrupt latency
The realtime Linux patches, covered at length (too much length, according
to some) on these pages, have been aimed primarily at reducing scheduling
latency: the amount of time it takes to switch control to a high-priority
process in response to an event which makes it runnable. Scheduling
latency is important, but the harder end of the realtime spectrum also
places a premium on interrupt latency: how long the system takes to respond
to a hardware interrupt. In many realtime situations, the processor must
answer quickly when the hardware asks for attention; excessive latency can
lead to lost data and failure to respond quickly enough to external
events. A Linux-based, realtime beer monitoring system may only have a
few milliseconds to deal with a "refrigerator door opened" interrupt before
one's roommate has swiped a bottle and left the scene. In this sort of
high-stakes deployment, interrupt latency is everything.
One of the biggest sources of interrupt latency is periods when the
processor has simply disabled interrupt delivery. Device drivers often
disable interrupts - on the local processor at least - to avoid creating
race conditions with themselves. Even (or especially) when spinlocks are
used to control concurrency with interrupt handlers, interrupts must be
disabled. Imagine a driver which duly acquires a spinlock before working
with its data structures. One of that driver's devices raises an interrupt while
the lock is held, and the interrupt handler runs on the same CPU. That
interrupt handler will try to acquire the same spinlock, and, finding it
busy, will proceed to spin until the lock becomes free. But, since the
interrupt handler has preempted the only thread which can ever release the
lock, it will spin forever. That is a different sort of interrupt latency
altogether, and one which even general-purpose kernels try to avoid. The
usual technique is simply to disable interrupts while holding a spinlock
which might be acquired by an interrupt handler. Disabling interrupts
solves the immediate problem, but it can lead to increased interrupt
latency.
Ingo Molnar's realtime preemption patches improve the situation by moving
interrupt handlers into their own processes. Since interrupt handlers are
scheduled with everything else, and since "spinlocks" no longer spin with
this patch set, the sort of deadlock described in the previous paragraph
can not happen. So there is no longer any need to disable interrupts when
acquiring spinlocks. Changing the locking primitives eliminated the major
part of the code in the kernel which runs with interrupts disabled.
Daniel Walker recently noticed that one could do a little better - and
followed up with a patch showing how.
Fixing the locking primitives got rid of most of the driver code which runs
with interrupts turned off, but it did nothing for all of the places where
drivers explicitly disable interrupts themselves with a call to
local_irq_disable(). In most of these cases, the driver is simply
trying to avoid racing with its interrupt handler. But when interrupt handlers
run in their own threads, all
that is really needed to avoid concurrency problems is to disable
preemption. So Daniel's patch reworks local_irq_disable() to turn
off preemption while leaving the interrupt
configuration alone. For the few cases where it is truly necessary to
disable interrupts at the hardware level, hard_local_irq_disable()
(later renamed to raw_local_irq_disable())
has been provided.
One might argue that disabling preemption is counterproductive, given that
any code which runs with preemption disabled will contribute to the
scheduling latency problem. But any code which disables interrupts already
runs with preemption turned off, so the situation is not made any worse by
this patch. It could, in fact, be improved: all that really needs to be
protected against is preemption by one specific interrupt handler thread.
The extra scheduler complexity which would be required to implement that
solution is unlikely to be worth it, however; better to just fix the
drivers to use locks. So Ingo picked up Daniel's patch, spent a few
minutes completely reworking it, and added it to his realtime preemption
patch set.
Meanwhile, Karim Yaghmour was heard
wondering:
I'm not sure exactly why you guys are reinventing the wheel. Adeos
already does this soft-cli/sti stuff for you, it's been available
for a few years already, tested, and ported to a number of
architectures, and is generalized, why not just adopt it?
It does seem that not everybody understands what the Adeos patch (available
from the Gna server) does.
The description of Adeos, in its current form, as a "nanokernel" probably
does this work a disservice; what Adeos really comes down to is a patch to
the kernel's interrupt handling code.
To reduce interrupt latency, Adeos takes the classic approach of adding a
layer of indirection. The patch adds an "interrupt pipeline" to the
low-level, architecture-specific code. Any "domain" (read "piece of code")
can register itself with this interrupt pipeline, providing a priority as
it does so. Whenever a hardware interrupt arrives, Adeos works its way
down the pipeline, calling into the handler of each domain which has
expressed an interest in that interrupt. The higher-priority handlers are,
of course, called first.
In this world, the regular Linux interrupt subsystem is registered as just
another Adeos domain. Any code which absolutely, positively must have its
interrupts arrive within microseconds can register itself as a
higher-priority domain. When interrupt time comes, the high-priority code
can respond to the interrupt before Linux even hears about it. Since
nothing in Linux can possibly get in the way (unless it does evil things to
the hardware), there is no need to worry about which parts of Linux might
create latency problems.
Some benchmark results were recently
posted; they showed generally better performance from Adeos than from the
realtime preemption patch. Some issues have been raised, however, with how
those numbers were collected; the tests are set to be rerun in the near
future.
Meanwhile, a slow debate over inclusion of the realtime work continues,
with some participants pushing for the code to be merged eventually, others
being skeptical, and a few asking for the realtime discussion to be removed
from linux-kernel altogether. One viewpoint worth considering can be found
in this posting from Gerrit Huizenga, who
argued that the realtime patches of today resemble the scalability patches
from a few years ago, and that they must follow a similar path toward
inclusion:
I believe that any effort towards mainline support of RT has to
follow a similar set of guidelines. And, I believe strongly that
*most* of the RT code should be crafted so that every single laptop
user is running most of the code *and* benefiting from it. If most
of the RT code goes unused by most of the population, and the only
way to get an RT kernel of any reasonable level is to ask the
distros to build yet another configuration, RT will always be a
poor, undertested, underutilized ugly stepchild of Linux.
Ingo Molnar clearly understands this; he has consistently worked toward
making the realtime patches minimally intrusive and useful in many
situations. Parts of the realtime work have already been merged, and this
process may continue. There may come a time when developers will be
surprised to discover that most of the realtime preemption patch can be
found in the mainline.
Comments (1 posted)
NAPI performance - a weighty matter
Modern network interfaces are easily capable of handling thousands of
packets per second. They are also capable of burying the host processor
under thousands of interrupts per second. As a way of dealing with the
interrupt problem (and fixing some other things as well), the networking
hackers added the NAPI driver interface. NAPI-capable drivers can, when
traffic gets high, turn off receive interrupts and collect incoming packets
in a polling mode. Polling is normally considered to be bad news, but,
when there is always data waiting on the interface, it turns out to be the
more efficient way to go. Some details on NAPI can be found in
this LWN Driver Porting Series
article; rather more details are available from the networking chapter
in
LDD3.
One of the things NAPI-compliant drivers must do is to specify the "weight"
of each interface. The weight parameter helps to determine how important
traffic from that interface is - it limits the number of packets each
interface can feed to the networking core in each polling cycle. This
parameter also controls whether the interface runs in the polling mode or
not; by the NAPI conventions, an interface which does not have enough
built-up traffic to fill its quota of packets (where the quota is
determined by
the interface's weight) should go back to the interrupt-driven mode. The
weight is thus a fundamental parameter controlling how packet reception is
handled, but there has never been any real guidance from the networking
crew on how the weight should be set. Most driver writers pick a value
between 16 and 64, with interfaces capable of higher speeds usually setting
larger values.
Some recent discussions on the netdev list have raised the issue of how the
weight of an interface should be set. In particular, the e1000 driver
hackers have discovered that their interface tends to perform better when
its weight is set lower - with the optimal value being around 10.
Investigations into this behavior continue, but a few observations have
come out; they give a view into what is really required to get top
performance out of modern hardware.
One problem, which appears to be specific to the e1000, is that the
interface runs out of receive buffers. The e1000 driver, in its
poll() function, will deliver its quota of packets to the
networking core; only when that process is complete does the driver concern
itself with providing more receive buffers to the interface. So one
short-term tactic would be to replenish the receive buffers more often.
Other interface drivers tend not to wait until an entire quota has been
processed to perform this replenishment. Lowering the weight of an
interface is one way to force this replenishment to happen more often
without actually changing the driver's logic.
But questions remain: why is the system taking so long to process 64
packets that a 256-packet ring is being exhausted? And why does
performance increase for smaller weights even when packets are not being
dropped? One possible explanation is that the actual amount of work being
done for each packet in the networking core can vary greatly depending on
the type of traffic being handled. Big TCP streams, in particular, take
longer to process than bursts of small UDP packets. So, depending on the
workload, processing one quota's worth of packets might take quite some
time.
This processing time affects performance in a number of ways. If the
system spends large bursts of time in software interrupt mode to deal with
incoming packets, it will be starving the actual application for processor
time. The overall latency of the system goes up, and performance goes
down. Smaller weights can lead to better interleaving of system and
application time.
A related issue is this check in the networking core's polling logic:
if (budget <= 0 || jiffies - start_time > 1)
goto softnet_break;
Essentially, if the networking core spends more than about one half of one
jiffy (very approximately 500 μsec on most systems) polling interfaces,
it decides that things have gone on for long enough and it's time to take a
break. If one high-weight interface is taking a lot of time to get its
packets through the system, the packet reception process can be cut short
early, perhaps before other interfaces have had their opportunity to deal
with their traffic. Once again, smaller weights can help to mitigate this
problem.
Finally, an overly large weight can work against the performance of an
interface when traffic is at moderate levels. If the driver does not fill
its entire quota in one polling cycle, it will turn off polling and go back
into interrupt-driven mode. So a steady stream of traffic which does not
quite fill the quota will cause the driver to bounce between the polling
and interrupt modes, and the processor will have to handle far more
interrupts that would otherwise be expected. Slower interfaces
(100 Mb/sec and below) are particularly vulnerable to this problem; on a
fast system, such interfaces simply cannot receive enough data to fill the
quota every time.
From all this information, some conclusions have emerged:
- There needs to be a smarter way of setting each interface's weight;
the current "grab the setting from some other driver" approach does
not always yield the right results.
- The direct tie between an interface's weight and its packet quota is
too simple. Each interface's quota should actually be determined, at
run time, by the amount of work that interface's packet stream is
creating.
- The quota value should not also be the threshold at which drivers
return to interrupt-driven mode. The cost of processor interrupts is
high enough that polling mode should be used as long as traffic
exists, even when an interface almost never fills its quota.
Changing the code to implement these conclusions is likely to be a long
process. Fundamental tweaks in the core of the networking code can lead to
strange performance regressions in surprising places. In the mean time,
Stephen Hemminger has posted a
patch which creates a sysfs knob for the interface weight. That patch
has been merged for 2.6.12, so people working on networking performance
problems will soon be able to see if adjustable interface weights can be
part of the solution.
Comments (4 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
- Marco Costalba: qgit-0.4.
(June 13, 2005)
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
- Nick Piggin: blkstat.
(June 13, 2005)
Page editor: Jonathan Corbet
Next page: Distributions>>