Brief items
The current 2.6 prepatch remains 2.6.16-rc1. Linus has been busily
merging patches, however, with the apparent aim of releasing -rc2
immediately after this article is published. 2.6.16-rc2 will contain a lot
of fixes, but it also has another set of semaphore-to-mutex conversions, a
USB driver for ET61X151 and ET61X251 camera controllers, a big Video4Linux
update, the
direct migration
patches, and various architecture updates.
The current -mm tree is 2.6.16-rc1-mm4. Recent changes
to -mm include some per-CPU variable tweaks, a representation of system CPU
topology in sysfs, and various fixes. As Andrew puts it: "Things
have been pretty quiet lately - most activity seems to be concentrated
about putting bugs into the various subsystem trees."
The current stable 2.6 release is 2.6.15.2, announced on January 30. It
includes a handful of fixes and a security patch. Expect
another update before too long, however, as a
few "box-killing bugs" are still known to exist in 2.6.15.
The stable kernel team has recently agreed to continue support for the
previous kernel for a little longer. The result is 2.6.14.7, with a handful of important fixes.
Comments (2 posted)
Kernel development news
The Linux networking stack isn't broken. The people who take care
of the stack know what they're doing & do good work. Based on all
the measurements I'm aware of, Linux has the fastest & most
complete stack of any OS.
-- Van Jacobson's linux.conf.au slides
To do this stuff right you want networking experts (not UNIX
interface standards experts) to come up with how to do things,
because folks like POSIX are going to make a rocking implementation
next to impossible.
Only folks like Van Jacobson can take us out of the myopic view we
currently have of how networking receive is done.
-- David Miller
Comments (none posted)
A group of kernel developers has been working for some time to try to help OSDL improve its interactions with the development and vendor communities. The result was a set of proposals presented to the OSDL board. Greg Kroah-Hartman has now
published a summary of the proposals and noted that the OSDL board has agreed to implement the full set. "
There is no Linux technical conference in the US anymore. If this could be addressed with a conference much like ALS used to be, it would be a very good thing. We need to nurture the technical community across the US with regional conferences that are easy to access in order to help seed the creation of new developers for Linux."
Comments (17 posted)
Your editor had the good fortune to see Van Jacobson speak at the 1989
USENIX conference. His talk covered some of the bleeding-edge topics of
the time, including TCP slow start algorithms and congestion avoidance. It
was the "how Van saved the net" talk (though he certainly did not put it
in those terms), and, many years later, the impression from that talk
remains. Van Jacobson is a smart guy.
Unfortunately, attending Van's talk at linux.conf.au this year was not in
the program. Fortunately, David
Miller was there and listening carefully. Van has figured out how the
next round of networking performance improvements will happen, and he has
the numbers to prove it. Expect some very interesting (and fundamental)
changes in the Linux networking stack as Van's ideas are incorporated.
This article attempts to cover the fundamentals of Van's scheme (called
"channels") based on David's weblog entry and Van's slides
[PDF].
Van, like many others, points out that the biggest impediment to
scalability on contemporary hardware is memory performance. Current
processors can often execute multiple instructions per nanosecond, but
loading a cache line from memory still takes 50ns or more. So cache
behavior will often be the dominant factor in the performance of kernel
code. That is why simply making code smaller often makes it faster. The
kernel developers understand cache behavior well, and much work has gone
into improving cache utilization in the kernel.
The Linux networking stack (like all others) does a number of things which
reduce cache performance, however. These include:
- Passing network packets through multiple layers of the kernel. When a
packet arrives, the network card's interrupt handler begins the task
of feeding the packet to the kernel. The remainder of the work may
well be performed at software interrupt level within the driver (in a
tasklet, perhaps). The core network processing happens in another
software interrupt. Copying the data (an expensive operation in
itself) to the application happens in kernel context. Finally the
application itself does something interesting with the data. The
context changes are expensive, and if any of these changes causes the
work to move from one CPU to another, a big cache penalty results.
Much work has been done to improve CPU locality in the networking
subsystem, but much remains to be done.
- Locking is expensive. Taking a lock requires a cross-system atomic
operation and moves a cache line between processors. Locking costs
have led to the development of lock-free techniques like seqlocks and read-copy-update, but the
the networking stack (like the rest of the kernel) remains full of locks.
- The networking code makes extensive use of queues implemented with
doubly-linked lists. These lists have poor cache behavior since they
require each user to make changes (and thus move cache lines) in
multiple places.
To demonstrate what can happen, Van ran some netperf tests on
an instrumented kernel. On a single CPU system, processor utilization was
50%, of which 16% was in the socket code, 5% in the scheduler, and 1% in
the application. On a two-processor system, utilization went to 77%,
including 24% in the socket code and 12% in the scheduler. That is a worst
case scenario in at least one way: the application and the interrupt
handler were configured to run on different CPUs. Things will not always
be that bad in the real world, but, as the number of processors increases,
the chances of the interrupt handler running on the same processor as any
given application decrease.
The key to better networking scalability, says Van, is to get rid of
locking and shared data as much as possible, and to make sure that as much
processing work as possible is done on the CPU where the application is
running. It is, he says, simply the end-to-end principle in action yet
again. This principle, which says that all of the intelligence in the
network belongs at the ends of the connections, doesn't stop at the
kernel. It should continue, pushing as much work as possible out of the
core kernel and toward the actual applications.
The tool used to make this shift happen is the "net channel," intended to
be a replacement for the socket buffers and queues used in the kernel now.
Some details of how channels are implemented can be found in Van's slides,
but all that really matters is the core concept: a channel is a carefully
designed circular buffer. Properly done, circular buffers require no locks
and share no writable cache lines between the producer and the consumer.
So adding
data to (or removing data from) a net channel will be a fast,
cache-friendly operation.
As a first step, channels can be pushed into the driver interface. A
network driver need no longer be aware of sk_buff structures and
such; instead, it simply drops incoming packets into a channel as they are
received. Making this change cuts the CPU utilization in the two-processor case
back to 58%. But things need not stop there. A next logical step would be
to get rid of the networking stack processing at softirq level and to feed
packets directly into the socket code via a channel. Doing that requires
creating a separate channel for each socket and adding a simple packet
classifier so that the driver knows which
channel should get each packet. The socket code must also be rewritten to do
the protocol processing (using the existing kernel code). That change
drops the overall CPU utilization to
28%, with the portion spent at softirq level dropping to zero.
But why stop there? If one wants to be serious about this end-to-end
thing, one could connect the channel directly to the application. Said
application gets the packet buffers mapped directly into its address space
and performs protocol processing by way of a user-space library. This
would be a huge change in how Linux does networking, but Van's results
speak for themselves. Here is his table showing the percentage CPU
utilization for each of the cases described above:
| Total CPU | Interrupt | SoftIRQ |
Socket | Locks | Sched | App. |
| 1 CPU |
50 |
7 |
11 |
16 |
8 |
5 |
1 |
| 2 CPUs |
77 |
9 |
13 |
24 |
14 |
12 |
1 |
| Driver channel |
58 |
6 |
12 |
16 |
9 |
9 |
1 |
| Socket channel |
28 |
6 |
0 |
16 |
1 |
3 |
1 |
| App. channel |
14 |
6 |
0 |
0 |
0 |
2 |
5 |
The bottom line (literally) is this: processing time for the packet stream
dropped to just over 25% of the previous single-CPU case, and less than 20%
of the previous two-CPU behavior. Three layers of kernel code have been
shorted out altogether, with the remaining work performed in the driver
interrupt handler and the application itself. The test system running
with the full application channel code was able to handle twice the
network bandwidth as an unmodified system - with the processors idle most
of the time.
Linux networking hackers have always been highly attentive to performance
issues, so numbers like these are bound to get their attention. Beyond
performance, however, this approach promises simpler drivers and a
reasonably straightforward transition between the current stack and a
future stack built around channels. A channel-based user-space interface
will make it easy to create applications which can send and receive packets
using any protocol. If Van's results hold together in a "real-world"
implementation, the only remaining question would be: when will it be
merged so the rest of us can use it?
Comments (63 posted)
The kernel needs to count a lot of things. There are counters for
networking statistics, usage of various resources, and so on. One would
ordinarily think that operating a counter would be a relatively
straightforward task, but ordinarily simple things can become complicated
in the kernel context, especially when the number of processors involved
gets large.
In theory, a counter is just a simple integer variable. In an SMP
environment, however, that variable must be protected against concurrent
updates, or it will eventually get corrupted. The tool that kernel hackers
reach for first in this situation is the atomic_t type. Atomic
variables are simple integers with a set of atomic operations. If you have
an atomic_t variable called counter, that counter can be
incremented with a call like:
atomic_inc(&counter);
and its value will be changed in an SMP-safe, interrupt-safe manner. These
operations are
relatively fast, being hand-coded to use the mechanisms provided by each
host architecture. In many cases, an atomic_t counter is the best
solution to the problem.
The problem with atomic_t counters is that they use expensive
locked operations, and they require that the current CPU obtain exclusive
cache access for the variable. A frequently-modified atomic counter can
cause a cache line to bounce constantly between CPUs, impacting the
performance of the entire system. As an example, consider this patch set from Ravikiran
Thirumalai. He replaced a single counter (the memory_allocated
field of the proto structure) in the networking code with a more
SMP-friendly counter, and reported a 5% improvement in an Apache benchmark
on an eight-processor system. 5% is a nice improvement for changing a
single counter, but it seems that perhaps even better results could be had.
Ravikiran replaced the atomic_t counter with the
percpu_counter type. These counters use per-CPU variables to hold
a CPU-local count. Modifying that count is fast, since it is local to the
given CPU, no locking is required, and no cache lines need be moved from
other processors. If any given processor's count exceeds a given
threshold, its value is added to a (spinlock-protected) global count, and
the CPU-local count is set back to zero. Queries of the counter look only
at the global count. The result is a counter which is somewhat
approximate, but quite fast. In many cases, an "almost right" count is
entirely good enough.
Per-CPU counters become increasingly inaccurate as the number of processors
grows, however. Each processor has a certain residual count which has not
yet been folded into the global count. In situations where counters tend
to increase, the result will be a global count which underestimates the
real value, and which is increasingly wrong on larger systems. Per-CPU
counters are also memory-intensive, partly due to inefficiencies in how
per-CPU variables are allocated.
So the discussion wandered toward another
possibility implemented with the somewhat obscure local_t
type. This type is apparently intended to function as a sort of
atomic_t which is only visible to a single CPU; it is currently
only used in two places in the kernel: to manage module reference counts
and in the x86-64 architecture code. It supports a set of
operations similar to atomic_t: local_set(),
local_read, local_add(), etc. There is also a set of
variants (cpu_local_set(), ...) intended for use with a local_t
declared as a per-CPU variable. The default implementation uses
atomic_t for 32-bit systems and a strange three-variable
structure for 64-bit systems. All architectures are encouraged to
reimplement the type in a more efficient, interrupt-safe manner, however,
and that has been done for several of them.
The local_t solution would set up two counters for each CPU, a
flag saying which of the two is in use, and a global count. For many operations,
they would behave just like percpu_counter, and they could yield
the same approximate answer. Should a precise count be needed, however,
the "which counter" bit would be flipped and all of the per-CPU offsets
summed. The result would be an exact count at the time the bit was
flipped, at the cost of taking a spinlock and iterating through the array.
All of this starts to look a little elaborate, however, and that may be the
point where kernel developers lose interest. A counter should only be so
complex, and making the code more twisted can only improve things to a
point. Sooner or later, people will decide that there are more important
things to be working on.
Comments (10 posted)
The software suspend story seems to repeat itself endlessly. Developers
debate multiple implementations while no decision gets made and software
suspend in Linux continues to fall short of what it could really be. One
place where this discussion might actually come to a head soon is in the storage
and retrieval of the suspend image - the copy of system memory which is
stored on disk while the system is down. Two approaches are being pushed;
they reveal two very different views of the problem.
One approach is the user-space interface, currently being developed by
Rafael Wysocki. Rafael's patch is similar in spirit to the user-space
patch covered here last
September. It no longer uses /dev/kmem, however; instead, it
sets up a dedicated device for the software suspend operations. A
user-space program can then invoke a set of ioctl() operations to
freeze the system, allocate swap space, and move memory pages to their
resting place - possibly compressing or encrypting them on the way. The documentation file provided with the patch
gives a good introduction to the interface and how it should be used.
In the other corner we have Nigel Cunningham, who has recently broken out
the modules mechanism from
his Suspend2 patch set. Rather than move image writing and reading support
to user space, this patch sets up a complex kernel interface for plugins
which take on parts of that task. There are two types of plugins: "filter"
plugins which transform the image data (performing encryption, say) and
"writer" plugins which handle the actual storage I/O. Parts of the code
anticipate "misc" and "checksum" plugins as well, but those are not
currently supported.
The plugin API is somewhat complex. Each
plugin has eleven methods to provide to the core suspend code; these handle
memory allocation, configuration, initialization and cleanup. Filter
plugins must define three more methods to handle data passing through for
processing. And writer modules have an additional 21 methods to provide
for dealing with various parts of that task. There are, it seems, a lot of
things that have to be done to get an image written to (and read from)
persistent storage.
The two patches are clearly incompatible - there is no point in setting up
an elaborate in-kernel interface if the whole process is to be moved out of
the kernel altogether, and vice-versa. So, before merging either of these
patches, somebody will have to make a decision. Anyone looking for tea
leaves to read might take a hint from the fact that the user-space patches
are currently in the -mm tree. As the reiser4 folks (among others) know,
however, the road from -mm to mainline can be long and perilous.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
- Junio C Hamano: GIT 1.1.5.
(January 28, 2006)
Device drivers
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>