Brief itemsdirect migration patches, and various architecture updates.
The current -mm tree is 2.6.16-rc1-mm4. Recent changes to -mm include some per-CPU variable tweaks, a representation of system CPU topology in sysfs, and various fixes. As Andrew puts it: "Things have been pretty quiet lately - most activity seems to be concentrated about putting bugs into the various subsystem trees."
The current stable 2.6 release is 22.214.171.124, announced on January 30. It includes a handful of fixes and a security patch. Expect another update before too long, however, as a few "box-killing bugs" are still known to exist in 2.6.15.
The stable kernel team has recently agreed to continue support for the previous kernel for a little longer. The result is 126.96.36.199, with a handful of important fixes.
Kernel development news
-- Van Jacobson's linux.conf.au slides
Only folks like Van Jacobson can take us out of the myopic view we currently have of how networking receive is done.
-- David Millerpublished a summary of the proposals and noted that the OSDL board has agreed to implement the full set. "There is no Linux technical conference in the US anymore. If this could be addressed with a conference much like ALS used to be, it would be a very good thing. We need to nurture the technical community across the US with regional conferences that are easy to access in order to help seed the creation of new developers for Linux."
Unfortunately, attending Van's talk at linux.conf.au this year was not in the program. Fortunately, David Miller was there and listening carefully. Van has figured out how the next round of networking performance improvements will happen, and he has the numbers to prove it. Expect some very interesting (and fundamental) changes in the Linux networking stack as Van's ideas are incorporated. This article attempts to cover the fundamentals of Van's scheme (called "channels") based on David's weblog entry and Van's slides [PDF].
Van, like many others, points out that the biggest impediment to scalability on contemporary hardware is memory performance. Current processors can often execute multiple instructions per nanosecond, but loading a cache line from memory still takes 50ns or more. So cache behavior will often be the dominant factor in the performance of kernel code. That is why simply making code smaller often makes it faster. The kernel developers understand cache behavior well, and much work has gone into improving cache utilization in the kernel.
The Linux networking stack (like all others) does a number of things which reduce cache performance, however. These include:
To demonstrate what can happen, Van ran some netperf tests on an instrumented kernel. On a single CPU system, processor utilization was 50%, of which 16% was in the socket code, 5% in the scheduler, and 1% in the application. On a two-processor system, utilization went to 77%, including 24% in the socket code and 12% in the scheduler. That is a worst case scenario in at least one way: the application and the interrupt handler were configured to run on different CPUs. Things will not always be that bad in the real world, but, as the number of processors increases, the chances of the interrupt handler running on the same processor as any given application decrease.
The key to better networking scalability, says Van, is to get rid of locking and shared data as much as possible, and to make sure that as much processing work as possible is done on the CPU where the application is running. It is, he says, simply the end-to-end principle in action yet again. This principle, which says that all of the intelligence in the network belongs at the ends of the connections, doesn't stop at the kernel. It should continue, pushing as much work as possible out of the core kernel and toward the actual applications.
The tool used to make this shift happen is the "net channel," intended to be a replacement for the socket buffers and queues used in the kernel now. Some details of how channels are implemented can be found in Van's slides, but all that really matters is the core concept: a channel is a carefully designed circular buffer. Properly done, circular buffers require no locks and share no writable cache lines between the producer and the consumer. So adding data to (or removing data from) a net channel will be a fast, cache-friendly operation.
As a first step, channels can be pushed into the driver interface. A network driver need no longer be aware of sk_buff structures and such; instead, it simply drops incoming packets into a channel as they are received. Making this change cuts the CPU utilization in the two-processor case back to 58%. But things need not stop there. A next logical step would be to get rid of the networking stack processing at softirq level and to feed packets directly into the socket code via a channel. Doing that requires creating a separate channel for each socket and adding a simple packet classifier so that the driver knows which channel should get each packet. The socket code must also be rewritten to do the protocol processing (using the existing kernel code). That change drops the overall CPU utilization to 28%, with the portion spent at softirq level dropping to zero.
But why stop there? If one wants to be serious about this end-to-end thing, one could connect the channel directly to the application. Said application gets the packet buffers mapped directly into its address space and performs protocol processing by way of a user-space library. This would be a huge change in how Linux does networking, but Van's results speak for themselves. Here is his table showing the percentage CPU utilization for each of the cases described above:
Total CPU Interrupt SoftIRQ Socket Locks Sched App. 1 CPU 50 7 11 16 8 5 1 2 CPUs 77 9 13 24 14 12 1 Driver channel 58 6 12 16 9 9 1 Socket channel 28 6 0 16 1 3 1 App. channel 14 6 0 0 0 2 5
The bottom line (literally) is this: processing time for the packet stream dropped to just over 25% of the previous single-CPU case, and less than 20% of the previous two-CPU behavior. Three layers of kernel code have been shorted out altogether, with the remaining work performed in the driver interrupt handler and the application itself. The test system running with the full application channel code was able to handle twice the network bandwidth as an unmodified system - with the processors idle most of the time.
Linux networking hackers have always been highly attentive to performance issues, so numbers like these are bound to get their attention. Beyond performance, however, this approach promises simpler drivers and a reasonably straightforward transition between the current stack and a future stack built around channels. A channel-based user-space interface will make it easy to create applications which can send and receive packets using any protocol. If Van's results hold together in a "real-world" implementation, the only remaining question would be: when will it be merged so the rest of us can use it?
In theory, a counter is just a simple integer variable. In an SMP environment, however, that variable must be protected against concurrent updates, or it will eventually get corrupted. The tool that kernel hackers reach for first in this situation is the atomic_t type. Atomic variables are simple integers with a set of atomic operations. If you have an atomic_t variable called counter, that counter can be incremented with a call like:
and its value will be changed in an SMP-safe, interrupt-safe manner. These operations are relatively fast, being hand-coded to use the mechanisms provided by each host architecture. In many cases, an atomic_t counter is the best solution to the problem.
The problem with atomic_t counters is that they use expensive locked operations, and they require that the current CPU obtain exclusive cache access for the variable. A frequently-modified atomic counter can cause a cache line to bounce constantly between CPUs, impacting the performance of the entire system. As an example, consider this patch set from Ravikiran Thirumalai. He replaced a single counter (the memory_allocated field of the proto structure) in the networking code with a more SMP-friendly counter, and reported a 5% improvement in an Apache benchmark on an eight-processor system. 5% is a nice improvement for changing a single counter, but it seems that perhaps even better results could be had.
Ravikiran replaced the atomic_t counter with the percpu_counter type. These counters use per-CPU variables to hold a CPU-local count. Modifying that count is fast, since it is local to the given CPU, no locking is required, and no cache lines need be moved from other processors. If any given processor's count exceeds a given threshold, its value is added to a (spinlock-protected) global count, and the CPU-local count is set back to zero. Queries of the counter look only at the global count. The result is a counter which is somewhat approximate, but quite fast. In many cases, an "almost right" count is entirely good enough.
Per-CPU counters become increasingly inaccurate as the number of processors grows, however. Each processor has a certain residual count which has not yet been folded into the global count. In situations where counters tend to increase, the result will be a global count which underestimates the real value, and which is increasingly wrong on larger systems. Per-CPU counters are also memory-intensive, partly due to inefficiencies in how per-CPU variables are allocated.
So the discussion wandered toward another possibility implemented with the somewhat obscure local_t type. This type is apparently intended to function as a sort of atomic_t which is only visible to a single CPU; it is currently only used in two places in the kernel: to manage module reference counts and in the x86-64 architecture code. It supports a set of operations similar to atomic_t: local_set(), local_read, local_add(), etc. There is also a set of variants (cpu_local_set(), ...) intended for use with a local_t declared as a per-CPU variable. The default implementation uses atomic_t for 32-bit systems and a strange three-variable structure for 64-bit systems. All architectures are encouraged to reimplement the type in a more efficient, interrupt-safe manner, however, and that has been done for several of them.
The local_t solution would set up two counters for each CPU, a flag saying which of the two is in use, and a global count. For many operations, they would behave just like percpu_counter, and they could yield the same approximate answer. Should a precise count be needed, however, the "which counter" bit would be flipped and all of the per-CPU offsets summed. The result would be an exact count at the time the bit was flipped, at the cost of taking a spinlock and iterating through the array.
All of this starts to look a little elaborate, however, and that may be the point where kernel developers lose interest. A counter should only be so complex, and making the code more twisted can only improve things to a point. Sooner or later, people will decide that there are more important things to be working on.
One approach is the user-space interface, currently being developed by Rafael Wysocki. Rafael's patch is similar in spirit to the user-space patch covered here last September. It no longer uses /dev/kmem, however; instead, it sets up a dedicated device for the software suspend operations. A user-space program can then invoke a set of ioctl() operations to freeze the system, allocate swap space, and move memory pages to their resting place - possibly compressing or encrypting them on the way. The documentation file provided with the patch gives a good introduction to the interface and how it should be used.
In the other corner we have Nigel Cunningham, who has recently broken out the modules mechanism from his Suspend2 patch set. Rather than move image writing and reading support to user space, this patch sets up a complex kernel interface for plugins which take on parts of that task. There are two types of plugins: "filter" plugins which transform the image data (performing encryption, say) and "writer" plugins which handle the actual storage I/O. Parts of the code anticipate "misc" and "checksum" plugins as well, but those are not currently supported.
The plugin API is somewhat complex. Each plugin has eleven methods to provide to the core suspend code; these handle memory allocation, configuration, initialization and cleanup. Filter plugins must define three more methods to handle data passing through for processing. And writer modules have an additional 21 methods to provide for dealing with various parts of that task. There are, it seems, a lot of things that have to be done to get an image written to (and read from) persistent storage.
The two patches are clearly incompatible - there is no point in setting up an elaborate in-kernel interface if the whole process is to be moved out of the kernel altogether, and vice-versa. So, before merging either of these patches, somebody will have to make a decision. Anyone looking for tea leaves to read might take a hint from the fact that the user-space patches are currently in the -mm tree. As the reiser4 folks (among others) know, however, the road from -mm to mainline can be long and perilous.
Patches and updates
Core kernel code
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds