No fear of the fire engine
[Posted December 3, 2003 by corbet]
The Linux developers have long, and with reason, been pleased with
the performance of the kernel's networking subsystem. For
various
reasons, there is also a longstanding rivalry between the Linux
networking hackers and their Sun counterparts. So when The Register posted
an article
about the "Fire Engine" networking stack which will be part of future
Solaris releases, it drew some attention. This quote from John Fowler,
Sun's software CTO:
Also we focused on CPU utilization. One of the little secrets of
networking is high speed interfaces can in fact pump lots of bits,
but they chew up lots of CPU, which means you aren't doing other
things. We worked hard on efficiency, and we now measure, at a
given network workload on identical x86 hardware, we use 30 percent
less CPU than Linux.
also didn't help.
The dissection of Sun's claim was quick to begin. It was pointed out that
we don't know which version of Linux is being referred to in the quote.
There's a lot of differences between the 2.4 and 2.6 kernels, and it would
not be quite sporting for Sun to be comparing its upcoming, unreleased
technology with an old version of Linux.
Sun's performance improvements appear to be based on the use of "TCP
Offload Engine" (TOE) technology. The idea of a network adaptor which can
take on the network protocol overhead is not particularly new; such
hardware has been available for many years. The Linux networking hackers
have always had a low opinion of the TOE approach, however. TOE hardware
may offload a bit of work from the processor, but it suffers from a number
of disadvantages:
- When you use TOE hardware, you have just moved your networking
stack into a firmware-based, close-source module. This code can not
be inspected, fixed, or improved.
- TOE-based networking suffers from latency problems. The setup and
teardown of network connections still requires the processor's
intervention, and that means several round trips over the bus for each
connection.
- As Larry McVoy heard from "Sun employee
#1," processors are getting faster much more quickly than TOE hardware
is. Even if a TOE adaptor performs reasonably when it is released, it
will be quickly outstripped by processor-based TCP implementations.
The 2.6 networking stack is happy to offload some functions to smart
interfaces; examples include packet checksumming and TCP segmentation. But
the full TCP offload approach is likely to remain unpopular into the
future.
In general, the networking hackers do not feel threatened by "Fire Engine."
That didn't stop them from having a discussion of how Linux networking
could be made faster, however. The conversation was based around a shopping list of possible improvements posted
by Andi Kleen. This list includes a number of good ideas, but the bulk of
the debate concerned a relatively obscure topic: timestamp generation.
Certain applications want to get each packet packaged with a timestamp
saying exactly when that packet was received. Tools like tcpdump, for
example, make use of this capability. The socket interfaces were designed
in such a way that the networking subsystem cannot know if any particular
packet needs to be timestamped or not; as a result, it generates timestamps
for all incoming packets, even though they are rarely used.
The problem is that this timestamp generation gets to be expensive when you
have thousands of packets flowing through the system every second.
Depending on the architecture Linux is running on, generating the timestamp
can involve talking to a (slow) off-CPU timer or moving cache lines
frequently between processors. Improving the timestamp generation might be
the most straightforward way of speeding up Linux networking, at least at
the high end.
That fix is not entirely easy, however. Networking maintainer David Miller
is unwilling to make any changes that would reduce the accuracy of the
timestamps returned to user space. Any such changes would be seen as an
API change; somebody, somewhere, would be badly affected by it.
The proper solution, as proposed by David,
is the creation of a new fast_timestamp_t type which is quicker to
generate, but which can be converted to a real time when the need arises.
The optimal implementation of this type would be highly dependent on the
underlying architecture; on many systems the CPU cycle timer could be used,
but that approach would not work universally. A default,
architecture-independent "fast timestamp" implementation is easy to add,
however. Creating that sort of structure for the architecture maintainers
to play with may be one of the first things to happen when the 2.7 series
opens up.
(
Log in to post comments)