A recurring Kernel Summit feature is a session run by a high-profile Linux
end user. For the 2010 event, that speaker was Bob Evans, representing
NASDAQ OMX. His talk covered many aspects of NASDAQ's use of Linux,
much of which was covered in
back in October; that material will not be repeated here.
One of the main reasons why NASDAQ uses Linux is its speed. After
Monday's discussion on performance regressions, it was nice to hear Bob say
that they have not really experienced any problems in that area. Linux has
also never brought them down, which is much appreciated. The exchange is
using relatively current stable mainline kernels (18.104.22.168, currently), and
has been happy with the result.
The talk covered a number of areas where the exchange is feeling pain, many
of which were covered in the other article. One thing which was discussed
in particular this time around is the overhead of simply waking a sleeping
process. The latencies involved are high enough that they have tried to
program around them by just having threads busy-wait for events. The
consensus in the room was that the biggest piece of wakeup overhead is
saving and restoring the floating-point unit status. The exchange doesn't
do floating-point, of course, but the FPU covers a lot more than basic
number crunching anymore. In particular, SSE
instructions are used to
implement memcpy() in glibc, and SSE use will force a
save/restore. So one bit of homework for Bob is to try running on a system
without SSE enabled to see if that helps with his wakeup latency issues.
Asynchronous network I/O remains high on his list; 10G Ethernet cards are
out there, and 40G is not that far away. That kind of interface can
generate data rates that are seriously difficult for the system to keep up
with. So they are looking at a number of techniques adopted by the
InfiniBand industry: separating control and data paths, bypassing the
kernel for data streams, etc. There is a lot of pressure to be able to
keep up with these data rates; the kernel will have to do something to
reduce network stack overheads and make it possible.
One thing that Bob thinks could help is a special network asynchronous I/O
API which, among other things, would provide direct user-space access to
the hardware-managed packet queue. That would involve doing a fair amount
of the protocol processing in user space, preferably "just in time" as the
application reads that data. There are interesting issues having to do
with state shared between processes and protocol compliance, but one
assumes they can be worked out. As always, cache effects need to be dealt
with; he has gained significant performance benefits by prefetching packets
while processing the preceding packets. Evidently there are cards out
there which can push incoming data directly into the processor cache,
cutting out stalls caused by cache misses.
Another topic that was discussed briefly is realtime scheduling. One hears
that realtime is used in the financial trading industry, but NASDAQ has not
found it to be useful. The exchange tends to deal with long queues of
orders, and what matters is the time required to get to the end of the
queue. In other words, despite the strong focus on latency reduction, the
exchange is still very much throughput-driven.
The final question had to do with putting high- and low-level protocol
processing tasks together.
Rather than move TCP processing to user space, have they considered putting
their higher-level protocols into the kernel? The answer is "yes," but it
worthwhile. If nothing else, the ongoing maintenance would be a real
pain. Bob said that it seems better to him to work toward solutions that work
instead of putting a specific protocol hack into a locally-maintained
to post comments)