|
|
Log in / Subscribe / Register

KS2010: Linux at NASDAQ

By Jonathan Corbet
November 3, 2010

2010 Kernel Summit
A recurring Kernel Summit feature is a session run by a high-profile Linux end user. For the 2010 event, that speaker was Bob Evans, representing NASDAQ OMX. His talk covered many aspects of NASDAQ's use of Linux, much of which was covered in this article back in October; that material will not be repeated here.

One of the main reasons why NASDAQ uses Linux is its speed. After Monday's discussion on performance regressions, it was nice to hear Bob say that they have not really experienced any problems in that area. Linux has also never brought them down, which is much appreciated. The exchange is using relatively current stable mainline kernels (2.6.35.3, currently), and has been happy with the result.

The talk covered a number of areas where the exchange is feeling pain, many of which were covered in the other article. One thing which was discussed in particular this time around is the overhead of simply waking a sleeping process. The latencies involved are high enough that they have tried to program around them by just having threads busy-wait for events. The consensus in the room was that the biggest piece of wakeup overhead is saving and restoring the floating-point unit status. The exchange doesn't do floating-point, of course, but the FPU covers a lot more than basic number crunching anymore. In particular, SSE instructions are used to implement memcpy() in glibc, and SSE use will force a save/restore. So one bit of homework for Bob is to try running on a system without SSE enabled to see if that helps with his wakeup latency issues.

Asynchronous network I/O remains high on his list; 10G Ethernet cards are out there, and 40G is not that far away. That kind of interface can generate data rates that are seriously difficult for the system to keep up with. So they are looking at a number of techniques adopted by the InfiniBand industry: separating control and data paths, bypassing the kernel for data streams, etc. There is a lot of pressure to be able to keep up with these data rates; the kernel will have to do something to reduce network stack overheads and make it possible.

One thing that Bob thinks could help is a special network asynchronous I/O API which, among other things, would provide direct user-space access to the hardware-managed packet queue. That would involve doing a fair amount of the protocol processing in user space, preferably "just in time" as the application reads that data. There are interesting issues having to do with state shared between processes and protocol compliance, but one assumes they can be worked out. As always, cache effects need to be dealt with; he has gained significant performance benefits by prefetching packets while processing the preceding packets. Evidently there are cards out there which can push incoming data directly into the processor cache, cutting out stalls caused by cache misses.

Another topic that was discussed briefly is realtime scheduling. One hears that realtime is used in the financial trading industry, but NASDAQ has not found it to be useful. The exchange tends to deal with long queues of orders, and what matters is the time required to get to the end of the queue. In other words, despite the strong focus on latency reduction, the exchange is still very much throughput-driven.

The final question had to do with putting high- and low-level protocol processing tasks together. Rather than move TCP processing to user space, have they considered putting their higher-level protocols into the kernel? The answer is "yes," but it doesn't seem worthwhile. If nothing else, the ongoing maintenance would be a real pain. Bob said that it seems better to him to work toward solutions that work for everybody instead of putting a specific protocol hack into a locally-maintained kernel.

Next: Scalability


to post comments

Network card

Posted Nov 3, 2010 22:35 UTC (Wed) by cma (guest, #49905) [Link] (2 responses)

"Evidently there are cards out there which can push incoming data directly into the processor cache, cutting out stalls caused by cache misses".

Any idea which card are those?

Thanks!

Network card

Posted Nov 4, 2010 0:17 UTC (Thu) by cma (guest, #49905) [Link] (1 responses)

OK...found the tacnology term: DCA (Direct Cache Access)

One example: http://www.dell.com/downloads/global/products/pwcnt/en/nic-intel-gb-et-brief.pdf

Network card

Posted Nov 4, 2010 15:39 UTC (Thu) by i3839 (guest, #31386) [Link]

See http://daniel.haxx.se/blog/2008/12/18/10g-and-direct-cach...
It's also known as I/OAT.

To quote bgoglin:

> DCA doesn’t really reduce the memory bandwidth requirements since the
> data still has to be fetched by the cache from the main memory (the
> device doesn’t write into the cache, it just tells the cache that data
> should be fetched). The whole point of the approach is that this fetch
> is done in advance, so you don’t have to wait for it when the host starts
> processing the packet.

So it doesn't seem that good yet.

With integrated memory controllers I would think it's easy to do this automatically, or at least very easily, by directly going to the cache first instead of through RAM (at least the L3 cache).

KS2010: Linux at NASDAQ

Posted Nov 4, 2010 15:21 UTC (Thu) by shtylman (guest, #70765) [Link]

By the way, it is 'NASDAQ OMX' not 'NASDAQ OMG'


Copyright © 2010, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds