KS2010: Linux at NASDAQ
One of the main reasons why NASDAQ uses Linux is its speed. After Monday's discussion on performance regressions, it was nice to hear Bob say that they have not really experienced any problems in that area. Linux has also never brought them down, which is much appreciated. The exchange is using relatively current stable mainline kernels (2.6.35.3, currently), and has been happy with the result.
The talk covered a number of areas where the exchange is feeling pain, many of which were covered in the other article. One thing which was discussed in particular this time around is the overhead of simply waking a sleeping process. The latencies involved are high enough that they have tried to program around them by just having threads busy-wait for events. The consensus in the room was that the biggest piece of wakeup overhead is saving and restoring the floating-point unit status. The exchange doesn't do floating-point, of course, but the FPU covers a lot more than basic number crunching anymore. In particular, SSE instructions are used to implement memcpy() in glibc, and SSE use will force a save/restore. So one bit of homework for Bob is to try running on a system without SSE enabled to see if that helps with his wakeup latency issues.
Asynchronous network I/O remains high on his list; 10G Ethernet cards are out there, and 40G is not that far away. That kind of interface can generate data rates that are seriously difficult for the system to keep up with. So they are looking at a number of techniques adopted by the InfiniBand industry: separating control and data paths, bypassing the kernel for data streams, etc. There is a lot of pressure to be able to keep up with these data rates; the kernel will have to do something to reduce network stack overheads and make it possible.
One thing that Bob thinks could help is a special network asynchronous I/O API which, among other things, would provide direct user-space access to the hardware-managed packet queue. That would involve doing a fair amount of the protocol processing in user space, preferably "just in time" as the application reads that data. There are interesting issues having to do with state shared between processes and protocol compliance, but one assumes they can be worked out. As always, cache effects need to be dealt with; he has gained significant performance benefits by prefetching packets while processing the preceding packets. Evidently there are cards out there which can push incoming data directly into the processor cache, cutting out stalls caused by cache misses.
Another topic that was discussed briefly is realtime scheduling. One hears that realtime is used in the financial trading industry, but NASDAQ has not found it to be useful. The exchange tends to deal with long queues of orders, and what matters is the time required to get to the end of the queue. In other words, despite the strong focus on latency reduction, the exchange is still very much throughput-driven.
The final question had to do with putting high- and low-level protocol processing tasks together. Rather than move TCP processing to user space, have they considered putting their higher-level protocols into the kernel? The answer is "yes," but it doesn't seem worthwhile. If nothing else, the ongoing maintenance would be a real pain. Bob said that it seems better to him to work toward solutions that work for everybody instead of putting a specific protocol hack into a locally-maintained kernel.
