Linux at NASDAQ OMX
NASDAQ OMX's exchanges run on thousands of Linux-based servers. These servers handle realtime transaction processing, monitoring, and development as well. The big challenge in this environment, of course, is performance; real money depends on whether the exchange can keep up with the order stream. Latency matters as much as throughput, though; orders must be responded to (and executed) within bounded period of time. Needless to say, reliability is also crucially important; down time is not well received, to say the least.
To meet these requirements, NASDAQ OMX runs large clusters of thousands of machines. These clusters can process hundreds of millions of orders per day - up to one million orders per second - with 250µs latency.
According to Bob, Linux has incorporated some useful technologies in recent years. The NAPI interrupt mitigation technique for network drivers has, on its own, freed up about 1/3 of the available CPU time for other work. The epoll system call cuts out much of the per-call overhead, taking 33µs off of the latency in one benchmark. Handling clock_gettime() in user space via the VDSO page cuts almost another 60ns. Bob was also quite pleased with how the Linux page cache works; it is effective enough, he says, to eliminate the need to use asynchronous I/O, simplifying the code considerably.
On the other hand, there are some things which have not worked out as well for them. These include I/O signals; they are complex to program with and, if things get busy, the signal queue can overflow. The user-space libaio asynchronous I/O (AIO) implementation is thread-based; it scales poorly, he says, and does not integrate well with epoll. Kernel-based asynchronous I/O, instead, lacks proper socket support. He also mentioned the recvmsg() system call, which requires a call into the kernel for every incoming packet.
There is some new stuff coming along which shows some promise. The new recvmmsg() system call can receive multiple packets with a single call. For now, though, it is just a wrapper around the internal recvmsg() implementation and does not hold the socket lock across the entire operation. But, he said, recvmmsg() is a good example of how the ability to add new APIs to Linux is a good thing. He also likes the combination of kernel-based AIO and the eventfd() system call; that makes it possible to integrate file-based AIO into an applications normal event-processing loop. There is also some potential in syslets, which he sees as a way of delivering cheap notifications to user space; it's not clear whether syslets will scale usefully, though.
What NASDAQ OMX would really like to see in Linux now is good socket-based AIO. That would make it possible to replace epoll/recvmsg/sendmsg sequences with fewer system calls. Even better would be if the kernel could provide notifications for multiple events at a time. Best would be if the interface to this functionality were completely based on sockets. He described a vision of an "epoll-like kernel object" which would handle in-kernel network traffic processing. The application could post asynchronous send and receive requests to the queue, and receive notifications when they have been executed. He would like to see multiple sockets attached to a single object, and a file descriptor suitable for passing to poll() for notifications. With a setup like that, it should be possible to push more network traffic through the kernel with lower latencies.
In summary, NASDAQ OMX seems to be happy with its use of Linux. They also
seem to like to go with current software - the exchange is currently
rolling out 2.6.35.3 kernels. "Emerging APIs" are helping operations like
NASDAQ OMX realize real-world performance gains in areas that
matter. Linux, Bob says, is one of the few systems that are willing to
introduce new APIs just for performance reasons. That is an interesting
point of view to contrast with Linus Torvalds's often-stated claim that
nobody uses Linux-specific APIs; it seems that there are users, they just
tend to be relatively well hidden.
