Moving past TCP in the data center, part 1
At the recently concluded Netdev 0x16 conference, which was held both in Lisbon, Portugal and virtually, Stanford professor John Ousterhout gave his personal views on where networking in data centers needs to be headed. To solve the problems that he sees, he suggested some "fairly significant changes" to those environments, including leaving behind the venerable—ubiquitous—TCP transport protocol. While LWN was unable to attend the conference itself, due to scheduling and time-zone conflicts, we were able to view the video of Ousterhout's keynote talk to bring you this report.
The problems
There has been amazing progress in hardware, he began. The link speeds are now over 100Gbps and rising with hardware round-trip times (RTTs) of five or ten microseconds, which may go lower over the next few years. But that raw network speed is not accessible by applications; in particular, the latency and throughput for small messages is not anywhere near what the hardware numbers would support. "We're one to two orders of magnitude off—or more." The problem is the overhead in the software network stacks.
If we are going to make those capabilities actually available to applications, he said, some "radical changes" will be required. There are three things that need to happen, but the biggest is replacing the TCP protocol; he is "not going to argue that's easy, but there really is no other way [...] if you want that hardware potential to get through". He said that he would spend the bulk of the talk on moving away from TCP, but that there is also a need for a lighter-weight remote-procedure-call (RPC) framework. Beyond that, he believes that it no longer makes sense to implement transport protocols in software, so those will need to eventually move into the network interface cards (NICs), which will require major changes to NIC architectures as well.
There are different goals that one might have for data-center networks, but he wanted to focus on high performance. When sending large objects, you want to get the full link speed, which is something he calls "data throughput". That "has always been TCP's sweet spot" and today's data-center networks do pretty well on that measure. However there are two other measures where TCP does not fare as well.
For short messages, there is a need for low latency; in particular, "low tail latency", so that 99% or 99.9% of the messages have low round-trip latency. In principle, we should be able to have that number be below 10µs, but we are around two orders of magnitude away from that, he said; TCP is up in the millisecond range for short-message latencies.
Another measure, which he has not heard being talked about much, is the message throughput for short messages. The hardware should be able to send 10-100 million short messages per second, which is "important for large-scale data-center applications that are working closely together for doing various kinds of group-communication operations". Today, in software, doing one million messages per second is just barely possible. "We're just way off." He reiterated that the goal would be to deliver that performance all the way up to the applications
Those performance requirements then imply some other requirements, Ousterhout said. For one thing, load balancing across multiple cores is needed because a single core cannot keep up with speeds beyond 10Gbps. But load balancing is difficult to do well, so overloaded cores cause hot spots that hurt the throughput and tail latency. This problem is so severe that it is part of why he argues that the transport protocols need to move into the NICs.
Another implied requirement is for doing congestion control in the network; the buffers and queues in the network devices need to be managed correctly. Congestion in the core fabric is avoidable if you can do load balancing correctly, he argued, which is not happening today; "TCP cannot do load balancing correctly". Congestion at the edge (i.e. the downlink to the end host) is unavoidable because the downlink capacity can always be exceeded by multiple senders; if that is not managed well, though, latency increases because of buffering buildup.
TCP shortcomings
TCP is an amazing protocol that was designed 40 years ago when the internet looked rather different than it does today; it is surprising that it has lasted as long as it has with the changes in the network over that span. Even today, it works well for wide-area networks, but there were no data centers when TCP was designed, "so unsurprisingly it was not designed for data centers". Ousterhout said that he would argue that every major aspect of the TCP design is wrong for data centers. "I am not able to identify anything about TCP that is right for the data center."
He listed five major aspects of the TCP design (stream-oriented, connection-oriented, fair scheduling, sender-driven congestion control, and in-order packet delivery) that are wrong for data-center applications and said he would be discussing each individually in the talk; "we have to change all five of those". To do that, TCP must be removed, at least mostly, from the data center; it needs to be displaced, though not completely replaced, with something new. One candidate is the Homa transport protocol that he and others have been working on. Since switching away from TCP will be difficult, though, adding support for Homa or some other data-center-oriented transport under RPC frameworks would ease the transition by reducing the number of application changes required.
TCP is byte-stream-oriented, where each connection consists of a stream of bytes without any message boundaries, but applications actually care about messages. Receiving TCP data is normally done in fixed-size blocks that can contain multiple messages, part of a single message, or a mixture of those. So each application has to add its own message format on top of TCP and pay the price in time and complexity for reassembling the messages from the received blocks.
That is annoying, he said, but not a show-stopper; that you cannot do load balancing with TCP is the show-stopper. You cannot split the handling of the received byte-stream data to multiple threads because the threads may not receive a full message that can be dispatched and parts of a message may be shared with several threads. Trying to somehow reassemble the messages cooperatively between the threads would be fraught. If someday NICs start directly dispatching network data to user space, they will have the same problems with load balancing, he said.
There are two main ways to work around this TCP limitation: with a dispatcher thread that collects up full messages to send to workers or by statically allocating subsets of connections to worker threads. The dispatcher becomes a bottleneck and adds latency; that limits performance to around one million short messages per second, he said. But static load balancing is prone to performance problems because some workers are overloaded, while others are nearly idle.
Beyond that, due to head-of-line blocking, small messages can get trapped behind larger ones and need to wait for the messages ahead of them to be transmitted. The TCP streams also do not provide the reliability guarantees that applications are looking for. Applications want to have their message delivered, processed by the server, and a response returned; if any of those fail, they want some kind of error indication. Streams only deliver part of that guarantee and many of the failures that can occur in one of those round-trip transactions are not flagged to the application. That means applications need to add some kind of timeout mechanism of their own even though TCP has timeouts of various sorts.
The second aspect that is problematic is that TCP is connection-oriented. It is something of an "article of faith in the networking world" that you need to have connections for "interesting properties like flow control and congestion control and recovery from lost packets and so on". But connections require the storage of state, which can be rather expensive; it takes around 2000 bytes per connection on Linux, not including the packet buffers. Data-center applications can have thousands of open connections, however, and server applications can have tens of thousands, so that storing the state adds a lot of memory overhead. Attempts to pool connections to reduce that end up adding complexity—and latency, as with the dispatcher/worker workaround for TCP load balancing.
In addition, a round-trip is needed before any data is sent. Traditionally, that has not been a big problem because the connections were long-lived and the setup cost could be amortized, but in today's microservices and serverless worlds, applications may run for less than a second—or even just for a few tens of milliseconds. It turns out that the features that were thought to require connections, congestion control and so on, can be achieved without them, Ousterhout said.
TCP uses fair scheduling to share the available bandwidth among all of the active connections when there is contention. But that means that all of the connections finish slowly; "it's well-known that fair scheduling is a bad algorithm in terms of minimizing response time". Since there is no benefit to handling most (but not all) of a flow, it makes sense to take a run-to-completion approach; pick some flow and handle all of its data. But that requires knowing the size of the messages, so that the system knows how much to send or receive, which TCP does not have available; thus, fair scheduling is the best that TCP can do. He presented some benchmarking that he had done that showed TCP is not even actually fair, though; when short messages compete with long messages on a loaded network, the short messages show much more slowdown ("the short messages really get screwed").
The fourth aspect of TCP that he wanted to highlight is its sender-driven congestion control. Senders are responsible for reducing their transmission rates when there is congestion, but they have no direct way to know when they need to do so. Senders are trying to avoid filling up intermediate buffers, so the congestion signals are based on how full the buffers are in TCP.
In the extreme case, queues overflow and packets get dropped, which causes the packets to time out; that is "catastrophic enough" that it is avoided as much as possible. Instead, various queue-length indications are used as congestion notifications that the sender uses to scale back its transmission. But that means there is no way to know about congestion without having some amount of buffer buildup—which leads to delays. Since all TCP messages share the same class of service, all messages of all sizes queue up in the same queues; once again, short-message latency suffers.
The fifth aspect of the TCP design that works poorly for data centers is that it expects packets to be delivered in the same order they were sent in, he said; if packets arrive out of order, it is seen as indicating a dropped packet. That makes load balancing difficult both for hardware and software. In the hardware, the same path through the routing fabric must be used for every packet in a flow so that there is no risk of reordering packets, but the paths are chosen independently by the flows and if two flows end up using the same link, neither can use the full bandwidth. This can happen even if the overall load on the network fabric is low; if the hash function used to choose a path just happens to cause a collision, congestion will occur.
He hypothesizes that the dominant cause of congestion in today's data-center networks is this flow-consistent routing required by TCP. He has not seen any measurements of that, but would be interested; he invited attendees who had access to data-center networks to investigate it.
Processing the packets in software also suffers from this load-balancing problem. In Linux, normally a packet will traverse three CPU cores, one where the driver code is running, another where the network-stack processing is done (in a software interrupt), and a third for the application. In order to prevent out-of-order packets, the same cores need to be used for all of the packets in a flow. Like with the hardware, though, if two flows end up sharing a single core, that core becomes a bottleneck. That leads to uneven loading in the system; he has measured that it is the dominant cause of software-induced tail latency for TCP. That is also true for Homa on Linux, he said.
There is a question of whether TCP can be repaired, but Ousterhout does not think it is possible. There are too many fundamental problems that are interrelated to make that feasible. In fact, he can find no part of TCP that is worth keeping for data centers; if there are useful pieces, he would like to hear about them. So, in order to get around the "software tax" and allow applications to use the full potential of the available networking hardware, a new protocol that is different from TCP in every aspect will be needed.
That ended the first half of Ousterhout's keynote; next up is more on the Homa transport protocol that has been developed at Stanford. It has a clean-slate protocol design specifically targeting the needs of data centers. Tune in for our report on that part of the talk in a concluding article that is coming soon.
| Index entries for this article | |
|---|---|
| Conference | Netdev/2022 |
