Moving past TCP in the data center, part 1
At the recently concluded Netdev 0x16 conference, which was held both in Lisbon, Portugal and virtually, Stanford professor John Ousterhout gave his personal views on where networking in data centers needs to be headed. To solve the problems that he sees, he suggested some "fairly significant changes" to those environments, including leaving behind the venerable—ubiquitous—TCP transport protocol. While LWN was unable to attend the conference itself, due to scheduling and time-zone conflicts, we were able to view the video of Ousterhout's keynote talk to bring you this report.
The problems
There has been amazing progress in hardware, he began. The link speeds are now over 100Gbps and rising with hardware round-trip times (RTTs) of five or ten microseconds, which may go lower over the next few years. But that raw network speed is not accessible by applications; in particular, the latency and throughput for small messages is not anywhere near what the hardware numbers would support. "We're one to two orders of magnitude off—or more." The problem is the overhead in the software network stacks.
If we are going to make those capabilities actually available to applications, he said, some "radical changes" will be required. There are three things that need to happen, but the biggest is replacing the TCP protocol; he is "not going to argue that's easy, but there really is no other way [...] if you want that hardware potential to get through". He said that he would spend the bulk of the talk on moving away from TCP, but that there is also a need for a lighter-weight remote-procedure-call (RPC) framework. Beyond that, he believes that it no longer makes sense to implement transport protocols in software, so those will need to eventually move into the network interface cards (NICs), which will require major changes to NIC architectures as well.
There are different goals that one might have for data-center networks, but he wanted to focus on high performance. When sending large objects, you want to get the full link speed, which is something he calls "data throughput". That "has always been TCP's sweet spot" and today's data-center networks do pretty well on that measure. However there are two other measures where TCP does not fare as well.
For short messages, there is a need for low latency; in particular, "low tail latency", so that 99% or 99.9% of the messages have low round-trip latency. In principle, we should be able to have that number be below 10µs, but we are around two orders of magnitude away from that, he said; TCP is up in the millisecond range for short-message latencies.
Another measure, which he has not heard being talked about much, is the message throughput for short messages. The hardware should be able to send 10-100 million short messages per second, which is "important for large-scale data-center applications that are working closely together for doing various kinds of group-communication operations". Today, in software, doing one million messages per second is just barely possible. "We're just way off." He reiterated that the goal would be to deliver that performance all the way up to the applications
Those performance requirements then imply some other requirements, Ousterhout said. For one thing, load balancing across multiple cores is needed because a single core cannot keep up with speeds beyond 10Gbps. But load balancing is difficult to do well, so overloaded cores cause hot spots that hurt the throughput and tail latency. This problem is so severe that it is part of why he argues that the transport protocols need to move into the NICs.
Another implied requirement is for doing congestion control in the network; the buffers and queues in the network devices need to be managed correctly. Congestion in the core fabric is avoidable if you can do load balancing correctly, he argued, which is not happening today; "TCP cannot do load balancing correctly". Congestion at the edge (i.e. the downlink to the end host) is unavoidable because the downlink capacity can always be exceeded by multiple senders; if that is not managed well, though, latency increases because of buffering buildup.
TCP shortcomings
TCP is an amazing protocol that was designed 40 years ago when the internet looked rather different than it does today; it is surprising that it has lasted as long as it has with the changes in the network over that span. Even today, it works well for wide-area networks, but there were no data centers when TCP was designed, "so unsurprisingly it was not designed for data centers". Ousterhout said that he would argue that every major aspect of the TCP design is wrong for data centers. "I am not able to identify anything about TCP that is right for the data center."
He listed five major aspects of the TCP design (stream-oriented, connection-oriented, fair scheduling, sender-driven congestion control, and in-order packet delivery) that are wrong for data-center applications and said he would be discussing each individually in the talk; "we have to change all five of those". To do that, TCP must be removed, at least mostly, from the data center; it needs to be displaced, though not completely replaced, with something new. One candidate is the Homa transport protocol that he and others have been working on. Since switching away from TCP will be difficult, though, adding support for Homa or some other data-center-oriented transport under RPC frameworks would ease the transition by reducing the number of application changes required.
TCP is byte-stream-oriented, where each connection consists of a stream of bytes without any message boundaries, but applications actually care about messages. Receiving TCP data is normally done in fixed-size blocks that can contain multiple messages, part of a single message, or a mixture of those. So each application has to add its own message format on top of TCP and pay the price in time and complexity for reassembling the messages from the received blocks.
That is annoying, he said, but not a show-stopper; that you cannot do load balancing with TCP is the show-stopper. You cannot split the handling of the received byte-stream data to multiple threads because the threads may not receive a full message that can be dispatched and parts of a message may be shared with several threads. Trying to somehow reassemble the messages cooperatively between the threads would be fraught. If someday NICs start directly dispatching network data to user space, they will have the same problems with load balancing, he said.
There are two main ways to work around this TCP limitation: with a dispatcher thread that collects up full messages to send to workers or by statically allocating subsets of connections to worker threads. The dispatcher becomes a bottleneck and adds latency; that limits performance to around one million short messages per second, he said. But static load balancing is prone to performance problems because some workers are overloaded, while others are nearly idle.
Beyond that, due to head-of-line blocking, small messages can get trapped behind larger ones and need to wait for the messages ahead of them to be transmitted. The TCP streams also do not provide the reliability guarantees that applications are looking for. Applications want to have their message delivered, processed by the server, and a response returned; if any of those fail, they want some kind of error indication. Streams only deliver part of that guarantee and many of the failures that can occur in one of those round-trip transactions are not flagged to the application. That means applications need to add some kind of timeout mechanism of their own even though TCP has timeouts of various sorts.
The second aspect that is problematic is that TCP is connection-oriented. It is something of an "article of faith in the networking world" that you need to have connections for "interesting properties like flow control and congestion control and recovery from lost packets and so on". But connections require the storage of state, which can be rather expensive; it takes around 2000 bytes per connection on Linux, not including the packet buffers. Data-center applications can have thousands of open connections, however, and server applications can have tens of thousands, so that storing the state adds a lot of memory overhead. Attempts to pool connections to reduce that end up adding complexity—and latency, as with the dispatcher/worker workaround for TCP load balancing.
In addition, a round-trip is needed before any data is sent. Traditionally, that has not been a big problem because the connections were long-lived and the setup cost could be amortized, but in today's microservices and serverless worlds, applications may run for less than a second—or even just for a few tens of milliseconds. It turns out that the features that were thought to require connections, congestion control and so on, can be achieved without them, Ousterhout said.
TCP uses fair scheduling to share the available bandwidth among all of the active connections when there is contention. But that means that all of the connections finish slowly; "it's well-known that fair scheduling is a bad algorithm in terms of minimizing response time". Since there is no benefit to handling most (but not all) of a flow, it makes sense to take a run-to-completion approach; pick some flow and handle all of its data. But that requires knowing the size of the messages, so that the system knows how much to send or receive, which TCP does not have available; thus, fair scheduling is the best that TCP can do. He presented some benchmarking that he had done that showed TCP is not even actually fair, though; when short messages compete with long messages on a loaded network, the short messages show much more slowdown ("the short messages really get screwed").
The fourth aspect of TCP that he wanted to highlight is its sender-driven congestion control. Senders are responsible for reducing their transmission rates when there is congestion, but they have no direct way to know when they need to do so. Senders are trying to avoid filling up intermediate buffers, so the congestion signals are based on how full the buffers are in TCP.
In the extreme case, queues overflow and packets get dropped, which causes the packets to time out; that is "catastrophic enough" that it is avoided as much as possible. Instead, various queue-length indications are used as congestion notifications that the sender uses to scale back its transmission. But that means there is no way to know about congestion without having some amount of buffer buildup—which leads to delays. Since all TCP messages share the same class of service, all messages of all sizes queue up in the same queues; once again, short-message latency suffers.
The fifth aspect of the TCP design that works poorly for data centers is that it expects packets to be delivered in the same order they were sent in, he said; if packets arrive out of order, it is seen as indicating a dropped packet. That makes load balancing difficult both for hardware and software. In the hardware, the same path through the routing fabric must be used for every packet in a flow so that there is no risk of reordering packets, but the paths are chosen independently by the flows and if two flows end up using the same link, neither can use the full bandwidth. This can happen even if the overall load on the network fabric is low; if the hash function used to choose a path just happens to cause a collision, congestion will occur.
He hypothesizes that the dominant cause of congestion in today's data-center networks is this flow-consistent routing required by TCP. He has not seen any measurements of that, but would be interested; he invited attendees who had access to data-center networks to investigate it.
Processing the packets in software also suffers from this load-balancing problem. In Linux, normally a packet will traverse three CPU cores, one where the driver code is running, another where the network-stack processing is done (in a software interrupt), and a third for the application. In order to prevent out-of-order packets, the same cores need to be used for all of the packets in a flow. Like with the hardware, though, if two flows end up sharing a single core, that core becomes a bottleneck. That leads to uneven loading in the system; he has measured that it is the dominant cause of software-induced tail latency for TCP. That is also true for Homa on Linux, he said.
There is a question of whether TCP can be repaired, but Ousterhout does not think it is possible. There are too many fundamental problems that are interrelated to make that feasible. In fact, he can find no part of TCP that is worth keeping for data centers; if there are useful pieces, he would like to hear about them. So, in order to get around the "software tax" and allow applications to use the full potential of the available networking hardware, a new protocol that is different from TCP in every aspect will be needed.
That ended the first half of Ousterhout's keynote; next up is more on the Homa transport protocol that has been developed at Stanford. It has a clean-slate protocol design specifically targeting the needs of data centers. Tune in for our report on that part of the talk in a concluding article that is coming soon.
Index entries for this article | |
---|---|
Conference | Netdev/2022 |
Posted Nov 2, 2022 0:51 UTC (Wed)
by KaiRo (subscriber, #1987)
[Link] (33 responses)
Posted Nov 2, 2022 3:06 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (30 responses)
The criticisms of TCP for DC applications are on point. Any sane RPC design already has to deal with retries, timeouts and request idempotency. So switching to a connectionless protocol is definitely an interesting way.
With a way to fall back onto reliable streaming protocol as needed (e.g. if an RPC call needs to return a large amount of data).
Posted Nov 2, 2022 4:13 UTC (Wed)
by willy (subscriber, #9762)
[Link] (7 responses)
They still saw an advantage to maintaining a connection in order to manage reliable service. I don't know that was the right choice, but I'm looking forward to reading about Homa's design decisions.
Posted Nov 2, 2022 4:50 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
The idea was dead simple: just throw away EVERYTHING.
The network layer simply used fixed-size jumbo Ethernet frames and a request could contain up to 4 of them. The sender simply sent them one by one. No retransmissions or buffering, so the code was dead simple.
The receiver simply reassembled the frames into a buffer. Since the frame sizes were fixed, only 1 data copy was necessary (or none, if a request fits into 1 packet). No NAKs, ACKs or anything was needed, just a simple timer to discard the data and fail the request in case of a timeout due to a lost packet.
Everything else was handled on the upper levels. E.g. a packet loss simply got translated into a timeout error from the RPC layer. Instead of a network-level retry, regular retry policy for the service calls was used. It worked surprisingly well in experiments, and actually had a very nice property of making sure that the network congestion pressure rapidly propagates upstream.
Posted Nov 2, 2022 9:21 UTC (Wed)
by amboar (subscriber, #55307)
[Link]
Posted Nov 2, 2022 12:36 UTC (Wed)
by paulj (subscriber, #341)
[Link] (3 responses)
NFS was of course RPC over UDP based too.
Posted Nov 2, 2022 21:12 UTC (Wed)
by amarao (guest, #87073)
[Link] (2 responses)
Posted Nov 2, 2022 21:32 UTC (Wed)
by joib (subscriber, #8541)
[Link]
Should also be noted that current Linux kernels actually no longer support NFS over UDP.
Posted Nov 3, 2022 9:52 UTC (Thu)
by paulj (subscriber, #341)
[Link]
Posted Nov 10, 2022 1:06 UTC (Thu)
by TheJosh (guest, #162094)
[Link]
Posted Nov 2, 2022 11:04 UTC (Wed)
by paulj (subscriber, #341)
[Link] (6 responses)
Posted Nov 2, 2022 23:55 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
The service for this protocol was used to move already encrypted data, so additional layers of encryption were unnecessary for us. But a more general protocol should definitely have support for it.
> Encryption at least prevents those with network super-user roles from snooping on server traffic, and those with super-user access on some server-roles from being able to snoop on traffic for other server-roles
Snooping at datacenter-scale is surprisingly not at all useful. You can observe only a small amount of traffic if you're trying to do it by physically splicing into cables, and if you can do it on the machines that run the network code, it's probably game over already.
Posted Nov 3, 2022 9:59 UTC (Thu)
by paulj (subscriber, #341)
[Link] (4 responses)
If an attacker has access to machines, without role-authentication and encryption an attacker can potentially use that to widen the number of services and access-roles the attacker can manipulate. (And authentication without a MAC is effectively the same - and if you're going to MAC the traffic, you can as easily encrypt it).
Bear in mind, in the large tech companies some of the server machines basically have a (host controlled) switch built-in to them, to allow multiple hosts in a brick/sled to share the same PHY to the network. Also, the switches are basically (specialised) servers too. So an attacked could insert code in the switching agents to programme the switching hardware to selectively pick out certain flows, for later analysis to use for privilege widening/escalation - quite efficient.
Posted Nov 3, 2022 17:42 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Not quite true for very high-traffic services.
> Bear in mind, in the large tech companies some of the server machines basically have a (host controlled) switch built-in to them, to allow multiple hosts in a brick/sled to share the same PHY to the network.
It's typically more complicated. A host can contain one or more PHYs and they typically go to the TOR (top-of-the-rack) device that does further routing/NAT/whatever.
Posted Nov 4, 2022 10:28 UTC (Fri)
by paulj (subscriber, #341)
[Link] (2 responses)
There is one more PHY, to the BMC, which controls the chassis and can give serial access to the hosts (and some very slow, hacky IP/TCP to the host, IIRC).
There are some large enterprise "blade" type systems which also use onboard switching ASICS I think, but the ones I know of use more traditional switching ASICs, (e.g. HPE were using Intel RRC10k) with actual NIC blocks for each host included into the ASIC. So these actually look and work a lot more like a traditional network, with proper buffering and flow-control between the hosts and the switching logic and the upstream (the shared NICs above did not do this properly - least in earlier iterations - causing significant issues).
Posted Nov 4, 2022 10:32 UTC (Fri)
by paulj (subscriber, #341)
[Link]
Posted Nov 23, 2022 22:25 UTC (Wed)
by Rudd-O (guest, #61155)
[Link]
Cyberax's views would fit right into, let's say, a Google in the year 2011. In the year 2015 — after Snowden — it was already consensus that everything needed to be encrypted, no exceptions.
And so all traffic inside Google — anything at all using Stubby — is, in fact, encrypted. As you said, if you're gonna MAC the traffic, might as well crypt it too. All those primitives are accelerated in the CPU, and some in software now.
Posted Nov 2, 2022 20:36 UTC (Wed)
by MatejLach (guest, #84942)
[Link] (12 responses)
Is this still the general consensus even after Snowden? I am not saying it's a silver bullet against a dedicated attacker like the NSA, but I'd think that every additional obstacle in the way is going to raise the bar for dragnets just a tiny bit, no?
Posted Nov 2, 2022 21:58 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
Posted Nov 2, 2022 22:40 UTC (Wed)
by Wol (subscriber, #4433)
[Link] (2 responses)
Given the breakins and stuff businesses have suffered, yes it's an absolute pain, but it means that if anybody does manage to break in to my laptop, it's now rather harder for them to jump to someone else and break into their laptop, moving up the chain ...
Cheers,
Posted Nov 2, 2022 23:09 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
1. 2FA has nothing to do with encryption. 2FA is primarily about stopping phishing, and only used by humans (I was talking about machine-to-machine communication).
Posted Nov 3, 2022 7:56 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Nov 2, 2022 23:58 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
So NSA would simply officially ask Amazon to provide them with a covert way to access the data for a specific customer directly using the AWS services.
Posted Nov 3, 2022 9:48 UTC (Thu)
by paulj (subscriber, #341)
[Link] (1 responses)
Minimising the trust and scope of access of what employees can access, to what they need to access for their immediate role, is a good thing, security wise.
Posted Nov 3, 2022 17:31 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Nov 23, 2022 22:27 UTC (Wed)
by Rudd-O (guest, #61155)
[Link] (3 responses)
Snowden already revealed this. There are extensive talks on the subject presented by Jacob Applebaum at CCC.
Posted Nov 23, 2022 23:43 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
You maybe can mirror one port to another, but in an Amazon DC this means nothing. You'll just confuse some random EC2 hardware instance. You won't even be able to do much if you redirect the traffic to an instance that you control, because it's encrypted.
Posted Nov 24, 2022 9:54 UTC (Thu)
by paulj (subscriber, #341)
[Link]
Except the L3 switch ASIC can be programmed to redirect only certain flows to said CPU. They can also be programmed to encap and redirect certain flows to other hosts. Indeed, they can be programmed to mirror packets (but I can't remember if the L3 ASICs commonly used at the super-large DCs can /both/ mirror and encap the same flow - if not, it's just a matter of time till they do).
So:
1. You don't need to analyse the entire data flow on the puny switch CPU, cause the /powerful/ switching ASIC can be programmed to do hardware tcpdumping (basically). Given the CPUs on these switches aren't /that/ puny (low to mid end Xeons), further analysis on host is quite feasible.
Posted Nov 24, 2022 9:56 UTC (Thu)
by paulj (subscriber, #341)
[Link]
I note this is a good counter-argument to your own argument in another sub-thread that intra-DC traffic doesn't need to be encrypted. ;)
Posted Nov 23, 2022 22:26 UTC (Wed)
by Rudd-O (guest, #61155)
[Link]
Google encrypt all its traffic at the Stubby layer.
Posted Nov 3, 2022 15:32 UTC (Thu)
by k3ninho (subscriber, #50375)
[Link]
nth-ing the pile-on about this. 'Zero Trust' is an important choice when you're sharing network links with unknown people or having malicious and incompetent staff mess things up in your datacentre. QUIC sucks because encrypted links are expensive to set up in the context of small message sizes in a datacentre.
K3n.
Posted Nov 10, 2022 18:32 UTC (Thu)
by flussence (guest, #85566)
[Link]
If your org's budget doesn't extend to CPUs with AES-NI instructions that operate at line rate and you're *sure* it's a closed system, there's nothing really stopping you from dropping in a highly optimised rot26 cipher, like SSH-HPN does.
Posted Nov 2, 2022 5:31 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
1. HTTP is overkill. For many services, you don't need e.g. If-Modified-Since, Cache-Control, etc., and if you do want that sort of thing, you probably want to fine-tune exactly how it works rather than having the HTTP library try to handle it.
Posted Nov 2, 2022 5:34 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
OK, I lied, you don't have to do that. But you do have to emit and parse text-based headers, which is still unreasonably expensive.
Posted Nov 2, 2022 2:49 UTC (Wed)
by shemminger (subscriber, #5739)
[Link] (7 responses)
Posted Nov 2, 2022 20:32 UTC (Wed)
by stephen.pollei (subscriber, #125364)
[Link] (5 responses)
Posted Nov 3, 2022 10:50 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (4 responses)
Switches are the big pain here - they've created a world in which, because the switch can only forward or drop frames (where a router can also send errors back to the source, or possibly fragment frames), we can't increase packet size - a 100G switch that cannot interconnect to 10G networks is not useful, so the 100G switch uses 10G compatible frame size limits, and this repeats all the way down to a 100M switch wanting 10M compatible frame size limits.
One of the reasons IPv6 has the concept of link-local addressing is that design work on it started before switches were a thing in most computer networks - and the idea was that you'd have routers interconnecting network segments of different speeds (so your 100M FDDI network would go through a router to your 10M Ethernet segment, and thus devices on the FDDI network could use a higher MTU). If the industry as a whole had kept this idea alive, instead of having Ethernet switches interconnecting different speeds (10/100/1000/10G Ethernet all carrying identical frames), we'd have had routers, and we could have split the gains from faster wire speeds differently (because moving from one speed to another would go via a router that could send a packet too big message).
If we'd done that, we'd had have three choices:
Posted Nov 3, 2022 17:39 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
The MTU should have been reified to the IP level. Not doing this was one of the most unforgivable errors in the IPv6 development.
In the ideal world all the IP packets would have two fields: forward minimum packet size and the reverse minimum packet size. A switch/router during packet forwarding sets the forward minimum size to their max size, if it's less than the current value. Reverse size is simply propagated back.
Posted Nov 3, 2022 17:45 UTC (Thu)
by guus (subscriber, #41608)
[Link]
Actually the same should be done for congestion control; it makes no sense that multiple TCP connections to the same host each implement their own congestion control loop when all packets go via exactly the same path. This is in effect what QUIC is doing: combine multiple reliable, in-order streams with a single congestion control loop. But of course that only is effective if you really use a single QUIC connection.
Posted Nov 4, 2022 23:15 UTC (Fri)
by stephen.pollei (subscriber, #125364)
[Link] (1 responses)
I think that I'm in camp #3 "Split the difference". Some decrease in packet transmit latency seems good, but it isn't the only thing to optimize. I also remember 56k modems, where 1500 octets is 214 milliseconds worth of time. In practice, Path MTU discovery can break in in interesting ways . People over block ICMP and perhaps a mechanism closer to what Cyberax suggested would be nice. With tunneling and MTU discovery having issues, CloudFlare at one point reduced their MTU to 1024 for ipv4 and 1280 for ipv6. It might have been nicer world in many ways if routers/switches supported 16k packets, but most people only used 12k normally. Extra 4k could be margin for tunneling and encapsulation. This is of course all IMHO.
Posted Nov 5, 2022 9:47 UTC (Sat)
by farnz (subscriber, #17727)
[Link]
To an unfortunately large extent, broken Path MTU discovery is accepted on the Internet because you can pretty much assume that a 1280 byte MTU will be close enough to optimal that it makes no practical difference. If we had much larger MTUs - 6,000 bytes for 100M Ethernet, 24k for gig, 96k for 10G (yes, this needs jumbograms at the IP layer), WiFi probably at around 12k for 802.11n, and about 50k for 802.11ac, we'd have much more incentive to get PMTUD working well - the difference between 1280 bytes (minimum IPv6 frame size), and 12k for 802.11n is huge.
Posted Nov 2, 2022 23:38 UTC (Wed)
by stefanha (subscriber, #55072)
[Link]
I guess in the ideal case there is flow steering that ensures the incoming packets appear in the receive queue receive handled by the CPU where the application calls recv(2). So just one CPU?
Do those conditions arise automatically or does it require manual tuning? Is application support required?
Posted Nov 2, 2022 4:01 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (1 responses)
I wonder what it is missing? Low-latency for tiny message maybe?
Posted Nov 2, 2022 9:39 UTC (Wed)
by joib (subscriber, #8541)
[Link]
> However, RDMA shares most of TCP’s problems. It is
Posted Nov 2, 2022 6:39 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Nov 2, 2022 10:08 UTC (Wed)
by ale2018 (guest, #128727)
[Link] (2 responses)
Posted Nov 3, 2022 18:16 UTC (Thu)
by jonesmz (subscriber, #130234)
[Link]
It's what the WebRTC data channel is built on.
It's relatively easy to implement over UDP (not simple, not easy, just relatively easy compared to other complex protocols).
Posted Nov 6, 2022 15:26 UTC (Sun)
by dullfire (guest, #111432)
[Link]
Even if it doesn't solve all the issues, it would be sane to at least state why SCTP is not a good solution.
And their aversion to encryption is... odd? Yes, it's a data center. Yes, you may not see an attack vector. But these kinds of attacks are only growing. Unless you are literally doing something like "cat /dev/zero | ssh ${other-dc-system} dd of=/dev/null" (in which case the is no useful information being transmited, and nothing is being done with that info either), it's not really sane to assume clear traffic is good. I've heard of several cases where "surprise. you totally benign clear connections can be used as a weapon".
Posted Nov 2, 2022 11:15 UTC (Wed)
by smoogen (subscriber, #97)
[Link] (1 responses)
Posted Nov 2, 2022 19:30 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
1. If you're only going over fiber links that you own (or lease), then you can send whatever you want over those links, but if you start tampering with the IP layer or below, some routers and/or switches might be unhappy about that. Playing with the transport layer should be fine in most reasonable deployments.
Posted Nov 2, 2022 16:21 UTC (Wed)
by jccleaver (guest, #127418)
[Link] (2 responses)
Assuming we're keeping IP, I'd be happier to see an old transport protocol with latent potential rediscovered than we move from 14->15 standards.
Posted Nov 29, 2022 20:03 UTC (Tue)
by marcH (subscriber, #57642)
[Link]
I remember discussions about replacing TCP 20 years ago. Should be easy to find if you follow the references in publications. It's like nuclear fusion: it's always 10 years away :-)
This all looks like a very good "state of the union" but commodity and "good enough" seems to win every time. As rich as data center companies are, they're still minuscule compared to the rest of the world using TCP. Moreover their lawyers and businessmen are very likely getting in the way of working together with competitors. It's not enough to tell them "look at how successful Linux is", the obvious answer is "So what? Why would help our competitors catch up with us?"
QUICK has been successful only because Google controls _both_ the browser and the servers - no cooperation required.
Posted Nov 29, 2022 23:43 UTC (Tue)
by marcH (subscriber, #57642)
[Link]
Off-topic but other-machine.local multicast DNS works most of the time. Often enough that I rarely ever typed an IP address in the last 10 years (and of course you really don't want to type an IPv6 address....)
Posted Nov 2, 2022 20:13 UTC (Wed)
by jenro (subscriber, #7024)
[Link] (7 responses)
If I summarize the article it seems to read like: TCP is not a good base for any kind of RPC-system, so ditch all of TCP and replace it with a shiny new thing.
Is it really true, that all or at least most traffic in a DC is some kind of RPC or request/reply system? Is it really true, that small packages are the ones that the whole DC needs to be optimized for? I highly doubt this.
I am an application developer. When I request data from any kind of information service - for example a database - I want the data to be complete and reliable. And to me that is far more important than to have some data faster. The nice thing about a TCP session is: as long as the session stays connected and error free it doesn't matter, if I retrieve 100 bytes or 100,000 bytes or 100,000,000 bytes, TCP guarantees, that I get all the data in the correct order.
On the other hand I have to work with market data feeds of different stock exchanges like NYSE or Xetra. They use UDP multicast to distribute the information. Dealing with this unreliable transmissions and recovering from lost packets is something you really don't want to do.
So let's wait for the second part of the article and than decide, if I want to give up my reliable TCP connections for Homa.
Posted Nov 2, 2022 21:13 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link]
In my experience (as a Google SRE), yes, at least for the way that Google tends to architect its services, anyway.
> I am an application developer. When I request data from any kind of information service - for example a database - I want the data to be complete and reliable. And to me that is far more important than to have some data faster. The nice thing about a TCP session is: as long as the session stays connected and error free it doesn't matter, if I retrieve 100 bytes or 100,000 bytes or 100,000,000 bytes, TCP guarantees, that I get all the data in the correct order.
For a typical CRUD app, you're generally dealing with one tuple at a time, or at most a few dozen (e.g. for search results).
Yes, we do have batch jobs that look at many (or all) tuples and manipulate them in some way. No, those jobs are not top priority. Top priority is responding to incoming HTTP requests in a timely fashion. If that means throwing away hours of batch work because live serving needs the machine, so be it.
> On the other hand I have to work with market data feeds of different stock exchanges like NYSE or Xetra. They use UDP multicast to distribute the information. Dealing with this unreliable transmissions and recovering from lost packets is something you really don't want to do.
Sounds like you're in finance. This is a very different sort of industry to tech - your top priority is probably some variation of figuring out which way the market is moving, not responding to incoming HTTP. So things are going to look different on your side of the fence. That does not mean the article is wrong. It just doesn't apply to your particular situation.
Posted Nov 2, 2022 21:19 UTC (Wed)
by joib (subscriber, #8541)
[Link] (2 responses)
It's just you.
> Is it really true, that all or at least most traffic in a DC is some kind of RPC or request/reply system? Is it really true, that small packages are the ones that the whole DC needs to be optimized for? I highly doubt this.
No, the argument is that 'lots of small messages' is an important usecase that TCP struggles with, due to inherent features in the TCP design. Large transfers are certainly not unimportant, but OTOH, if you can fill your pipe with lots of small messages, then the special case of a few very large messages is kind of trivial.
> if I want to give up my reliable TCP connections for Homa.
So with TCP you get 'reliable stream', and with UDP you get 'unreliable datagram'. The argument here is that for a lot of applications, the correct primitive is neither of these, but rather a 'reliable datagram' kind of service. Which is what this Homa proposal is apparently providing. Incidentally, Amazon's SRD is another design that has converged to a roughly similar conclusion.
Posted Nov 2, 2022 23:39 UTC (Wed)
by willy (subscriber, #9762)
[Link]
Posted Nov 2, 2022 23:43 UTC (Wed)
by rcampos (subscriber, #59737)
[Link]
It is a very interesting protocol for local networks, but don't know benchmarks about it.
Posted Nov 3, 2022 8:20 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
This isn't such a great advantage for many datacenter applications. You have to design services very differently if they need to handle 1K or 100M of data.
In fact, ability to hide this leads to bad designs.
Posted Dec 2, 2022 20:57 UTC (Fri)
by marcH (subscriber, #57642)
[Link] (1 responses)
Posted Dec 4, 2022 5:25 UTC (Sun)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Nov 10, 2022 3:05 UTC (Thu)
by endric (guest, #137197)
[Link]
Many distributed systems (not just in the datacenter) communicate better via short messages rather than one big stream of data. Homa as a new reliable, rapid request-response protocol may demonstrate its benefits: Lower (tail) latencies, less burdon on the CPUs and higher utilization of network infrastructure are interesting promises.
Anyone else interested in looking closer at this?
Posted Nov 14, 2022 5:23 UTC (Mon)
by gurugio (guest, #113827)
[Link]
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Wol
Moving past TCP in the data center, part 1
2. Unless your laptop is in the same physical building as all of the servers you will be interacting with, and your company has complete autonomy over that building (i.e. you're not leasing it out from someone who might have physical access), you're not using "trusted" lines in the sense I was referencing. I explicitly said this isn't about who owns or leases the lines. It's about who is able to physically touch and interact with the lines.
Moving past TCP in the data center, part 1
Wol
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
2. Even better, you can just redirect the flow you're interested in to a bigger server by encapping it (the server can resend the flow's packets out again so they're not missed).
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
2. HTTP is underkill. While it is technically possible for the client to present an X.509 certificate to the server to establish its identity, these certificates are (usually) designed to be long-lived and signed by a CA (which could be your own private CA). If you want to get into the business of minting your own bearer tokens, doing so in a way that is performant and not hilariously insecure or SPoF'd on the CA starts looking pretty dicey. Sometimes, you might also want to present multiple credentials for securing different aspects of the request (e.g. "this token proves I am the XYZ service, and this token proves that user 1234 has authorized me to access or manipulate their PII"), and doing that over HTTP is fiddly at best.
3. HTTP is text-based, so you have to finagle all of your data into text (or something reasonably resembling text, like JSON), and then unfinagle it at the other end. For every RPC. This ends up being ludicrously expensive when you consider the sheer scale that we're talking about here.
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
1. The cost of TCP is often that both the kernel and the application need to look at the data (ie cache misses).
2. Legacy Data Centers use legacy MTU of 1500 bytes or maybe 4K. The recent work for >64K segments reduces the per-packet cost
3. The assumptions around TCP bubble up into the legacy application stack (HTTP etc)
4. The incoming packets don't have to hit 3 CPU's. This sounds like case where academics don't understand the current Linux TCP stack
legacy MTU of 1500 bytes
legacy MTU of 1500 bytes
legacy MTU of 1500 bytes
legacy MTU of 1500 bytes
legacy MTU of 1500 bytes
legacy MTU of 1500 bytes
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
It is already implemented in (high end) hardware, and there are already RPC implementations layered over it.
Moving past TCP in the data center, part 1
> based on streams and connections (RDMA also offers unre-
> liable datagrams, but these have problems similar to those for
> UDP). It requires in-order packet delivery. Its congestion con-
> trol mechanism, based on priority flow control (PFC), is dif-
> ferent from TCP’s, but it is also problematic. And, it does not
> implement an SRPT priority mechanism.
Moving past TCP in the data center, part 1
I'm surprised SCTP is not mentioned, neither in the article nor in comments.
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
2. If you're going over one fixed ISP's lines, then you can experiment and/or talk to them to figure out exactly how transparent they are to "weird" protocols. Most ISPs at least have the good sense to be transparent-ish to TCP, but some ISPs will be fully IP transparent and won't care what you do at the transport layer. This is probably easier if you are paying them a lot of money and/or have some sort of peering arrangement with them.
3. If you're going over the public internet, I imagine there's a lot of old gear that will do bizarre things if you try sending packets it doesn't recognize.
4. You can always encapsulate your non-TCP transport in UDP, although that adds enough overhead that it may not be worth it. Some ISPs may not treat UDP nicely, but IMHO you have good cause to complain if that happens.
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1
Moving past TCP in the data center, part 1