Moving past TCP in the data center, part 1

By Jake Edge
November 1, 2022

At the recently concluded Netdev 0x16 conference, which was held both in Lisbon, Portugal and virtually, Stanford professor John Ousterhout gave his personal views on where networking in data centers needs to be headed. To solve the problems that he sees, he suggested some "fairly significant changes" to those environments, including leaving behind the venerable—ubiquitous—TCP transport protocol. While LWN was unable to attend the conference itself, due to scheduling and time-zone conflicts, we were able to view the video of Ousterhout's keynote talk to bring you this report.

The problems

There has been amazing progress in hardware, he began. The link speeds are now over 100Gbps and rising with hardware round-trip times (RTTs) of five or ten microseconds, which may go lower over the next few years. But that raw network speed is not accessible by applications; in particular, the latency and throughput for small messages is not anywhere near what the hardware numbers would support. "We're one to two orders of magnitude off—or more." The problem is the overhead in the software network stacks.

If we are going to make those capabilities actually available to applications, he said, some "radical changes" will be required. There are three things that need to happen, but the biggest is replacing the TCP protocol; he is "not going to argue that's easy, but there really is no other way [...] if you want that hardware potential to get through". He said that he would spend the bulk of the talk on moving away from TCP, but that there is also a need for a lighter-weight remote-procedure-call (RPC) framework. Beyond that, he believes that it no longer makes sense to implement transport protocols in software, so those will need to eventually move into the network interface cards (NICs), which will require major changes to NIC architectures as well.

There are different goals that one might have for data-center networks, but he wanted to focus on high performance. When sending large objects, you want to get the full link speed, which is something he calls "data throughput". That "has always been TCP's sweet spot" and today's data-center networks do pretty well on that measure. However there are two other measures where TCP does not fare as well.

For short messages, there is a need for low latency; in particular, "low tail latency", so that 99% or 99.9% of the messages have low round-trip latency. In principle, we should be able to have that number be below 10µs, but we are around two orders of magnitude away from that, he said; TCP is up in the millisecond range for short-message latencies.

Another measure, which he has not heard being talked about much, is the message throughput for short messages. The hardware should be able to send 10-100 million short messages per second, which is "important for large-scale data-center applications that are working closely together for doing various kinds of group-communication operations". Today, in software, doing one million messages per second is just barely possible. "We're just way off." He reiterated that the goal would be to deliver that performance all the way up to the applications

Those performance requirements then imply some other requirements, Ousterhout said. For one thing, load balancing across multiple cores is needed because a single core cannot keep up with speeds beyond 10Gbps. But load balancing is difficult to do well, so overloaded cores cause hot spots that hurt the throughput and tail latency. This problem is so severe that it is part of why he argues that the transport protocols need to move into the NICs.

Another implied requirement is for doing congestion control in the network; the buffers and queues in the network devices need to be managed correctly. Congestion in the core fabric is avoidable if you can do load balancing correctly, he argued, which is not happening today; "TCP cannot do load balancing correctly". Congestion at the edge (i.e. the downlink to the end host) is unavoidable because the downlink capacity can always be exceeded by multiple senders; if that is not managed well, though, latency increases because of buffering buildup.

TCP shortcomings

TCP is an amazing protocol that was designed 40 years ago when the internet looked rather different than it does today; it is surprising that it has lasted as long as it has with the changes in the network over that span. Even today, it works well for wide-area networks, but there were no data centers when TCP was designed, "so unsurprisingly it was not designed for data centers". Ousterhout said that he would argue that every major aspect of the TCP design is wrong for data centers. "I am not able to identify anything about TCP that is right for the data center."

He listed five major aspects of the TCP design (stream-oriented, connection-oriented, fair scheduling, sender-driven congestion control, and in-order packet delivery) that are wrong for data-center applications and said he would be discussing each individually in the talk; "we have to change all five of those". To do that, TCP must be removed, at least mostly, from the data center; it needs to be displaced, though not completely replaced, with something new. One candidate is the Homa transport protocol that he and others have been working on. Since switching away from TCP will be difficult, though, adding support for Homa or some other data-center-oriented transport under RPC frameworks would ease the transition by reducing the number of application changes required.

TCP is byte-stream-oriented, where each connection consists of a stream of bytes without any message boundaries, but applications actually care about messages. Receiving TCP data is normally done in fixed-size blocks that can contain multiple messages, part of a single message, or a mixture of those. So each application has to add its own message format on top of TCP and pay the price in time and complexity for reassembling the messages from the received blocks.

That is annoying, he said, but not a show-stopper; that you cannot do load balancing with TCP is the show-stopper. You cannot split the handling of the received byte-stream data to multiple threads because the threads may not receive a full message that can be dispatched and parts of a message may be shared with several threads. Trying to somehow reassemble the messages cooperatively between the threads would be fraught. If someday NICs start directly dispatching network data to user space, they will have the same problems with load balancing, he said.

There are two main ways to work around this TCP limitation: with a dispatcher thread that collects up full messages to send to workers or by statically allocating subsets of connections to worker threads. The dispatcher becomes a bottleneck and adds latency; that limits performance to around one million short messages per second, he said. But static load balancing is prone to performance problems because some workers are overloaded, while others are nearly idle.

Beyond that, due to head-of-line blocking, small messages can get trapped behind larger ones and need to wait for the messages ahead of them to be transmitted. The TCP streams also do not provide the reliability guarantees that applications are looking for. Applications want to have their message delivered, processed by the server, and a response returned; if any of those fail, they want some kind of error indication. Streams only deliver part of that guarantee and many of the failures that can occur in one of those round-trip transactions are not flagged to the application. That means applications need to add some kind of timeout mechanism of their own even though TCP has timeouts of various sorts.

The second aspect that is problematic is that TCP is connection-oriented. It is something of an "article of faith in the networking world" that you need to have connections for "interesting properties like flow control and congestion control and recovery from lost packets and so on". But connections require the storage of state, which can be rather expensive; it takes around 2000 bytes per connection on Linux, not including the packet buffers. Data-center applications can have thousands of open connections, however, and server applications can have tens of thousands, so that storing the state adds a lot of memory overhead. Attempts to pool connections to reduce that end up adding complexity—and latency, as with the dispatcher/worker workaround for TCP load balancing.

In addition, a round-trip is needed before any data is sent. Traditionally, that has not been a big problem because the connections were long-lived and the setup cost could be amortized, but in today's microservices and serverless worlds, applications may run for less than a second—or even just for a few tens of milliseconds. It turns out that the features that were thought to require connections, congestion control and so on, can be achieved without them, Ousterhout said.

TCP uses fair scheduling to share the available bandwidth among all of the active connections when there is contention. But that means that all of the connections finish slowly; "it's well-known that fair scheduling is a bad algorithm in terms of minimizing response time". Since there is no benefit to handling most (but not all) of a flow, it makes sense to take a run-to-completion approach; pick some flow and handle all of its data. But that requires knowing the size of the messages, so that the system knows how much to send or receive, which TCP does not have available; thus, fair scheduling is the best that TCP can do. He presented some benchmarking that he had done that showed TCP is not even actually fair, though; when short messages compete with long messages on a loaded network, the short messages show much more slowdown ("the short messages really get screwed").

The fourth aspect of TCP that he wanted to highlight is its sender-driven congestion control. Senders are responsible for reducing their transmission rates when there is congestion, but they have no direct way to know when they need to do so. Senders are trying to avoid filling up intermediate buffers, so the congestion signals are based on how full the buffers are in TCP.

In the extreme case, queues overflow and packets get dropped, which causes the packets to time out; that is "catastrophic enough" that it is avoided as much as possible. Instead, various queue-length indications are used as congestion notifications that the sender uses to scale back its transmission. But that means there is no way to know about congestion without having some amount of buffer buildup—which leads to delays. Since all TCP messages share the same class of service, all messages of all sizes queue up in the same queues; once again, short-message latency suffers.

The fifth aspect of the TCP design that works poorly for data centers is that it expects packets to be delivered in the same order they were sent in, he said; if packets arrive out of order, it is seen as indicating a dropped packet. That makes load balancing difficult both for hardware and software. In the hardware, the same path through the routing fabric must be used for every packet in a flow so that there is no risk of reordering packets, but the paths are chosen independently by the flows and if two flows end up using the same link, neither can use the full bandwidth. This can happen even if the overall load on the network fabric is low; if the hash function used to choose a path just happens to cause a collision, congestion will occur.

He hypothesizes that the dominant cause of congestion in today's data-center networks is this flow-consistent routing required by TCP. He has not seen any measurements of that, but would be interested; he invited attendees who had access to data-center networks to investigate it.

Processing the packets in software also suffers from this load-balancing problem. In Linux, normally a packet will traverse three CPU cores, one where the driver code is running, another where the network-stack processing is done (in a software interrupt), and a third for the application. In order to prevent out-of-order packets, the same cores need to be used for all of the packets in a flow. Like with the hardware, though, if two flows end up sharing a single core, that core becomes a bottleneck. That leads to uneven loading in the system; he has measured that it is the dominant cause of software-induced tail latency for TCP. That is also true for Homa on Linux, he said.

There is a question of whether TCP can be repaired, but Ousterhout does not think it is possible. There are too many fundamental problems that are interrelated to make that feasible. In fact, he can find no part of TCP that is worth keeping for data centers; if there are useful pieces, he would like to hear about them. So, in order to get around the "software tax" and allow applications to use the full potential of the available networking hardware, a new protocol that is different from TCP in every aspect will be needed.

That ended the first half of Ousterhout's keynote; next up is more on the Homa transport protocol that has been developed at Stanford. It has a clean-slate protocol design specifically targeting the needs of data centers. Tune in for our report on that part of the talk in a concluding article that is coming soon.

Index entries for this article
Conference	Netdev/2022

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 0:51 UTC (Wed) by KaiRo (subscriber, #1987) [Link] (33 responses)

I'm no expert, but it looks to me like TCP has established itself in datacenters because you can use the same standard components, tools, and software as everywhere else. Now, TCP having issues is also something that I've heard from different sources, and esp. from the HTTP camp, which has based HTTP/3 around IETF QUIC (not the same but an evolution of Google QUIC), a UDP-based protocol. Now, I wonder how much of datacenter traffic is HTTP, and how it would/will change when all of that moves to that HTTP/3 stack...

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 3:06 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (30 responses)

If anything, QUIC is even worse because it mandates encryption which is not really needed in a DC.

The criticisms of TCP for DC applications are on point. Any sane RPC design already has to deal with retries, timeouts and request idempotency. So switching to a connectionless protocol is definitely an interesting way.

With a way to fall back onto reliable streaming protocol as needed (e.g. if an RPC call needs to return a large amount of data).

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 4:13 UTC (Wed) by willy (subscriber, #9762) [Link] (7 responses)

A flow-control-free protocol is not exactly a new idea. Here's IL from plan9 (circa 1993; the "4th edition" in the URL is misleading): http://doc.cat-v.org/plan_9/4th_edition/papers/il/

They still saw an advantage to maintaining a connection in order to manage reliable service. I don't know that was the right choice, but I'm looking forward to reading about Homa's design decisions.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 4:50 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Even Homa is excessive for most cases. Way back when I worked in a large cloud company, we did some experiments with UDP-based transport for RPC.

The idea was dead simple: just throw away EVERYTHING.

The network layer simply used fixed-size jumbo Ethernet frames and a request could contain up to 4 of them. The sender simply sent them one by one. No retransmissions or buffering, so the code was dead simple.

The receiver simply reassembled the frames into a buffer. Since the frame sizes were fixed, only 1 data copy was necessary (or none, if a request fits into 1 packet). No NAKs, ACKs or anything was needed, just a simple timer to discard the data and fail the request in case of a timeout due to a lost packet.

Everything else was handled on the upper levels. E.g. a packet loss simply got translated into a timeout error from the RPC layer. Instead of a network-level retry, regular retry policy for the service calls was used. It worked surprisingly well in experiments, and actually had a very nice property of making sure that the network congestion pressure rapidly propagates upstream.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 9:21 UTC (Wed) by amboar (subscriber, #55307) [Link]

This is roughly the strategy used by the DMTF's MCTP protocol for intra-platform communication between devices. It's kinda impressive in its simplicity and effectiveness

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 12:36 UTC (Wed) by paulj (subscriber, #341) [Link] (3 responses)

Sending back-to-back frames - at a low enough level to ensure no other station could try send in-between - was a trick SGI used in IRIX NFS to get performance - I think I remember reading this in comments on LWN before, perhaps in a reply to you mentioning the same thing before. :)

NFS was of course RPC over UDP based too.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 21:12 UTC (Wed) by amarao (guest, #87073) [Link] (2 responses)

I remember time people said that udp for NFS is back and we should switch to tcp. Had something changed?

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 21:32 UTC (Wed) by joib (subscriber, #8541) [Link]

https://www.man7.org/linux/man-pages/man5/nfs.5.html#TRAN... has a discussion about the pitfalls of NFS over udp.

Should also be noted that current Linux kernels actually no longer support NFS over UDP.

Moving past TCP in the data center, part 1

Posted Nov 3, 2022 9:52 UTC (Thu) by paulj (subscriber, #341) [Link]

NFS has switched to TCP since /ages/ ago. However, in the beginning (and for quite a while) NFS ran its RPC protocol over UDP. Just, without any real flow-control, and poor loss handling. TCP provides flow-control and better loss-handling - though, still not great, (because of the impedance mismatch of TCP knowing only about the data as a byte-stream, and not being able to take the app. message boundaries into consideration), as per this talk.

Moving past TCP in the data center, part 1

Posted Nov 10, 2022 1:06 UTC (Thu) by TheJosh (guest, #162094) [Link]

Did this ever get used in production?

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 11:04 UTC (Wed) by paulj (subscriber, #341) [Link] (6 responses)

Large tech companies want encryption within their DC. They have worked hard at compartmentalising access levels. This still leaves some super-user roles able to snoop on network traffic. Encryption at least prevents those with network super-user roles from snooping on server traffic, and those with super-user access on some server-roles from being able to snoop on traffic for other server-roles (if they can get at the traffic, which they probably can due to internal service-IP advertising tools).

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 23:55 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

We had integrity control and request authentication as a part of the protocol (basically, hash the message along with the secret key that only you and the receiver know). Mostly to make sure the data is not corrupted in transit, which absolutely does happen at large enough scales.

The service for this protocol was used to move already encrypted data, so additional layers of encryption were unnecessary for us. But a more general protocol should definitely have support for it.

> Encryption at least prevents those with network super-user roles from snooping on server traffic, and those with super-user access on some server-roles from being able to snoop on traffic for other server-roles

Snooping at datacenter-scale is surprisingly not at all useful. You can observe only a small amount of traffic if you're trying to do it by physically splicing into cables, and if you can do it on the machines that run the network code, it's probably game over already.

Moving past TCP in the data center, part 1

Posted Nov 3, 2022 9:59 UTC (Thu) by paulj (subscriber, #341) [Link] (4 responses)

> Snooping at datacenter-scale is surprisingly not at all useful. You can observe only a small amount of traffic if you're trying to do it by physically splicing into cables, and if you can do it on the machines that run the network code, it's probably game over already.

If an attacker has access to machines, without role-authentication and encryption an attacker can potentially use that to widen the number of services and access-roles the attacker can manipulate. (And authentication without a MAC is effectively the same - and if you're going to MAC the traffic, you can as easily encrypt it).

Bear in mind, in the large tech companies some of the server machines basically have a (host controlled) switch built-in to them, to allow multiple hosts in a brick/sled to share the same PHY to the network. Also, the switches are basically (specialised) servers too. So an attacked could insert code in the switching agents to programme the switching hardware to selectively pick out certain flows, for later analysis to use for privilege widening/escalation - quite efficient.

Moving past TCP in the data center, part 1

Posted Nov 3, 2022 17:42 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> And authentication without a MAC is effectively the same - and if you're going to MAC the traffic, you can as easily encrypt it

Not quite true for very high-traffic services.

> Bear in mind, in the large tech companies some of the server machines basically have a (host controlled) switch built-in to them, to allow multiple hosts in a brick/sled to share the same PHY to the network.

It's typically more complicated. A host can contain one or more PHYs and they typically go to the TOR (top-of-the-rack) device that does further routing/NAT/whatever.

Moving past TCP in the data center, part 1

Posted Nov 4, 2022 10:28 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

You're thinking of the 'enterprise' type server world. I'm thinking of the much, much, much bigger FANG-type DC world. At at least 1 of the biggest DC operators in the world, the (newer) servers do not have an ethernet PHY. The host has a PCIe connection to a special shared-NIC (which is like a very cut-down, hacky L2+ switching ASIC). 3 or more hosts connect to the shared-NIC. The shared NIC then switches the host ethernet packets, which are sent over hardware specific messaging to the shared-NIC, onto the PHY.

There is one more PHY, to the BMC, which controls the chassis and can give serial access to the hosts (and some very slow, hacky IP/TCP to the host, IIRC).

There are some large enterprise "blade" type systems which also use onboard switching ASICS I think, but the ones I know of use more traditional switching ASICs, (e.g. HPE were using Intel RRC10k) with actual NIC blocks for each host included into the ASIC. So these actually look and work a lot more like a traditional network, with proper buffering and flow-control between the hosts and the switching logic and the upstream (the shared NICs above did not do this properly - least in earlier iterations - causing significant issues).

Moving past TCP in the data center, part 1

Posted Nov 4, 2022 10:32 UTC (Fri) by paulj (subscriber, #341) [Link]

Oh, and the reasoning for this is that PHYs are expensive. Specifically, the optics are and the power needed for higher-end optics along with the thermal budget/constraints (the DC cabling is also a headache, but more minor). Fewer PHYs, fewer high-end transceivers == cheaper + lower-power, more efficient DCs. At the scales the really really big DC operators work at anyway.

Moving past TCP in the data center, part 1

Posted Nov 23, 2022 22:25 UTC (Wed) by Rudd-O (guest, #61155) [Link]

Having worked in FANG I can attest to the correctness of what you're saying.

Cyberax's views would fit right into, let's say, a Google in the year 2011. In the year 2015 — after Snowden — it was already consensus that everything needed to be encrypted, no exceptions.

And so all traffic inside Google — anything at all using Stubby — is, in fact, encrypted. As you said, if you're gonna MAC the traffic, might as well crypt it too. All those primitives are accelerated in the CPU, and some in software now.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 20:36 UTC (Wed) by MatejLach (guest, #84942) [Link] (12 responses)

> it mandates encryption which is not really needed in a DC.

Is this still the general consensus even after Snowden? I am not saying it's a silver bullet against a dedicated attacker like the NSA, but I'd think that every additional obstacle in the way is going to raise the bar for dragnets just a tiny bit, no?

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 21:58 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (3 responses)

You at least want encryption over any line you don't physically control (i.e. it is not inside a building you own), regardless of ownership/leasing rights, which in practice means you encrypt all inter-DC traffic. You also probably want integrity (signatures, certificates, etc.) for most if not all traffic, to make it more difficult for malware and other adversaries to move laterally within your systems and/or escalate their privileges. Encryption within the DC may not be strictly necessary if you have good physical security and you trust the networking equipment, but those are both big ifs.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 22:40 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

My company - along with many others I suspect - is moving away from "trusted core". My works laptop now needs 2fa to authenticate to works servers - even when in the office and connecting over corporate infrastructure.

Given the breakins and stuff businesses have suffered, yes it's an absolute pain, but it means that if anybody does manage to break in to my laptop, it's now rather harder for them to jump to someone else and break into their laptop, moving up the chain ...

Cheers,
Wol

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 23:09 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

> My company - along with many others I suspect - is moving away from "trusted core". My works laptop now needs 2fa to authenticate to works servers - even when in the office and connecting over corporate infrastructure.

1. 2FA has nothing to do with encryption. 2FA is primarily about stopping phishing, and only used by humans (I was talking about machine-to-machine communication).
2. Unless your laptop is in the same physical building as all of the servers you will be interacting with, and your company has complete autonomy over that building (i.e. you're not leasing it out from someone who might have physical access), you're not using "trusted" lines in the sense I was referencing. I explicitly said this isn't about who owns or leases the lines. It's about who is able to physically touch and interact with the lines.

Moving past TCP in the data center, part 1

Posted Nov 3, 2022 7:56 UTC (Thu) by Wol (subscriber, #4433) [Link]

Yup. Might not be quite what you were talking about, but it's a general trend to restrict the trusted zone, even in a trusted network, such that there's minimal trust between any actors, even if you would assume that they are trusted actors.

Cheers,
Wol

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 23:58 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

NSA is not going to install spy boxes in your DC network switches. There's simply way too much traffic to analyze and the network complexity makes it infeasible to even do independently of AWS.

So NSA would simply officially ask Amazon to provide them with a covert way to access the data for a specific customer directly using the AWS services.

Moving past TCP in the data center, part 1

Posted Nov 3, 2022 9:48 UTC (Thu) by paulj (subscriber, #341) [Link] (1 responses)

The NSA may however have its employees go and work for your company under cover. It's pretty much a certainty that /multiple/ state intelligence agencies have agents working at the various large tech companies.

Minimising the trust and scope of access of what employees can access, to what they need to access for their immediate role, is a good thing, security wise.

Moving past TCP in the data center, part 1

Posted Nov 3, 2022 17:31 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Sure. But the complexity of AWS is so overwhelming that you won't really be able to do much unless you want a very targeted attack and have access through many layers of security (physical and virtual).

Moving past TCP in the data center, part 1

Posted Nov 23, 2022 22:27 UTC (Wed) by Rudd-O (guest, #61155) [Link] (3 responses)

Counterpoint: NSA absolutely does install spyware into network switches.

Snowden already revealed this. There are extensive talks on the subject presented by Jacob Applebaum at CCC.

Moving past TCP in the data center, part 1

Posted Nov 23, 2022 23:43 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

Sure. So you installed spyware into a TOR switch. Then what? The data plane is too fast for any meaningful analysis with the puny switch CPU.

You maybe can mirror one port to another, but in an Amazon DC this means nothing. You'll just confuse some random EC2 hardware instance. You won't even be able to do much if you redirect the traffic to an instance that you control, because it's encrypted.

Moving past TCP in the data center, part 1

Posted Nov 24, 2022 9:54 UTC (Thu) by paulj (subscriber, #341) [Link]

"The data plane is too fast for any meaningful analysis with the puny switch CPU."

Except the L3 switch ASIC can be programmed to redirect only certain flows to said CPU. They can also be programmed to encap and redirect certain flows to other hosts. Indeed, they can be programmed to mirror packets (but I can't remember if the L3 ASICs commonly used at the super-large DCs can /both/ mirror and encap the same flow - if not, it's just a matter of time till they do).

So:

1. You don't need to analyse the entire data flow on the puny switch CPU, cause the /powerful/ switching ASIC can be programmed to do hardware tcpdumping (basically). Given the CPUs on these switches aren't /that/ puny (low to mid end Xeons), further analysis on host is quite feasible.
2. Even better, you can just redirect the flow you're interested in to a bigger server by encapping it (the server can resend the flow's packets out again so they're not missed).

Moving past TCP in the data center, part 1

Posted Nov 24, 2022 9:56 UTC (Thu) by paulj (subscriber, #341) [Link]

"You won't even be able to do much if you redirect the traffic to an instance that you control, because it's encrypted."

I note this is a good counter-argument to your own argument in another sub-thread that intra-DC traffic doesn't need to be encrypted. ;)

Moving past TCP in the data center, part 1

Posted Nov 23, 2022 22:26 UTC (Wed) by Rudd-O (guest, #61155) [Link]

No it is not the consensus *at all*.

Google encrypt all its traffic at the Stubby layer.

Moving past TCP in the data center, part 1

Posted Nov 3, 2022 15:32 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

>mandates encryption which is not really needed in a DC.

nth-ing the pile-on about this. 'Zero Trust' is an important choice when you're sharing network links with unknown people or having malicious and incompetent staff mess things up in your datacentre. QUIC sucks because encrypted links are expensive to set up in the context of small message sizes in a datacentre.

K3n.

Moving past TCP in the data center, part 1

Posted Nov 10, 2022 18:32 UTC (Thu) by flussence (guest, #85566) [Link]

> it mandates encryption which is not really needed in a DC.

If your org's budget doesn't extend to CPUs with AES-NI instructions that operate at line rate and you're *sure* it's a closed system, there's nothing really stopping you from dropping in a highly optimised rot26 cipher, like SSH-HPN does.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 5:31 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (1 responses)

In my experience, HTTP is mostly used between the frontend, the load balancer, and the end user (who is not actually in the data center, one assumes). This is more than it sounds because the "frontend" may be comprised of many interdependent microservices rather than just being one giant monolith, so you could have more datacenter HTTP than you might expect (at least, for folks going the microservice route, anyway). But backend stuff is generally RPCs or RPC-like-messages, because:

1. HTTP is overkill. For many services, you don't need e.g. If-Modified-Since, Cache-Control, etc., and if you do want that sort of thing, you probably want to fine-tune exactly how it works rather than having the HTTP library try to handle it.
2. HTTP is underkill. While it is technically possible for the client to present an X.509 certificate to the server to establish its identity, these certificates are (usually) designed to be long-lived and signed by a CA (which could be your own private CA). If you want to get into the business of minting your own bearer tokens, doing so in a way that is performant and not hilariously insecure or SPoF'd on the CA starts looking pretty dicey. Sometimes, you might also want to present multiple credentials for securing different aspects of the request (e.g. "this token proves I am the XYZ service, and this token proves that user 1234 has authorized me to access or manipulate their PII"), and doing that over HTTP is fiddly at best.
3. HTTP is text-based, so you have to finagle all of your data into text (or something reasonably resembling text, like JSON), and then unfinagle it at the other end. For every RPC. This ends up being ludicrously expensive when you consider the sheer scale that we're talking about here.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 5:34 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

> 3. HTTP is text-based, so you have to finagle all of your data into text (or something reasonably resembling text, like JSON), and then unfinagle it at the other end. For every RPC. This ends up being ludicrously expensive when you consider the sheer scale that we're talking about here.

OK, I lied, you don't have to do that. But you do have to emit and parse text-based headers, which is still unreasonably expensive.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 2:49 UTC (Wed) by shemminger (subscriber, #5739) [Link] (7 responses)

The overall question posed is correct but there are problems with the details:
1. The cost of TCP is often that both the kernel and the application need to look at the data (ie cache misses).
2. Legacy Data Centers use legacy MTU of 1500 bytes or maybe 4K. The recent work for >64K segments reduces the per-packet cost
3. The assumptions around TCP bubble up into the legacy application stack (HTTP etc)
4. The incoming packets don't have to hit 3 CPU's. This sounds like case where academics don't understand the current Linux TCP stack

legacy MTU of 1500 bytes

Posted Nov 2, 2022 20:32 UTC (Wed) by stephen.pollei (subscriber, #125364) [Link] (5 responses)

I really wish that path MTU discovery worked better and default MTU would typically be much higher. We have networks that run a thousand times faster so using a factor of about ten to increase typical MTU seems reasonable enough. 16k at least, everywhere might be nice dream.

legacy MTU of 1500 bytes

Posted Nov 3, 2022 10:50 UTC (Thu) by farnz (subscriber, #17727) [Link] (4 responses)

Switches are the big pain here - they've created a world in which, because the switch can only forward or drop frames (where a router can also send errors back to the source, or possibly fragment frames), we can't increase packet size - a 100G switch that cannot interconnect to 10G networks is not useful, so the 100G switch uses 10G compatible frame size limits, and this repeats all the way down to a 100M switch wanting 10M compatible frame size limits.

One of the reasons IPv6 has the concept of link-local addressing is that design work on it started before switches were a thing in most computer networks - and the idea was that you'd have routers interconnecting network segments of different speeds (so your 100M FDDI network would go through a router to your 10M Ethernet segment, and thus devices on the FDDI network could use a higher MTU). If the industry as a whole had kept this idea alive, instead of having Ethernet switches interconnecting different speeds (10/100/1000/10G Ethernet all carrying identical frames), we'd have had routers, and we could have split the gains from faster wire speeds differently (because moving from one speed to another would go via a router that could send a packet too big message).

If we'd done that, we'd had have three choices:

Just improve serialization delay of the maximum packet size. This is what we actually did - we went from 1.2 ms for 10BASE-T to 0.12 ms for 100BASE-T to 0.012 ms for gigabit.
Just increase maximum MTU. We could have gone from 1,500 bytes to 15,000 bytes to 150,000 bytes for gigabit (with 100G allowing a maximum MTU of 15,000,000 bytes), maintaining serialization delay at 1.2 ms.
Split the difference. As an example tradeoff, we could have increased MTU 4x each generation, resulting in a 2.5x drop in serialization delay - 1500 bytes and 1.2 ms at 10M becomes 6000 bytes and 0.48 ms at 100M, which becomes 24,000 bytes and 0.2 ms at gigabit.

legacy MTU of 1500 bytes

Posted Nov 3, 2022 17:39 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> Switches are the big pain here - they've created a world in which, because the switch can only forward or drop frames

The MTU should have been reified to the IP level. Not doing this was one of the most unforgivable errors in the IPv6 development.

In the ideal world all the IP packets would have two fields: forward minimum packet size and the reverse minimum packet size. A switch/router during packet forwarding sets the forward minimum size to their max size, if it's less than the current value. Reverse size is simply propagated back.

legacy MTU of 1500 bytes

Posted Nov 3, 2022 17:45 UTC (Thu) by guus (subscriber, #41608) [Link]

> The MTU should have been reified to the IP level. Not doing this was one of the most unforgivable errors in the IPv6 development.

Actually the same should be done for congestion control; it makes no sense that multiple TCP connections to the same host each implement their own congestion control loop when all packets go via exactly the same path. This is in effect what QUIC is doing: combine multiple reliable, in-order streams with a single congestion control loop. But of course that only is effective if you really use a single QUIC connection.

legacy MTU of 1500 bytes

Posted Nov 4, 2022 23:15 UTC (Fri) by stephen.pollei (subscriber, #125364) [Link] (1 responses)

I think that I'm in camp #3 "Split the difference". Some decrease in packet transmit latency seems good, but it isn't the only thing to optimize. I also remember 56k modems, where 1500 octets is 214 milliseconds worth of time.

In practice, Path MTU discovery can break in in interesting ways . People over block ICMP and perhaps a mechanism closer to what Cyberax suggested would be nice. With tunneling and MTU discovery having issues, CloudFlare at one point reduced their MTU to 1024 for ipv4 and 1280 for ipv6. It might have been nicer world in many ways if routers/switches supported 16k packets, but most people only used 12k normally. Extra 4k could be margin for tunneling and encapsulation. This is of course all IMHO.

legacy MTU of 1500 bytes

Posted Nov 5, 2022 9:47 UTC (Sat) by farnz (subscriber, #17727) [Link]

To an unfortunately large extent, broken Path MTU discovery is accepted on the Internet because you can pretty much assume that a 1280 byte MTU will be close enough to optimal that it makes no practical difference. If we had much larger MTUs - 6,000 bytes for 100M Ethernet, 24k for gig, 96k for 10G (yes, this needs jumbograms at the IP layer), WiFi probably at around 12k for 802.11n, and about 50k for 802.11ac, we'd have much more incentive to get PMTUD working well - the difference between 1280 bytes (minimum IPv6 frame size), and 12k for 802.11n is huge.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 23:38 UTC (Wed) by stefanha (subscriber, #55072) [Link]

Can you describe which CPUs are involved in receiving TCP packets?

I guess in the ideal case there is flow steering that ensures the incoming packets appear in the receive queue receive handled by the CPU where the application calls recv(2). So just one CPU?

Do those conditions arise automatically or does it require manual tuning? Is application support required?

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 4:01 UTC (Wed) by neilbrown (subscriber, #359) [Link] (1 responses)

I thought that RDMA was the "new" way to get better performance in the data center.
It is already implemented in (high end) hardware, and there are already RPC implementations layered over it.

I wonder what it is missing? Low-latency for tiny message maybe?

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 9:39 UTC (Wed) by joib (subscriber, #8541) [Link]

From https://web.stanford.edu/~ouster/cgi-bin/papers/replaceTc... (which I guess is the paper this presentation is based on), section 6 deals with Infiniband.

> However, RDMA shares most of TCP’s problems. It is
> based on streams and connections (RDMA also offers unre-
> liable datagrams, but these have problems similar to those for
> UDP). It requires in-order packet delivery. Its congestion con-
> trol mechanism, based on priority flow control (PFC), is dif-
> ferent from TCP’s, but it is also problematic. And, it does not
> implement an SRPT priority mechanism.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 6:39 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

BTW, by some coincidence there's a very timely article from Geoff Houston comparing TCP and QUIC: https://www.potaroo.net/ispcol/2022-11/quicvtcp.html

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 10:08 UTC (Wed) by ale2018 (guest, #128727) [Link] (2 responses)

I'm surprised SCTP is not mentioned, neither in the article nor in comments.

Moving past TCP in the data center, part 1

Posted Nov 3, 2022 18:16 UTC (Thu) by jonesmz (subscriber, #130234) [Link]

Same.

It's what the WebRTC data channel is built on.

It's relatively easy to implement over UDP (not simple, not easy, just relatively easy compared to other complex protocols).

Moving past TCP in the data center, part 1

Posted Nov 6, 2022 15:26 UTC (Sun) by dullfire (guest, #111432) [Link]

SCTP does seem like it would solve several of the issues.

Even if it doesn't solve all the issues, it would be sane to at least state why SCTP is not a good solution.

And their aversion to encryption is... odd? Yes, it's a data center. Yes, you may not see an attack vector. But these kinds of attacks are only growing. Unless you are literally doing something like "cat /dev/zero | ssh ${other-dc-system} dd of=/dev/null" (in which case the is no useful information being transmited, and nothing is being done with that info either), it's not really sane to assume clear traffic is good. I've heard of several cases where "surprise. you totally benign clear connections can be used as a weapon".

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 11:15 UTC (Wed) by smoogen (subscriber, #97) [Link] (1 responses)

How do you deal with inter DC traffic? It seems to be common these days to just move half your compute needs to a different DC for cost or similar reasons. These may even be different 'clouds', and you may end up moving stuff live depending on failure or rollout issues. Would this mean that applications would need to be dual stacked? As in 'use HOMA' for local and fail over to TCP for 'been moved to Canada'?

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 19:30 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

It depends on how your traffic is routed:

1. If you're only going over fiber links that you own (or lease), then you can send whatever you want over those links, but if you start tampering with the IP layer or below, some routers and/or switches might be unhappy about that. Playing with the transport layer should be fine in most reasonable deployments.
2. If you're going over one fixed ISP's lines, then you can experiment and/or talk to them to figure out exactly how transparent they are to "weird" protocols. Most ISPs at least have the good sense to be transparent-ish to TCP, but some ISPs will be fully IP transparent and won't care what you do at the transport layer. This is probably easier if you are paying them a lot of money and/or have some sort of peering arrangement with them.
3. If you're going over the public internet, I imagine there's a lot of old gear that will do bizarre things if you try sending packets it doesn't recognize.
4. You can always encapsulate your non-TCP transport in UDP, although that adds enough overhead that it may not be worth it. Some ISPs may not treat UDP nicely, but IMHO you have good cause to complain if that happens.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 16:21 UTC (Wed) by jccleaver (guest, #127418) [Link] (2 responses)

Curious how many ideas they've revisited from the past here. TCP won due to interconnection ubiquity in the '90s while things from AppleTalk (too chatty, but amazing local discovery) to IPX (painful) went by the wayside, especially as layered on top of IP, but there were a lot of baby ideas thrown out with the bathwater.

Assuming we're keeping IP, I'd be happier to see an old transport protocol with latent potential rediscovered than we move from 14->15 standards.

Moving past TCP in the data center, part 1

Posted Nov 29, 2022 20:03 UTC (Tue) by marcH (subscriber, #57642) [Link]

> Curious how many ideas they've revisited from the past here.

I remember discussions about replacing TCP 20 years ago. Should be easy to find if you follow the references in publications. It's like nuclear fusion: it's always 10 years away :-)

This all looks like a very good "state of the union" but commodity and "good enough" seems to win every time. As rich as data center companies are, they're still minuscule compared to the rest of the world using TCP. Moreover their lawyers and businessmen are very likely getting in the way of working together with competitors. It's not enough to tell them "look at how successful Linux is", the obvious answer is "So what? Why would help our competitors catch up with us?"

QUICK has been successful only because Google controls _both_ the browser and the servers - no cooperation required.

Moving past TCP in the data center, part 1

Posted Nov 29, 2022 23:43 UTC (Tue) by marcH (subscriber, #57642) [Link]

> (too chatty, but amazing local discovery)

Off-topic but other-machine.local multicast DNS works most of the time. Often enough that I rarely ever typed an IP address in the last 10 years (and of course you really don't want to type an IPv6 address....)

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 20:13 UTC (Wed) by jenro (subscriber, #7024) [Link] (7 responses)

Is it just me, who thinks some of these ideas sound rather strange?

If I summarize the article it seems to read like: TCP is not a good base for any kind of RPC-system, so ditch all of TCP and replace it with a shiny new thing.

Is it really true, that all or at least most traffic in a DC is some kind of RPC or request/reply system? Is it really true, that small packages are the ones that the whole DC needs to be optimized for? I highly doubt this.

I am an application developer. When I request data from any kind of information service - for example a database - I want the data to be complete and reliable. And to me that is far more important than to have some data faster. The nice thing about a TCP session is: as long as the session stays connected and error free it doesn't matter, if I retrieve 100 bytes or 100,000 bytes or 100,000,000 bytes, TCP guarantees, that I get all the data in the correct order.

On the other hand I have to work with market data feeds of different stock exchanges like NYSE or Xetra. They use UDP multicast to distribute the information. Dealing with this unreliable transmissions and recovering from lost packets is something you really don't want to do.

So let's wait for the second part of the article and than decide, if I want to give up my reliable TCP connections for Homa.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 21:13 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

> Is it really true, that all or at least most traffic in a DC is some kind of RPC or request/reply system?

In my experience (as a Google SRE), yes, at least for the way that Google tends to architect its services, anyway.

> I am an application developer. When I request data from any kind of information service - for example a database - I want the data to be complete and reliable. And to me that is far more important than to have some data faster. The nice thing about a TCP session is: as long as the session stays connected and error free it doesn't matter, if I retrieve 100 bytes or 100,000 bytes or 100,000,000 bytes, TCP guarantees, that I get all the data in the correct order.

For a typical CRUD app, you're generally dealing with one tuple at a time, or at most a few dozen (e.g. for search results).

Yes, we do have batch jobs that look at many (or all) tuples and manipulate them in some way. No, those jobs are not top priority. Top priority is responding to incoming HTTP requests in a timely fashion. If that means throwing away hours of batch work because live serving needs the machine, so be it.

> On the other hand I have to work with market data feeds of different stock exchanges like NYSE or Xetra. They use UDP multicast to distribute the information. Dealing with this unreliable transmissions and recovering from lost packets is something you really don't want to do.

Sounds like you're in finance. This is a very different sort of industry to tech - your top priority is probably some variation of figuring out which way the market is moving, not responding to incoming HTTP. So things are going to look different on your side of the fence. That does not mean the article is wrong. It just doesn't apply to your particular situation.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 21:19 UTC (Wed) by joib (subscriber, #8541) [Link] (2 responses)

> Is it just me, who thinks some of these ideas sound rather strange?

It's just you.

> Is it really true, that all or at least most traffic in a DC is some kind of RPC or request/reply system? Is it really true, that small packages are the ones that the whole DC needs to be optimized for? I highly doubt this.

No, the argument is that 'lots of small messages' is an important usecase that TCP struggles with, due to inherent features in the TCP design. Large transfers are certainly not unimportant, but OTOH, if you can fill your pipe with lots of small messages, then the special case of a few very large messages is kind of trivial.

> if I want to give up my reliable TCP connections for Homa.

So with TCP you get 'reliable stream', and with UDP you get 'unreliable datagram'. The argument here is that for a lot of applications, the correct primitive is neither of these, but rather a 'reliable datagram' kind of service. Which is what this Homa proposal is apparently providing. Incidentally, Amazon's SRD is another design that has converged to a roughly similar conclusion.

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 23:39 UTC (Wed) by willy (subscriber, #9762) [Link]

It sounds like it's reliable out-of-order datagrams which is different from SOCK_SEQPACKET (reliable in-order datagrams)

Moving past TCP in the data center, part 1

Posted Nov 2, 2022 23:43 UTC (Wed) by rcampos (subscriber, #59737) [Link]

TIPC is another protocol that offers reliable datagram. It's merged in Linux since the 2.6 era at least.

It is a very interesting protocol for local networks, but don't know benchmarks about it.

Moving past TCP in the data center, part 1

Posted Nov 3, 2022 8:20 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> The nice thing about a TCP session is: as long as the session stays connected and error free it doesn't matter, if I retrieve 100 bytes or 100,000 bytes or 100,000,000 bytes, TCP guarantees, that I get all the data in the correct order.

This isn't such a great advantage for many datacenter applications. You have to design services very differently if they need to handle 1K or 100M of data.

In fact, ability to hide this leads to bad designs.

Moving past TCP in the data center, part 1

Posted Dec 2, 2022 20:57 UTC (Fri) by marcH (subscriber, #57642) [Link] (1 responses)

Do we need a new https://en.wikipedia.org/wiki/Fallacies_of_distributed_co... ?

Moving past TCP in the data center, part 1

Posted Dec 4, 2022 5:25 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Absolutely. Including discussions of emergent behavior, such as retry storms.

Moving past TCP in the data center, part 1

Posted Nov 10, 2022 3:05 UTC (Thu) by endric (guest, #137197) [Link]

Leaving legacy protocols behind, or at least complementing them with newer, domain-specific protocols, is sometimes the best option for advancing compute systems. We just experienced this with NVMe "replacing" legacy storage protocols SATA and SAS for better price/performance.

Many distributed systems (not just in the datacenter) communicate better via short messages rather than one big stream of data. Homa as a new reliable, rapid request-response protocol may demonstrate its benefits: Lower (tail) latencies, less burdon on the CPUs and higher utilization of network infrastructure are interesting promises.

Anyone else interested in looking closer at this?

Moving past TCP in the data center, part 1

Posted Nov 14, 2022 5:23 UTC (Mon) by gurugio (guest, #113827) [Link]

I was not a big fan of Infiniband but it is getting promising to me these days for the data-center network. Some of disadvantages of TCP in this document looks covered by the Infiniband/RDMA.