Moving past TCP in the data center, part 2

By Jake Edge
November 9, 2022

At the end of our earlier article on John Ousterhout's talk at Netdev 0x16, he had concluded that TCP was unsuitable for data-center environments for a variety of reasons. He also argued that there was no way to repair TCP so that it could serve the needs of data-center networking. In order for software to be able to use the full potential of today's networking hardware, TCP needs to be replaced with a protocol that is different in almost every way, he said. The second half of the talk covered the Homa transport protocol that he and others at Stanford have been working on as a possible replacement for TCP in the data center.

The Homa project set out to design a protocol from scratch that would be ideal for the needs of data-center networking today. It turned out to be different from TCP in each of the five aspects that he had covered in the first half of the talk; the choices made by Homa for those work well together to "produce a really really high-performance data-center protocol". But, he stressed, Homa is not suitable for wide-area networks (WANs); it is only for data centers.

Homa overview

To start with, Homa is message-based, rather than byte-stream-based, which means that the "dispatchable units" are recorded in the protocol. Each message is made up of multiple packets on the wire that get assembled into a full message for application processing. This solves the thread load-balancing problems that TCP has, because multiple threads can safely read from a single socket; it will also allow network interface cards (NICs) to directly dispatch messages to threads in the future should NICs ever gain that support.

Homa is connectionless; its fundamental unit is a remote procedure call (RPC), which consists of two messages: a request and a response. "The notion of 'round trip' is explicit in the Homa protocol." RPCs are independent with no ordering guarantees between them; multiple RPCs can be initiated and can finish in any order. There is no long-lived connection state stored by Homa; once an RPC completes, all of its state is removed. There is a small amount (roughly 200 bytes) of state stored for each peer host, for IP routing information and the like, however. There is also no connection setup overhead and a single socket can be used to send any number of RPCs to any number of peers. Homa ensures that an RPC either completes or an error is returned, so application-level timers are not needed.

Congestion control in Homa is receiver-driven, which has major advantages over the sender-driven congestion control used by TCP. Ousterhout described how flow control works in Homa. When a message needs to be sent to a receiver, the sender can transmit a few "unscheduled packets" immediately, but additional "scheduled packets" must each wait for an explicit "grant" from the receiver. The idea is that there are enough unscheduled packets to cover the round-trip time between the hosts; if there is no congestion, the sender will have received some grants by the time it has sent all of the unscheduled packets. That allows sending at the hardware line speed.

A receiver can choose not to send grants if it detects congestion in its top-of-rack (TOR) switch; it can pause or slow the grants until that condition subsides. It can also prioritize messages by sending or denying grants based on the remaining size of the message. Having the message size available in the protocol is helpful because "message sizes allow us to predict the future". Once a single packet of a message is received, the receiver knows how many unscheduled packets are coming, how many scheduled packets remain after that, and it can quickly decide on a strategy to minimize the congestion that could result from all of the different in-progress messages it is receiving. It can react much more quickly than TCP can.

Homa also takes advantage of the priority queues that modern switches have; each egress port in those switches typically has 8-16 priority queues. It uses the queues to implement the "shortest remaining processing time" (SRPT) algorithm; when a receiver has multiple messages in progress, it allows the shorter messages to use the higher-priority queues on the switch. In addition, the queues provide a way to achieve a "much better tradeoff between throughput and latency".

"Buffers are funny; you can't live with them, but you can't live without them either." Some amount of buffering is needed at the switch to ensure that link bandwidth is not wasted if the higher-priority senders stop sending for any reason. Throughput is maintained by having lower-priority packets queue up and be sent when there is a lull from a higher-priority sender; Homa can then send grants to other senders to get them ramped up. This "overcommitment" of granting more messages than can be handled, coupled with the prioritized buffering leads to higher throughput with low latency on the important, shorter messages, he said.

One might think that long messages could potentially suffer from starvation on a loaded network. He said that he tried to produce that behavior in Homa so that he could study it, but he found it really hard to do; he can create starvation scenarios, but had to "contort the benchmarks" to do it. But, in order to avoid starvation, Homa takes a small portion of the receiver's bandwidth (typically 5-10%) and uses it for the oldest message, rather than the shortest message as strict SRPT would dictate. That guarantees that those messages make progress; eventually their remaining size gets small enough to be prioritized with the other small messages.

Homa does not rely on in-order packet delivery; packets can arrive in any order and the receivers will sort them as needed. In practice, the packets arrive nearly in order anyway, he said, so it is not computationally expensive to do the reordering. He believes that Homa eliminates core-congestion problems in data centers, unless there is simply too much traffic overall, because packets can take different paths through the core fabric. That leads to better load balancing in the fabric as well as across CPU cores on the receiving hosts.

Replacing TCP?

It is hard to imagine a standard more entrenched than the TCP protocol is, Ousterhout said, so "I undertake this with full realization that I may be out of my mind". Based on the results he has seen from Homa, he has set a personal mission to figure out a way for Homa to take over a substantial portion of TCP's traffic in the data center. Either that, or learn why it is not possible; "I'm going to keep going until I hit a roadblock that I simply can't solve".

The first step to doing that is to have a production-quality implementation of Homa "that people can actually use". He is "maybe a little bit of a freak among academics", he said, because he loves to write code. A few years ago, he set out to write a Linux kernel driver for Homa; he "knew nothing about the Linux kernel", but now has a working driver for Homa.

The Homa module runs on Linux 5.17 and 5.18 and is not meant as a "research prototype", which is a term that he does not like. A research prototype is something that does not work, "but you can somehow write a paper about it and make claims about it, even though it doesn't actually really work". The Homa module is nearing production quality at this point, he said; the only way to find and fix the remaining bugs is to start running it in production.

In terms of performance, Homa "completely dominates" TCP and Data Center TCP (DCTCP) for all workloads and all message sizes, he said. He gave a few sample benchmarks that showed 3-7x improvements in short-message latency at the 50th percentile, and 19-72x improvements in the 99th percentile (the "tail latency"). More information about the kernel driver for Homa can be found in his paper from the 2021 USENIX annual technical conference.

Applications

The biggest problem he sees with his mission is the huge number of applications that directly use TCP via the socket interface. Homa's message-based API is different and he has been unable to find a way to "slide it in under the existing TCP sockets interface". But it is not realistic to plan to convert even a substantial fraction of those applications. Many, perhaps most, of the applications that directly use TCP sockets also need to work over the WAN, where Homa is not a good choice; others really do not need the performance boost that Homa brings. The applications that will really benefit from Homa are the newer data-center applications that mostly use one of a handful of RPC frameworks.

Adding Homa support to RPC frameworks would allow applications that use them to switch to Homa with a small configuration change, instead of a major code change; much like applications can change the server name they use, they would be able to choose Homa instead of TCP. He has work in progress on adding Homa support to gRPC. The C++ gRPC integration is working now, though without support for encryption, while the Java gRPC support for Homa is "embryonic" but currently underway.

Working with gRPC has made him "very sad", however, because it is "unimaginably slow". The round-trip latency for gRPC on TCP is 90µs, with two-thirds of that time spent in the gRPC layers on the client and server (30µs each). If the goal is to get to 5µs round trips, it is pretty clear it cannot be done using gRPC, he said. With Homa, gRPC is roughly twice as fast, even in the gRPC layers on the endpoints, which he does not really have an explanation for. But even that is far from the goal, so he believes a "super lightweight" framework needs to be developed eventually.

Even with Homa, though, performance is still an order of magnitude off from what the hardware is capable of, he said. "Network speeds are increasing at this amazing rate, CPU speeds aren't changing much." Software simply cannot keep up and he sees no solution to that problem. So software implementations of transport protocols no longer make sense; those protocols need to move into the NICs.

Moving to user space

Some people will say that you simply need to get "those terrible operating systems" out of the way and move everything to user space; that will help some, he said, but it is not enough. In the RAMCloud storage project, Stanford researchers implemented Homa in user space and were able to achieve 99th percentile round-trip latencies of 14µs, while the tail latency for Homa on Linux is 100µs.

The primary culprit for high tail latencies is software overhead and, in particular, load-balancing overhead. A single core cannot handle more than 10Gbps but as soon as you split the job over multiple cores, there is an inherent performance penalty just for doing the split. Beyond that, there is the problem of hot spots, where too much work ends up on certain cores while others are idle or nearly so; those can cause spikes of high latency.

He has not measured it, but his theory is that the inherent slowdown from splitting up the job to multiple cores is due to cache interference. Lots of cache-coherency traffic coming from the split reduces the performance when compared to doing the processing on a single core. It is not just his Homa implementation that sees a problem of this magnitude; he looked at Google's Snap/Pony system, which sees a 4-6x efficiency loss when moving from single to multiple cores. He said that the rule of thumb is that if you need to go beyond a single core, a second or third core is not enough; "you actually have to go to four cores before you actually get any performance improvement over one core".

If you look at some of the numbers from various papers, Ousterhout said, you might conclude that moving the protocol processing to user space is a good idea. But, all of those user-space implementations are "basically just research prototypes and they are way oversimplified"; they do not do a lot of the processing that any production system would require. They are also measured under the most ideal conditions, with either no load balancing or perfect by-hand partitioning, handling only short messages ("that's easy"), and so on.

But Snap is a production-quality user-space implementation that can be compared; it is roughly 2x better than Homa in the Linux kernel. If you look at the number of cores needed to drive 80Gbps (an 80% loaded 100Gbps network) bidirectionally, Snap requires 9-14 cores, while Linux Homa requires 17. So user space is better, but not on the scale needed to hit the goals.

To the NIC

The only alternative that he sees is to move the protocol processing of the transport layer into the NIC. Applications would access the NIC directly, bypassing the kernel, via a message-based interface; "the notion of a packet never gets to software". Other features like load balancing by choosing an idle application thread, virtualization, and encryption should be added to the NICs as well.

The architecture for these NICs is not going to be easy to figure out, he said; they need to process packets at line rate, while being programmable to support multiple protocols. It is also important that the programs for those protocols can still be open source, Ousterhout said. None of the existing "smart NIC" architectures is adequate, he thinks, so it makes for an interesting problem for computer architects to solve.

Homa is "not without its controversies"; there have been several papers that have claimed to find problems with Homa. There are also some alternatives being proposed in other papers, but "all of these papers have pretty significant flaws in them, unfortunately". They have used Homa in unrealistic configurations or hobbled it in some fashion in his opinion; while he does not "want to get into a food fight" about them, he has gathered some of his responses on the Homa wiki.

But there is a bigger "meta question" that needs to be answered: "do applications actually care about harnessing the full network performance?" Today, we are "stuck in a no-chicken-no-egg cycle", there are no applications that require the full hardware capabilities, because there would be no way to run them. So today's applications make do with the performance provided by today's network stacks and there is no real incentive to make the infrastructure faster, because there is no one who is "screaming for it".

So, he wondered, if we make the networking "as blindingly fast as I think we can do", will new applications appear that people have not even thought about today because the capability is not available? He would be interested in hearing about any applications that would be dramatically improved by these performance improvements. As an academic, he does not actually need to have a market; he can build something and hope that the applications come along later, but that "would be a terrible way to do a startup".

He concluded by reiterating the main points of what he thinks needs to be done if applications are going to take advantage of the amazing advances in networking speeds. A new transport protocol is needed; he is obviously a Homa advocate but would be happy to discuss other possibilities for that. Beyond that, a new lightweight RPC framework will be important, and, ultimately, whatever transport protocol is used will need to move down into the NIC. Ousterhout's keynote laid out a vision of a fairly radical reworking of the state of data-center networking today; it will be interesting to see how it plays out over the coming years.

Index entries for this article
Kernel	Networking/Protocols
Conference	Netdev/2022

Moving past TCP in the data center, part 2

Posted Nov 9, 2022 21:23 UTC (Wed) by developer122 (guest, #152928) [Link] (2 responses)

So...................they reinvented UDP, with a bit of cruft on top?

Moving past TCP in the data center, part 2

Posted Nov 9, 2022 21:36 UTC (Wed) by fraetor (subscriber, #161147) [Link]

I think the idea is that it is a reliable transport.

I could potentially see this being useful for some kind of Plan9-esq system to get truly massive vertical scaling.

Moving past TCP in the data center, part 2

Posted Nov 9, 2022 22:27 UTC (Wed) by wmf (guest, #33791) [Link]

No, flow/congestion control is not "a bit of cruft" especially when it delivers 10x performance improvement.

Moving past TCP in the data center, part 2

Posted Nov 9, 2022 21:50 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Working with gRPC has made him "very sad", however, because it is "unimaginably slow". The round-trip latency for gRPC on TCP is 90µs, with two-thirds of that time spent in the gRPC layers on the client and server (30µs each)

That's because gRPC is a huge framework with its own HTTP/2 implementation, support for streaming, asynchronous programming, etc. You can actually swap protobuf there for other serialization frameworks.

You don't need any of it if you want to make a simple request/reply framework. We used Twitch's Twirp as the base: https://twitchtv.github.io/twirp/docs/spec_v7.html and did tweaks on top of it.

Moving past TCP in the data center, part 2

Posted Nov 9, 2022 21:53 UTC (Wed) by invidian (subscriber, #131837) [Link] (1 responses)

I wonder if projects like Ceph, so large storage systems would benefit a lot from Homa. I'm also curious if running TCP over Homa would make sense, for virtualization and software defined networks.

Moving past TCP in the data center, part 2

Posted Nov 11, 2022 7:46 UTC (Fri) by jcolledge (subscriber, #153588) [Link]

I also wonder if a filesystem or block layer protocol would be an interesting application for Homa.
Ceph is an option, as mentioned.
NBD (Network Block Device) could be an easy place to start, but I'm not sure how much it is used.
NVMe-oF (NVMe over Fabrics)?
DRBD (Distributed Replicated Block Device)?
NFS?

People always want better performance at file and block levels. These protocols don't require complete ordering, so using TCP is a waste of resources.

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 4:00 UTC (Thu) by gcarothers (subscriber, #63072) [Link] (5 responses)

I guess I’d like to understand better how this relates to InfiniBand vs TCP. Seems to be solving the same sort of problems?

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 13:14 UTC (Thu) by jtaylor (subscriber, #91739) [Link] (1 responses)

from first article https://lwn.net/Articles/913452/
the paper has a small section on it, direct link: https://arxiv.org/abs/2210.00714

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 14:05 UTC (Thu) by joib (subscriber, #8541) [Link]

Ha, beat me to it! :) (I started typing an answer, got distracted by something, and then when I finished it you had already answered almost an hour before..)

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 14:02 UTC (Thu) by joib (subscriber, #8541) [Link]

In the comments part 1 the same question was asked, I wrote https://lwn.net/Articles/913452/ :

From https://web.stanford.edu/~ouster/cgi-bin/papers/replaceTc... (which I guess is the paper this presentation is based on), section 6 deals with Infiniband.

> However, RDMA shares most of TCP’s problems. It is
> based on streams and connections (RDMA also offers unre-
> liable datagrams, but these have problems similar to those for
> UDP). It requires in-order packet delivery. Its congestion con-
> trol mechanism, based on priority flow control (PFC), is dif-
> ferent from TCP’s, but it is also problematic. And, it does not
> implement an SRPT priority mechanism.

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 21:23 UTC (Thu) by MattBBaker (subscriber, #28651) [Link] (1 responses)

Honestly it feels like most of the criticisms of IB based RDMA is based on misunderstanding what RDMA means. IB will share many of TCP's problems if you use the send()/recv() pattern seen in TCP, but if you use the one sided PUT/GET protocols and active messages all sorts of exciting communication patterns are possible that just aren't possible in TCP land.

Moving past TCP in the data center, part 2

Posted Nov 14, 2022 11:07 UTC (Mon) by gurugio (guest, #113827) [Link]

I agree that Infiniband/RDMA is different. But the worst point of Infiniband/RDMA is cost. If a new protocol uses the Ethernet network and faster than TCP/Ethernet, it might be best.

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 4:55 UTC (Thu) by da4089 (subscriber, #1195) [Link] (3 responses)

I'd like to see an implementation of the Homa API over TCP (or even UDP) so that applications could directly use Homa for highest performance where it's available, but then fall back to a Homa-over-TCP where that's not possible.

I also think it'd be useful to have Homa implemented using some kernel-bypass stacks: the Mellanox/NVIVIDA one, or the Solarflare/Xilinx/AMD one -- having a kernel module is fine and good, but ... if we really care about latency, why suffer the system call overhead?

Both of those tasks should be relatively simple given the kernel module code exists, although perhaps it's a better investment to jump straight to the FPGA implementation of Homa and avoid the additional software layers altogether.

Moving past TCP in the data center, part 2

Posted Nov 11, 2022 8:06 UTC (Fri) by nilsmeyer (guest, #122604) [Link] (2 responses)

Perhaps it's possible to have some sort of ring buffer system like io_uring but directly connected to the NIC?

Moving past TCP in the data center, part 2

Posted Nov 13, 2022 16:13 UTC (Sun) by blablok (guest, #160937) [Link] (1 responses)

That's pretty much what DPDK does.

Moving past TCP in the data center, part 2

Posted Nov 17, 2022 16:14 UTC (Thu) by mrugiero (guest, #153040) [Link]

And AF_XDP.

Why not over the Internet?

Posted Nov 10, 2022 9:29 UTC (Thu) by epa (subscriber, #39769) [Link] (6 responses)

"Homa is not suitable for wide-area networks (WANs)". Why not? If there is retrying at the application level it ought to work well enough.

Why not over the Internet?

Posted Nov 10, 2022 11:55 UTC (Thu) by james (subscriber, #1325) [Link] (3 responses)

As I read it, a very low round-trip time is a key part of the assumptions.

When a message needs to be sent to a receiver, the sender can transmit a few "unscheduled packets" immediately, but additional "scheduled packets" must each wait for an explicit "grant" from the receiver

which requires one round trip for each RPC larger than "a few" packets, then at least another one for the rest of the data and the response. If you have a suitable protocol running over TCP, the connection will already be set up and you can send all the data at once.

Also, I suspect that

A receiver can choose not to send grants if it detects congestion in its top-of-rack (TOR) switch; it can pause or slow the grants until that condition subsides

embeds the assumption that this is where congestion will occur.

Why not over the Internet?

Posted Nov 10, 2022 14:25 UTC (Thu) by joib (subscriber, #8541) [Link] (2 responses)

> > When a message needs to be sent to a receiver, the sender can transmit a few "unscheduled packets" immediately, but additional "scheduled packets" must each wait for an explicit "grant" from the receiver

Isn't this similar to the problem of window size scaling in TCP? I.e. in a network with a big bandwidth delay product, you need a bigger window in TCP, or more 'unscheduled packets' and more in-flight 'grants' in Homa?

> the assumption that this is where congestion will occur.

That might actually be an issue, yes. If the idea is to not use packet drops as a congestion signal, how is congestion somewhere in the middle of the path detected? Guess you would need some kind of BBR-style timing based congestion control?

Why not over the Internet?

Posted Nov 10, 2022 17:03 UTC (Thu) by james (subscriber, #1325) [Link]

It's somewhat similar, but the difference is that TCP as it is used today has long-lived connections to which you can attach things like window size.

From the article:

There is no long-lived connection state stored by Homa; once an RPC completes, all of its state is removed

including anything like window size. The next RPC (larger than a "few packets") has to discover that all over again.

Or so I understand.

Why not over the Internet?

Posted Nov 10, 2022 19:22 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

> That might actually be an issue, yes. If the idea is to not use packet drops as a congestion signal, how is congestion somewhere in the middle of the path detected? Guess you would need some kind of BBR-style timing based congestion control?

According to https://homa-transport.atlassian.net/wiki/spaces/HOMA/pages/262171/Concerns+Raised+About+Homa, this should not be a problem if different packets can be routed differently from each other.

In theory, you could still run out of bandwidth for the entire network overall, but that probably requires a much higher volume than is typical of a TCP-driven network, and should likely be dealt with using some combination of explicit provisioning and QoS (both of which are far more suitable to a closed DC-type network than to an open WAN).

Why not over the Internet?

Posted Nov 10, 2022 12:03 UTC (Thu) by Sesse (subscriber, #53779) [Link]

Random guesses (I haven't read the Homa paper): Receiver-initiated control doesn't work well when time scales are milliseconds and not microseconds (you very quickly run out of credits). Lack of encryption. More packet loss requires more sophisticated retransmit.

Why not over the Internet?

Posted Nov 10, 2022 12:11 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

WAN is ??? about where each link goes and what routes are available. Datacentre cabling is very controlled and very specific, so you also guess at maximal link capability in designing a protocol for it; maybe the summary is that TCP survives the worst link-failure/link-capacity cases where Homa aims to maximise known-good links in known-good configurations.

K3n.

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 9:31 UTC (Thu) by andrewsh (subscriber, #71043) [Link] (2 responses)

For those wondering, the presenter is indeed the John Ousterhout, the creator of the Tcl programming language.

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 12:16 UTC (Thu) by interalia (subscriber, #26615) [Link]

Thanks, this was definitely what I was wondering when I saw the name!

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 19:13 UTC (Thu) by epa (subscriber, #39769) [Link]

An alternative title for his talk would be “Why you should not use Tcp”.

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 14:28 UTC (Thu) by joib (subscriber, #8541) [Link] (2 responses)

Not a networking expert, but how does a SRPT style priority mechanism deal with non-cooperative flows? That is, what if an application splits up its messages into lots of small messages, in order to get preferential treatment?

(I guess a bit similar to how Bittorrent games TCP by launching a gazillion streams?)

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 19:34 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

Then SRE turns that application off until its developers agree to fix it.

Or maybe Homa has some sort of mechanism to deal with this, I don't know, but you can't just run random adversarial code in a DC and expect to get away with it.

Moving past TCP in the data center, part 2

Posted Nov 17, 2022 16:18 UTC (Thu) by mrugiero (guest, #153040) [Link]

You're on a datacenter, so presumably you can agree with other teams to Just Don't Do That (TM) I think. As someone else mentioned, the SRE can enforce this, misbehave and will shut your service off until you fix it.

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 14:32 UTC (Thu) by joib (subscriber, #8541) [Link]

Homa does have some similarities with Amazon SRD: https://assets.amazon.science/a6/34/41496f64421faafa1cbe3... , in that both provide an unordered reliable datagram service as the fundamental primitive.

SRD congestion control is different though, seems inspired by BBR, based on measuring RTT.

Moving past TCP in the data center, part 2

Posted Nov 10, 2022 18:59 UTC (Thu) by nickodell (subscriber, #125165) [Link] (1 responses)

Something I don't follow about this new protocol: one of the benefits is the elimination of per-connection state. But then how are RPC messages larger than one packet handled? It seems like it requires some kind of re-assembly on the receiving end, which requires the receiver to keep per-RPC state. How does Homa avoid this?

Moving past TCP in the data center, part 2

Posted Nov 11, 2022 14:07 UTC (Fri) by MattBBaker (subscriber, #28651) [Link]

In the article it says that Homa is message/reply structure with a grant mechanism for sending more packets after the first few. I imagine that the grant comes with metadata/cookie for the expected granted packets so that the receiver can reassemble the message based on headers that includes the cookie.

Moving past TCP in the data center, part 2

Posted Nov 16, 2022 1:23 UTC (Wed) by RobertBrockway (guest, #48927) [Link] (1 responses)

Why would moving Homa in to user space improve performance?

There's a push to put QUIC in the Linux kernel to improve performance, and this is a general trend. I understand this aims to minimise context switches.

One of the main arguments again microkernels is that putting core OS functionality in user space kills performance.

Moving past TCP in the data center, part 2

Posted Nov 17, 2022 16:10 UTC (Thu) by mrugiero (guest, #153040) [Link]

> Why would moving Homa in to user space improve performance?
>
> There's a push to put QUIC in the Linux kernel to improve performance, and this is a general trend. I understand this aims to minimise context switches.
>
> One of the main arguments again microkernels is that putting core OS functionality in user space kills performance.
>

It's a bit more complicated than that. The kernel has some extra work due to not needing the exact needs of a given userspace service. It involves extra copies as well, you need to keep a separate buffer where you get the actual packages from the NIC, then a buffer per socket, then if you don't use a memory map yet another buffer in userspace for reading into. All this copying is irrelevant in latency insensitive contexts, but they quickly pile up.
When a single application owns the whole traffic you can perform either a single copy (memory allocated for DMA to userspace) or zero (if your driver allows direct access to the DMA area). For the common case this simply can't be done, but when you fire a VM for a single application in a datacenter this is an entirely viable approach.

QUIC in the kernel improves performance because you can do things like packet filtering, decryption without extra copies, etc, without having to map the driver and have a single owner, i.e. an approach that is viable for a multitasking system such as what anything running a browser will have.

Note this is not the same as what microkernels do, where processes still copy data, just between userspace servers and applications, so instead of saving copies you just added extra IPC+scheduling overhead.