OpenSSL 3.2.0 released

Posted Nov 28, 2023 11:06 UTC (Tue) by DemiMarie (subscriber, #164188)
In reply to: OpenSSL 3.2.0 released by paulj
Parent article: OpenSSL 3.2.0 released

Which implementations are you referring to, and why is their performance so bad?

OpenSSL 3.2.0 released

Posted Nov 28, 2023 11:27 UTC (Tue) by paulj (subscriber, #341) [Link] (15 responses)

Public Google QUIC ("quiche" - not to be confused with other implementation also called "quiche"). It's used by Chrome, and is fine there. It has some bad performance issues in ACK processing, which affect server use.

OpenSSL 3.2.0 released

Posted Nov 28, 2023 11:47 UTC (Tue) by paulj (subscriber, #341) [Link] (14 responses)

Oh, and on "allowing too much" - ACKs in QUIC have an unlimited number of ACK ranges (other than the general bounds on what QUIC frames can hold). This implies using general purpose range tracking data-structures. Which come with much higher overheads compared to a data-structure to track a known, fixed set of ranges. Additionally, QUIC also uses packet numbers and ACKs packet numbers. Which is a double edged sword - it fixes some issues TCP has (no ambiguity on which sent packet is being ACKed), but it adds costs (you now have an extra level of indirection between what your ACKs are ACKing and your data-stream, which you must efficiently manage).

This is a major reason why QUIC server side code costs a lot more than TCP on ACK processing. And it's very easy to make a small change and blow your performance out by 100x.

I havn't seen clear evidence that the unbounded ACK ranges of QUIC have given it an improvement on congestion control, to make up for the data-structure costs. Most people who measure find HTTP/QUIC to have lower throughput than HTTP/TCP. Usually they look to low-level UDP I/O optimisations to regain performance (UDP GRO, mmsg, etc.), but I think the protocol itself has some complications that require extra care to get performance. Which not all implementations have.

LsQuic is pretty fast though - much faster than Google Quiche - server side. Though, still not as efficient as kernel TCP.

OpenSSL 3.2.0 released

Posted Nov 29, 2023 12:35 UTC (Wed) by DemiMarie (subscriber, #164188) [Link] (12 responses)

Would a kernel-mode QUIC implementation solve the performance problems? I would be opposed to one written in C due to security concerns, but one written in Rust would be quite interesting, especially with sendfile and io_uring integration.

OpenSSL 3.2.0 released

Posted Nov 29, 2023 14:07 UTC (Wed) by paulj (subscriber, #341) [Link]

The performance issues around the protocol features that require data-structures with more overhead to track, compared to TCP - those can not be fixed by a kernel implementation. I.e., unbounded ACK ranges, and the added indirection between the packet numbers in ACKs and the data byte ranges.

You just need an implementation that is /super/ careful about both the /design/ and implementation of the state tracking code to handle those things.

If you have that, how much difference could a kernel implementation make? I don't know. Personally, I think with io_uring and segment overload there isn't much difference to be had on I/O costs.

One place a kernel implementation can win is that (at least some) handling of ACKs can be done in soft-IRQs, from any user context - no need to switch to the specific user context, or switch from kernel to user mode. So that would probably get some performance benefit - not just in CPU, but, e.g., also in terms of reduced jitter in sending ACKs for received data, which would improve the congestion control behaviour of the protocol and potentially squeeze out a little bit more performance from the network.

But hard to say. ;)

OpenSSL 3.2.0 released

Posted Nov 29, 2023 15:09 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (10 responses)

The cost of running in userland is the overhead of syscalls per packet. And sendmmsg() or recvmmsg() will not change much, since every buffer must have its address checked, which also comes with a cost. An alternate approach would be to make it run entirely in userland with a userland driver such as DPDK or using AF_PACKET or AF_XDP etc. In this case you retrieve batches of packets and don't need to recheck their individual addresses.

Regarding the choice of language, that's interesting. Low levels like this require cross-references between many elements (packets, ack ranges, rx buffers etc) so writing them in too strict languages either would require a lot of unsafe sections (hence no benefit) or invent a complex and expensive model to track everything. Given that the first concern about packet-based protocols like this is the processing cost, a more complex and more expensive implementation could very possibly become its own security issue by being easier to DoS. Such a design must not be neglected at all and there is absolutely no room for compromise between performance and safety here, you need to have both, even if the second one is only guaranteed by the developer.

OpenSSL 3.2.0 released

Posted Nov 29, 2023 15:45 UTC (Wed) by paulj (subscriber, #341) [Link] (6 responses)

There is also GSO segment offload (what I brainfartingly referred to as "overload" in sibling comment) - you can send many packets worth of data with /1/ sendmsg call and the kernel sends that as a series of packets to the same destination. You can combine with sendmmsg to send trains of packets to multiple destinations. Worth a good bit according to Cloudflare. You can also control the pacing (to an extent) with SO_TXTIME / SCM_TXTIME - you can specify the launch time for each message (before GSO) - may be important for congestion control.

DPDK is not an option at all for many use-cases (mobile, shared servers, containers, VMs, etc..) - also an energy muncher in the typical busy-loop, poll-driven use. I think it is meant to support interrupts now though, not kept up with how well that works. (??)

I agree on the language thing.

OpenSSL 3.2.0 released

Posted Nov 29, 2023 16:54 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (5 responses)

For GSO we intend to study it. I'm not much convinced for now, I suspect it could add more complexity on the sender side to send perfectly aligned packets so that the stack cuts them on the correct boundaries. But that's still on the todo list.

DPDK is not interesting for regular servers, but network equipment vendors (DDoS protection, load balancers etc) need to cram the highest possible performance in a single device and they already use that extensively.

DPDK uses

Posted Nov 29, 2023 18:49 UTC (Wed) by DemiMarie (subscriber, #164188) [Link] (1 responses)

Does HAProxy Technologies use DPDK in its commercial products?

DPDK uses

Posted Dec 1, 2023 6:06 UTC (Fri) by wtarreau (subscriber, #51152) [Link]

> Does HAProxy Technologies use DPDK in its commercial products?

No. We gave it a try 10 years ago for anti-ddos stuff and we found that it was much more efficient to implement it early in the regular driver (hence the NDIV framework I created by then, presentation here: https://kernel-recipes.org/en/2014/ndiv-a-low-overhead-network-traffic-diverter/ ). Recently we ported it to XDP, losing a few optimizations but apparently recent updates should allow us to recover them. And that way we don't have to maintain our patches to these drivers anymore.

The reason why solutions like netmap and DPDK are not interesting in our case is that we still want to use the NIC as a regular one. With these frameworks, you lose the NIC from the system so it's up to the application to forward packets in and out using a much slower API (we tried). DPDK is very interesting when you process 100% of the NIC's traffic inside the DPDK application, and for TCP you'd need to use one of the available TCP stacks. But I still prefer to the use kernel's stack for TCP, as it's fast, reliable and proven. It's already possible for us to forward 40 GbE of L7 TLS traffic on an outdated 8th gen 4-core desktop CPU, and 100 GbE on an outdated 8-core one. DPDK would allow us to use even smaller CPUs but there's no point doing this, those who need such levels of traffic are not seeking to save $50 on the CPU to reuse an old machine that will cost much more on the electricity bill! Thus when you use the right device for the job, for L7 proxying these frameworks do not bring benefits.

OpenSSL 3.2.0 released

Posted Dec 1, 2023 11:53 UTC (Fri) by paulj (subscriber, #341) [Link] (2 responses)

On UDP GSO, you control the packet size - the kernel does not arbitrarily chop your buffer into whatever packets (that wouldn't work for the reason you give). You specify the packet size, either via a socket option on the socket, or a per-call cmsg on when you send your msg. See "Optimizing UDP for content delivery: GSO, pacing and zerocopy" for an example.

The packets then all have to be same size, but that's the common case when sending trains of max-size packets.

So basically with GSO + sendmmsg you can send:

Time t_1:
- burst x_1 to dest x
- burst y_1 to dest y
<etc>
Time t_2:
- burst x_2 to x
- <etc>
....
Time t_n:
<etc>

You can send a CWND worth of packet trains to many destinations, with the packets to each destination correctly spaced out into smaller bursts to be network congestion-control friendly. All in 1 syscall.

OpenSSL 3.2.0 released

Posted Dec 1, 2023 11:56 UTC (Fri) by paulj (subscriber, #341) [Link] (1 responses)

Oh, and to be clear, only packets in the same burst (i.e., same "super-"message) must have the same size - you can set the GSO packet size in the cmsg for the msg that is to be split via GSO.

OpenSSL 3.2.0 released

Posted Dec 1, 2023 13:09 UTC (Fri) by wtarreau (subscriber, #51152) [Link]

Thanks for the summary and the pointer, I've looked at the doc from Alex and Eric and it's pretty clear on how to proceed. This will definitely encourage us to start to experiment with it soon ;-)

OpenSSL 3.2.0 released

Posted Nov 29, 2023 18:47 UTC (Wed) by DemiMarie (subscriber, #164188) [Link] (2 responses)

Cloudflare’s QUIC implementation is written in Rust and powers their edge network, so I’m not concerned about Rust being too slow.

OpenSSL 3.2.0 released

Posted Nov 30, 2023 0:25 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

quiche seems to conveniently omit implementing the layers that require the complex reference graphs: "The application is responsible for providing I/O (e.g. sockets handling) as well as an event loop with support for timers." IOW, quiche exposes an interface for processing and producing packets for individual connections. The actual process I/O (blocking, non-blocking, aggregating with sendmmsg, etc) as well as global connection book keeping (e.g. indexing connection state) is left up to the application.

This actually seems like a solid example of how to best make use of Rust's strengths, admitting some of its deficits as a standalone language or for writing soup-to-nuts frameworks.

OpenSSL 3.2.0 released

Posted Dec 1, 2023 11:35 UTC (Fri) by paulj (subscriber, #341) [Link]

This is a very common pattern for user-space network protocol libraries (ones that want to be widely used anyway).

You want to avoid them doing the actual I/O, you want to avoid coding them to any specific event library. So they generally end up having 2 sets of interfaces: a) The direct API the user calls into the library with, to supply inbound packets, trigger timing events, etc.; b) The indirect API by which the library calls out to and outputs it's work back to the user, e.g. to send packets, to setup a timer event, etc - i.e. a set of callbacks the user supplies in setup, using the direct API.

Google Quiche (yay, multiple projects in the QUIC space have the same name!) and LsQuic have the same pattern.

OpenSSL 3.2.0 released

Posted Nov 29, 2023 16:09 UTC (Wed) by paulj (subscriber, #341) [Link]

There is maybe a simple tweak that can cut gQuiche's ACK processing costs by perhaps 20% to 50%... Just validating that.

However, that stack was over an order of magnitude higher in CPU costs compared to the likes of LsQuic, and scales worse. So even if I rerun those comparisons with that tweak, gQuiche will still not be very good (ACK intensive side - i.e. server side).