LWN: Comments on "Avoiding unintended connection failures with SO_REUSEPORT"

Avoiding unintended connection failures with SO_REUSEPORT

bernat — Sun, 09 May 2021 06:49:47 +0000

The idea of listen(0) was to then allow you to drain the remaining connections. I remember you proposing this simple solution but your patch was rejected because "this should be done with BPF."

Avoiding unintended connection failures with SO_REUSEPORT

smurf — Fri, 30 Apr 2021 05:53:05 +0000

That depends on whether you need kernel support for it. I could get by with an IPC socket to some other server to whom I can send the file descriptors returned from accept()ing these connections.

Or maybe we want a "sock_inject(listener,conn)" syscall that adds an open socket into a listener's queue. You could do to another process, just open /proc/‹serverpid›/fd/‹bound_socket›. In fact something like this is also required for migrating a server to a different host, so it'd not be a single-use syscall.

Avoiding unintended connection failures with SO_REUSEPORT

wtarreau — Thu, 29 Apr 2021 17:48:01 +0000

Ah OK but the internal problem remains the same: the difficulty of rehashing the queues without losing entries. listen(0), setsockopt(), shutdown(SHUTRD) etc were all among valid candidates for me as soon as I'd have had a reliable way to move these queues around :-/

Avoiding unintended connection failures with SO_REUSEPORT

smurf — Wed, 28 Apr 2021 05:32:28 +0000

You misinderstand (or I miswrote): my intent was to fix the kernel so that "listen(fd,0)" simply closes the queue for new arrivals. Then the process would accept() the remaining open connections (and somehow deal with them), and shut down when that blocks. No new syscall, new common queueing mechanism, or other intrusive shenanigans required.

Avoiding unintended connection failures with SO_REUSEPORT

marcH — Mon, 26 Apr 2021 22:39:40 +0000

> LWN's server, for example, sweats hard when keeping up with the comment stream that accompanies any article mentioning the Rust programming language. But some organizations run truly busy servers and have to take some extraordinary measures to keep up with levels of traffic that even language advocates cannot create.

It looks like our editor's great mood made him temporarily stray from LWN's legendary rigor and thoroughness and drop a piece of critical information highly relevant to this article: in such a difficult server situation, are the Rust, C or C++ advocates causing the most traffic?

Avoiding unintended connection failures with SO_REUSEPORT

NYKevin — Mon, 26 Apr 2021 22:04:00 +0000

That is fairly likely; I'm basing my assumptions on the "lame duck state" documented here: https://sre.google/sre-book/load-balancing-datacenter/

But that probably wouldn't work very well for frontends.

Avoiding unintended connection failures with SO_REUSEPORT

sodabrew — Mon, 26 Apr 2021 18:37:34 +0000

You may be mistaken on Google's lack of use for SO_REUSEPORT, given that SO_REUSEPORT was developed by Tom Herbert at Google and merged for the 3.9 kernel release: https://lwn.net/Articles/542629/

It's certainly possible that the original needs have changed and become solved in other ways in the eight years that have passed, or that you're working on a different product that doesn't have the same requirements as the one for which this feature was developed, but either way, your statement that (paraphrased) "Google doesn't use SO_REUSEPORT" ought to have a modifier of either "...anymore because..." or "...in my group that's doing something different."

Avoiding unintended connection failures with SO_REUSEPORT

wtarreau — Mon, 26 Apr 2021 16:13:46 +0000

It's essential to deal with high the SYN rates that happen during SYN floods (i.e. all the time on high traffic sites). You want that part to be ultra-scalable. The difference can be 1 vs 10 Mpps.

Avoiding unintended connection failures with SO_REUSEPORT

wtarreau — Mon, 26 Apr 2021 16:11:46 +0000

doesn't work well, see my response above.

Avoiding unintended connection failures with SO_REUSEPORT

wtarreau — Mon, 26 Apr 2021 16:10:20 +0000

No, that was among my first attempts. It simply results in the listen queue being zero for that socket and connection requests being dropped. Still I kept that as a work around for the RST for a few days because it managed to cause less RST by rejecting SYNs earlier when detecting the queue was full. But that was not possible after the lockless SYN patches anyway so there was no hope in this direction. Plus this resulted in a huge CPU usage for the user application that had to call accept() in loops and was not able to group the accepts any more.

Avoiding unintended connection failures with SO_REUSEPORT

mm7323 — Mon, 26 Apr 2021 04:45:08 +0000

Why not just let a process do something like call listen(fd, 0) to indicate that it doesn't want any more connections queued to the socket. Then it can wait a few seconds for any handshaking to complete and use non-blocking accept() to empty it's queue before exiting or reloading config gracefully.

This doesn't handle the crashing process scenario, but that's a lost cause anyway - a crash after accept() was called could still result in a dropped connection or invalid or partial response and agrevated users.

This suggestion is pretty much the same as using a BPF program to steer connections as suggested in the article. It just makes a simpler API to achieve the same.

Avoiding unintended connection failures with SO_REUSEPORT

Sesse — Sun, 25 Apr 2021 21:56:52 +0000

I'm a bit confused. What good does it do that SYN processing is lockless, if you're still going to serialize it into a single listener?

Avoiding unintended connection failures with SO_REUSEPORT

smurf — Sun, 25 Apr 2021 08:09:08 +0000

Couldn't the unhashing be accomplished (with some minimal kernel support of course) by calling listen(fd,0)? Then either process the remaining connections or hand them off to another process.

Avoiding unintended connection failures with SO_REUSEPORT

NYKevin — Sat, 24 Apr 2021 23:28:53 +0000

I should probably also emphasize that I deal with things at a much higher level than "the specific flags we pass to individual syscalls when setting up sockets," so while I believe what I have written is generally correct, my understanding might be incomplete or incorrect with respect to SO_REUSEPORT in particular.

Nevertheless, there are definitely lots of people who want to push their software frequently, and making that easier in one fashion or another can only be a good thing.

Avoiding unintended connection failures with SO_REUSEPORT

flussence — Sat, 24 Apr 2021 23:08:00 +0000

Why not? Linux already has a bunch of similar complexity to prevent timestamp counters wrapping after hundreds of days of uptime, and we all know that would've only happened to evil lazy sysadmins that don't apply security updates and totally deserve it. /s

If this was hardware the maker would've put out a product recall for such a high failure rate.

Avoiding unintended connection failures with SO_REUSEPORT

ms-tg — Sat, 24 Apr 2021 19:07:03 +0000

Thanks for posting this, I find a lot of value in the “this is what I’m seeing in my industry” posts that LWN attracts related to specific kernel intricacies.

Avoiding unintended connection failures with SO_REUSEPORT

wtarreau — Sat, 24 Apr 2021 12:12:10 +0000

Yes that could be an option, I considered it but lost my way by then. The accept queue code became tricky since the reintroduction of SO_REUSEPORT, and I seem to remember that one of the difficulty was to pick pending connections, and another one was to unhash some of the queues while they were in use without losing what was in them. For whatever reason I remember not figuring how to allow a program to still pick what was left in a queue with that queue not being visible to the rx path that distributes incoming requests. But these are old memories, and I remember that Eric was quite concerned about my fiddling there because he was about to finish to kill the SYN queue lock.

Also it's important to keep in mind that we cannot afford to lose even a tenth of a percent of performance there, because such tricks would only be used during process reloads, and the code path they're affecting is the one being the most stressed during DDoSes.

Ideally we should just remove a queue in two steps. First step it should simply be unhashed and second step it should be closed. From what I remember, pending entries were killed inside the unhashing code, but I could be saying crap.

Avoiding unintended connection failures with SO_REUSEPORT

gracinet — Sat, 24 Apr 2021 11:05:21 +0000

This is interesting to me because SO_REUSEPORT is as far as I know the only way to do prefork multiprocessing in gRPC servers, which is something that's typically wanted if implemented in Python, because of the global interpreter lock (GIL).

In some cases the performance impact of the GIL can be overstated, it really depends on the workload. But then, many Python applications aren't designed to be thread-safe anyway because it's generally believed that the GIL would make the effort useless.

A common case for restarting worker processes would be to reach some limitation, such as memory footprint. Lots of applications in the wild have at least minor memory leaks.

Avoiding unintended connection failures with SO_REUSEPORT

Cyberax — Sat, 24 Apr 2021 10:29:26 +0000

How about this method?

Have two queues for each listening socket, one normal and one for "extraordinary" requests. In case of a process death during the closure process take the queued connections and redistribute them across extraordinary queues.

Since these queues are special and are used infrequently, you can use simple locking-based algorithms there.

Ideally this can be done transparently in kernel, but it can also be done with some userspace assistance. Processes willing to "mop up" connections can open a new listening socket and communicate (via setsockopt/ioctl) that it should be used for connection migration.

Avoiding unintended connection failures with SO_REUSEPORT

wtarreau — Sat, 24 Apr 2021 08:46:11 +0000

I tried to do exactly this several times in the past but failed. We had the same problem with haproxy where high traffic users were constantly seeing a few resets being emitted when one queue was unbound. I tried to figure how to detach pending connections from a queue and reinject them into other queues but never managed to, that are was too complex.

I understand Eric's concerns (and he already expressed them to me by then). It is possible that it's not the best solution, but it addresses a real issue in field that needs to be addressed.

We worked around it by passing listening file descriptors between the old and the new process during reloads. All this just to avoid a bind+unbind cycle! It comes with its own set of limitations, of course.

Also SO_REUSEPORT is not just used for this, the initial purpose was to allow multiple processes to bind to the same port and avoid black out periods. It used to work fine in 2.2 and was removed in 2.4. I had to maintain the patch to reintroduce it till someone else proposed a variant in 3.9 which also implemented the multiqueue balancing. But this is an essential feature in highly available environments.

Avoiding unintended connection failures with SO_REUSEPORT

NYKevin — Sat, 24 Apr 2021 04:45:21 +0000

Speaking as a Google SRE, we restart servers "to effect a configuration change" all the damn time. We push new code into production literally every single week[1] unless there's a holiday or our error budget is depleted. Every push involves (slowly, carefully[2]) restarting all running instances of the server. Now, in practice, SO_REUSEPORT is probably not the most relevant flag in the world for us, but that's mostly just because we've already solved this problem (i.e. "don't drop in-flight requests") at other levels of abstraction, and so asking the kernel for help is less useful. But any shop that's less aggressively containerized than us[3] would probably find this sort of thing Nice To Have, if they want to do frequent releases.

[1]: This is my experience on one team managing a small number of services. It is not necessarily representative; I know for a fact that other teams often have wildly different release cadences.
[2]: https://sre.google/workbook/canarying-releases/
[3]: https://sre.google/sre-book/production-environment/

Avoiding unintended connection failures with SO_REUSEPORT

Cyberax — Sat, 24 Apr 2021 02:43:56 +0000

That's not necessarily true. For example, we were using plenty of Let's Encrypt certs on internal NAS-like servers. We could have used self-signed certs, but maintaining our own CA and installing it on all computers was a bigger hassle.

Avoiding unintended connection failures with SO_REUSEPORT

clugstj — Sat, 24 Apr 2021 02:39:58 +0000

If you are using Let's Encrypt, you server isn't that busy.

Avoiding unintended connection failures with SO_REUSEPORT

Cyberax — Sat, 24 Apr 2021 02:38:10 +0000

Every couple of weeks with Let's Encrypt.

Avoiding unintended connection failures with SO_REUSEPORT

clugstj — Sat, 24 Apr 2021 02:37:22 +0000

The article says "the server is being restarted to effect a configuration change or to switch to a new certificate". How often does this happen?

Avoiding unintended connection failures with SO_REUSEPORT

HenrikH — Sat, 24 Apr 2021 01:07:46 +0000

The sockets/connections that the patch set is about have not reached userspace yet when the process is restarted so there is nothing an application can do here. It's about connections that have been assigned to a specific process by the kernel but where said process have not yet called accept() to get them when that process was restarted/stopped/crashed.

Avoiding unintended connection failures with SO_REUSEPORT

Cyberax — Fri, 23 Apr 2021 23:39:30 +0000

> IIRC, the purpose of SO_REUSEPORT wasn't to permit multiple threads to dequeue connections; they could already do that, and if there was lock contention this was just a QoI issue.
SO_REUSEPORT is needed to allow multiple _processes_ to dequeue connections.

The problem with accepting connections from multiple threads is that they are still effectively serialized, because all file operations take an implicit lock to allocate a file descriptor.

Avoiding unintended connection failures with SO_REUSEPORT

wahern — Fri, 23 Apr 2021 23:29:09 +0000

The article mentions that "the TCP accept code has been reworked to run locklessly". IIRC, the purpose of SO_REUSEPORT wasn't to permit multiple threads to dequeue connections; they could already do that, and if there was lock contention this was just a QoI issue. Rather, I think the primary issue was efficient polling and resolving the stampeding herd problem. The reason an incoming connection is immediately assigned to a specific queue is so that only that descriptor (or descriptors if dup'd) will signal readiness while still avoiding introducing stalling and fair dispatch dilemmas, especially in the context of polling as opposed to threads actually waiting inside accept. The classic way to implement a multi-threaded accept in userspace while avoiding thundering herds was to ensure only a single thread was waiting in accept or polling on the accept descriptor at any one time. SO_REUSEPORT simply moved assignment earlier in the pipeline while largely preserving semantics.

BSD supported SO_REUSEPORT long before Linux did, albeit support for TCP seems to be undocumented. Rather than round-robin, though, only the most recent binding is assigned connections. When that goes away the previous one starts to see connections again. However, at least on macOS queued connections are still lost on close. I see FreeBSD added SO_REUSEPORT_LB which does round-robin; not sure if it suffers from the lost connection problem.

Avoiding unintended connection failures with SO_REUSEPORT

Cyberax — Fri, 23 Apr 2021 20:58:39 +0000

> I wonder if the right thing might be to have SO_REUSEPORT switch to actually dup()ing the single socket to the caller's fd. Is there per-socket information that is allowed to be different among the sockets listening on the same port that would cause behavior changes in that situation?
The problem is that you'll be funneling all the connections through effectively one thread, because all the file operations take a process-wide lock. This is what SO_REUSEPORT was designed to avoid.

Avoiding unintended connection failures with SO_REUSEPORT

ibukanov — Fri, 23 Apr 2021 19:15:38 +0000

I do not see how this is useful in practice. nginx has been supporting live updates without dropping any connection for ages with careful protocol to transfer the file descriptors to another process. systemd made that very straightforward to implement as the process can store descriptors before the restart in the pool provided by systemd. Those busy sites surely can implement something like that allowing to preserve not only listening sockets with or without SO_REUSEPORT but also accepted ones.

As for crashing servers loosing incoming queue is the least of worries as the crash is clear sign that the system misbehaves. And for absolute robustness one can transfer important sockets to another process in the crash handler.

Avoiding unintended connection failures with SO_REUSEPORT

mss — Fri, 23 Apr 2021 19:10:14 +0000

> So, we are worrying about 1 in a billion connections to a busy web server failing?
Where did you get this number from? Is it stated somewhere in the patch series?

A TCP connection could fail for many other reasons, too.
We definitely don't want to add additional ones, as randomly failing server connections give poor user experience.

Avoiding unintended connection failures with SO_REUSEPORT

smurf — Fri, 23 Apr 2021 19:07:47 +0000

It's not just web sockets this is useful for.

The connection in question has already been accepted and there's a server willing to work with it. Dropping it is just plain rude, esp. as the client has no way to decide whether the server crashed or not. This is bad. You need retry code in the client (which otherwise could just blindly assume that the server crashed). The retry introduces latency you might want to avoid.

Also the client might be a load balancer which now thinks that your server just crashed. This is a bad idea, esp. if you use a sharded data set because the requests now go to "cold" machines.

Avoiding unintended connection failures with SO_REUSEPORT

xi0n — Fri, 23 Apr 2021 18:29:56 +0000

AFAIK it’s associated with a particular socket but either way, it does look like legacy assumption. SO_REUSEPORT would likely be simpler in implementation (incl. possibly the patch set discussed here being unnecessary) if those queues were maintained on a per-bindpoint basis instead.

(This would raise a question what to do with the second argument to listen(). Since the number given there is defined more like a hint than actual limit, it’s probably a minor issue, though).

Avoiding unintended connection failures with SO_REUSEPORT

iabervon — Fri, 23 Apr 2021 16:16:35 +0000

I wonder if the right thing might be to have SO_REUSEPORT switch to actually dup()ing the single socket to the caller's fd. Is there per-socket information that is allowed to be different among the sockets listening on the same port that would cause behavior changes in that situation?

I assume the now-recommended userspace code would bind the address in a single process and pass the bound socket over a unix domain socket to other processes (or fork after binding it), and it seems like this situation wouldn't be too hard to replicate even when userspace didn't ask like that, assuming that there aren't any visible differences.

I guess the max queue length may need to be adjusted in order to enqueue the same number of incoming connections total, and that might be noticed by the other local processes?

Avoiding unintended connection failures with SO_REUSEPORT

Tomasu — Fri, 23 Apr 2021 15:42:30 +0000

I wonder why the queue has to be associated with a process at all till it's been "accepted". Probably just legacy assumptions.

Avoiding unintended connection failures with SO_REUSEPORT

clugstj — Fri, 23 Apr 2021 15:35:13 +0000

So, we are worrying about 1 in a billion connections to a busy web server failing? I can't see this being worth the added complexity.