Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Posted Dec 31, 2021 16:31 UTC (Fri) by Sesse (subscriber, #53779)In reply to: Zero-copy network transmission with io_uring by jezuch
Parent article: Zero-copy network transmission with io_uring
char buf[256];
int len = snprintf(buf, sizeof(buf), "GET / HTTP/1.1\r\nHost: www.lwn.net\r\n\r\n");
int ret = send(sock, buf, len, MSG_ZEROCOPY);
Now you cannot return from the function safely! Because the kernel holds a pointer into your stack (the buf variable) all the way until it's received an ACK from the other side of the world,. With default flags (no MSG_ZEROCOPY), you can do whatever you want with buf the nanosecond the send() call returns, because it's taken a copy of your data for you.
Posted Dec 31, 2021 16:41 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (6 responses)
Maybe some immutable mechanism coupled with garbage collect so the program can release it when the kernel's finished with it.
Cheers,
Posted Dec 31, 2021 17:07 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Jan 1, 2022 16:55 UTC (Sat)
by shemminger (subscriber, #5739)
[Link]
Posted Dec 31, 2021 21:47 UTC (Fri)
by NYKevin (subscriber, #129325)
[Link]
Alternatives:
* The kernel calls free or something similar to free for you. But malloc (or tcmalloc, or whatever you're using to allocate memory) is a black box from the kernel's perspective, because it lives in userspace, and the only way the kernel can plausibly invoke it is to either inject a thread (shudder) or preempt an existing thread and borrow its stack. You end up with all of the infelicities of signal handlers, which notoriously cannot allocate or free memory because malloc and free are usually not reentrant. That means your "something similar to free" just ends up being a more elaborate and complicated version of exactly the same completion mechanism, and userspace still has to do the actual memory management itself.
Posted Dec 31, 2021 22:07 UTC (Fri)
by roc (subscriber, #30627)
[Link]
Posted Jan 2, 2022 20:35 UTC (Sun)
by luto (guest, #39314)
[Link]
Posted Jan 25, 2022 14:37 UTC (Tue)
by Funcan (subscriber, #44209)
[Link]
A language like goland that already has magic in the compiler to promote local variables to the heap automatically where needed might be able to optimise this away behind the scenes (not necessarily mainline golang since it would require baking the semantics of networking into the compiler, which would be odd but a language with similar properties) but it is probably better to provide a less leaky abstraction to programmers. Things done behind the scenes to make one interface look like an older, simpler one are rarely optimal, and this code is about getting the last drop of network performance
Posted Jan 1, 2022 8:17 UTC (Sat)
by epa (subscriber, #39769)
[Link] (17 responses)
Posted Jan 1, 2022 9:58 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (12 responses)
Posted Jan 1, 2022 11:04 UTC (Sat)
by smurf (subscriber, #17840)
[Link] (11 responses)
Nothing new here. File systems (particularly networked ones, but also physical disks when they are under heavy load and/or start going bad) may or may not behave like this also, depending on whether you mount them synchronously and/or open a file with O_SYNC and/or O_DIRECT turned on.
Posted Jan 1, 2022 11:14 UTC (Sat)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Jan 1, 2022 12:30 UTC (Sat)
by smurf (subscriber, #17840)
[Link]
Assume, for instance, that you are a server. You have a thread per client. You get a request, assemble the response, send it, then you free the data structure. An end-to-end-blocking send would obviate the need for copying the data to the kernel while not affecting anything else.
I do agree that doing the same thing is way more useful in an io_uring setup.
Posted Jan 2, 2022 10:32 UTC (Sun)
by NYKevin (subscriber, #129325)
[Link] (8 responses)
0. The kernel has validated the operation (i.e. checked the arguments), and is preparing to push the data we care about into the TCP send queue, but it has not actually done that yet (i.e. it might still be sitting in a userspace buffer). This is probably not what you meant, but I include it for completeness.
You probably don't mean 0 or 1, because 1 is (I think) when regular old send(2) returns (and 0 is pretty much the exact opposite of "the data has been sent"). But even getting to (2) is potentially multiple seconds and might never happen at all (in which case, I assume you just fail with EIO or something?). If you want that to perform well, you had better not be doing one-thread-per-operation, or you will either spawn way too many threads, or take far too long to accomplish anything. Both of those are bad, so now we're in the realm of async networking and io_uring, or at least the realm of thread pools and epoll, so no matter what, you're going to be doing some sort of async, event-driven programming. There's no way for the kernel to provide the illusion of synchronous networking short of actually doing things synchronously, and that just doesn't scale in any kind of reasonable way. I dunno, maybe your language/OS can fake it using async/await and such? But that's still event-driven programming, it's just dressed up to look like it's synchronous (and the kernel has no part in this chicanery anyway).
Even getting to (4) is not really enough to have any assurances, because you have to treat every element of the system as unreliable, including the remote host. The peer could still crash after (4) and never see your data (because it was sitting in the receive queue). Until you see (5), you simply cannot know whether the peer received or will receive the data. And the kernel can't figure out whether (5) has happened for you, because (5) is application-specific.
Posted Jan 4, 2022 11:19 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (7 responses)
Getting to 4 is, however, useful in terms of RAII or other stack-based buffer management. Until you get to 4, the kernel may need access to the data again so that it can resend if not ACKed; once you get to 4, the kernel will never look at the buffer again, even if the remote application doesn't receive the data.
Basically, getting to 1 is the minimum useful state, but it makes zero-copy hard, because I now have to keep the buffer alive in user-space until the kernel gets to 4. Getting to 4 or clearly indicating that we will never get to 4 is useful because it means that when send returns, the kernel is promising to never look at the buffer again.
Posted Jan 5, 2022 4:37 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (6 responses)
Posted Jan 5, 2022 9:39 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (4 responses)
You've explained why it's inadequate in terms of CPU time given a large number of clients, but not in terms of memory usage, nor in terms of small servers handling a few tens of clients at peak; different optimization targets for different needs.
For the general case, io_uring and async is the "best" option, but it brings in a lot of complexity managing the state machines in user code rather than simply relying on thread per client. Zero-copy reduces memory demand as compared to current send syscalls, and having a way to do simple buffer management would be useful for the subset of systems that don't actually care about CPU load that much, don't have many clients at a time to multiplex (hence not many threads), but do want a simple "one thread per client" model that avoids cross-thread synchronisation fun.
Not everything is a Google/Facebook/Netflix level problem, and on embedded systems I've worked on, a zero-copy blocking until ACK send would have made the code smaller and simpler; we emulated it in userspace via mutexes, but that's not exactly a high performance option.
Posted Jan 5, 2022 9:46 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (3 responses)
Each thread has a stack, whose default size seems to be measured in megabytes (by cursory Googling, anyway). If you spawn way too many threads, you are going to use way too much memory just allocating all of those stacks.
> nor in terms of small servers handling a few tens of clients at peak
I find it difficult to believe that blocking send(2) is too slow yet you only have tens of clients. You are well off the beaten path if that's really the shape of your problem. So I guess you get to build your own solution.
Posted Jan 5, 2022 10:06 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (2 responses)
Setting stack sizes is trivial to do - you don't have to stick to the default, and when you're on a small system, you do tune the stack size down to a sensible level for your memory. Plus, those megabytes are VA space, not physical memory; there's no problem having a machine with 256 MiB physical RAM, no swap and 16 GiB of VA space allocated, as long as you don't actually try to use all your VA space.
And you're focusing on speed again, not simplicity of programming a low memory usage system; I want to be able to call send, have the kernel not need to double my buffer (typically an entire compressed video frame in the application I was working on) by copying it into kernel space, and then poll the kernel until the IP stack has actually sent the data. I want to be able to call send, and know that when it returns, the video frame has been sent on the wire, and I can safely reuse the buffer for the next encoded frame.
It's not that send is slow - it's that doing a good job of keeping memory use down on a system with reasonable CPU (in order to keep the final BOM down while still having enough grunt to encode video) requires co-operation. And I can (and did) emulate this by using zero-copy send and mutexes in userspace, but it's not exactly easy to maintain code - and yet it's just doing stuff that the kernel already knows how to do well.
Posted Jan 6, 2022 6:47 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
I don't see why you would need to use mutexes. The sample code in https://www.kernel.org/doc/html/v4.15/networking/msg_zero... uses poll(2) to wait for the send to complete, and I tend to assume you could also use select(2) or epoll(2) instead if you find those easier or more familiar (until now, I had never heard of poll). Just write a five-line wrapper function that calls send(2) and then waits on the socket with one of those syscalls, and (as far as I can tell) you should be good to go.
Frankly, the kernel is not a library. If we *really* need to have a wrapper function for this, it ought to live in libc, not in the kernel.
Posted Jan 6, 2022 10:24 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
I had to use mutexes because of the way the rest of the application was structured - the socket was already being polled elsewhere via epoll edge triggered (not my decision, and in a bit of code I had no control over), and I needed to somehow feed the notifications from the main loop to the client threads.
It was not a nice project to work on, and the moment I got the chance, I rewrote it as a single-threaded application that used less CPU (but same amount of memory) and was much easier to read. Unfortunately, this meant arguing with management that the "framework" from the contracted developers wasn't worth the money they'd paid for it.
Posted Jan 5, 2022 13:41 UTC (Wed)
by foom (subscriber, #14868)
[Link]
Waiting until step 4 effectively removes local kernel buffering -- when one send completes there must be no further data available to send, thus outgoing packets will pause. And, like TCP_NODELAY, would cause extra partially filled packets to be sent at the tail of each send. If you're trying to send a stream of data as fast as the network allowed, this will all be counterproductive, unless all the data is provided in a single send. Sometimes that may be possible, but it seems like a very limited use case.
And, if your primary goal is not scalability or cpu, but rather memory usage, it seems like it'd be far simpler and just as effective to reduce SO_SNDBUF and use a regular blocking send loop.
Posted Jan 1, 2022 20:27 UTC (Sat)
by jafd (subscriber, #129642)
[Link] (3 responses)
This is a small problem, but it sure opens a huge can of worms.
Years ago I wanted this too. But then I read more about this problem (in the context of a rather good article about buffer bloat and how it actually harms both disk and network performance, who knows if it wasn't here at LWN, years ago), and I don't want it anymore. We can only sensibly agree that the packet is sent as soon as the postman (the kernel) found no immediate problems with it and took it for processing.
Posted Jan 1, 2022 23:38 UTC (Sat)
by james (subscriber, #1325)
[Link] (1 responses)
Posted Jan 3, 2022 8:25 UTC (Mon)
by taladar (subscriber, #68407)
[Link]
Posted Jan 2, 2022 9:56 UTC (Sun)
by smurf (subscriber, #17840)
[Link]
There is no kernel buffer. The whole point of this is to enable Zero Copy.
Posted Jan 2, 2022 7:59 UTC (Sun)
by jezuch (subscriber, #52988)
[Link] (1 responses)
My guess is I don't really understand how managed buffers work :)
(The difficulty then is in making sure that the userspace does not reuse the buffer while it's "busy". Rust's ownership model looks like it was designed just with this in mind ;) )
Posted Jan 2, 2022 9:32 UTC (Sun)
by Sesse (subscriber, #53779)
[Link]
Zero-copy network transmission with io_uring
Wol
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
The experiment concluded that COW was a slower because the cost of acquiring locks to invalidate the TLB entries on other CPU's exceeded the cost of the copy. The parameters might now with larger sends (64K or more) and huge pages. Definitely worth investigating but the VM overhead is significant.
Zero-copy network transmission with io_uring
* The buffer is mmap'd, and the kernel unmaps it for you when it's done. There are performance issues here; mmap simply cannot compete with a highly optimized userspace malloc, assuming the workload is reasonably compatible with the latter. However, this does at least have the advantage of being *possible* without ridiculous "let's inject a thread" tricks. But since the whole point of zero-copy is to improve performance, that's probably not enough.
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
1. The data we care about is in the TCP send queue and is enqueued for sending, but due to Nagle's algorithm and/or network congestion, we can't send it yet. Under zero-copy networking, this state is (or should be, anyway) functionally equivalent to state 0, and the application cannot safely free the buffer until it gets a completion notification. Time since previous state: Probably no more than a few hundred microseconds, possibly a lot less in a well-optimized implementation, especially if the data is small or we are async/zero-copy. Might be slower if the data is huge and we have to copy it.
2. The data we care about is "next in line," i.e. it will be transmitted in the next TCP segment, and we are not waiting on any ACKs. Time since previous: Could take multiple seconds if the network is badly congested, or rarely more than that. We might never reach this state, if the network drops altogether or we receive a RST packet. In a well-performing network, tens or hundreds of milliseconds would be typical depending on ping. Or this could be instant, if the send queue was already empty.
3. The data we care about has been packed into one or (rarely) more IP datagrams, and those datagrams have been sent. IP is an unreliable, connectionless protocol, so at the IP layer, sending is an event, not a process. This probably takes no more than a few milliseconds, but I'm not very familiar with this sort of low-level hardware timing, so I might be completely wrong there.
4. The data we care about has been ACKed. At this point, we can be somewhat confident that a well-behaved receiving application on the peer will eventually get a copy of the data, assuming it does not crash or abort before then. Time since previous: At least one ping round-trip, possibly forever if the network drops before we receive an ACK.
5. There has been some sort of application-level acknowledgement of the data we care about, such as an HTTP response code. This may or may not happen at all depending on the protocol and the server/client roles, and the kernel is obviously not in a position to figure that out anyway, so this is a non-starter.
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Or as soon as the remote side acknowledges receipt?
And "acknowledgement" can happen at different levels. For example, a mail server might indicate that TCP packets containing an email have been received, but you shouldn't consider that email "sent" until the receiving side sends a "250 OK" message (which might be after SpamAssassin has run, so potentially many milliseconds later).
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring
Zero-copy network transmission with io_uring