Zero-copy network transmission with io_uring

By Jonathan Corbet
December 30, 2021

When the goal is to push bits over the network as fast as the hardware can go, any overhead hurts. The cost of copying data to be transmitted from user space into the kernel can be especially painful; it adds latency, takes valuable CPU time, and can be hard on cache performance. So it is unsurprising that the developers working with io_uring, which is all about performance, have turned their attention to zero-copy network transmission. This patch set from Pavel Begunkov, now in its second revision, looks to be significantly faster than the MSG_ZEROCOPY option supported by current kernels.

As a reminder: io_uring is a relatively new API for asynchronous I/O (and related operations); it was first merged less than three years ago. User space sets up a pair of circular buffers shared with the kernel; the first buffer is used to submit operations to the kernel, while the second receives the results when operations complete. A suitably busy process that keeps the submission ring full can perform an indefinite number of operations without needing to make any system calls, which clearly improves performance. Io_uring also implements the concept of "fixed" buffers and files; these are held open, mapped, and ready for I/O within the kernel, saving the setup and teardown overhead that is otherwise incurred by every operation. It all adds up to a significantly faster way for I/O-intensive applications to work.

One thing that io_uring still does not have is zero-copy networking, even though the networking subsystem supports zero-copy operation via the MSG_ZEROCOPY socket option. In theory, adding that support is simply a matter of wiring up the integration between the two subsystems. In practice, naturally, there are a few more details to deal with.

A zero-copy networking implementation must have a way to inform applications when any given operation is truly complete; the application cannot reuse a buffer containing data to be transmitted if the kernel is still working on it. There is a subtle point that is relevant here: the completion of a send() call (for example) does not imply that the associated buffer is no longer in use. The operation "completes" when the data has been accepted into the networking subsystem for transmission; the higher layers may well be done with it, but the buffer itself may still be sitting in a network interface's transmission queue. A zero-copy operation is only truly done with its data buffers when the hardware has done its work — and, for many protocols, when the remote peer has acknowledged receipt of the data. That can happen long after the operation that initiated the transfer has completed.

So there needs to be a mechanism by which the kernel can tell applications that a given buffer can be reused. MSG_ZEROCOPY handles this by returning notifications via the error queue associated with the socket — a bit awkward, but it works. Io_uring, instead, already has a completion-notification mechanism in place, so the "really complete" notifications fit in naturally. But there are still a few complications resulting from the need to accurately tell an application which buffers can be reused.

An application doing zero-copy networking with io_uring will start by registering at least one completion context, using the IORING_REGISTER_TX_CTX registration operation. The context itself is a simple structure:

    struct io_uring_tx_ctx_register {
	__u64 tag;
    };

The tag is a caller-chosen value used to identify this particular context in future zero-copy operations on the associated ring. There can be a maximum of 1024 contexts associated with the ring; user space should register them all with a single IORING_REGISTER_TX_CTX operation, passing the structures as an array. An attempt to register a second set of contexts will fail unless an intervening IORING_UNREGISTER_TX_CTX operation has been done to remove the first set.

Zero-copy writes are initiated with the new IORING_OP_SENDZC operation. As usual, a set of buffers is passed to be written out to the socket (which must also be provided, obviously). Additionally, each zero-copy write must have a context associated with it, stored in the submission queue entry's user_data field. The context is specified as an index into the array of contexts that was registered previously (not as the tag associated with the context). These writes will use the kernel's zero-copy mechanism when possible and will "complete" in the usual way, with the usual result in the completion ring, perhaps while the supplied buffers are still in use.

To know that the kernel is done with the buffers, the application must wait for the second notification informing it of that fact. Those notifications are not (by default) sent for every zero-copy operation that is submitted; instead, they are batched into "generations". Each completion context has a sequence number that starts at zero. Multiple operations can be associated with each generation; the notification for that generation is sent once all of the associated operations have truly completed.

It is up to user space to tell the kernel when to move on to a new generation; that is done by setting the IORING_SENDZC_FLUSH flag in a zero-copy write request. The flag itself lives in the ioprio field of the submission queue entry. The presence of this flag indicates that the request being submitted is the last of the current generation; the next request will begin the new generation. Thus, if a separate done-with-the-buffers notification is needed for each write request, IORING_SENDZC_FLUSH should be set on every request.

When a given generation completes, the notification will show up in the completion ring. The user_data field will contain the context tag, while the res field will hold the generation number. Once the notification arrives, the application will be able to safely reuse the buffers associated with that generation.

The end result seems to be quite good; benchmarks included in the cover letter suggest that io_uring's zero-copy operations can perform more than 200% better than MSG_ZEROCOPY. Much of that improvement likely comes from the ability to use fixed buffers and files with io_uring, cutting out much of the per-operation overhead. Most applications won't see that kind of improvement, of course; they are not so heavily dominated by the cost of network transmission. If your business is providing the world with cat videos, though, zero-copy networking with io_uring is likely to be appealing.

For now, the new zero-copy operations are meticulously undocumented. Begunkov has posted a test application that can be read to see how the new interface is meant to be used. There have not been many comments on this version (the second) of this series. Perhaps that will change after the holidays, but it seems likely that this work is getting close to ready for inclusion.

Index entries for this article
Kernel	Asynchronous I/O
Kernel	io_uring
Kernel	Networking/Performance

Zero-copy network transmission with io_uring

Posted Dec 30, 2021 22:37 UTC (Thu) by itsmycpu (guest, #139639) [Link]

Appreciate the article, thanks for the concise information.

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 0:09 UTC (Fri) by Sesse (subscriber, #53779) [Link] (4 responses)

Interestingly, as I understand the patch cover letter (https://lore.kernel.org/lkml/cover.1640029579.git.asml.si...), io_uring + zerocopy is only ~27% faster than io_uring alone. And this is with dummy netdev and UDP packets, so zero NIC and TCP overhead. There will surely be applications interested in this, but most users should be pretty happy with io_uring on its own.

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 0:15 UTC (Fri) by andresfreund (subscriber, #69562) [Link] (3 responses)

ISTM that it's hard to estimate the potential benefits by benchmarking with a dummy NIC: Isn't the whole point of zero-copy IO is that the to-be-sent data will be accessed off-cpu via hardware DMA units? But there's no DMA capable hardware involved here...

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 9:02 UTC (Fri) by jorgegv (subscriber, #60484) [Link] (1 responses)

I have always understood zero-copy as not copying the buffers around (i.e. userspace to kernel and back), and instead using always the same copy of the data in memory by playing tricks vía the MMU.

Copying around data in memory can be expensive and thrashes the caches...

Zero-copy network transmission with io_uring

Posted Jan 3, 2022 16:39 UTC (Mon) by schuyler_t (subscriber, #91921) [Link]

The overall bandwidth matters a lot too. With high bandwidth (handwave at >10 gbps), it's extremely easy to begin running into CPU<->memory bandwidth limitations, especially for non-server class big metal CPUs. With a dummy NIC it's hard to tell, like you said. But once you run into that cliff, ZC is pretty much the only way around it.

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 9:07 UTC (Fri) by Sesse (subscriber, #53779) [Link]

I would assume that the dummy netdev made a copy for the non-zerocopy case, and simply never touched the data for zerocopy (ie., simulating an infinitely efficient DMA engine).

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 15:39 UTC (Fri) by jezuch (subscriber, #52988) [Link] (29 responses)

Silly question: why can't this be the default mode? What are the advantages of non-zero-copy that make it better for some use cases? Perhaps it's lower memory use because we can reuse a fixed set of buffers or something?

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 16:31 UTC (Fri) by Sesse (subscriber, #53779) [Link] (27 responses)

The advantage is that you don't have to worry about managing the memory once you've called send(). Just as a silly example:

char buf[256];
int len = snprintf(buf, sizeof(buf), "GET / HTTP/1.1\r\nHost: www.lwn.net\r\n\r\n");
int ret = send(sock, buf, len, MSG_ZEROCOPY);

Now you cannot return from the function safely! Because the kernel holds a pointer into your stack (the buf variable) all the way until it's received an ACK from the other side of the world,. With default flags (no MSG_ZEROCOPY), you can do whatever you want with buf the nanosecond the send() call returns, because it's taken a copy of your data for you.

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 16:41 UTC (Fri) by Wol (subscriber, #4433) [Link] (6 responses)

What we now want :-) is a mechanism that flags that buffer as COW in the user program.

Maybe some immutable mechanism coupled with garbage collect so the program can release it when the kernel's finished with it.

Cheers,
Wol

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 17:07 UTC (Fri) by Sesse (subscriber, #53779) [Link] (1 responses)

FreeBSD (possibly some of the other BSDs?) have a zerocopy mechanism that flags the page as COW in the MMU (relieving userspace of the need to know, so you can essentially always use zerocopy), but of course then you pay a price if you don't leave it alone for long enough afterwards.

Zero-copy network transmission with io_uring

Posted Jan 1, 2022 16:55 UTC (Sat) by shemminger (subscriber, #5739) [Link]

Many years ago there were experiments with COW and networking. See Intel paper at Ottawa Linux Symposium.
The experiment concluded that COW was a slower because the cost of acquiring locks to invalidate the TLB entries on other CPU's exceeded the cost of the copy. The parameters might now with larger sends (64K or more) and huge pages. Definitely worth investigating but the VM overhead is significant.

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 21:47 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

In order for GC to work, the application has to somehow know whether the kernel is still using the buffer (since the GC cannot descend into kernel memory and trace pointers directly). Therefore, you need some sort of completion mechanism like the one described in this patch in order to notify the GC of that fact. Once that mechanism exists, sure, you can hook it up to a GC or anything else you want, because userspace can manage these buffers in whatever manner it likes.

Alternatives:

* The kernel calls free or something similar to free for you. But malloc (or tcmalloc, or whatever you're using to allocate memory) is a black box from the kernel's perspective, because it lives in userspace, and the only way the kernel can plausibly invoke it is to either inject a thread (shudder) or preempt an existing thread and borrow its stack. You end up with all of the infelicities of signal handlers, which notoriously cannot allocate or free memory because malloc and free are usually not reentrant. That means your "something similar to free" just ends up being a more elaborate and complicated version of exactly the same completion mechanism, and userspace still has to do the actual memory management itself.
* The buffer is mmap'd, and the kernel unmaps it for you when it's done. There are performance issues here; mmap simply cannot compete with a highly optimized userspace malloc, assuming the workload is reasonably compatible with the latter. However, this does at least have the advantage of being *possible* without ridiculous "let's inject a thread" tricks. But since the whole point of zero-copy is to improve performance, that's probably not enough.

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 22:07 UTC (Fri) by roc (subscriber, #30627) [Link]

That's really expensive because when you make the page COW, you have to tell all CPUs that might have a TLB entry for that page to invalidate that TLB entry so that the next time they try to write to it, they get a fault. For a process with many threads using many CPUs, that is slow and scales poorly.

Zero-copy network transmission with io_uring

Posted Jan 2, 2022 20:35 UTC (Sun) by luto (guest, #39314) [Link]

That would be *stupendously* slow. It requires changing the PTE (no big deal, although locking might be nasty), broadcasting a TLB flush to all threads (might as well throw out your fancy server and run on a single CPU from 1999), and then eventually handling a page fault (embarrassingly slow on x86).

Zero-copy network transmission with io_uring

Posted Jan 25, 2022 14:37 UTC (Tue) by Funcan (subscriber, #44209) [Link]

Cows are gassy and slow - even spherical cows in a vacuum.

A language like goland that already has magic in the compiler to promote local variables to the heap automatically where needed might be able to optimise this away behind the scenes (not necessarily mainline golang since it would require baking the semantics of networking into the compiler, which would be odd but a language with similar properties) but it is probably better to provide a less leaky abstraction to programmers. Things done behind the scenes to make one interface look like an older, simpler one are rarely optimal, and this code is about getting the last drop of network performance

Zero-copy network transmission with io_uring

Posted Jan 1, 2022 8:17 UTC (Sat) by epa (subscriber, #39769) [Link] (17 responses)

Can there be a synchronous mode for send() so it doesn't return until the data has been sent? It would be a bit like turning off write-behind caching for disk I/O. And similarly, it would hit performance in some single-threaded programs, but lead to more predictable behaviour, and in programs with a separate network thread it might perform fine.

Zero-copy network transmission with io_uring

Posted Jan 1, 2022 9:58 UTC (Sat) by Sesse (subscriber, #53779) [Link] (12 responses)

How would that behavior be predictable? It would mean that your call could return in 10 ms or in 30 seconds depending on network conditions.

Zero-copy network transmission with io_uring

Posted Jan 1, 2022 11:04 UTC (Sat) by smurf (subscriber, #17840) [Link] (11 responses)

So? Networks are unpredictable, period. Put the call in a thread and deal with it.

Nothing new here. File systems (particularly networked ones, but also physical disks when they are under heavy load and/or start going bad) may or may not behave like this also, depending on whether you mount them synchronously and/or open a file with O_SYNC and/or O_DIRECT turned on.

Zero-copy network transmission with io_uring

Posted Jan 1, 2022 11:14 UTC (Sat) by Sesse (subscriber, #53779) [Link] (1 responses)

But how is the behavior more predictable? What are you trying to achieve by waiting for the call to return? (Even more so if you put it in a thread; the context switch cost is going to be way higher than the data copy cost.)

Zero-copy network transmission with io_uring

Posted Jan 1, 2022 12:30 UTC (Sat) by smurf (subscriber, #17840) [Link]

That depends how much data you have to copy and how you set up your threads.

Assume, for instance, that you are a server. You have a thread per client. You get a request, assemble the response, send it, then you free the data structure. An end-to-end-blocking send would obviate the need for copying the data to the kernel while not affecting anything else.

I do agree that doing the same thing is way more useful in an io_uring setup.

Zero-copy network transmission with io_uring

Posted Jan 2, 2022 10:32 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (8 responses)

It depends on what you mean by "the data has been sent." There are multiple possible definitions of that phrase, in chronological order:

0. The kernel has validated the operation (i.e. checked the arguments), and is preparing to push the data we care about into the TCP send queue, but it has not actually done that yet (i.e. it might still be sitting in a userspace buffer). This is probably not what you meant, but I include it for completeness.
1. The data we care about is in the TCP send queue and is enqueued for sending, but due to Nagle's algorithm and/or network congestion, we can't send it yet. Under zero-copy networking, this state is (or should be, anyway) functionally equivalent to state 0, and the application cannot safely free the buffer until it gets a completion notification. Time since previous state: Probably no more than a few hundred microseconds, possibly a lot less in a well-optimized implementation, especially if the data is small or we are async/zero-copy. Might be slower if the data is huge and we have to copy it.
2. The data we care about is "next in line," i.e. it will be transmitted in the next TCP segment, and we are not waiting on any ACKs. Time since previous: Could take multiple seconds if the network is badly congested, or rarely more than that. We might never reach this state, if the network drops altogether or we receive a RST packet. In a well-performing network, tens or hundreds of milliseconds would be typical depending on ping. Or this could be instant, if the send queue was already empty.
3. The data we care about has been packed into one or (rarely) more IP datagrams, and those datagrams have been sent. IP is an unreliable, connectionless protocol, so at the IP layer, sending is an event, not a process. This probably takes no more than a few milliseconds, but I'm not very familiar with this sort of low-level hardware timing, so I might be completely wrong there.
4. The data we care about has been ACKed. At this point, we can be somewhat confident that a well-behaved receiving application on the peer will eventually get a copy of the data, assuming it does not crash or abort before then. Time since previous: At least one ping round-trip, possibly forever if the network drops before we receive an ACK.
5. There has been some sort of application-level acknowledgement of the data we care about, such as an HTTP response code. This may or may not happen at all depending on the protocol and the server/client roles, and the kernel is obviously not in a position to figure that out anyway, so this is a non-starter.

You probably don't mean 0 or 1, because 1 is (I think) when regular old send(2) returns (and 0 is pretty much the exact opposite of "the data has been sent"). But even getting to (2) is potentially multiple seconds and might never happen at all (in which case, I assume you just fail with EIO or something?). If you want that to perform well, you had better not be doing one-thread-per-operation, or you will either spawn way too many threads, or take far too long to accomplish anything. Both of those are bad, so now we're in the realm of async networking and io_uring, or at least the realm of thread pools and epoll, so no matter what, you're going to be doing some sort of async, event-driven programming. There's no way for the kernel to provide the illusion of synchronous networking short of actually doing things synchronously, and that just doesn't scale in any kind of reasonable way. I dunno, maybe your language/OS can fake it using async/await and such? But that's still event-driven programming, it's just dressed up to look like it's synchronous (and the kernel has no part in this chicanery anyway).

Even getting to (4) is not really enough to have any assurances, because you have to treat every element of the system as unreliable, including the remote host. The peer could still crash after (4) and never see your data (because it was sitting in the receive queue). Until you see (5), you simply cannot know whether the peer received or will receive the data. And the kernel can't figure out whether (5) has happened for you, because (5) is application-specific.

Zero-copy network transmission with io_uring

Posted Jan 4, 2022 11:19 UTC (Tue) by farnz (subscriber, #17727) [Link] (7 responses)

Getting to 4 is, however, useful in terms of RAII or other stack-based buffer management. Until you get to 4, the kernel may need access to the data again so that it can resend if not ACKed; once you get to 4, the kernel will never look at the buffer again, even if the remote application doesn't receive the data.

Basically, getting to 1 is the minimum useful state, but it makes zero-copy hard, because I now have to keep the buffer alive in user-space until the kernel gets to 4. Getting to 4 or clearly indicating that we will never get to 4 is useful because it means that when send returns, the kernel is promising to never look at the buffer again.

Zero-copy network transmission with io_uring

Posted Jan 5, 2022 4:37 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (6 responses)

In order to get to 4, you must first get to 2, and as I explained, there's no way to make that perform acceptably in the general case, so you might as well give up on zero-copy and go with regular old blocking, copying send(2) instead.

Zero-copy network transmission with io_uring

Posted Jan 5, 2022 9:39 UTC (Wed) by farnz (subscriber, #17727) [Link] (4 responses)

You've explained why it's inadequate in terms of CPU time given a large number of clients, but not in terms of memory usage, nor in terms of small servers handling a few tens of clients at peak; different optimization targets for different needs.

For the general case, io_uring and async is the "best" option, but it brings in a lot of complexity managing the state machines in user code rather than simply relying on thread per client. Zero-copy reduces memory demand as compared to current send syscalls, and having a way to do simple buffer management would be useful for the subset of systems that don't actually care about CPU load that much, don't have many clients at a time to multiplex (hence not many threads), but do want a simple "one thread per client" model that avoids cross-thread synchronisation fun.

Not everything is a Google/Facebook/Netflix level problem, and on embedded systems I've worked on, a zero-copy blocking until ACK send would have made the code smaller and simpler; we emulated it in userspace via mutexes, but that's not exactly a high performance option.

Zero-copy network transmission with io_uring

Posted Jan 5, 2022 9:46 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (3 responses)

> but not in terms of memory usage,

Each thread has a stack, whose default size seems to be measured in megabytes (by cursory Googling, anyway). If you spawn way too many threads, you are going to use way too much memory just allocating all of those stacks.

> nor in terms of small servers handling a few tens of clients at peak

I find it difficult to believe that blocking send(2) is too slow yet you only have tens of clients. You are well off the beaten path if that's really the shape of your problem. So I guess you get to build your own solution.

Zero-copy network transmission with io_uring

Posted Jan 5, 2022 10:06 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

Setting stack sizes is trivial to do - you don't have to stick to the default, and when you're on a small system, you do tune the stack size down to a sensible level for your memory. Plus, those megabytes are VA space, not physical memory; there's no problem having a machine with 256 MiB physical RAM, no swap and 16 GiB of VA space allocated, as long as you don't actually try to use all your VA space.

And you're focusing on speed again, not simplicity of programming a low memory usage system; I want to be able to call send, have the kernel not need to double my buffer (typically an entire compressed video frame in the application I was working on) by copying it into kernel space, and then poll the kernel until the IP stack has actually sent the data. I want to be able to call send, and know that when it returns, the video frame has been sent on the wire, and I can safely reuse the buffer for the next encoded frame.

It's not that send is slow - it's that doing a good job of keeping memory use down on a system with reasonable CPU (in order to keep the final BOM down while still having enough grunt to encode video) requires co-operation. And I can (and did) emulate this by using zero-copy send and mutexes in userspace, but it's not exactly easy to maintain code - and yet it's just doing stuff that the kernel already knows how to do well.

Zero-copy network transmission with io_uring

Posted Jan 6, 2022 6:47 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

> And I can (and did) emulate this by using zero-copy send and mutexes in userspace, but it's not exactly easy to maintain code - and yet it's just doing stuff that the kernel already knows how to do well.

I don't see why you would need to use mutexes. The sample code in https://www.kernel.org/doc/html/v4.15/networking/msg_zero... uses poll(2) to wait for the send to complete, and I tend to assume you could also use select(2) or epoll(2) instead if you find those easier or more familiar (until now, I had never heard of poll). Just write a five-line wrapper function that calls send(2) and then waits on the socket with one of those syscalls, and (as far as I can tell) you should be good to go.

Frankly, the kernel is not a library. If we *really* need to have a wrapper function for this, it ought to live in libc, not in the kernel.

Zero-copy network transmission with io_uring

Posted Jan 6, 2022 10:24 UTC (Thu) by farnz (subscriber, #17727) [Link]

I had to use mutexes because of the way the rest of the application was structured - the socket was already being polled elsewhere via epoll edge triggered (not my decision, and in a bit of code I had no control over), and I needed to somehow feed the notifications from the main loop to the client threads.

It was not a nice project to work on, and the moment I got the chance, I rewrote it as a single-threaded application that used less CPU (but same amount of memory) and was much easier to read. Unfortunately, this meant arguing with management that the "framework" from the contracted developers wasn't worth the money they'd paid for it.

Zero-copy network transmission with io_uring

Posted Jan 5, 2022 13:41 UTC (Wed) by foom (subscriber, #14868) [Link]

It's not the time it blocks that's gonna be the problem, but the impact on buffering.

Waiting until step 4 effectively removes local kernel buffering -- when one send completes there must be no further data available to send, thus outgoing packets will pause. And, like TCP_NODELAY, would cause extra partially filled packets to be sent at the tail of each send. If you're trying to send a stream of data as fast as the network allowed, this will all be counterproductive, unless all the data is provided in a single send. Sometimes that may be possible, but it seems like a very limited use case.

And, if your primary goal is not scalability or cpu, but rather memory usage, it seems like it'd be far simpler and just as effective to reduce SO_SNDBUF and use a regular blocking send loop.

Zero-copy network transmission with io_uring

Posted Jan 1, 2022 20:27 UTC (Sat) by jafd (subscriber, #129642) [Link] (3 responses)

There is a small problem with this "synchronous send" idea. How do you define "sent"? Does it get sent as soon as it reaches the kernel's buffer? Or does it get sent as soon as it reaches the NIC? Or does it get sent as soon as the NIC has put it onto the wire? Note that the NIC has its own buffers, too. Or maybe when it reaches the nearest router? Or as soon as the remote side acknowledges receipt? But what about raw IP, UDP, and who knows what other stateless protocols?

This is a small problem, but it sure opens a huge can of worms.

Years ago I wanted this too. But then I read more about this problem (in the context of a rather good article about buffer bloat and how it actually harms both disk and network performance, who knows if it wasn't here at LWN, years ago), and I don't want it anymore. We can only sensibly agree that the packet is sent as soon as the postman (the kernel) found no immediate problems with it and took it for processing.

Zero-copy network transmission with io_uring

Posted Jan 1, 2022 23:38 UTC (Sat) by james (subscriber, #1325) [Link] (1 responses)

Or as soon as the remote side acknowledges receipt?

And "acknowledgement" can happen at different levels. For example, a mail server might indicate that TCP packets containing an email have been received, but you shouldn't consider that email "sent" until the receiving side sends a "250 OK" message (which might be after SpamAssassin has run, so potentially many milliseconds later).

Zero-copy network transmission with io_uring

Posted Jan 3, 2022 8:25 UTC (Mon) by taladar (subscriber, #68407) [Link]

While that is true the kernel can not really recover from a high level rejection of the message using just the buffer content anyway.

Zero-copy network transmission with io_uring

Posted Jan 2, 2022 9:56 UTC (Sun) by smurf (subscriber, #17840) [Link]

That's easy: as soon as the kernel is sure it doesn't need the buffer any more, i.e. when the TCP ACK has arrived for its last byte, or when the network interface tells the kernel that it has transmitted the data (UDP).

There is no kernel buffer. The whole point of this is to enable Zero Copy.

Zero-copy network transmission with io_uring

Posted Jan 2, 2022 7:59 UTC (Sun) by jezuch (subscriber, #52988) [Link] (1 responses)

Yes, but we're talking about io_uring and managed buffers. Ideally the user space would not touch the buffers at all, there would just be an op that says "gimme a (handle to a) buffer, any buffer, I can write to", give it to the device that fills it, then pass it along and never think of it again. The ring would handle the completion signals and return the buffer to the pool when it's really, truly done with it. This way the buffer would be "attached" not to the read or write, but to the entire sequence of operations. I think the media subsystem has something similar.

My guess is I don't really understand how managed buffers work :)

(The difficulty then is in making sure that the userspace does not reuse the buffer while it's "busy". Rust's ownership model looks like it was designed just with this in mind ;) )

Zero-copy network transmission with io_uring

Posted Jan 2, 2022 9:32 UTC (Sun) by Sesse (subscriber, #53779) [Link]

The mmap packet socket has something like this for raw packets.

Zero-copy network transmission with io_uring

Posted Dec 31, 2021 20:01 UTC (Fri) by MattBBaker (guest, #28651) [Link]

It's a matter of application latency and reasoning about how the application behaves. If the sent data is small enough you can get better application performance by copying the data into the kernel and giving the application the completed signal, then let the buffer work through the network stack at the kernel's pace.
There is also a problem in that the logic gets a lot harder if the application has to manage separate "message send begin" and "message send complete" signals. People have a hard enough time writing network programs that don't deadlock and zero copy makes it more difficult. So it makes a lot of sense to write an API where the simple and easy case is default, and if they need the power tools they have to ask.

Improvement percentage: ??

Posted Dec 31, 2021 17:17 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

There is no such thing as "200% better" time. If an operation takes a second, "100% better" means that it now completes instantly, thus with 200% the operation completes a second before it's started. Congratulations, you have invented a time machine.

While you can be 200% worse, i.e. take three times as long, the inverse of that would be 67% better.

Improvement percentage: ??

Posted Dec 31, 2021 18:03 UTC (Fri) by corbet (editor, #1) [Link]

I didn't write "better time", I wrote "perform more than 200% better". The amount of data transmitted per unit time can indeed improve by 200%.

Zero-copy network transmission with io_uring

Posted Jan 6, 2022 9:13 UTC (Thu) by developer122 (guest, #152928) [Link]

I'm really starting to warm to the concept of an exokernel as a high-performance solution. Just do away with all the kernel abstractions and system calls and even the message passing of a microkernel.

Two threads, one with access to the networking hardware and one from the application, communicating simultaneously through a lockless ring buffer in shared memory. If you ignore the fact that the former here is in kernelspace (in an exokernel userspace processes can be given direct hardware access) then this starts to look like that a lot.

Of course, in an exokernel you don't need the first thread that specializes in handling the network hardware if your webserver thread knows how to do it. Real examples exist.

Zero-copy network transmission with io_uring

Posted Jan 7, 2022 12:28 UTC (Fri) by kpfleming (subscriber, #23250) [Link]

'meticulously undocumented' has to be my favorite Corbet-ism. It conjures a mental image of a developer spending hours and hours poring over the patches to ensure that no accidental documentation has leaked through. Thanks :-)

Zero-copy network transmission with io_uring

Posted Jan 11, 2022 13:19 UTC (Tue) by al4711 (subscriber, #57932) [Link] (5 responses)

As we see today almost everything switch to encrypted communication how does fit this zero-copy concept with encryption and decryption traffic?

Zero-copy network transmission with io_uring

Posted Jan 11, 2022 14:26 UTC (Tue) by smurf (subscriber, #17840) [Link] (3 responses)

Simple. You don't need to copy the encrypted data.

What exactly is the question?

Zero-copy network transmission with io_uring

Posted Jan 12, 2022 12:25 UTC (Wed) by al4711 (subscriber, #57932) [Link] (2 responses)

> What exactly is the question?

My question is what's the benefit of zero-copy data when the decrypt/encrypt step is in between.

Maybe I misunderstand the benefit, so please let me draw a picture.

client -> data -> nic -> kernel -> reading data and write data to nic buffer -> client

When we look now into the decrypt/encrypt step is this my understanding.

client -> data -> nic -> kernel -> server reading data -> decrypt/encrypt -> write data to nic buffer -> client

Could the ktls help in this case?

Zero-copy network transmission with io_uring

Posted Jan 13, 2022 1:40 UTC (Thu) by neilbrown (subscriber, #359) [Link]

> My question is what's the benefit of zero-copy data when the decrypt/encrypt step is in between.

"Zero copy" is a marketing term. A more accurate term would be "reduced copy".
You might image an naive protocol stack where a copy happens when moving from each level to the next. Then the data is copied onto the network fabric, copied off into the destination, and copied back up the stack.

At any stage there is a potential benefit in avoiding the copy (and also a cost, so small messages are likely to be copied anyway).

Encrypt/decrypt may require a copy that would not otherwise be needed - though it may be possible to encrypt-in-place or encrypt-and-copy for one of the unavoidable copies (like copying onto the networking fabric). But that doesn't mean there aren't opportunities to reduce copying when encryption is used.

And also, encryption is not always used, even though it should always be available. On the open Internet, or in the public cloud, encryption is a must-have. In a private machine-room with a private network, there is minimal value in encryption, and there may be great value in reducing latency. In that case, it may be possible and beneficial to eliminate all the memory-to-memory copies ... particularly when an RDMA network fabric is used which allows the receiver to tell the sender when in memory to place different parts on an incoming message.

Zero-copy network transmission with io_uring

Posted Jan 13, 2022 13:55 UTC (Thu) by farnz (subscriber, #17727) [Link]

This does also reduce the number of copies when using kTLS. "Zero copy" is a bit of a misnomer - it's only there to eliminate memcpys from user owned memory to kernel owned memory, not all copies.

The point of "zero copy" is that in a normal transfer, data is copied from the user buffer to a kernel buffer, then the network card does DMA from the kernel buffer to its own transmit buffer. "zero copy" reduces that to a copy from the user buffer to the NIC's transmit buffer.

With kTLS, "zero copy" is a win with or without expensive NICs:

With expensive NICs, the NIC can do the encryption during DMA from CPU memory to the transmit buffer. You thus avoid copying the data into the kernel, and just have the NIC read and encrypt during DMA.
With cheap NICs, the kernel has to do a copy. Without zero copy, it copies plain text from user buffer to kernel buffer and then encrypts from kernel buffer to network buffer. With zero copy, it encrypts from user buffer to network buffer. In either case, the NIC will then DMA the network buffer into the on-chip transmit buffer.

Zero-copy network transmission with io_uring

Posted Jan 11, 2022 17:09 UTC (Tue) by farnz (subscriber, #17727) [Link]

Two things here:

Available to everyone; in the process of encrypting using the CPU, I end up with a copy of the data in encrypted form. I don't need to copy that again into kernel buffers, I can just point the kernel at the already encrypted data. Same reasoning applies if I'm not processing the plain text, just forwarding pre-encrypted data (E2E use cases).
For places with big pockets; there exist expensive network cards from brands like Mellanox capable of doing encryption as part of the scatter-gather DMA to the card. kTLS and IPSec both take advantage of this to have the network card encrypt the payload as it copies it out of main memory and onto the wire. This means that I can use kTLS or IPSec and have the communication encrypted on the wire without using CPU time to do the crypto; if I don't have one of the fancy cards, then the kernel encrypts for me, directly copying from my userspace buffer to the kernel's encrypted send buffer.

Zero-copy network transmission with io_uring

Posted Jan 13, 2022 9:56 UTC (Thu) by Lawless-M (guest, #155377) [Link]

> io_uring's zero-copy operations can perform more than 200% better than MSG_ZEROCOPY.

The maximum speedup posted was 2.27x which is 127% more than MSG_ZEROCOPY.

200% more would be 3x speedup

Zero-copy network transmission with io_uring

Posted Jan 31, 2022 5:59 UTC (Mon) by eximius (guest, #124510) [Link]

What is the reason behind the hoop jumping with extra notifications and generations and userspace informing the kernel when it can move on to next generations?

With the completion notifications, it would have seemed like the simplest, most resistant to misuse API would be to delay the completion notification until it was *actually* done - which is what we wanted? (Unless there is some semantic meaning worth differentiating between the two that I'm missing.)