|
|
Subscribe / Log in / New account

Zero-copy network transmission with io_uring

Zero-copy network transmission with io_uring

Posted Jan 5, 2022 9:39 UTC (Wed) by farnz (subscriber, #17727)
In reply to: Zero-copy network transmission with io_uring by NYKevin
Parent article: Zero-copy network transmission with io_uring

You've explained why it's inadequate in terms of CPU time given a large number of clients, but not in terms of memory usage, nor in terms of small servers handling a few tens of clients at peak; different optimization targets for different needs.

For the general case, io_uring and async is the "best" option, but it brings in a lot of complexity managing the state machines in user code rather than simply relying on thread per client. Zero-copy reduces memory demand as compared to current send syscalls, and having a way to do simple buffer management would be useful for the subset of systems that don't actually care about CPU load that much, don't have many clients at a time to multiplex (hence not many threads), but do want a simple "one thread per client" model that avoids cross-thread synchronisation fun.

Not everything is a Google/Facebook/Netflix level problem, and on embedded systems I've worked on, a zero-copy blocking until ACK send would have made the code smaller and simpler; we emulated it in userspace via mutexes, but that's not exactly a high performance option.


to post comments

Zero-copy network transmission with io_uring

Posted Jan 5, 2022 9:46 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (3 responses)

> but not in terms of memory usage,

Each thread has a stack, whose default size seems to be measured in megabytes (by cursory Googling, anyway). If you spawn way too many threads, you are going to use way too much memory just allocating all of those stacks.

> nor in terms of small servers handling a few tens of clients at peak

I find it difficult to believe that blocking send(2) is too slow yet you only have tens of clients. You are well off the beaten path if that's really the shape of your problem. So I guess you get to build your own solution.

Zero-copy network transmission with io_uring

Posted Jan 5, 2022 10:06 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

Setting stack sizes is trivial to do - you don't have to stick to the default, and when you're on a small system, you do tune the stack size down to a sensible level for your memory. Plus, those megabytes are VA space, not physical memory; there's no problem having a machine with 256 MiB physical RAM, no swap and 16 GiB of VA space allocated, as long as you don't actually try to use all your VA space.

And you're focusing on speed again, not simplicity of programming a low memory usage system; I want to be able to call send, have the kernel not need to double my buffer (typically an entire compressed video frame in the application I was working on) by copying it into kernel space, and then poll the kernel until the IP stack has actually sent the data. I want to be able to call send, and know that when it returns, the video frame has been sent on the wire, and I can safely reuse the buffer for the next encoded frame.

It's not that send is slow - it's that doing a good job of keeping memory use down on a system with reasonable CPU (in order to keep the final BOM down while still having enough grunt to encode video) requires co-operation. And I can (and did) emulate this by using zero-copy send and mutexes in userspace, but it's not exactly easy to maintain code - and yet it's just doing stuff that the kernel already knows how to do well.

Zero-copy network transmission with io_uring

Posted Jan 6, 2022 6:47 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

> And I can (and did) emulate this by using zero-copy send and mutexes in userspace, but it's not exactly easy to maintain code - and yet it's just doing stuff that the kernel already knows how to do well.

I don't see why you would need to use mutexes. The sample code in https://www.kernel.org/doc/html/v4.15/networking/msg_zero... uses poll(2) to wait for the send to complete, and I tend to assume you could also use select(2) or epoll(2) instead if you find those easier or more familiar (until now, I had never heard of poll). Just write a five-line wrapper function that calls send(2) and then waits on the socket with one of those syscalls, and (as far as I can tell) you should be good to go.

Frankly, the kernel is not a library. If we *really* need to have a wrapper function for this, it ought to live in libc, not in the kernel.

Zero-copy network transmission with io_uring

Posted Jan 6, 2022 10:24 UTC (Thu) by farnz (subscriber, #17727) [Link]

I had to use mutexes because of the way the rest of the application was structured - the socket was already being polled elsewhere via epoll edge triggered (not my decision, and in a bit of code I had no control over), and I needed to somehow feed the notifications from the main loop to the client threads.

It was not a nice project to work on, and the moment I got the chance, I rewrote it as a single-threaded application that used less CPU (but same amount of memory) and was much easier to read. Unfortunately, this meant arguing with management that the "framework" from the contracted developers wasn't worth the money they'd paid for it.


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds