LWN: Comments on "Zero-copy networking"

Zero-copy networking

mikemol — Wed, 25 Apr 2018 14:48:57 +0000

I think zero-copy should be possible with a sockets API.

You still use the ring buffer, but you only drain the buffer when there are read() calls on the socket. Send ACKs only for the packets whose data have been consumed by read() calls.

This further allows you to combat bufferbloat at the application layer.

Zero-copy networking

immibis — Tue, 19 Dec 2017 22:23:56 +0000

The Windows I/O stack has always felt lightyears ahead of Linux to me.

Just because several components are tied together in one product with annoying marketing, doesn't mean they all suck. I'm sure the kernel I/O developers had nothing to do with the start screen.

Zero-copy networking

njs — Thu, 21 Sep 2017 07:38:04 +0000

> Last I checked I wasn't even possible for user space to tell the kernel: "I'm expecting to recv() 2kbytes, don't wake me up and waste our time until you got at least that".

recv(2) says it was added in Linux 2.2 :-) (see MSG_WAITALL)

Unfortunately this doesn't help if you're using non-blocking sockets.

Zero-copy networking

Wol — Thu, 13 Jul 2017 14:05:26 +0000

Transistor miniaturisation has pretty much peaked ... The width of a conducting line, and the insulator between it, can now be measured in atoms, and it's not many of them ...

The biggest problem with shrinking dies further is now quantum leakage, which will only grow. In order to make chips faster, they are now having seriously to contend with the speed of light, which is why chips are spreading more and more into the realms of 3d.

Cheers,
Wol

Zero-copy networking

abo — Tue, 11 Jul 2017 10:47:01 +0000

Netflix does have a zero-copy sendfile() in their FreeBSD based CDN. They also (partially) implemented TLS in the kernel.

https://medium.com/netflix-techblog/protecting-netflix-vi...

Zero-copy networking

klempner — Tue, 11 Jul 2017 00:00:12 +0000

The entire point of MSG_ZEROCOPY is the notification that the kernel is done so the memory can be unpinned and potentially freed/reused. This isn't a problem for HAProxy's socket to splice pipe to socket doesn't have that problem because in that case the memory never has a userspace address.

The fundamental problem with application to vmsplice pipe to TCP socket is that you don't know when the pages in question are done and can be freed/modified, and if you modify them you're liable to start, say, leaking crypto keys out to the network if that memory gets reused before the TCP stack is done with it.

fd <-> memory

Jandar — Thu, 06 Jul 2017 22:16:49 +0000

You could open /dev/zero and mmap it MAP_PRIVATE. This wouldn't give you a fd corresponding to an arbitrary memory region but you could setup this memory in advance to create the content into it. Zero-copy with sendfile should be possible with this.

Zero-copy networking

wtarreau — Thu, 06 Jul 2017 21:52:35 +0000

It has even worked for userland for a while. HAProxy successfully makes use of splice() to perform zero-copy transfers between sockets (receive AND send).

Also it seems to me that this send(MSG_ZEROCOPY) is not much different from doing a vmsplice().

Zero-copy networking

marcH — Thu, 06 Jul 2017 17:43:58 +0000

> In this case by the time the program is woken up tailing packets have already arrived.

This fully qualifies the definition of a race condition.

On one hand there's the performance question which you addressed. On the other hand there's the fact that some applications don't bother coding a retry loop around the recv() call. Combine this with the race condition above and let the fun begin.

Zero-copy networking

sorokin — Thu, 06 Jul 2017 17:32:55 +0000

> Last I checked I wasn't even possible for user space to tell the kernel: "I'm expecting to recv() 2kbytes, don't wake me up and waste our time until you got at least that".

Although I agree that having this type of function in kernel would be good, I believe the benefits of it are small compared to what we have now. Let me explain. There is 2 options: either the chunk of data you expect is small in size or it is big.

In case it is small. It is likely fit one packet or a few packets with a small interval between them. In this case by the time the program is woken up tailing packets have already arrived.

In case it is big. The program is unlikely to buffer all the data in memory. Instead it is likely to write this data to disk/handle it somehow as it is coming. And in this case waking up the program is what you want.

So the only use case for this function would be a program that receives large amount of data buffering it in memory until the last byte arrives.

Zero-copy networking

paulj — Thu, 06 Jul 2017 10:42:29 +0000

IP fragments are how some applications send larger-than-MTU sized application-layer messages.

Zero-copy networking

ncm — Thu, 06 Jul 2017 03:56:01 +0000

I have seen IP fragmentation in the wild.

Zero-copy networking

vomlehn — Thu, 06 Jul 2017 01:10:53 +0000

It's still possible to put a sockets API on such a ring buffer based interface but any approach I know of will then require for receive...a copy out of the ring buffer. I think that, ultimately, we need recognize that we need a new set of APIs for zero-copy operations. You use those if you want maximum performance, otherwise you use the familiar sockets interface, which is layered on the zero copy APIs.

By "new", I just mean new for Linux. There are candidate APIs out there that should be considered. The HPC guys are kind of nuts about shipping data, so they have at least one possibility, though I haven't used them. I've heard comments implying that their environment doesn't mandate the level of security a more general Linux deployment might need but this is, again, hearsay.

Zero-copy networking

k8to — Wed, 05 Jul 2017 21:27:35 +0000

Async I/O a la completion ports shouldn't be news to anyone, given that this was the standard means of doing I/O on VMS in the 1980s. The downsides are that's it's tricker to get right (there's a lot more opportunity for creating bugs in the application code), and that reaping the benefits means receiving data in an interface that looks essentially nothing like sockets.

Those are prices that must be paid for completely maximizing throughput in high transaction count scenarios, but they're an awkward price for most users.

The registered I/O tweak is relatively recent and somewhat informative however, driven as it is by modern hardware requirements instead of ancient design concerns.

Zero-copy networking

clameter — Wed, 05 Jul 2017 14:35:52 +0000

Linux has had zero copy networking for more than 10 years. Use the RDMA subsystem to send messages. The RDMA subsystem can even receive messages(!!!). The RDMA subsystem can not only perform remote DMA operations but also send and receive datagrams.

Zero-copy networking

abatters — Wed, 05 Jul 2017 13:17:23 +0000

I don't know Windows, but this is exactly what I was thinking. Register the buffers ahead of time to save the overhead of pinning and unpinning the pages over and over again. See also: SCSI generic (drivers/scsi/sg.c) mmap()ed I/O.

Zero-copy networking

marcH — Wed, 05 Jul 2017 06:37:40 +0000

> Reading is inherently harder because it is not generally known where a packet is headed when the network interface receives it.

Interruptions is another reason why receiving is "harder"/more expensive than sending. As usual, predicting the future is very difficult.

> With remote DMA, the sender of the packet puts the memory address of the receiver's buffer into the packet, and the receiving interface just writes it there. This requires that the receiver has told the sender in advance...

For high performance, not just RDMA but message passing interfaces also involve a certain amount of "telling in advance", at least for large messages.

The socket API sucks, really not designed for high performance: "I'm receiving, send me whatever whenever and I'll manage somehow!" Very slowly.

Last I checked I wasn't even possible for user space to tell the kernel: "I'm expecting to recv() 2kbytes, don't wake me up and waste our time until you got at least that". BTW I suspect many applications would break if IP fragmentation was real.

Zero-copy networking

gdt — Wed, 05 Jul 2017 03:47:23 +0000

5% gains are worthwhile. CPUs get faster by about 5% a year from process improvements alone. So a 5% gain obtained from software can delay a hardware upgrade by a year. For which you can calculate a TCO value.

The more worrying thought is that semiconductor process miniaturisation seems to be coming to an end: each node shrink is successively later. At some point in the late-2020s future the major source for throughput improvement will move to software. Ie: assume a 5nm process is economic around 2024, allow two generations of CPU design tuning, after that the "CPU" barrel is empty of economic performance improvements. We can still look to other hardware components for performance enhancement -- particularly in the way hardware hangs together -- but such system-level design change usually requires enabling software in the operating system and in key applications. In short, this patch is a look into the future of computer performance.

Zero-copy networking

ncm — Tue, 04 Jul 2017 21:46:15 +0000

It is legitimate to bring up numbers, but I doubt that the monetary value of performance is measured by TCO alone, or by average CPU utilization. There are service level agreements, and peak-traffic latency, and potential income. How many ads can be delivered, and displayed for the required amount of time before the user clicks past, at 5:30 PM? How frequently does the streaming video skip, at 6:30 PM, and how does that affect subscription renewal rates? How many product pages can be presented per potential customer per minute, and how does the conversion rate (from looking to spending money) vary with how long they have to wait to see the next page?

Zero-copy networking

einstein — Tue, 04 Jul 2017 18:49:08 +0000

Good point - In this case, Linux devs ought to look at borrowing worthwhile features even from crappy OSes.

Zero-copy networking

jhoblitt — Tue, 04 Jul 2017 17:55:38 +0000

I have heard that the CPU utilization on the megasize infrastructures averages around 5%. I don't have a citation for that, and have no idea how close that number is to reality. However, if that is in the ballpark, a 5% syscall savings probably won't have a major TCO impact for most applications.

Zero-copy networking

sorokin — Tue, 04 Jul 2017 12:18:10 +0000

Linux people should keep an eye on what Microsoft do. Microsoft had zero copy networking for ages. I believe they had had I/O completion ports (queues where notifications that I/O operation is completed are pushed) even before Linux got epoll support.

In Microsoft they realized that locking pages in memory is expensive. And in Windows 8 they come up with an API called Registered I/O. It requires an application to register buffers in advance. Then I/O operations just use these registered buffers and therefore don't need to lock any pages.

I believe in Linux kernel people should just skip designing zero-copy operations altogether and just implement Registered I/O.

Zero-copy networking

Cyberax — Tue, 04 Jul 2017 08:36:46 +0000

That's pretty much how userspace network stacks work these days. To replicate it, a client will have to set up some kind of a lockless ring buffer and a way for the kernel to wake up the the client. But the API won't look like sockets anymore.

Zero-copy networking

cladisch — Tue, 04 Jul 2017 08:00:46 +0000

With remote DMA, the sender of the packet puts the memory address of the receiver's buffer into the packet, and the receiving interface just writes it there. This requires that the receiver has told the sender in advance what a valid buffer address would be (so this cannot work with most existing protocols), and this is, of course, completely unsafe if you cannot trust the sender.

In theory, it would be possible to configure the interface to put packets from a specific source address and port and to a specific destination address and port into a specific buffer. The article mentions that such fancy interfaces might exist.

Zero-copy networking

liam — Tue, 04 Jul 2017 07:15:59 +0000

Mmmmmm....memfd?

Zero-copy networking

lkundrak — Tue, 04 Jul 2017 03:47:06 +0000

There is memfd_create(2).

Zero-copy networking

ncm — Tue, 04 Jul 2017 02:04:53 +0000

The last time I discussed zero-copy networking with a kernel person (this was in the context of NetBSD's UVM, which worked automatically, given certain preconditions), it was explained to me that the benefit available plummets when you go from one to two-or-more CPUs. The unintuitive reason was that flipping the page tables, as UVM did, invalidates caches on other CPUs, which more than eats up any savings from not copying. Anyway, chip designers put enormous efforts into making sequential copying fast. It's the single most heavily optimized operation in any system, so it is stiff competition.

So, this scheme, that requires just a little more cooperation from user space to avoid re-mappings, seems like the minimum needed to make the process useful at all. (That approximates the definition of good engineering.) A practical saving of 5% isn't much for most of us, but it would pay salaries of dozens of engineers at a Facebook or Netflix. Indeed, it would be negligent of each of them not to have implemented something like this, so I count it likely that many have cobbled-up versions of this in-house that work just well enough for their purposes. This work was done at Google; probably they, too, had a cobbled-up version that worked just well enough, which then took several times more engineering effort to clean up for publication and upstreaming. The negligent among the other big shops get the benefit for free, but only years later, and the rest (Google included) can delete their hard-to-maintain cobbled-up versions.

Probably the biggest benefits of this work go to (often big) customers of small vendors who haven't the engineering staff to devote to faking it, and who in any case don't control what kernel their code runs on. The small shops can reasonably be asked to have their code use the feature when running on a kernel that supports it. Google, meanwhile, will have reasonable hope that, eventually, companies they buy already use the feature, and the code that comes with them doesn't need to be jimmied to use a cobbled-up, proprietary version.

Zero-copy networking

mikemol — Tue, 04 Jul 2017 00:38:42 +0000

Also, wouldn't zero-copy reception effectively be RDMA?

Zero-copy networking

mikemol — Tue, 04 Jul 2017 00:35:49 +0000

> That works well when transmitting static data that is in the page cache, but it cannot be used to transmit data that does not come directly from a file. If, as is often the case, the data to be transmitted is the result of some sort of computation — the application of a template in a content-management system, for example — sendfile() cannot be used, and zero-copy operation is not available.

Is there no mechanism which can be used to wrap a chunk of memory in a file descriptor? I would think that an incredibly useful and powerful abstraction. If I can wrap a file with a block of memory via mmap, why couldn't I do the reverse?

That's not to say this work doesn't have value, but still...