|
|
Log in / Subscribe / Register

Mazzoli: How fast are Linux pipes anyway?

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 3, 2022 13:55 UTC (Fri) by martin.langhoff (subscriber, #61417)
Parent article: Mazzoli: How fast are Linux pipes anyway?

I'm a bit disappointed it takes so much kernel API abuse to get fast throughput.

A lot of the 'magic' of Linux is that it makes common operations very fast without the userland developer using esoteric APIs. open() and stat() are famously fast.

Why isn't write() fast?

Or – why is the throughput of a naive write() so radically slower than the optimized version, and why are the steps to optimize it so obscure and arcane?


to post comments

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 3, 2022 14:53 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

I guess because a naive read simply says "get me this block" and it passes down the various block layers easily. It may need expanding slightly if it hits things like raid, or luks, or integrity, but the basic tenet is just "get me this (list of) blocks".

When writing, luks/raid/integirty add far more complexity to the the request.

But even if you're not taking that into account, reads are by their very nature synchronous - the program MUST wait until it gets the information it wants, while a write can wait. So the elevator algorithm prioritises reads over writes ...

And then there's buffer(/cache)-bloat ...

Cheers,
Wol

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 3, 2022 19:44 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

Well, io_uring in principle allows us to pipeline reads and writes to arbitrary depths without creating new kernel threads for each read-to-write data hazard.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 3, 2022 20:30 UTC (Fri) by martin.langhoff (subscriber, #61417) [Link]

To be clear - we're talking about writing to a pipe, which is writing to memory. Not to disk.

I had (naively) imagined that with the right open() parameters, and with sufficiently large writes, things would happen automagically fast.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 4, 2022 13:50 UTC (Sat) by willy (subscriber, #9762) [Link]

This program is limited by the performance of memcpy() when calling write(). We necessarily have to copy to implement the semantics of write() and so we need to use system calls with different semantics to get better performance.

The particular semantic that vmsplice() has over write() is that modifications to the buffer after making the call are not allowed. Or rather, they will show up in the recipient when they ought not to. We can't optimise write() to a pipe like this, because we'd need to shoot down TLB entries to prevent subsequent modifications to the buffer from showing up. That's why the new syscall was needed.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 4, 2022 14:19 UTC (Sat) by atnot (guest, #124910) [Link] (10 responses)

The core of the issue is really with the ownership of the buffers in the write() api.

write() takes the address of a buffer that the user owns. Because it can not make any assumptions about how long that data is going to remain valid after the syscall, it has no choice but to copy. Because the memory was allocated ahead of time by the user, it can't really make any smart decisions about page size either. Then on read, it has to do the same thing again.

I guess for pipes specifically, one could imagine a flag which would make write() block until there is a corresponding read() call on the other side of the pipe, which would eliminate one copy. But with differing buffer sizes, non-blocking IO and other considerations I presume that'd be a lot of complexity that's unlikely to be worth it, especially considering the fact that you apparently can't even compute fizzbuzz fast enough to completely saturate a pipe like this. [1]

In that sense I'd say that the premise is wrong and write() is actually already plenty fast enough by default.

[1] In scenarios where this does matter there are usually already plenty of better, specialized solutions like mmap, XDP, dpdk, netfilter, etc.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 5, 2022 17:10 UTC (Sun) by Wol (subscriber, #4433) [Link] (9 responses)

> write() takes the address of a buffer that the user owns. Because it can not make any assumptions about how long that data is going to remain valid after the syscall, it has no choice but to copy.

What about COW? (Might be tricky I know, but just COW the page containing the buffer.)

Cheers,
Wol

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 5, 2022 18:58 UTC (Sun) by atnot (guest, #124910) [Link] (8 responses)

I don't think that would help. CoW is only faster if the memory isn't actually written to again. But applications don't really keep buffers around unmodified for posterity, nor would there be any real way for them to know when it's okay to reuse them. So in practice, you're likely to just end up with the page that contains the buffer being immediately written to again, at which point you're at the status quo again except with additional page faults and fragmentation. There's not really any way to solve this without deviating significantly from the write() API, which is probably one reason people keep inventing new ways of doing it.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 8, 2022 1:08 UTC (Wed) by willy (subscriber, #9762) [Link]

Worse than that, write() would have to invalidate the TLB entry for the page in question (in order to make COW work). TLB invalidation are slower than copies.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 9, 2022 5:09 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (6 responses)

Why can't we just mmap the pipe? It's ultimately "just" a buffer in kernelspace, is it really that hard to add a userspace mapping for it?

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 9, 2022 5:12 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

(Clarifying: I am aware that you cannot mmap pipes. I'm asking why this restriction shouldn't/can't be lifted.)

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 9, 2022 21:07 UTC (Thu) by anton (subscriber, #25547) [Link]

Peter Syrowatka added mmap() for pipes in his Diplomarbeit (master's thesis, in German).

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 9, 2022 15:01 UTC (Thu) by willy (subscriber, #9762) [Link] (3 responses)

What does it mean to mmap() a pipe?

Let's suppose I have a pipefd and addr = mmap(offset=0, length=1M). Then I call read(4kB) on pipefd. Does the pipe shuffle down so that *addr now refers to what was at 4kB, or does it still have a reference to what was at 0 when I called mmap()?

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 10, 2022 17:32 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (2 responses)

Whichever is easier. As long as it's consistent and well-documented, userspace can figure out the rest.

However, I should point out that, if the pipe does not shuffle down, then you need to add an API for telling userspace the current read/write offsets, so that userspace knows where to begin reading or writing. Regardless, you also want an API for setting (or at least advancing) those offsets, so that userspace can emulate read/write calls. Therefore, you might as well not bother with the shuffling-down logic and just provide full get/set support for the offsets.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 10, 2022 17:36 UTC (Fri) by willy (subscriber, #9762) [Link] (1 responses)

Haven't you just reinvented shared memory?

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 10, 2022 17:55 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Well gee, I thought that's what vmsplice(2) was trying to do in the first place.

But I think the big difference is this: If you e.g. create a file in /dev/shm (or any tmpfs) and just keep writing more and more data to it, it'll get bigger and bigger indefinitely, so you have to periodically seek to zero at both ends (and/or truncate it). Pipes are effectively ring buffers, so they don't have this problem. Your consumer can just call read, and doesn't have to know anything about your fancy mmap nonsense.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds