|
|
Log in / Subscribe / Register

Mazzoli: How fast are Linux pipes anyway?

Francesco Mazzoli delves deeply into the kernel's implementation of pipes (and more) in an attempt to maximize the throughput of data.

The post was inspired by reading a highly optimized FizzBuzz program, which pushes output to a pipe at a rate of ~35GiB/s on my laptop. Our first goal will be to match that speed, explaining every step as we go along. We’ll also add an additional performance-improving measure, which is not needed in FizzBuzz since the bottleneck is actually computing the output, not IO, at least on my machine.


to post comments

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 2, 2022 14:43 UTC (Thu) by joey (guest, #328) [Link]

What a great article. One thing that was not entirely clear until I looked at the code is that the optimised read end using splices the pipe to /dev/null, so only counts the bytes read, without actually seeing them. There's apparently no way to really speed up reading from a pipe, if that is your program's bottleneck.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 2, 2022 16:00 UTC (Thu) by atnot (guest, #124910) [Link] (2 responses)

This was super interesting!

One thing I'm a bit curious about: The hardware has support for resolving virtual memory and also has a handy dedicated TLB cache for it. The article mentions that the kernel reimplements this behavior. Is it possible for operating systems to take advantage of this dedicated hardware? Say, with a "resolve physical address" instruction. If not, is there a reason this isn't being done?

I know there are architectures where the opposite is done (the hardware calls into the kernel to resolve physical addresses), so perhaps get_user_pages just isn't performance critical enough outside of synthetic benchmarks like these?

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 2, 2022 16:12 UTC (Thu) by Bigos (subscriber, #96807) [Link] (1 responses)

I believe the get_user_pages* call also increments the reference count on struct page, which I would say should be the bottleneck here.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 4, 2022 1:55 UTC (Sat) by scientes (guest, #83068) [Link]

get_user_pages_* was a personal project of Linus Torvalds, who voiced regret over ever adding splice, because it failed to live up to its expectations. I feel this article has (finally) explained to me what those expectations were.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 3, 2022 8:00 UTC (Fri) by Fowl (subscriber, #65667) [Link] (1 responses)

How much of this is actually testing the "pv" command's ability to read, or is it measuring "out of band" some how?

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 9, 2022 10:44 UTC (Thu) by heiner (guest, #158880) [Link]

The article does not actually use `pv` in its main part.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 3, 2022 13:55 UTC (Fri) by martin.langhoff (subscriber, #61417) [Link] (15 responses)

I'm a bit disappointed it takes so much kernel API abuse to get fast throughput.

A lot of the 'magic' of Linux is that it makes common operations very fast without the userland developer using esoteric APIs. open() and stat() are famously fast.

Why isn't write() fast?

Or – why is the throughput of a naive write() so radically slower than the optimized version, and why are the steps to optimize it so obscure and arcane?

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 3, 2022 14:53 UTC (Fri) by Wol (subscriber, #4433) [Link] (2 responses)

I guess because a naive read simply says "get me this block" and it passes down the various block layers easily. It may need expanding slightly if it hits things like raid, or luks, or integrity, but the basic tenet is just "get me this (list of) blocks".

When writing, luks/raid/integirty add far more complexity to the the request.

But even if you're not taking that into account, reads are by their very nature synchronous - the program MUST wait until it gets the information it wants, while a write can wait. So the elevator algorithm prioritises reads over writes ...

And then there's buffer(/cache)-bloat ...

Cheers,
Wol

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 3, 2022 19:44 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

Well, io_uring in principle allows us to pipeline reads and writes to arbitrary depths without creating new kernel threads for each read-to-write data hazard.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 3, 2022 20:30 UTC (Fri) by martin.langhoff (subscriber, #61417) [Link]

To be clear - we're talking about writing to a pipe, which is writing to memory. Not to disk.

I had (naively) imagined that with the right open() parameters, and with sufficiently large writes, things would happen automagically fast.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 4, 2022 13:50 UTC (Sat) by willy (subscriber, #9762) [Link]

This program is limited by the performance of memcpy() when calling write(). We necessarily have to copy to implement the semantics of write() and so we need to use system calls with different semantics to get better performance.

The particular semantic that vmsplice() has over write() is that modifications to the buffer after making the call are not allowed. Or rather, they will show up in the recipient when they ought not to. We can't optimise write() to a pipe like this, because we'd need to shoot down TLB entries to prevent subsequent modifications to the buffer from showing up. That's why the new syscall was needed.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 4, 2022 14:19 UTC (Sat) by atnot (guest, #124910) [Link] (10 responses)

The core of the issue is really with the ownership of the buffers in the write() api.

write() takes the address of a buffer that the user owns. Because it can not make any assumptions about how long that data is going to remain valid after the syscall, it has no choice but to copy. Because the memory was allocated ahead of time by the user, it can't really make any smart decisions about page size either. Then on read, it has to do the same thing again.

I guess for pipes specifically, one could imagine a flag which would make write() block until there is a corresponding read() call on the other side of the pipe, which would eliminate one copy. But with differing buffer sizes, non-blocking IO and other considerations I presume that'd be a lot of complexity that's unlikely to be worth it, especially considering the fact that you apparently can't even compute fizzbuzz fast enough to completely saturate a pipe like this. [1]

In that sense I'd say that the premise is wrong and write() is actually already plenty fast enough by default.

[1] In scenarios where this does matter there are usually already plenty of better, specialized solutions like mmap, XDP, dpdk, netfilter, etc.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 5, 2022 17:10 UTC (Sun) by Wol (subscriber, #4433) [Link] (9 responses)

> write() takes the address of a buffer that the user owns. Because it can not make any assumptions about how long that data is going to remain valid after the syscall, it has no choice but to copy.

What about COW? (Might be tricky I know, but just COW the page containing the buffer.)

Cheers,
Wol

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 5, 2022 18:58 UTC (Sun) by atnot (guest, #124910) [Link] (8 responses)

I don't think that would help. CoW is only faster if the memory isn't actually written to again. But applications don't really keep buffers around unmodified for posterity, nor would there be any real way for them to know when it's okay to reuse them. So in practice, you're likely to just end up with the page that contains the buffer being immediately written to again, at which point you're at the status quo again except with additional page faults and fragmentation. There's not really any way to solve this without deviating significantly from the write() API, which is probably one reason people keep inventing new ways of doing it.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 8, 2022 1:08 UTC (Wed) by willy (subscriber, #9762) [Link]

Worse than that, write() would have to invalidate the TLB entry for the page in question (in order to make COW work). TLB invalidation are slower than copies.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 9, 2022 5:09 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (6 responses)

Why can't we just mmap the pipe? It's ultimately "just" a buffer in kernelspace, is it really that hard to add a userspace mapping for it?

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 9, 2022 5:12 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

(Clarifying: I am aware that you cannot mmap pipes. I'm asking why this restriction shouldn't/can't be lifted.)

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 9, 2022 21:07 UTC (Thu) by anton (subscriber, #25547) [Link]

Peter Syrowatka added mmap() for pipes in his Diplomarbeit (master's thesis, in German).

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 9, 2022 15:01 UTC (Thu) by willy (subscriber, #9762) [Link] (3 responses)

What does it mean to mmap() a pipe?

Let's suppose I have a pipefd and addr = mmap(offset=0, length=1M). Then I call read(4kB) on pipefd. Does the pipe shuffle down so that *addr now refers to what was at 4kB, or does it still have a reference to what was at 0 when I called mmap()?

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 10, 2022 17:32 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (2 responses)

Whichever is easier. As long as it's consistent and well-documented, userspace can figure out the rest.

However, I should point out that, if the pipe does not shuffle down, then you need to add an API for telling userspace the current read/write offsets, so that userspace knows where to begin reading or writing. Regardless, you also want an API for setting (or at least advancing) those offsets, so that userspace can emulate read/write calls. Therefore, you might as well not bother with the shuffling-down logic and just provide full get/set support for the offsets.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 10, 2022 17:36 UTC (Fri) by willy (subscriber, #9762) [Link] (1 responses)

Haven't you just reinvented shared memory?

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 10, 2022 17:55 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

Well gee, I thought that's what vmsplice(2) was trying to do in the first place.

But I think the big difference is this: If you e.g. create a file in /dev/shm (or any tmpfs) and just keep writing more and more data to it, it'll get bigger and bigger indefinitely, so you have to periodically seek to zero at both ends (and/or truncate it). Pipes are effectively ring buffers, so they don't have this problem. Your consumer can just call read, and doesn't have to know anything about your fancy mmap nonsense.

CPU chip characteristics matter

Posted Jun 3, 2022 16:33 UTC (Fri) by jreiser (subscriber, #11027) [Link] (1 responses)

Serious performance work should report grep -E 'cpu family|model name|model|stepping|microcode|cache size|siblings|cpu cores' /proc/cpuinfo, trimmed of redundancies.

There are more details regarding cache architecture. For Intel Core chips of the last 12 years or so, each core has its own L1 (separate I and D, each 32kB) and its own L2 (256kB unified I and D). Then L3 (unified I and D) is on the non-core side of an internal bus, and is shared by all cores. (PCIe I/O devices also talk to L3.) Typical chips for non-server consumer machines have an L3 of 8MB for Core i7, 6MB for Core i5, 4MB for Core i3. Server chips have much larger L3: upto 28MB or more. Finally, L3 talks to physical RAM.

In some ways the best situation for communication via cache is 2 hyperthreads in the same core, where sharing is guaranteed and the hardware resolves cache contention on a cycle-by-cycle basis. Separate non-hyper threads in the same core suffer CPU contention via software task switching. Separate cores forces all Write operations to use L3, although the Reads can be short-circuited by cross-core L2 cache snooping on that shared bus.

It would be interesting to see some measurements of the analogous use of io_uring.

CPU chip characteristics matter

Posted Jun 5, 2022 20:42 UTC (Sun) by flussence (guest, #85566) [Link]

> Serious performance work should report grep -E 'cpu family|model name|model|stepping|microcode|cache size|siblings|cpu cores' /proc/cpuinfo, trimmed of redundancies.

There's a better command for that nowadays: `lscpu`. It's much easier than trying to make sense of /proc/cpuinfo (and much more reliable than cpuinfo's unfixable legacy format trying to describe modern CPUs; it incorrectly shows my desktop as having 12MB of L2 when there should be 6, and also omits the L1 and 64MB of L3 entirely).

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 9, 2022 0:38 UTC (Thu) by Trelane (guest, #56877) [Link] (2 responses)

Also, the Hacker News thread: https://news.ycombinator.com/item?id=31592934

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 13, 2022 22:32 UTC (Mon) by ras (subscriber, #33059) [Link] (1 responses)

That HN discussion was more interesting than the discussion here in many ways. I was hoping the discussion here might expand on it, but alas it hasn't happened.

In the original article, he achieved the speed by changing the problem from "transferring bytes via Linux pipes" to "transferring pages via Linux pipes". You can't move bytes at 35GiB/s purely locally within the same program, so claims of doing it over pipes makes you wonder what he's smoking. Turns the title involved a little poetic licence. Perhaps the article could have been better called "zero copy via pipes".

In case it tickles someone's interest: when you transfer pages, you are transferring ownership (permission to write) of the page. The HN discussion centred on how that transfer of ownership was managed. In the article it was dealt with by not handling it at all. This is just a benchmark program, and so he could get away with writing to the pages once before he started and then never writing them them again during the test. But that means he was transferring a constant over the pipes over and over again by vmsplice()'ing a page that never changed. The HN discussion revolved around whether it was possible to write to (ie, later reuse) the transferred pages in a way that didn't create a race between the reader and writer and didn't need to set up some second channel (eg, a futex) to avoid the race condition. It centred around transferring ownership of the page to the kernel. It hadn't reached a conclusion when I read it.

The HN discussion was more interesting that the article in some ways, because it explored whether what was really just a clever sleight of hand could ever be useful in practice.

Mazzoli: How fast are Linux pipes anyway?

Posted Jun 21, 2022 20:07 UTC (Tue) by flussence (guest, #85566) [Link]

If the problem is already reduced to “transferring ownership of memory chunks between processes” then this all becomes moot anyway: use a socket and send memfds over, now it's O(1) instead of O(n).


Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds