Not even close to saturating consumer NVMe disks

Posted Jun 7, 2024 12:16 UTC (Fri) by adobriyan (subscriber, #30858)
In reply to: Not even close to saturating consumer NVMe disks by intgr
Parent article: Measuring and improving buffered I/O

I don't really want to comment on Linux pagecache performance, it can't make things faster, only slower, that's quite obvious.

The answer in your case is probably different I/O scheduler + filesystem mount options, TRIM and as you've already discovered -- LBA size.

I want to comment on methodological traps and mistakes so to speak.

1) dd

dd is single threaded program. Comparing single threaded performance with report of CrystalMark "SEQ 1MiB (Q= 8, T= 1):" is not correct.
Q=8 means "queue depth 8" which means 8 NVMe commands in flight not 1.

Internally SSDs are quite parallelised (imagine HDD with dozens of platters each having _independent_ actuator arm),
so queue depth 1 is worst case performance for SSD. CrystalMark report have measly "130.946 MB/s" qd1, 4 KiB random ream, that's SSDs for you.

Don't use dd, don't use those pretty apps unless you traced then and know how they open descriptors and issue I/O.

Learn fio. It does O_DIRECT, maintains queue depth, can emulate dd(1)/cp(1) easily and have more config options than you can imagine.
ioengine=libaio, iodepth=N. ioengine=psync to emulate dd/file copy routines.

Windows may do QD>1 copy, someone needs to check that. Naive POSIX read/write loop is not pinnacle of performance.

Again, it is easy to check: find block size with peak qd=1 seq write performance, anything past that must be qd>1.

1a) dd bs=1G is bad idea.

first you need to free 1 GiB by flushing everything else only to move big file through pagecache twice (not /dev/zero, lets say video).

If target file is unused afterwards, do "dd oflag=direct" and big-but-not-really-big bs= .

Internally, there is RAM buffer to keep the data, any write bigger than that has to be throttled!

2) manufacturers do _not_ use filesystems for benchmarks. ...
... because it only makes things slower, not faster, because they can't control host OS kernel, application, mount options, SCSI/NVMe stack and so on.

Internally they may have even more stripped down environments (think DPDK) to separate what hardware can do from what OS kernel developers can do afterwards.

So what happens, is some Gnome dev writes naive file copier and than users compare it with maximum possible number from real benchmarks which simulate maximum workload.

3) numbers are sensitive to what's called preconditioning

Manufacturers do preconditioning before benchmarks so that results aren't fudged. Those big enterprise buyers open devices with O_CONTRACT and reserve the right to reopen with O_LAWSUIT, so can't lie to them. Consumers benchmarks are crapshoot, of course.

Writing 2-3-4 capacities _before_ doing the run to get real numbers is a must if you want to know what hardware is capable of.
It's takes time and eats block erase count but it is necessary.

Not even close to saturating consumer NVMe disks

Posted Jun 7, 2024 14:02 UTC (Fri) by intgr (subscriber, #39733) [Link] (10 responses)

> I don't really want to comment on Linux pagecache performance

I guess I failed to put together a good narrative around my expectations, how it conflicted with reality, and what my benchmarks were supposed to illustrate.

As a user, I *only* care about performance of the buffered I/O path, which includes page cache. Workloads on my desktop/laptop compter don't use O_DIRECT, they are extremely simple and non-parallel: Copy some files, unpack some files, install packages, git operations, etc. I upgraded the SSD so that I could do these operations faster.

My expectation is that an operation as simple as copying a file should be able to mostly saturate the I/O bandwidth of my hardware. Not 100%, but I sure as hell wasn't expecting it to be ≤35%. I thought sequential I/O is simple. Is this a crazy expectation?

(To put it anther way, I expected to reach 50% of SSD capability due to PCIe bandwidth limitation, but I got <20%)

Or if computers in 2024 are fundamentally unable to provide this, then it would be a fair argument that cp/tar/rsync implementations should all be rewritten with parallel I/O or O_DIRECT?

> dd is single threaded program.

And so are the workloads I am interested in. cp, tar and rsync are all single-threaded in the write I/O path, as far as I am aware. Not sure about git.

> Comparing single threaded performance with report of CrystalMark "SEQ 1MiB (Q= 8, T= 1):" is not correct.

I didn't compare to CrystalMark. Are you referring to the KDiskMark screenshots?

More importantly, if KDiskMark with buffered I/O gives me 1.1 GB/s and `dd` (from the forum post) gives 1.3 GB/s, they're both in the same ballpark and way below my expectation. The additional queue depth didn't seem to make any difference. I think it's valid to demonstrate that?

The reason for including the KDiskMark O_DIRECT results was to demonstrate that this is not a hardware limitation. To drive home the fact that Linux buffered I/O is the bottleneck.

> Learn fio. It does O_DIRECT, maintains queue depth, can emulate dd(1)/cp(1)

KDiskMark uses fio internally. I included the numbers for both O_DIRECT on and off. It's true that I didn't benchmark fio with QD=1, but as explaned above, it didn't make much difference and both highlight how much performance is left untapped compared to hardware capabilities.

> 1a) dd bs=1G is bad idea.

I didn't use `bs=1G`. In the forum post, I used `dd bs=1M`.

But again, it doesn't matter, I could also benchmark `cp` to copy a file from tmpfs to a real disk. The results are in the same ballpark and still a fraction of maximum I/O capacity.

> 2) manufacturers do _not_ use filesystems for benchmarks. ...

Yes, I didn't expect that they would. My complaint is not with manufacturers. My complaint is with Linux leaving so much hardware capability underutilized. And how nobody seems to be talking about this.

> some Gnome dev writes naive file copier and than users compare it with maximum possible number from real benchmarks which simulate maximum workload.

Exactly, this. Now we're getting somewhere. :D

Why is a simple buffered I/O writer limited to 35% or less of potential hardware throughput?

Not even close to saturating consumer NVMe disks

Posted Jun 7, 2024 14:53 UTC (Fri) by pizza (subscriber, #46) [Link] (6 responses)

> (To put it anther way, I expected to reach 50% of SSD capability due to PCIe bandwidth limitation, but I got <20%)

The point is that that "SSD capability" (if not a completely theoretical maximum burst speed [1]) is, at best, derived from a synthetic multi-threaded benchmark that is all but guaranteed to bypass the I/O mechanisms available to real-world applications in real-world operating systems.

> Why is a simple buffered I/O writer limited to 35% or less of potential hardware throughput?

Due to fixed overhead (eg making system calls, I/O scheduling, copying data into/out of buffers, servicing interrupts, and latencies for all of these meaning that your bus duty cycle is well below the theoretical maximum) -- and the problem that optimizations that maximize throughput for a single thread can severely penalize other usage patterns. Case in point: the "bufferbloat" problem on wifi and ISP home routers.

[1] eg quoting max speeds as the theroetical maximum PCIe transfer speed; other stratageies include not including IOP transactional overhead, using transfers that can fit entirely within the SSD's cache, and/or preconditioning the SSD in some way. As the saying goes, "lies, damn lies, and benchmarks"

Not even close to saturating consumer NVMe disks

Posted Jun 7, 2024 15:30 UTC (Fri) by Wol (subscriber, #4433) [Link] (4 responses)

> The point is that that "SSD capability" (if not a completely theoretical maximum burst speed [1]) is, at best, derived from a synthetic multi-threaded benchmark that is all but guaranteed to bypass the I/O mechanisms available to real-world applications in real-world operating systems.

The point the other way, is that linux is caching a lot of disk i/o that is actually "write and forget". That cache is (a) expensive, and (b) pointless.

I get that knowing the difference between useful and useless caching is a very difficult problem, but there's a lot of evidence that caching is a prime example of "premature optimisation is the root of all evil". If on average half of that written data is then re-used, the implication is that the cost of caching it is more than the cost of reading it again from disk ... not a good trade-off.

I believe a lot of databases cache stuff they've read. Why is the disk caching it too? That's why I believe PostGRES uses direct i/o as a matter of course.

Back in the day, Pick/MultiValue databases never cached data, because they could retrieve it from disk so fast (and I've seen them drive disks like that!) (That was with disks and ram measured in megabytes. Or even less.)

What's needed is a - probably by device - toggle to turn direct/buffered/cached i/o on and off. That guy doing streaming to disk - if the disks were unbuffered all the normal utilities would work. Databases with a dedicated storage partition - they can control their own caching if they wish. Even normal users with a normal /home - do they really need caching? How often do they re-read the same file in normal usage? Probably pretty much never ...

Buffered/cached i/o is an anachronism that probably doesn't make sense in a lot of workloads ...

Cheers,
Wol

Not even close to saturating consumer NVMe disks

Posted Jun 7, 2024 15:42 UTC (Fri) by adobriyan (subscriber, #30858) [Link]

> Even normal users with a normal /home - do they really need caching? How often do they re-read the same file in normal usage? Probably pretty much never ...

Devs need pagecache, all of it!

# from cold cache, top end consumer NVMe SSD
$ time rg -e 'page->private' -w -n
real 0m1.498s

# from pagecache
real 0m0.093s

rg(1) is multithreaded.

Not even close to saturating consumer NVMe disks

Posted Jun 8, 2024 18:46 UTC (Sat) by jkl (subscriber, #95256) [Link]

> That's why I believe PostGRES uses direct i/o as a matter of course.

It does not. It was just introduced as an optional experimental/dev only feature in PostgreSQL 16. https://wiki.postgresql.org/wiki/AIO.

Not even close to saturating consumer NVMe disks

Posted Jun 17, 2024 17:03 UTC (Mon) by fest3er (guest, #60379) [Link] (1 responses)

«Even normal users with a normal /home - do they really need caching?»

IMO, normal users benefit from disk cache, just as they benefit from having at least 4 CPU cores (GUI, disk IO, local program, other activity).

Some years ago, I experimented to see the difference between Linux's disk caching and ramdisk. The system had 16GiB RAM and 8 or 16 CPU cores. I was working on a custom Linux system at the time. The whole build required about 13GiB, using as many CPU cores as any pkg could (GCC and Linux would use all allowed cores for minutes). First, I built it on spinning disk (after having dropped the cache). It took X minutes. Then I cleared it, loaded the tree into a 14GiB ramdisk formatted with some filesystem and built again. This build took X - (time to read files from spinning drive) minutes. My conclusion 10 years ago was that Linux's disk caching was quite efficient; reads weren't really noticeable and writes vanished into unused CPU cycles. But perhaps this isn't 'normal use'.

Once upon a time, I noticed that EXT4 filesystems would 'lock' or 'hang' the system for some seconds while data were being flushed to disk, something I didn't see on other FSen (ReiserFS). But I haven't seen that happen for prolly 10 years now. (Maybe I had enabled something on that particular EXT4 FS and forgot I'd done did it.)

Not even close to saturating consumer NVMe disks

Posted Jun 18, 2024 11:29 UTC (Tue) by Wol (subscriber, #4433) [Link]

> IMO, normal users benefit from disk cache, just as they benefit from having at least 4 CPU cores (GUI, disk IO, local program, other activity).

So you were doing a build. What exactly do you mean?

If you're talking a compile-and-link, dev-type stuff, you're doing a lot of "write then read", and yes, and cache is useful. But how typical is that sort of behaviour for a normal user?

> Then I cleared it, loaded the tree into a 14GiB ramdisk formatted with some filesystem and built again. This build took X - (time to read files from spinning drive) minutes.

So you're now measuring "time for first read from spinning rust". And how exactly is caching going to improve THAT figure? It's not! It'll make a big difference to "time to read it again", or "time to retrieve what I just wrote", but how much does that figure into a NORMAL USER'S workflow. If I'm working on a document, I'll read it ONCE, then it'll sit in ram til I'm finished with it. Cache saves me nothing.

> Once upon a time, I noticed that EXT4 filesystems would 'lock' or 'hang' the system for some seconds while data were being flushed to disk, something I didn't see on other FSen (ReiserFS).

And that's background writes, again nothing to do with caching. Except that there's a LOT of evidence that leaving all that i/o behind in cache is damaging to performance.

I'm an analyst. My whole job is to analyse performance. I'm looking at all this and thinking "where are the performance gains coming from?". All the evidence in front of me says that caching has a very measurable performance cost. And when I try and work out where the offsetting gains are going to come from, there are some obvious places - development work for example. A "make" cycle will certainly benefit. But there are also places where I struggle to find any benefit - YOUR TYPICAL USER - for example.

All I'm asking is "is caching worth the cost?". For servers, "it depends". Do they do their own caching, do they rely on the OS? For a development workstation, almost certainly, reading and writing the same files over and over will obviously benefit. But for a typical user PC? Caching will *interfere* with a normal work-pattern. And for a backup server? All the evidence is that caching is actually a major hindrance. I've very recently seen someone complaining about NVME performance - the difference in speed between cached and uncached i/o is about three HUNDRED percent.

The ability to switch it on and off by volume is probably very useful.

Cheers,
Wol

Not even close to saturating consumer NVMe disks

Posted Jun 11, 2024 17:44 UTC (Tue) by anton (subscriber, #25547) [Link]

Due to fixed overhead (eg making system calls, I/O scheduling, copying data into/out of buffers, servicing interrupts, and latencies for all of these meaning that your bus duty cycle is well below the theoretical maximum) -- and the problem that optimizations that maximize throughput for a single thread can severely penalize other usage patterns.

I don't see these issues as fundamentally limiting buffered I/O bandwidth, especially for the kinds of applications that intgr mentions. I did

strace cp lp_solve_2.2.tar.gz xxx

cp performs 101 system calls, most of them startup stuff. The actual work seems to be:

openat(AT_FDCWD, "xxx", O_RDONLY|O_PATH|O_DIRECTORY) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "lp_solve_2.2.tar.gz", {st_mode=S_IFREG|0644, st_size=87030, ...}, 0) = 0
openat(AT_FDCWD, "lp_solve_2.2.tar.gz", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=87030, ...}, AT_EMPTY_PATH) = 0
openat(AT_FDCWD, "xxx", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = -1 EOPNOTSUPP (Operation not supported)
newfstatat(4, "", {st_mode=S_IFREG|0644, st_size=0, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 87030
copy_file_range(3, NULL, 4, NULL, 9223372035781033984, 0) = 0
close(4)                                = 0
close(3)                                = 0

If cp used direct I/O, the number of system calls would not be less, so system calls are not the issue.

Let's assume that this performs a standard copy rather than a reflink. Then inside the kernel a bunch of blocks (including some metadata blocks) are marked as dirty and will be written at some later time. If you do a lot of such copies, a lot of dirty pages accumulate, and when the kernel decides to write this data, it can do that with as many threads as is appropriate. If it uses too few, that's an optimization opportunity for the kernel.

However, for the kind of usage that intgr mentions, as long as the data fits into RAM, it does not matter when the kernel starts and finishes writing the data to the disk, and what the bandwidth of that is. So for this kind of usage the write bandwidth to the buffer cache is what counts, and it's no wonder that he does not see a speedup from switching to a faster SSD: In this setting the limit is RAM bandwidth, not bandwidth to the SSD.

There is a catch: People who argue with the term O_PONIES for file systems with bad crash consistency guarantees tell us that the applications need to sync at some points, and that syncs mean that the application waits every time until the data resides on the disk. Unfortunately, last I looked, the only Linux file system with a good crash consistency guarantee is NILFS2. So if applications sync frequently, write bandwidth to the SSDs would become the limiter. The cp invocation I straced does not call anything that has "sync" in its name.

My guess is that writing from buffered I/O to SSDs has not been optimized because it has not hurt kernel developers much. In the usual case git performance is plenty fast (or maybe my repos are just too small to notice problems:-).

Not even close to saturating consumer NVMe disks

Posted Jun 7, 2024 18:10 UTC (Fri) by Tobu (subscriber, #24111) [Link]

I would guess most of the reason why rsync and tar don't do multithreaded writes is that C makes concurrency hard and risky (with bad consequences for corruption in the write path; maybe it's not worth the risk of silently corrupted backups).

Anyway, Rust makes correct and highly concurrent tools feasible, quite a few are best in their class. Try cpz and rmz. Also look at gitoxide, not full featured (currently targetting use cases where it's embedded inside another app) but it can git checkout the kernel much faster. Or fclones and ripgrep for more read-heavy use cases.

Not even close to saturating consumer NVMe disks

Posted Jun 9, 2024 12:22 UTC (Sun) by malmedal (subscriber, #56172) [Link] (1 responses)

Maybe try recompiling the kernel with a higher limit to nr_requests? I have no idea if this is easy to do or if it would corrupt your data. You may want to ask on linux-kernel.

Not even close to saturating consumer NVMe disks

Posted Jun 12, 2024 12:40 UTC (Wed) by intgr (subscriber, #39733) [Link]

I already tested with various nr_requests values. I don't have the numbers, but I believe the delta between nr_requests=1 and nr_requests=1023 was only around 10-15%. And above values 100, the difference was in the noise. So I doubt that will help much, at least in my case.

Not even close to saturating consumer NVMe disks

Posted Jun 8, 2024 18:44 UTC (Sat) by dcoutts (guest, #5387) [Link]

> [...] Linux pagecache performance, it can't make things faster, only slower, that's quite obvious.

I think this is far from obvious. The crucial observation is that SSDs are highly parallel (with e.g. a 20x factor between serial (Q1) and parallel (QD32) use).

Certainly, one can expect to achieve best I/O throughput and lowest CPU use by using nice modern async I/O APIs like io_uring, with O_DIRECT, and submitting I/O from multiple cores. That allows one to fully take advantage of the SSDs parallelism.

But how can older programs written using classic serial I/O APIs take advantage of parallel I/O capabilities of SSDs? Or are all these programs condemned to poor serial performance or rewrites using new (non-portable) APIs? One potential answer is the page cache!

The page cache allows a program using serial I/O APIs to quickly move data into the page cache, from where it should _in principle_ be possible to write it to disk in parallel. And similarly for reading data via readahead. The kernel readahead could use parallel I/O to get data from the SSD and put it into the page cache, while the application still uses serial I/O APIs to copy that data from the page cache.

Of course this will have a substantially higher CPU overhead than using io_uring with O_DIRECT, but it should still be possible to achieve the full I/O throughput at least for current generation SSDs (e.g. in the 1M IOPS range).

So in summary, the problem is how to mediate between parallel SSDs and (older) serial applications, and the page cache _should_ be the solution.

Not even close to saturating consumer NVMe disks

Posted Jun 11, 2024 20:56 UTC (Tue) by willy (subscriber, #9762) [Link]

> I don't really want to comment on Linux pagecache performance, it can't make things faster, only slower, that's quite obvious.

That's an exceptionally stupid thing to say. It's a cache. Its entire purpose is to make things faster. If what you meant was "for an uncachable workload it can only make things slower", well, I disagree with that too. It decouples the application from the characteristics of the underlying storage, allowing write() of a single byte to succeed, no matter the block size of the underlying device.