| LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing |
Linux offers two modes for file I/O: buffered and direct. Buffered I/O passes through the kernel's page cache; it is relatively easy to use and can yield significant performance benefits for data that is accessed multiple times. Direct I/O, instead, goes straight between a user-space buffer and the storage device. It can be much faster for situations where caching by the operating system isn't necessary, but it is complex to use and contains traps for the unwary. Now, it seems, Jens Axboe has come up with a way to get many of the benefits of direct I/O with a lot less bother.
Direct I/O can give better performance than buffered I/O in a couple of ways. One of those is simply avoiding the cost of copying the data between user space and the page cache; that cost can be significant, but in many cases it is not the biggest problem. The real issue may be the effect of buffered I/O on the page cache.
A process that performs large amounts of buffered I/O spread out over one or more large (relative to available memory) files will quickly fill the page cache (and thus memory) with cached file data. If the process in question does not access those pages after performing I/O, there is no benefit to keeping the data in memory, but it's there anyway. To be able to allocate memory for other uses, the kernel will have to reclaim some pages from somewhere. That can be expensive for the system as a whole, even if "somewhere" is the data associated with this I/O activity.
The memory-management subsystem tries to do the right thing in this situation. Pages added to the cache via buffered I/O go onto the inactive list; unless they are accessed a second time in the near future, they will be the first pages to be kicked back out. But there is still a fair amount of overhead associated with implementing this behavior; Axboe ran a simple test and described the results this way:
This kind of problem can be avoided by switching to direct I/O, but that brings challenges and problems of its own. Axboe has concluded that there may be a third way that can provide the best of both worlds.
That third way is a new flag, RWF_UNCACHED, which is provided to the preadv2() and pwritev2() system calls. If present, this flag changes the requested I/O operation in two ways, depending on whether the affected file pages are currently in the page cache or not. When the data is present in the page cache, the operation proceeds as if the RWF_UNCACHED flag were not present; data is copied to or from the pages in the cache. If the pages are absent, instead, they will be added to the page cache, but only for the duration of the operation; those pages will be removed from the page cache once the operation completes.
The result, in other words, is buffered I/O that does not change the state of the page cache; whatever was present there before will still be there afterward, but nothing new will be added. I/O performed in this way will gain most of the benefits of buffered I/O, including ease of use and access to any data that is already cached, but without filling memory with unneeded cached data. The result, Axboe says, is readily observable:
This new flag thus seems like a significant improvement for a variety of workloads. In particular, workloads where it is known that the data will only be used once, or where the application performs its own caching in user space, may well benefit from running with the RWF_UNCACHED flag.
The implementation of this new behavior is not complicated; the entire patch set (which also adds support to io_uring) involves just over 200 lines of code. Of course, as Dave Chinner pointed out, there is something missing: all of the testing infrastructure needed to ensure that RWF_UNCACHED behaves as expected and does not corrupt data. Chinner also noted some performance issues in the write implementation, suggesting that an entire I/O operation should be flushed out at a time rather than the page-at-a-time approach taken in the original patch set. Axboe has already reworked the code to address that problem; the boring work of writing tests and precisely documenting semantics will follow at some future point.
If RWF_UNCACHED proves to work as well in real-world workloads, it may eventually be seen as one of those things that somebody should have thought of many years ago. Things often turn out this way. Solving the problem isn't hard; the hard part is figuring out which problem needs to be solved in the first place. That, and writing tests and documentation, of course.
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 15:09 UTC (Thu) by epa (subscriber, #39769) [Link]
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 15:11 UTC (Thu) by corbet (editor, #1) [Link]
I would expect the impact on the cache to be small; a matter of kilobytes rather than gigabytes.The data is added to the cache because that way the entire buffered I/O setup — both user space and in the kernel — Just Works without additional trouble.
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 7:34 UTC (Fri) by epa (subscriber, #39769) [Link]
But why only kilobytes? What if the request was for much more than that? The article seems to imply the whole request gets into the cache, then once the request is completed it moves out again. Even if it was in fact for multiple pages:those pages will be removed from the page cache once the operation completesAnd the API for readv() and writev() doesn't have a maximum size. So if there are a million pages to read, do they all go into the cache and get removed afterwards, or can they be removed piecemeal even before the operation is finished?
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 9:19 UTC (Fri) by tlamp (subscriber, #108540) [Link]
Because a page is normally 4k, and only one page is allocated for a request, even if the request itself is much bigger (e.g., many GB) - IIUC.
With hugepages you may get a bit bigger but still less than the whole thing at once.
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 10:51 UTC (Fri) by epa (subscriber, #39769) [Link]
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 15:20 UTC (Thu) by axboe (subscriber, #904) [Link]
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 16:29 UTC (Thu) by nix (subscriber, #2304) [Link]
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 16:34 UTC (Thu) by axboe (subscriber, #904) [Link]
1) Lookup page cache page for the read. This is a very cheap operation.
2) If page is there, lock and copy data, done.
3) If page is not there, do IO to private page, copy data, free page, done.
Writes aren't (yet) as optimal, they will always use the page cache. If _any_ page in the written range was not in page cache to begin with, the range is dropped from cache.
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 23:59 UTC (Thu) by Paf (subscriber, #91811) [Link]
It’s not really clear to me where you’re piling up here when you’re spinning - is this the per file stuff (the old mapping->tree_lock, now an xarray of course) or is this the global provisioning of pages? (Would expect pileup there for a multi-file workload, but not for a single file workload.)
I suppose this could help with either, but it’s not 100% clear to me from the patch set description which case is being discussed. (it seems to imply single file)
Presumably write performance improvements are the multi-file case, since AFAIK all the main Linux file systems are single writer per file at the inode mutex. (A limitation it would be fun to see lifted - the currently-out-of-tree Lustre file system has had that for a few years. But there’s nothing there to share, the hard part wasn’t replacing the inode mutex, it’s fixing all the bits of the fs level write path to work with concurrency.)
Anyway, this is very nice!
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 0:48 UTC (Fri) by axboe (subscriber, #904) [Link]
But the kswapd usage is just part of it, it's not the main thing for me. With a full page cache, anyone attempting to do buffered IO will suffer. If I can do 2GB/sec of RWF_UNCACHED IO with a basically empty page cache, others doing IO are hardly disturbed.
And agree, buffered writes suck due to the inode mutex. My patchset does nothing to fix that... io_uring has hashed (by inode) items serialized to work around that issue.
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 3:03 UTC (Fri) by Paf (subscriber, #91811) [Link]
The single file case will be thrashing on the lock protecting that specific mapping, and the many file case will be thrashing on the global stuff, which is much higher throughput. (Mostly because it’s lists rather than an Xarray, so insertion/removal is much faster)
I’m overly familiar with this stuff because I picked up improving page cache throughput for Lustre. (Being out of tree sucks.)
But for reads you avoid putting pages in cache and for writes it seems you just remove them immediately. I’m curious why you get benefits there (presumably in the many files case) vs just rapid flushing of pages by kswapd - if you still have to put pages in the cache and remove them, it’s not immediately obvious where the benefit comes from. Are you skipping something? Are you able to batch? Is kswapd inefficient for some reason? (eg, navigating the LRU list where you can just work through your array)
Sorry for all the questions, and doubly sorry if they’re addressed in parts of the thread I haven’t read :)
“ io_uring has hashed (by inode) items serialized to work around that issue.”
I don’t quite follow that statement (mostly because I have minimal knowledge of io_uring :) ) - is this saying that io_uring won’t do parallel submission of writes to the same inode because they’d just block?
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 4:01 UTC (Fri) by axboe (subscriber, #904) [Link]
> is this saying that io_uring won’t do parallel submission of writes to the same inode because they’d just block?
Yes, that's exactly right. It's pointless and just causes tons of contention. If this gets fixed at some point we can just lift this restriction, or even do it per fs if it's flagged somehow. As it stands, all the important file systems suffer from this, which is why io_uring behaves that way.
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 14:14 UTC (Fri) by Paf (subscriber, #91811) [Link]
And now I get it. I thought you were looking at large streaming, which is the case I’ve worked with in the past. Never mind what I said, then, many small reads would be different (and I’ve never really benchmarked it). Sorry for my confusion.
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 13:44 UTC (Fri) by smooth1x (subscriber, #25322) [Link]
"all the main Linux file systems are single writer per file at the inode mutex"
Does that still apply to XFS with O_DIRECT?
Regards,
David.
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 13:54 UTC (Fri) by axboe (subscriber, #904) [Link]
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 21:28 UTC (Fri) by dgc (subscriber, #6611) [Link]
1. there is no "inode mutex" anymore - it's a rwsem. :)
2. buffered writes on every filesystem take the rwsem in exclusive mode, so yes it's still single writer.
3. Filesystems can do what they like with O_DIRECT, so XFS still uses shared writer locking for O_DIRECT.
and...
4. I'm slowly working on range locking for IO in XFS, such that we can do concurrent buffered writes that still have exclusive access guarantees against other buffered/direct IO, truncate, hole punching, etc....
Basically, IO range locking is what io_uring really needs for concurrent buffered IO without needing nasty hacks to avoid exclusive writer serialisation...
-Dave.
-Dave.
Buffered I/O without page-cache thrashing
Posted Dec 14, 2019 18:41 UTC (Sat) by axboe (subscriber, #904) [Link]
Another thing that buffered writes really needs (for io_uring) is RWF_NOWAIT support. Are you (or anyone you're aware of) working on that? I might just take a stab at it if not.
Buffered I/O without page-cache thrashing
Posted Dec 16, 2019 3:51 UTC (Mon) by dgc (subscriber, #6611) [Link]
AFAIA no one is working on RWF_NOWAIT for buffered writes. I'm not sure it's worth the trouble at this point in time because any amount of other IO (e.g. other buffered writes) will result in buffered writes always returning -EAGAIN to the caller so you'll just end up punting them all to an async thread, anyway...
-Dave.
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 17:50 UTC (Thu) by thoughtpolice (subscriber, #87455) [Link]
Therefore, the RWF_UNCACHED flag allows you to keep using the (simpler) buffered I/O interface, but simply fixes the pollution issue. It makes the simpler API less troublesome to use, effectively. The fact that the page is quickly added to the cache and then thrown away is really just a natural consequence of the design. End goal (easier to program) vs means of achieving that goal (quick add/remove), and all that.
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 17:50 UTC (Thu) by Sesse (subscriber, #53779) [Link]
“Pages added to the cache via buffered I/O go onto the inactive list; unless they are accessed a second time in the near future, they will be the first pages to be kicked back out. But there is still a fair amount of overhead associated with implementing this behavior”
I didn't know at all that this was the behavior, but it sounds very reasonable to me, so I'm surprised that it's still bad. What is all the overhead about? Simply traversing the inactive list? (It can't be splitting of huge pages, since they are not supported in the page caching yet…)
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 19:28 UTC (Thu) by hnaz (subscriber, #67104) [Link]
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 20:00 UTC (Thu) by Sesse (subscriber, #53779) [Link]
In a theoretical world with a hugepage-backed buffer cache, would the equation be any different?
Buffered I/O without page-cache thrashing
Posted Dec 12, 2019 20:39 UTC (Thu) by hnaz (subscriber, #67104) [Link]
It would be, because we'd instantiate and reclaim cache entries in units of 2M (or whatever the huge page size on the architecture) instead of 4k. That's a 512x reduction of list and tree operations for the same amount of data going through.
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 4:17 UTC (Fri) by champtar (subscriber, #128673) [Link]
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 6:43 UTC (Fri) by magnus (subscriber, #34778) [Link]
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 14:29 UTC (Fri) by axboe (subscriber, #904) [Link]
Buffered I/O without page-cache thrashing
Posted Dec 13, 2019 10:37 UTC (Fri) by dgm (subscriber, #49227) [Link]
Buffered I/O without page-cache thrashing
Posted Dec 19, 2019 20:30 UTC (Thu) by mklwn (subscriber, #121081) [Link]
What kernel is this in or what rev will it go in ?
thx,
-m
Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds