LWN: Comments on "Readahead: the documentation I wanted to read"

Readahead: the documentation I wanted to read

donald.buczek — Tue, 12 Apr 2022 19:08:28 +0000

> Backup is always tricky, and your proposal would fix one-large-file but do nothing for many-small-files.

To be exact, its not backup. Our maintenance jobs run locally and are tamed via cgroups. Its (other) users. Our scientific users often process rather big files.

> Or rather, we'd like the page cache to initially exert pressure only on the page cache. The dcache should initially exert pressure only on the dcache. Etc. If a cache can't reclaim enough memory easily, then it should pressure other caches to shrink.

This would probably help a lot in the problem area I described and also in some others. It good to know, that this is on your mind. The negative dentries discussion mentioned in the later lwn article seems to get into the same field.

SAM / DAM files

farnz — Tue, 12 Apr 2022 09:44:13 +0000

And on the POSIX side, there's posix_fadvise, which has those two options, plus 3 more for telling the kernel what you're planning to do in the near future.

SAM / DAM files

Fowl — Tue, 12 Apr 2022 02:46:24 +0000

Windows has FILE_FLAG_SEQUENTIAL_SCAN and FILE_FLAG_RANDOM_ACCESS "hints" that can be passed when opening files, indeed.

https://devblogs.microsoft.com/oldnewthing/20120120-00/?p...

SAM / DAM files

joib — Mon, 11 Apr 2022 18:28:02 +0000

> An OS I used in the past (Pr1mos, what else :-) flagged all files as either SAM or DAM. From the coder's POV there was no real difference between the two - the same primitives worked the same way on all files. But the S and D stood for Sequential and Direct, and the documentation was very clear that sequential files were meant to be read from the beginning, while Direct files were quite happy reading random blocks.

Sounds like an OS designed for Fortran, which has sequential access and direct access I/O. Or well, nowadays Fortran additionally has stream access as well, which is more like the stream of bytes model Unix and Windows provide. Fortran sequential files allow stepping forwards or backwards one record at a time (or going all the way to the beginning or end), but going forwards or backwards N records is an O(N) operation. Direct access, conceptually, is a bunch of fixed size records allowing access in any order.

Needless to say, on a modern day OS which provides only the stream-of-bytes model, it's the task of the Fortran runtime library to implement direct and sequential access on top of the stream-of-bytes model that the OS provides.

> Is there any way this sort of information could be fed through to these routines, because there's clearly no point reading-ahead a dam file, while there is no point caching a sam file once that bit of data has been synchronously read ...

At least from the perspective of the typical Fortran applications I've seen, this would be a very simplistic and bad caching strategy. For instance, reading direct access files sequentially (as in, first read record #1, then record #2, etc) is actually very common, as is rereading files (maybe by rerunning the applications with partially different input parameters etc.).

SAM / DAM files

Wol — Mon, 11 Apr 2022 14:18:58 +0000

An OS I used in the past (Pr1mos, what else :-) flagged all files as either SAM or DAM. From the coder's POV there was no real difference between the two - the same primitives worked the same way on all files. But the S and D stood for Sequential and Direct, and the documentation was very clear that sequential files were meant to be read from the beginning, while Direct files were quite happy reading random blocks.

Is there any way this sort of information could be fed through to these routines, because there's clearly no point reading-ahead a dam file, while there is no point caching a sam file once that bit of data has been synchronously read ...

Cheers,
Wol

Readahead: the documentation I wanted to read

bfields — Mon, 11 Apr 2022 14:17:13 +0000

Additional NFS drawbacks: fadvice() data not available

There is actually an IO_ADVISE operation, that Linux doesn't implement: https://www.rfc-editor.org/rfc/rfc7862.html#section-1.4.2

Maybe there's a good reason it hasn't been implemented yet, but, anyway, it might be another thing worth looking into here.

Readahead: the documentation I wanted to read

willy — Mon, 11 Apr 2022 13:39:05 +0000

Ah, I see that in my inbox now ... I read the third email in the chain (the first two went only to xfs?), but didn't read the fourth and fifth. The usual too-much-email problem.

Anyway, I think recognising this special case probably isn't the right solution. Backup is always tricky, and your proposal would fix one-large-file but do nothing for many-small-files.

I suspect the right way to go is to recognise that the page cache is large and has many easily-reclaimable pages, and shrink only the page cache. ie the problem is that backup is exerting general memory pressure when we'd really like it to only exert pressure only on the page cache. Or rather, we'd like the page cache to initially exert pressure only on the page cache. The dcache should initially exert pressure only on the dcache. Etc. If a cache can't reclaim enough memory easily, then it should pressure other caches to shrink.

Readahead: the documentation I wanted to read

donald.buczek — Mon, 11 Apr 2022 08:19:11 +0000

> I am a bit confused that it evicts useful data.

Sorry, I was unclear with the term "valuable". I'm not talking about hot pages, which are accessed by the system. These can probably avoid eviction by returning to the active list fast enough. The (possibly) useful data lost, I've talked about, are other inactive pages and data from other caches (namely dcache). The original user complaint was, " `ls` take ages in the morning". So only when the user took a break, his data was replaced. That by itself is not wrong and the basic strategy of LRU. How should the system now, that the user is going to return the next morning? On the other hand, the system *could* notice, that a big file, which is never going to fit into the cache, is being read sequentially from the beginning. So keeping the already processed head of the file when memory is needed, is even more likely to be useless, because it will be evicted anyway if the observed pattern continues.

> Do you see a difference if the files are accessed locally versus over NFS

No, the same is true for access from the local system. NFS is just a complication in the regards I mentioned (sometimes out of order, no fadvice, no cgroups). In the thread referenced below, I've posted a reproducer script for a local file access.

> would you mind taking this to linux-mm, and/or linux-fsdevel

A colleague of mine did so in August 2021 [1]

Best
Donald

[1]: https://lore.kernel.org/all/878157e2-b065-aaee-f26b-5c87e...

Readahead: the documentation I wanted to read

willy — Sun, 10 Apr 2022 18:06:33 +0000

Thanks for bringing this up! It's a problem I'm aware of and want to fix, but don't have a plan for how to fix yet.

I am a bit confused that it evicts useful data. The way it *should* work is that the use-once pages go on the inactive list, then the inactive list gets pruned, and the pages on the active list stay there.

Do you see a difference if the files are accessed locally versus over NFS? If so, it may be a bug in NFSd (that it's adding pages to the active list instead of the inactive list, perhaps)

I'm not sure that LWN comments are the best place to help you debug this; would you mind taking this to linux-mm, and/or linux-fsdevel?

I don't think that readahead is the right place to fix this, but it may be the right place to fix a related problem (which might end up fixing your problem). That is, on an SMP (indeed, NUMA system), a sequential read much larger than memory will end up triggering reclaim, as it should. The problem is that each CPU tries to take pages from the end of the LRU list and then remove it from the page cache. But all the pages belong to the same file, so they all fight over the same i_pages lock, and do not make useful progress.

Since readahead is already using the i_pages lock to add the new pages to the page cache, I think it's the right place to remove unwanted pages from the page cache, but as you note, we need to find the right pages to toss (or not ... there's an argument that throwing away the wrong pages is cheaper than finding the right ones ...)

Readahead: the documentation I wanted to read

donald.buczek — Sun, 10 Apr 2022 11:29:22 +0000

Thanks for the article and the docs!

A year or so ago I dug my way through the code without the help of your new documentation, because I was looking for a way to fix something, which I think is sub-optimal behavior of mm: We've noticed that a single sequential read of a single big file (bigger than the memory available for the page cache) triggers the shrinkers and makes you lose all your valuable caches for the never-again needed data from the big file.

As the readahead code is in a position to detect sequential access patterns and has access to the information of the backing file (is it "big"?) , I wonder, if that was the right place to detect the scenario and maybe drop specific pages from that file before more valuable pages and other caches are affected.

I made some experiments with the detection part, which in our use-case is complicated by the fact, that accesses come over NFS, so they are out of order occasionally. Additional NFS drawbacks: fadvice() data not available, alternative mitigations based on cgroups not available...

I could more or less identify this pattern and do a printk to show, that an opportunity was detected, but I didn't get to the other parts, which would be

- Decide, which pages of the sequential file to scarify. LRU might not be optimal here, because if the file is read a second time, it will be from the beginning again.

- How to drop specific pages from the cache. I guess, there are a lot things which can be done wrongly.

Probably i won't get very far. Maybe other work ( multi-generational LRU? ) will help in that problem area.

Readahead: the documentation I wanted to read

jreiser — Sun, 10 Apr 2022 02:10:19 +0000

I applaud this and other efforts to enhance maintainability. Almost every study has concluded that maintenance costs are at least 60% of the life cycle cost of software. When the writing and initial testing have been completed, then more than half of the total work remains to be done (over time). Making maintenance easier through better documentation (thorough and accurate) creates higher-quality and less-expensive software.