|
|
Log in / Subscribe / Register

Checking page-cache status with cachestat()

By Jonathan Corbet
December 6, 2022
The kernel's page cache holds pages from files in RAM, allowing those pages to be accessed without expensive trips to persistent storage. Applications are normally entirely unaware of the page cache's operation; it speeds things up and that is all that matters. Some applications, though, can benefit from knowledge about how much of a given file is present in the page cache at any given time; the proposed cachestat() system call from Nhat Pham is the latest in a long series of attempts to make that information available.

In truth, even current kernels make it possible to learn which pages of a file are present in the page cache. The application just needs to map the file into its address space with mmap(), after which a call to mincore() will return a vector showing which pages in that file are resident. This is an expensive solution, though; it requires setting up a (possibly unneeded otherwise) mapping and returns information that, for many applications, has a higher resolution than is necessary.

The proposed cachestat() system call is rather simpler:

    struct cachestat {
        __u64 nr_cache;
        __u64 nr_dirty;
        __u64 nr_writeback;
        __u64 nr_evicted;
        __u64 nr_recently_evicted;
    };

    int cachestat(unsigned int fd, off_t offset, size_t len, size_t cstat_size, 
		  struct cachestat *cstat);

This call will check the pages of the file indicated by fd, starting at the given offset and going for len bytes, and count the number of pages that are in various states of residency. The offset must be page-aligned; len will be rounded up to a multiple of the page size if needed. The counts will then be returned in the structure pointed to by cstat. In that structure, nr_cache is the number of pages in the given range that are present in the page cache, nr_dirty is the number of those pages that are dirty (have been modified and not yet written back to persistent storage), and nr_writeback is the number of pages currently being written back.

The nr_evicted field provides the count of how many pages were once resident in the cache but have since been forced out, and nr_recently_evicted is the number of those that have been forced out in the recent past. In this case, the "recent past" is defined by the number of pages that have been evicted since the page in question was forced out; if that number is smaller than the process's working-set size, the eviction is deemed to be recent. These counts are obtained by looking at the shadow page-table information that was added to the kernel about ten years ago.

The size of the cachestat structure must be provided to cachestat() as cstat_size. This interface allows new fields to be added to that structure in the future; if cstat_size is smaller than the size as known within the kernel, data will only be provided up to the provided size, preserving compatibility. (If, instead, cstat_size is larger than what the kernel expects, the call will fail with an EINVAL error).

By not requiring the mapping and unmapping of the file(s) to be queried, cachestat() avoids most of the overhead created by the mincore() method. The fact that this call returns simple counts rather than detailed, by-page information is also helpful in the end; it seems that applications wanting this kind of information are interested in the number of cache-resident pages, but they don't really care about which pages are resident. So there is no point in returning the more detailed data.

One open question that is not well answered in this patch set, though, is: what kinds of applications will benefit from this information? When LWN covered a similar effort in 2010 (the system call was called fincore() then), the use case involved applications that call posix_fadvise() to bring data into the page cache prior to accessing it. These applications (SQLite is evidently one of them) know what their data-access patterns will be, but they have less information about how much of their data will fit into the page cache at any given time. By calling cachestat(), such an application can learn whether the pages it is prefetching into the cache are still there by the time it gets around to using them. If those pages are being evicted, the prefetching is overloading the page cache and causing more work overall; in such situations, the application can back off and get better performance.

So cachestat() appears to be useful, but whether there is room in the kernel for this new system call remains to be seen. Attempts to add this functionality have faltered for over a decade, perhaps due to the highly specialized nature of the use case. But, just maybe, the new interface and renewed push for inclusion will get it over the bar this time.

Index entries for this article
KernelReleases/6.5
KernelSystem calls/cachestat()


to post comments

Checking page-cache status with cachestat()

Posted Dec 6, 2022 19:24 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

> unsigned int fd

Since when are file descriptors unsigned?

Checking page-cache status with cachestat()

Posted Dec 6, 2022 20:53 UTC (Tue) by jhoblitt (subscriber, #77733) [Link]

It has always amazed me there isn't a typedef for file descriptors.

Checking page-cache status with cachestat()

Posted Dec 6, 2022 20:13 UTC (Tue) by mss (subscriber, #138799) [Link] (8 responses)

It would be really great if there was a way to list system's page-cache contents: like which inodes are there and which of their offset ranges are covered.

Same goes for files currently being read or written from a block device and their offsets - blktrace only shows raw block device operations, AFAIK there's no way to see which files they pertain to.

This would really help debugging or fine tuning the page cache replacement algorithm.

Checking page-cache status with cachestat()

Posted Dec 6, 2022 20:27 UTC (Tue) by willy (subscriber, #9762) [Link] (1 responses)

You can enable the tracepoints.

/sys/kernel/debug/tracing/events/filemap/

mm_filemap_add_to_page_cache
mm_filemap_delete_from_page_cache

Is there more information you'd like?

Checking page-cache status with cachestat()

Posted Dec 6, 2022 21:18 UTC (Tue) by mss (subscriber, #138799) [Link]

That tracepoints' output look sensible, thanks.

I've also looked at related tracepoints and writeback:writeback_single_inode and writeback:writeback_written seem to be the most useful ones for tracing inode writeback.

Although an ability to list page-cache contents would still be useful for debugging cases which occur only infrequently, where it is not practical to have tracing enabled 24/7.

Checking page-cache status with cachestat()

Posted Dec 6, 2022 21:39 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

Some information is available, but the tooling for it is quite janky. The kernel's tools/vm/page-types can show the set pages that exist for a file, the pages for a process, etc. I don't think it can show the set of files that are cached though.

Checking page-cache status with cachestat()

Posted Dec 6, 2022 22:41 UTC (Tue) by osandov (subscriber, #97963) [Link] (4 responses)

If you have debug symbols installed, you can get this information fairly easily with drgn. I threw this together in 10 minutes, so no it's not super polished, but it's a good start:
#!/usr/bin/env drgn

import os

from drgn.helpers.linux.fs import inode_path
from drgn.helpers.linux.list import list_for_each_entry
from drgn.helpers.linux.radixtree import radix_tree_for_each

page_size = prog["PAGE_SIZE"].value_()
for sb in list_for_each_entry(
    "struct super_block", prog["super_blocks"].address_of_(), "s_list"
):
    printed_sb = False
    for inode in list_for_each_entry(
        "struct inode", sb.s_inodes.address_of_(), "i_sb_list"
    ):
        printed_path = False
        start = end = -1
        previous_index = None
        for index, page in radix_tree_for_each(inode.i_mapping.i_pages.address_of_()):
            if not printed_path:
                if not printed_sb:
                    print(f"Filesystem {os.fsdecode(sb.s_id.string_())}:")
                    printed_sb = True
                path = inode_path(inode)
                if path is None:
                    print("  <unknown>:")  # Name isn't cached.
                else:
                    print(f"  {os.fsdecode(path)}:")
                printed_path = True
            if index == end:
                end = index + 1
            else:
                if start < end:
                    print(f"    {start * page_size}-{end * page_size - 1}")
                start = index
                end = index + 1
        if start < end:
            print(f"    {start * page_size}-{end * page_size - 1}")

Checking page-cache status with cachestat()

Posted Dec 7, 2022 4:16 UTC (Wed) by willy (subscriber, #9762) [Link] (2 responses)

This needs a minor tweak to handle large folios correctly -- ask each page that you retrieve how large it is instead of assuming it's PAGE_SIZE

Checking page-cache status with cachestat()

Posted Dec 7, 2022 4:52 UTC (Wed) by osandov (subscriber, #97963) [Link] (1 responses)

Ah, thanks. I seem to remember that at one point, we would put the head page of a huge page in i_pages multiple times, one for each PAGE_SIZE unit that it covered. Is that not the case anymore?

Checking page-cache status with cachestat()

Posted Dec 7, 2022 12:58 UTC (Wed) by willy (subscriber, #9762) [Link]

We've had three representations of THPs in the page cache. Before I got to it, each page in a THP was inserted into the tree. I first changed that to inserting the head page N times. For a while now, we've used the sibling feature of the radix tree / XArray to insert the head page just once.

There's two reasons for the latest change; the first is that it handles tags/marks correctly; if you mark any index as dirty, it marks the entire range as dirty. That's important for range writeback (which wasn't needed for shmem but is for XFS!). The second reason is that it saves memory once you have a page of order 6 or higher; a THP of order 9 saves 8 radix tree nodes, each of which occupies a seventh of a 4kB page, so about 4682 bytes.

Checking page-cache status with cachestat()

Posted Dec 8, 2022 0:50 UTC (Thu) by mss (subscriber, #138799) [Link]

That's an impressive script made on demand, thanks!

I think it would be great if it was included as an example in the contrib directory of your drgn repo for posterity as it will get quickly lost buried deeply in comments here.

There's even a related fs_inodes.py script already there.

Checking page-cache status with cachestat()

Posted Dec 6, 2022 20:58 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> The proposed cachestat() system call is rather simpler

Le sigh.

Where's the flags argument? Where's the version argument? Where are the input arguments (e.g. to get per-core stats)?

Checking page-cache status with cachestat()

Posted Dec 6, 2022 21:09 UTC (Tue) by jhoblitt (subscriber, #77733) [Link]

In the prototype for cachestat2()...

Checking page-cache status with cachestat()

Posted Dec 7, 2022 10:07 UTC (Wed) by bof (subscriber, #110741) [Link] (3 responses)

> Where's the flags argument? Where's the version argument?

The article seems to say that they are called "len".

> Where are the input arguments (e.g. to get per-core stats)?

What is per-core about pagecache?

Checking page-cache status with cachestat()

Posted Dec 7, 2022 13:00 UTC (Wed) by willy (subscriber, #9762) [Link]

I've seen situations where it might be useful to know which NUMA node the pagecache is allocated from.

Checking page-cache status with cachestat()

Posted Dec 8, 2022 0:35 UTC (Thu) by gray_-_wolf (subscriber, #131074) [Link] (1 responses)

> > Where's the flags argument? Where's the version argument?
>
> The article seems to say that they are called "len".

I always wondered, how do you handle situations where you for example want to
replace 1 int with 4 chars or something like that? The size stays the same (on
some platforms). How do you handle that? Add some padding or something to make
sure the size is different? Or is it that you basically never remove fields and
just mark them unused/reserved, so the structure is always just growing?

Checking page-cache status with cachestat()

Posted Dec 8, 2022 8:37 UTC (Thu) by bof (subscriber, #110741) [Link]

I think that's the sane thing, only adding to the structure and keeping older fields at some dummy default value.

If you think about mismatch between kernel and userlevel, with potentially ancient userlevel tools inside containers or whatever, you don't want to confuse them by changing representation of fields.

Checking page-cache status with cachestat()

Posted Dec 8, 2022 8:46 UTC (Thu) by rwmj (subscriber, #5474) [Link]

No one seems to have mentioned the excellent little cachestats tool (https://github.com/Feh/nocache). A quick look at the code shows it is using the mmap+mincore method.


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds