Defending against page-cache attacks

By Jonathan Corbet
January 17, 2019

The kernel's page cache works to improve performance by minimizing disk I/O and increasing the sharing of physical memory. But, like other performance-enhancing techniques that involve resources shared across security boundaries, the page cache can be abused as a way to extract information that should be kept secret. A recent paper [PDF] by Daniel Gruss and colleagues showed how the page cache can be targeted for a number of different attacks, leading to an abrupt change in how the mincore() system call works at the end of the 5.0 merge window. But subsequent discussion has made it clear that mincore() is just the tip of the iceberg; it is unclear what will really need to be done to protect a system against page-cache attacks or what the performance cost might be.

The page cache holds copies of portions of files (in 4KB pages on most systems) in main memory. When a process needs to access data from a file, the presence of that data in the page cache eliminates the need to read it from disk, speeding things considerably. Multiple processes accessing the same files (such as the C library, for example) will share the same copies in the page cache, reducing the amount of memory that is required by the current workload. On systems hosting containers, much of the runtime system can be shared in this manner.

All of this is good, but it has been known for some time that this kind of shared caching can disclose information between processes. If an attacker can determine which files are currently represented in the page cache, they can learn about what processes running in the system are doing. When the attacker can observe when specific pages are brought into the cache, they can make conclusions about when specific accesses are being made. For example, it is possible to figure out when a specific function has been called by noting when the page containing that function appears in the cache. Gruss and company have been able to demonstrate a number of exploits, including covert channels and keystroke timing, that can be accomplished with this information.

There are two components to a successful page-cache attack. One of them is being able to determine whether a given page is in the cache, preferably without perturbing the state of the cache in the process. The other half of the problem, though, is the ability to evict specific pages from the cache; that is required to be able to see when a target accesses those pages. In the paper, this is done simply by faulting in enough other pages to force the target pages out; as it turns out, though, there may be an easier way.

Fixing mincore()

Most of the focus in the development community has been on the ability to obtain information on page-cache residency. It may never be possible to completely prevent an attacker from changing the state of the cache (though memory control groups can probably help here), but if the attacker cannot observe the state of the cache, most attacks become quite a bit harder. Indeed, it would be hard even to know that the target pages have been successfully pushed out. Unfortunately, securing this information channel will not be easy.

The Gruss paper targeted mincore(), which is an obvious thing to use since its job is to report on the state of the page cache. By mapping a target file and calling mincore(), an attacker can get immediate information on which pages in that file are currently resident in the page cache. The response that was merged for 5.0 is to change the behavior of mincore() to only report on pages that have been faulted in by the calling process. An attacker can still use mincore() to learn when a page has been evicted, but it can no longer be used to observe when the page is faulted back in by some other process; to do so, the attacker would have to fault the page in first, destroying the desired information.

This is a significant change to how mincore() works; it has been deliberately held back from the stable updates because of concerns that it might break a user-space program and have to be reverted — concerns that appear to have a basis in reality. Kevin Easton put together a list of Debian packages that use mincore(), but it's not yet clear which of these might have been broken. Perhaps the application from that list that raised the most concern is vmtouch, which is used in some settings to pre-fault in a known working set to speed the startup of a virtual machine.

The fatal blow, though, seems to have come from Josh Snyder, who reported that: "For Netflix, losing accurate information from the mincore syscall would lengthen database cluster maintenance operations from days to months". That has led developers to reconsider their options, including adding a system mode that would turn mincore() into a privileged operation. Perhaps the idea that is most likely to be adopted came from Dominique Martinet, who suggested that information for a given mapping should only be provided if the caller would be allowed to write to the file underlying that mapping. That would fix the Netflix use case while preventing the monitoring of pages from system executable files. A patch implementing this approach has been posted by Jiri Kosina.

The larger problem

Assuming that a workable solution is found, one might be tempted to conclude that the bigger problem is solved, but that is not yet the case. Dave Chinner pointed out that preadv2() can be used with the RWF_NOWAIT flag to perform non-destructive testing of page-cache contents. A possible solution here is to initiate readahead when an RWF_NOWAIT read fails to find data in the page cache, thus changing the state of the cache and possibly improving performance for normal users at the same time. The patch set from Kosina linked above contains this change as well.

Chinner sees such patches as playing a game of Whack-A-Mole, though, in a setting containing an abundance of moles. He noted that a number of kernel interfaces have been designed to communicate whether data is immediately available (which generally means that it is in the page cache); this information is legitimately useful to a number of applications. Another possible exploit path, he said, is overlayfs, which is used as a way of sharing page-cache contents across containers. Overall, he said, the mincore() change was the wrong approach:

It's just a hacky band-aid over a specific extraction method and does nothing to reduce the actual scope of the information leak. Focusing on the band-aid means you've missed all the other avenues that the same information is exposed and all the infrastructure we've build on the core concept of sharing kernel side pages across security boundaries.

Later in the discussion, he identified another exploit path: with some filesystems at least, performing a direct-I/O read on a page will force that page out of the cache, greatly simplifying the invalidation problem for attackers. There was some heated discussion over whether this was the right thing for filesystems like XFS to do (Linus Torvalds sees it as a bug), but one clear outcome from the discussion is that this behavior is unlikely to change anytime soon.

Even if all of these holes are plugged, there is still the blunt weapon: simple timing attacks. If a read of a specific page goes quickly, that page was almost certainly in the cache; if it takes more time, it probably had to be read in from persistent storage. Timing attacks are generally destructive and are more easily noticed, but they can still be used. And new holes are likely to appear in the future; in a separate discussion Chinner commented on how the recently posted virtio pmem device functionality could be exploited in the same way. The io_uring feature, if merged in its current form, will also make it easy for an attacker to query the state of the page cache.

In other words, the problem seems nearly unsolvable, at least in any absolute sense. Probably the best that can be done is to try to raise the bar high enough to head off most attacks. So the known mechanisms for non-destructively querying the state of the page cache are likely to be shut down, perhaps only if the kernel is configured into a "secure mode". Timing attacks may prove to be too hard (or costly) to close off entirely. So, as Torvalds put it, those wanting any sort of absolute security will be disappointed, as usual:

And no, we're *never* going to stop all side channel leaks. Some parts of caching (notably the timing effects of it) are pretty fundamental.

So at no point is this going to be some kind of absolute line in the sand _anyway_. There is no black-and-white "you're protected", there's only levels of convenience.

That still leaves open the problem of closing off the known exploitation vectors without creating problems for existing user-space applications. Like Meltdown and Spectre, this looks like the kind of problem that will be able to keep kernel developers busy for some time yet.

Index entries for this article
Kernel	Memory management/Page cache
Kernel	Security

Defending against page-cache attacks

Posted Jan 17, 2019 20:59 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (1 responses)

The paper authors suggest moving toward a Windows-style "working set" model of page cache instead of global LRU. I wish this option would be more seriously considered despite it involving massive vm subsystem changes.

Defending against page-cache attacks

Posted Jan 31, 2019 11:31 UTC (Thu) by sourcejedi (guest, #45153) [Link]

https://cartesianproduct.wordpress.com/2011/09/15/done-an...

> The report was on “applying working set heuristics to the Linux kernel“: essentially testing to see if there were ways to overlay some elements of local page replacement to the kernel’s global page replacement policy that would speed turnaround times.

> The answer to that appears to be ‘no’ – at least not in the ways I attempted, though I think there may be some ways to improve performance if some serious studies of phases of locality in programs gave us a better understanding of ways to spot the end of one phase and the beginning of another.

> But, generally speaking, my work showed the global LRU policy of the kernel was pretty robust.

Defending against page-cache attacks

Posted Jan 17, 2019 21:44 UTC (Thu) by kucharsk (subscriber, #115077) [Link] (10 responses)

I think the general case problem here is cached data is generally interesting data.

You can extend the paradigm as far out into the computing arena as you like; if a system has both SSD and hard drives, data from SSD will probably be more important or of greater interest than that on the spinning media. If you have a storage solution that sends data off to secondary or tertiary storage, the time it takes to access said data reveals how old the data is.

Likewise on systems with NVRAM, information in NVRAM will generally be more important or interesting than data not kept in non-volatile storage.

This paradigm is of course true for all operating systems, not just Linux.

Timing is always an issue; during the Cold War, Soviet spies were able to wiretap IBM Selectric typewriters in embassies by detecting how long it took the type ball to rotate to each character, giving them a reasonable chance of determining each character being typed.

We obviously can't take the approach of "slow everything down to the time taken to access the slowest device," and there will always be a need to be able to pre-populate clusters, containers or other mechanisms to provide for fast startup times or to provide instant failover. Someone will need access to that information, and as soon as someone does, that's a potential leak.

It's more a matter of reducing exposure than eliminating it, and the question is where does that balance between security and the need for ever faster operation lie?

Defending against page-cache attacks

Posted Jan 18, 2019 1:32 UTC (Fri) by Nahor (subscriber, #51583) [Link] (9 responses)

> I think the general case problem here is cached data is generally interesting data.

Easy solution: just cache everything. Load the whole disk in RAM at boot. No slow access, no timing attack and the system becomes faster. Win-win! :)

Defending against page-cache attacks

Posted Jan 18, 2019 13:36 UTC (Fri) by Sesse (subscriber, #53779) [Link] (6 responses)

By RAM, you mean L1 cache?

Defending against page-cache attacks

Posted Jan 19, 2019 1:45 UTC (Sat) by naptastic (guest, #60139) [Link] (5 responses)

I can fit 2^7 of my first computer in the on-die cache of my five-year-old desktop processor. (Commodore 64 -> i7 4790k)

Defending against page-cache attacks

Posted Jan 20, 2019 18:37 UTC (Sun) by Sesse (subscriber, #53779) [Link]

That's not L1 cache only. You can do cache attacks even if you only have L3.

Defending against page-cache attacks

Posted Jan 20, 2019 20:00 UTC (Sun) by farnz (subscriber, #17727) [Link] (3 responses)

Your i7-4790K has 32 KiB I$ and 32 KiB D$ - so about as much total L1 cache as your C64 had RAM, but not enough to cover the ROM as well.

My first Z80 machine would fit in L1 cache on your CPU, though - the ZX81 had 1 KiB RAM, 8 KiB ROM, and could be expanded commercially to 16 KiB RAM, 8 KiB ROM.

Defending against page-cache attacks

Posted Jan 28, 2019 7:55 UTC (Mon) by paulj (subscriber, #341) [Link] (2 responses)

Complete tangent from the story: That 16 KiB ZX81 RAM pack - it was wobbly, and just as you'd be getting into the end of (what felt like to a 9yo anyway) hours of typing in some programme, it'd wobble, the ZX81 would reset and everything would be gone! Oh that RAM pack, so frustrating! :)

Defending against page-cache attacks

Posted Jan 28, 2019 14:09 UTC (Mon) by gevaerts (subscriber, #21521) [Link]

That's why you built some contraption to keep it all in place! (which is, of course, when something went wrong with saving and you had to re-type it anyway)

Defending against page-cache attacks

Posted Jan 30, 2019 14:42 UTC (Wed) by nix (subscriber, #2304) [Link]

I had a wobbly RAM pack with an extra flaw: the PSU on my ZX81 was underspec so it didn't generate quite enough power to power the RAM and screen at once. The video signal generation was the first thing to go: you got waves of sync problems like a bad VHS video player working their way over the screen. But it didn't take long for eight-year-old me to figure out that the RAM wasn't holding its content either...

(Obviously I couldn't fix it. An eight year old with terrible coordination go messing in a power supply?! HELL NO.)

Defending against page-cache attacks

Posted Jan 18, 2019 19:23 UTC (Fri) by quotemstr (subscriber, #45331) [Link] (1 responses)

Well, yeah. The whole reason we share stuff in the first place is to make efficient use of limited system resources. As resources become cheaper, the case for elaborate (and apparently insecurity-prone) sharing mechanisms diminishes. The future of computing is straight-up partitioning, sharing nothing. It's a much simpler and more robust world.

Defending against page-cache attacks

Posted Jan 24, 2019 5:15 UTC (Thu) by marcH (subscriber, #57642) [Link]

> > > > So the known mechanisms for non-destructively querying the state of the page cache are likely to be shut down, perhaps only if the kernel is configured into a "secure mode".

> The future of computing is straight-up partitioning, sharing nothing. It's a much simpler and more robust world.

To avoid a myriad of new CONFIG_SECURE_SIDE_CHANNEL_FOO options, how about a unique CONFIG_SHARED_SYSTEM setting controlling all these at once?

"Shared" can unfortunately apply to single-user systems too, think Android applications for instance :-(

Defending against page-cache attacks

Posted Jan 18, 2019 4:27 UTC (Fri) by mangix (guest, #126006) [Link] (3 responses)

wonder how many programs will use O_DIRECT now. Or am I misunderstanding things?

Defending against page-cache attacks

Posted Jan 18, 2019 13:20 UTC (Fri) by amarao (guest, #87073) [Link]

> wonder how many programs will use O_DIRECT now. Or am I misunderstanding things?
A lot of server apps, specifically on IO side (iscsi, different storage/cluster/database software). The faster underlying device is, the more desirable is to use O_DIRECT for the access.

Defending against page-cache attacks

Posted Jan 18, 2019 14:47 UTC (Fri) by bof (subscriber, #110741) [Link] (1 responses)

> wonder how many programs will use O_DIRECT now.

Anything with a use case that wants to *avoid* perturbing the page cache. As a sysadmin I regularly use dd iflag=direct or oflag=direct when checksumming or network copying block devices. Applicable to all do-once I/O, actually, and the last time I played with fadvise FADV_NOREUSE (which dd does not support anyway) it was much less reliable.

Defending against page-cache attacks

Posted Jan 21, 2019 1:49 UTC (Mon) by Paf (subscriber, #91811) [Link]

The page cache is not just about reuse.

The page cache allows both write aggregation and readahead, and for writes to complete asynchronously from the submitting syscall. Both of these have enormous (positive) performance impacts which rise as the amount of I/O the filesystem/device can have in flight increases, and also as the response latency of the device increases.

The page cache allows your single threaded dd to have the system queue up a bunch of writes which may be able to be processed all at once, as contrasted with direct I/O which is 1 I/O per process.

Additionally, if your whole write fits in the page cache and you’re not doing other heavy I/O (ie semi-idle time is available to write out your data) the ability to write to memory and complete asynchronously means your application level performance (where the app doesn’t wait for the write to be on disk) will stomp almost any standard storage device or RAID array,

This means it’s not beneficial to use direct I/O for single use I/O in general, it really depends on your case. DIO is essentially only faster in the cases where your device is *extremely* fast or you have many threads and a very high bandwidth back end (you can overwhelm the page cache).

In cases with higher latency devices (HDD, network file systems) or where there is device level parallelism to exploit (SSDs), direct I/O is often much, much slower, even for well formed I/O. (In real deployments of the Lustre parallel file system, which I work on, single threaded DIO can be 5-10x slower than normal I/O. That’s an extreme case but the reasons for it hold for local file systems too.)