Defending against page-cache attacks
The page cache holds copies of portions of files (in 4KB pages on most systems) in main memory. When a process needs to access data from a file, the presence of that data in the page cache eliminates the need to read it from disk, speeding things considerably. Multiple processes accessing the same files (such as the C library, for example) will share the same copies in the page cache, reducing the amount of memory that is required by the current workload. On systems hosting containers, much of the runtime system can be shared in this manner.
All of this is good, but it has been known for some time that this kind of shared caching can disclose information between processes. If an attacker can determine which files are currently represented in the page cache, they can learn about what processes running in the system are doing. When the attacker can observe when specific pages are brought into the cache, they can make conclusions about when specific accesses are being made. For example, it is possible to figure out when a specific function has been called by noting when the page containing that function appears in the cache. Gruss and company have been able to demonstrate a number of exploits, including covert channels and keystroke timing, that can be accomplished with this information.
There are two components to a successful page-cache attack. One of them is being able to determine whether a given page is in the cache, preferably without perturbing the state of the cache in the process. The other half of the problem, though, is the ability to evict specific pages from the cache; that is required to be able to see when a target accesses those pages. In the paper, this is done simply by faulting in enough other pages to force the target pages out; as it turns out, though, there may be an easier way.
Fixing mincore()
Most of the focus in the development community has been on the ability to obtain information on page-cache residency. It may never be possible to completely prevent an attacker from changing the state of the cache (though memory control groups can probably help here), but if the attacker cannot observe the state of the cache, most attacks become quite a bit harder. Indeed, it would be hard even to know that the target pages have been successfully pushed out. Unfortunately, securing this information channel will not be easy.
The Gruss paper targeted mincore(), which is an obvious thing to use since its job is to report on the state of the page cache. By mapping a target file and calling mincore(), an attacker can get immediate information on which pages in that file are currently resident in the page cache. The response that was merged for 5.0 is to change the behavior of mincore() to only report on pages that have been faulted in by the calling process. An attacker can still use mincore() to learn when a page has been evicted, but it can no longer be used to observe when the page is faulted back in by some other process; to do so, the attacker would have to fault the page in first, destroying the desired information.
This is a significant change to how mincore() works; it has been deliberately held back from the stable updates because of concerns that it might break a user-space program and have to be reverted — concerns that appear to have a basis in reality. Kevin Easton put together a list of Debian packages that use mincore(), but it's not yet clear which of these might have been broken. Perhaps the application from that list that raised the most concern is vmtouch, which is used in some settings to pre-fault in a known working set to speed the startup of a virtual machine.
The fatal blow, though, seems to have come
from Josh Snyder, who reported that: "For Netflix, losing accurate
information from the mincore syscall would lengthen database cluster
maintenance operations from days to months
". That has led
developers to reconsider their options, including adding a system mode that
would turn mincore() into a privileged operation. Perhaps the
idea that is most likely to be adopted came from Dominique
Martinet, who suggested that information for a given mapping should
only be provided if the caller would be allowed to write to the file
underlying that mapping. That would fix the Netflix use case while
preventing the monitoring of pages from system executable files. A patch
implementing this approach has been posted by Jiri Kosina.
The larger problem
Assuming that a workable solution is found, one might be tempted to conclude that the bigger problem is solved, but that is not yet the case. Dave Chinner pointed out that preadv2() can be used with the RWF_NOWAIT flag to perform non-destructive testing of page-cache contents. A possible solution here is to initiate readahead when an RWF_NOWAIT read fails to find data in the page cache, thus changing the state of the cache and possibly improving performance for normal users at the same time. The patch set from Kosina linked above contains this change as well.
Chinner sees such patches as playing a game of Whack-A-Mole, though, in a setting containing an abundance of moles. He noted that a number of kernel interfaces have been designed to communicate whether data is immediately available (which generally means that it is in the page cache); this information is legitimately useful to a number of applications. Another possible exploit path, he said, is overlayfs, which is used as a way of sharing page-cache contents across containers. Overall, he said, the mincore() change was the wrong approach:
Later in the discussion, he identified another exploit path: with some filesystems at least, performing a direct-I/O read on a page will force that page out of the cache, greatly simplifying the invalidation problem for attackers. There was some heated discussion over whether this was the right thing for filesystems like XFS to do (Linus Torvalds sees it as a bug), but one clear outcome from the discussion is that this behavior is unlikely to change anytime soon.
Even if all of these holes are plugged, there is still the blunt weapon: simple timing attacks. If a read of a specific page goes quickly, that page was almost certainly in the cache; if it takes more time, it probably had to be read in from persistent storage. Timing attacks are generally destructive and are more easily noticed, but they can still be used. And new holes are likely to appear in the future; in a separate discussion Chinner commented on how the recently posted virtio pmem device functionality could be exploited in the same way. The io_uring feature, if merged in its current form, will also make it easy for an attacker to query the state of the page cache.
In other words, the problem seems nearly unsolvable, at least in any absolute sense. Probably the best that can be done is to try to raise the bar high enough to head off most attacks. So the known mechanisms for non-destructively querying the state of the page cache are likely to be shut down, perhaps only if the kernel is configured into a "secure mode". Timing attacks may prove to be too hard (or costly) to close off entirely. So, as Torvalds put it, those wanting any sort of absolute security will be disappointed, as usual:
So at no point is this going to be some kind of absolute line in the sand _anyway_. There is no black-and-white "you're protected", there's only levels of convenience.
That still leaves open the problem of closing off the known exploitation
vectors without creating problems for existing user-space applications.
Like Meltdown and Spectre, this looks like the kind of problem that will be
able to keep kernel developers busy for some time yet.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Page cache |
| Kernel | Security |
