Defending against page-cache attacks
The page cache holds copies of portions of files (in 4KB pages on most systems) in main memory. When a process needs to access data from a file, the presence of that data in the page cache eliminates the need to read it from disk, speeding things considerably. Multiple processes accessing the same files (such as the C library, for example) will share the same copies in the page cache, reducing the amount of memory that is required by the current workload. On systems hosting containers, much of the runtime system can be shared in this manner.
All of this is good, but it has been known for some time that this kind of shared caching can disclose information between processes. If an attacker can determine which files are currently represented in the page cache, they can learn about what processes running in the system are doing. When the attacker can observe when specific pages are brought into the cache, they can make conclusions about when specific accesses are being made. For example, it is possible to figure out when a specific function has been called by noting when the page containing that function appears in the cache. Gruss and company have been able to demonstrate a number of exploits, including covert channels and keystroke timing, that can be accomplished with this information.
There are two components to a successful page-cache attack. One of them is being able to determine whether a given page is in the cache, preferably without perturbing the state of the cache in the process. The other half of the problem, though, is the ability to evict specific pages from the cache; that is required to be able to see when a target accesses those pages. In the paper, this is done simply by faulting in enough other pages to force the target pages out; as it turns out, though, there may be an easier way.
Fixing mincore()
Most of the focus in the development community has been on the ability to obtain information on page-cache residency. It may never be possible to completely prevent an attacker from changing the state of the cache (though memory control groups can probably help here), but if the attacker cannot observe the state of the cache, most attacks become quite a bit harder. Indeed, it would be hard even to know that the target pages have been successfully pushed out. Unfortunately, securing this information channel will not be easy.
The Gruss paper targeted mincore(), which is an obvious thing to use since its job is to report on the state of the page cache. By mapping a target file and calling mincore(), an attacker can get immediate information on which pages in that file are currently resident in the page cache. The response that was merged for 5.0 is to change the behavior of mincore() to only report on pages that have been faulted in by the calling process. An attacker can still use mincore() to learn when a page has been evicted, but it can no longer be used to observe when the page is faulted back in by some other process; to do so, the attacker would have to fault the page in first, destroying the desired information.
This is a significant change to how mincore() works; it has been deliberately held back from the stable updates because of concerns that it might break a user-space program and have to be reverted — concerns that appear to have a basis in reality. Kevin Easton put together a list of Debian packages that use mincore(), but it's not yet clear which of these might have been broken. Perhaps the application from that list that raised the most concern is vmtouch, which is used in some settings to pre-fault in a known working set to speed the startup of a virtual machine.
The fatal blow, though, seems to have come
from Josh Snyder, who reported that: "For Netflix, losing accurate
information from the mincore syscall would lengthen database cluster
maintenance operations from days to months
". That has led
developers to reconsider their options, including adding a system mode that
would turn mincore() into a privileged operation. Perhaps the
idea that is most likely to be adopted came from Dominique
Martinet, who suggested that information for a given mapping should
only be provided if the caller would be allowed to write to the file
underlying that mapping. That would fix the Netflix use case while
preventing the monitoring of pages from system executable files. A patch
implementing this approach has been posted by Jiri Kosina.
The larger problem
Assuming that a workable solution is found, one might be tempted to conclude that the bigger problem is solved, but that is not yet the case. Dave Chinner pointed out that preadv2() can be used with the RWF_NOWAIT flag to perform non-destructive testing of page-cache contents. A possible solution here is to initiate readahead when an RWF_NOWAIT read fails to find data in the page cache, thus changing the state of the cache and possibly improving performance for normal users at the same time. The patch set from Kosina linked above contains this change as well.
Chinner sees such patches as playing a game of Whack-A-Mole, though, in a setting containing an abundance of moles. He noted that a number of kernel interfaces have been designed to communicate whether data is immediately available (which generally means that it is in the page cache); this information is legitimately useful to a number of applications. Another possible exploit path, he said, is overlayfs, which is used as a way of sharing page-cache contents across containers. Overall, he said, the mincore() change was the wrong approach:
Later in the discussion, he identified another exploit path: with some filesystems at least, performing a direct-I/O read on a page will force that page out of the cache, greatly simplifying the invalidation problem for attackers. There was some heated discussion over whether this was the right thing for filesystems like XFS to do (Linus Torvalds sees it as a bug), but one clear outcome from the discussion is that this behavior is unlikely to change anytime soon.
Even if all of these holes are plugged, there is still the blunt weapon: simple timing attacks. If a read of a specific page goes quickly, that page was almost certainly in the cache; if it takes more time, it probably had to be read in from persistent storage. Timing attacks are generally destructive and are more easily noticed, but they can still be used. And new holes are likely to appear in the future; in a separate discussion Chinner commented on how the recently posted virtio pmem device functionality could be exploited in the same way. The io_uring feature, if merged in its current form, will also make it easy for an attacker to query the state of the page cache.
In other words, the problem seems nearly unsolvable, at least in any absolute sense. Probably the best that can be done is to try to raise the bar high enough to head off most attacks. So the known mechanisms for non-destructively querying the state of the page cache are likely to be shut down, perhaps only if the kernel is configured into a "secure mode". Timing attacks may prove to be too hard (or costly) to close off entirely. So, as Torvalds put it, those wanting any sort of absolute security will be disappointed, as usual:
So at no point is this going to be some kind of absolute line in the sand _anyway_. There is no black-and-white "you're protected", there's only levels of convenience.
That still leaves open the problem of closing off the known exploitation
vectors without creating problems for existing user-space applications.
Like Meltdown and Spectre, this looks like the kind of problem that will be
able to keep kernel developers busy for some time yet.
Index entries for this article | |
---|---|
Kernel | Memory management/Page cache |
Kernel | Security |
Posted Jan 17, 2019 20:59 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Posted Jan 31, 2019 11:31 UTC (Thu)
by sourcejedi (guest, #45153)
[Link]
> The report was on “applying working set heuristics to the Linux kernel“: essentially testing to see if there were ways to overlay some elements of local page replacement to the kernel’s global page replacement policy that would speed turnaround times.
> The answer to that appears to be ‘no’ – at least not in the ways I attempted, though I think there may be some ways to improve performance if some serious studies of phases of locality in programs gave us a better understanding of ways to spot the end of one phase and the beginning of another.
> But, generally speaking, my work showed the global LRU policy of the kernel was pretty robust.
Posted Jan 17, 2019 21:44 UTC (Thu)
by kucharsk (subscriber, #115077)
[Link] (10 responses)
You can extend the paradigm as far out into the computing arena as you like; if a system has both SSD and hard drives, data from SSD will probably be more important or of greater interest than that on the spinning media. If you have a storage solution that sends data off to secondary or tertiary storage, the time it takes to access said data reveals how old the data is.
Likewise on systems with NVRAM, information in NVRAM will generally be more important or interesting than data not kept in non-volatile storage.
This paradigm is of course true for all operating systems, not just Linux.
Timing is always an issue; during the Cold War, Soviet spies were able to wiretap IBM Selectric typewriters in embassies by detecting how long it took the type ball to rotate to each character, giving them a reasonable chance of determining each character being typed.
We obviously can't take the approach of "slow everything down to the time taken to access the slowest device," and there will always be a need to be able to pre-populate clusters, containers or other mechanisms to provide for fast startup times or to provide instant failover. Someone will need access to that information, and as soon as someone does, that's a potential leak.
It's more a matter of reducing exposure than eliminating it, and the question is where does that balance between security and the need for ever faster operation lie?
Posted Jan 18, 2019 1:32 UTC (Fri)
by Nahor (subscriber, #51583)
[Link] (9 responses)
Easy solution: just cache everything. Load the whole disk in RAM at boot. No slow access, no timing attack and the system becomes faster. Win-win! :)
Posted Jan 18, 2019 13:36 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (6 responses)
Posted Jan 19, 2019 1:45 UTC (Sat)
by naptastic (guest, #60139)
[Link] (5 responses)
Posted Jan 20, 2019 18:37 UTC (Sun)
by Sesse (subscriber, #53779)
[Link]
Posted Jan 20, 2019 20:00 UTC (Sun)
by farnz (subscriber, #17727)
[Link] (3 responses)
Your i7-4790K has 32 KiB I$ and 32 KiB D$ - so about as much total L1 cache as your C64 had RAM, but not enough to cover the ROM as well.
My first Z80 machine would fit in L1 cache on your CPU, though - the ZX81 had 1 KiB RAM, 8 KiB ROM, and could be expanded commercially to 16 KiB RAM, 8 KiB ROM.
Posted Jan 28, 2019 7:55 UTC (Mon)
by paulj (subscriber, #341)
[Link] (2 responses)
Posted Jan 28, 2019 14:09 UTC (Mon)
by gevaerts (subscriber, #21521)
[Link]
Posted Jan 30, 2019 14:42 UTC (Wed)
by nix (subscriber, #2304)
[Link]
(Obviously I couldn't fix it. An eight year old with terrible coordination go messing in a power supply?! HELL NO.)
Posted Jan 18, 2019 19:23 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Posted Jan 24, 2019 5:15 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
> The future of computing is straight-up partitioning, sharing nothing. It's a much simpler and more robust world.
To avoid a myriad of new CONFIG_SECURE_SIDE_CHANNEL_FOO options, how about a unique CONFIG_SHARED_SYSTEM setting controlling all these at once?
"Shared" can unfortunately apply to single-user systems too, think Android applications for instance :-(
Posted Jan 18, 2019 4:27 UTC (Fri)
by mangix (guest, #126006)
[Link] (3 responses)
Posted Jan 18, 2019 13:20 UTC (Fri)
by amarao (guest, #87073)
[Link]
Posted Jan 18, 2019 14:47 UTC (Fri)
by bof (subscriber, #110741)
[Link] (1 responses)
Anything with a use case that wants to *avoid* perturbing the page cache. As a sysadmin I regularly use dd iflag=direct or oflag=direct when checksumming or network copying block devices. Applicable to all do-once I/O, actually, and the last time I played with fadvise FADV_NOREUSE (which dd does not support anyway) it was much less reliable.
Posted Jan 21, 2019 1:49 UTC (Mon)
by Paf (subscriber, #91811)
[Link]
The page cache allows both write aggregation and readahead, and for writes to complete asynchronously from the submitting syscall. Both of these have enormous (positive) performance impacts which rise as the amount of I/O the filesystem/device can have in flight increases, and also as the response latency of the device increases.
The page cache allows your single threaded dd to have the system queue up a bunch of writes which may be able to be processed all at once, as contrasted with direct I/O which is 1 I/O per process.
Additionally, if your whole write fits in the page cache and you’re not doing other heavy I/O (ie semi-idle time is available to write out your data) the ability to write to memory and complete asynchronously means your application level performance (where the app doesn’t wait for the write to be on disk) will stomp almost any standard storage device or RAID array,
This means it’s not beneficial to use direct I/O for single use I/O in general, it really depends on your case. DIO is essentially only faster in the cases where your device is *extremely* fast or you have many threads and a very high bandwidth back end (you can overwhelm the page cache).
In cases with higher latency devices (HDD, network file systems) or where there is device level parallelism to exploit (SSDs), direct I/O is often much, much slower, even for well formed I/O. (In real deployments of the Lustre parallel file system, which I work on, single threaded DIO can be 5-10x slower than normal I/O. That’s an extreme case but the reasons for it hold for local file systems too.)
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
Defending against page-cache attacks
A lot of server apps, specifically on IO side (iscsi, different storage/cluster/database software). The faster underlying device is, the more desirable is to use O_DIRECT for the access.
Defending against page-cache attacks
Defending against page-cache attacks