More timing side-channels for the page cache
In 2019, researchers published a way to identify which file-backed pages were being accessed on a system using timing information from the page cache, leading to a handful of unpleasant consequences and a change to the design of the mincore() system call. Discussion at the time led to a number of ad-hoc patches to address the problem. The lack of new page-cache attacks suggested that attempts to fix things in a piecemeal fashion had succeeded. Now, however, Sudheendra Raghav Neela, Jonas Juffinger, Lukas Maar, and Daniel Gruss have found a new set of holes in the Linux kernel's page-cache-timing protections that allow the same general class of attack.
The impact
The ability to determine when pages are present in memory and when they are accessed
may not sound particularly easy to exploit. There are some subtle attack
vectors, such as how knowing which page of an executable is in memory
can indicate which code is being executed. In turn, that allows deploying other
attacks that depend on coincidences of timing with more reliability. For
example, the timing
information can be used to defeat
address-space-layout randomization. Another
possible use is detecting when a user is entering a password, and looking at when
the privileged application accepting the password resumes executing in response
to an event. That
reveals how long it takes the user to press each key, which
can be used to
reconstruct the actual password typed words with reasonable
fidelity. [A reader pointed out that the linked paper applies specifically to
written text, and that "reconstructing passwords, passphrases, and
pseudorandom strings presents a very different and more
difficult problem
". Depending on how similar a password is to normal text,
timing information may be more or less usable to narrow the search space of
possible passwords.]
It is not currently recommended to change your keyboard layout every thirty days, however: the real problem is not any specific attack that can be performed with access to page-cache-timing information, but rather the fact that the page cache touches on the timing of nearly every operation on a modern computer. The page cache is shared between applications running at different privilege levels, and can be monitored and flushed by any of those applications, for longstanding performance reasons. Changing the semantics of page-cache operations would certainly cause noticeable breakage in user space.
The mechanisms
The original page-cache-timing exploits from 2019 used the mincore() system call, which allows programs to check whether a page is already present in the page cache for performance optimization. The fix at the time made mincore() return fake information for pages that are not mapped in a process's page tables, so that one application could not spy on another. That check, however, was not correctly applied to the cachestat() system call when it was added to the kernel in 2023, reopening the same set of vulnerabilities.
The recent paper also laid out a handful of other mechanisms that don't depend on specific system calls, however. The most basic is simply measuring the amount of time that it takes for a page to be read; when the page is not in the cache, it takes noticeably longer to read the page. Reading the page can be done with a number of system calls, such as read(), mmap(), or even sendfile().
The disadvantage to using timing information to detect whether a page is present in the page cache is that attempting to read the page brings it into the cache. If a malicious program loads the page of interest right before the program being attacked, the malicious program may not be able to flush the page out of the cache quickly enough to observe the subsequent access. Still, there are well-known statistical techniques for turning an unreliable timing channel into a reliable leak of information.
Flushing a page from the page cache by brute force is fairly simple: just access enough other pages that the cache fills up and evicts the least-recently accessed page. (Although the exact details can get more complicated with the kernel's multi-generational LRU.) There is a subtler way, though: the posix_fadvise() system call. It allows applications to advise the kernel on how they expect to access memory; using the POSIX_FADV_DONTNEED flag reliably makes the kernel remove the relevant page from the page cache as long as the page is not mapped anywhere else with mmap(). The system call is also usable as a timing side-channel itself: the call completes more quickly when the targeted page is not already in the cache. Calling posix_fadvise() in a loop can therefore both keep a page out of the page cache, and determine when another process faults it back in.
Even if posix_fadvise() isn't available, the preadv2() system call, when invoked with the RFW_NOWAIT flag, also provides a way to check whether a page is in the page cache. The flag makes the system call return EAGAIN when the page is not immediately available. The exact semantics of the call are up to specific filesystems, so this could theoretically still bring the page into the cache, but the researchers cite another recent paper (unfortunately paywalled) that claims this doesn't happen in practice.
These mechanisms can be combined in flexible ways. The most reliable attack technique demonstrated by the paper was to use posix_fadvise() to remove a page from the cache, and then wait for it to be faulted back in with preadv2(). If preadv2() is blocked (by seccomp, perhaps), using posix_fadvise() on its own still works, just a little less reliably. If posix_fadvise() is blocked, evicting the page from the cache the old-fashioned way still works. And if both system calls are blocked, pages can still be repeatedly evicted and loaded to obtain rough timing information.
What to do
It's tempting to try to introduce another set of targeted fixes for posix_fadvise() and preadv2(). The problem is that these system calls have existed since 2003 and 2016, respectively, and are widely used by existing user-space applications. Changing either of their semantics would certainly introduce breaking changes. In an informal writeup of the paper on his blog, Neela quoted Linus Torvalds's opinion from their private correspondence:
Yeah, while I'm very comfortable changing cachestats, I'm not so sure about POSIX_FADV_DONTNEED.
In particular, I can easily see cases where people really want to say "drop the caches" on files that they really cannot write to.
Even if kernel developers were to change the semantics of POSIX_FADV_DONTNEED, however, there are multiple other mechanisms to accomplish the same thing. The only total solution would be to partition the page cache so that privileged processes no longer share pages with unprivileged ones. There is existing research on the performance impacts of such a change, but the results are mixed: the impact depends heavily on the particular workloads making use of the cache. A less-invasive solution might be to update secure software to use mlock() to pin their executable code in the page cache. That would introduce yet another subtle detail that writers of secure software would have to be aware of and use judiciously.
Since the researchers' disclosure of the new set of page-cache-based attacks in January, cachestat() has been patched, but there has been relatively little discussion about other changes to the page cache. Without any clever ideas about how to mitigate the risks without harming backward compatibility, this may become one of those attacks that, like Spectre or Rowhammer, can be mitigated but not properly prevented.
| Index entries for this article | |
|---|---|
| Security | Information leak |
