Fixing page-cache side channels, second attempt

By Jonathan Corbet
February 5, 2019

The kernel's page cache, which holds copies of data stored in filesystems, is crucial to the performance of the system as a whole. But, as has recently been demonstrated, it can also be exploited to learn about what other users in the system are doing and extract information that should be kept secret. In January, the behavior of the mincore() system call was changed in an attempt to close this vulnerability, but that solution was shown to break existing applications while not fully solving the problem. A better solution will have to wait for the 5.1 development cycle, but the shape of the proposed changes has started to come into focus.

The mincore() change for 5.0 caused this system call to report only the pages that are mapped into the calling process's address space rather than all pages currently resident in the page cache. That change does indeed take away the ability for an attacker to nondestructively test whether specific pages are present in the cache (using mincore() at least), but it also turned out to break some user-space applications that legitimately needed to know about all of the resident pages. The kernel community is unwilling to accept such regressions unless there is absolutely no other solution, so this change could not remain; it was thus duly reverted for 5.0-rc4.

Regressions are against the community's policy, but so is allowing known security holes to remain open. A replacement for the mincore() change is thus needed; it can probably be found in this patch set posted by Vlastimil Babka at the end of January. It applies a new test to determine whether mincore() will report on the presence of pages in the page cache; in particular, it will only provide that information for memory regions that (1) are anonymous memory, or (2) are backed by a file that the calling process would be allowed to open for write access. In the first case, anonymous mappings should not be shared across security boundaries, so there should be no need to protect information about page-cache residency. For the second case, the ability to write a given file would give an attacker the ability to create all kinds of mischief, of which learning about which pages are cached is relatively minor.

Interestingly, in the cases where mincore() does not return actual page-cache residency information, it reports all pages as being present. This was done out of worries that applications might exist that will make repeated attempts to fault in pages until mincore() confirms that they are present in the cache; reporting a "present" state will prevent such applications from looping forever. But it might also prevent them from bringing in the pages they need, harming performance later. In an attempt to avoid the second problem, Babka has added another patch partially restoring the behavior that was removed from 5.0: if information about page-cache residency for a given region is restricted by the criteria described above, pages will be marked as present only if they are mapped in the calling process's page tables. That will allow a process to observe the effect of explicitly faulting a page in while hiding information about pages that the process has not touched.

It appears that these changes should suffice to close off the use of mincore() to watch the page-cache behavior of other processes without breaking any legitimate use cases. The real world is always capable of providing surprises, though, so these changes will have to be tested for a while before they can be trusted not to break anything. For this reason, they are unlikely to be merged for the 5.0 release. They are likely to be backported to the stable updates, though, if and when they get into the mainline and nobody complains.

In the earlier discussions, though, Dave Chinner pointed out that there are other ways of obtaining the same information. In particular, the preadv2() system call, when used with the RWF_NOWAIT flag, will return immediately (without performing I/O) if the requested data is not in the page cache. It, too, can thus be used to query the presence of pages in the cache without changing that state — just the sort of tool an attacker would like to have. The proposed solution here can also be found in the patch set from Babka; it works by always initiating readahead on the pages read with RWF_NOWAIT. That will bring the queried page(s) into the cache, turning the test into a destructive one. That does not entirely foil the ability to determine whether a given page is in the cache, but it does eliminate the ability to repeatedly query to observe when a target process faults a page into the page cache. That should block most of the attacks of interest.

In theory, this change does not affect the semantics of preadv2() as seen by applications. In practice, it could still prove problematic. The existing preadv2() implementation takes pains to avoid performing I/O or blocking for any reason; the changed version could well block in the process of initiating readahead. It is hard to tell whether that change will create performance problems for specific applications, and it may take a long time before any such problems are actually observed and reported. Nobody has suggested a better solution thus far, though.

Assuming that these patches find their way into the mainline, the known mechanisms for nondestructively testing the state of the page cache will have been closed off. It will, of course, remain possible to do destructive testing by simply measuring how long it takes to access a given page; if the access happens quickly, the page is resident. But destructive attacks are much harder to block; they are also harder to exploit. A much bigger problem is likely to be nondestructive attacks that have not yet been discovered; like Spectre, such problems have the potential to haunt us for some time.

Index entries for this article
Kernel	Memory management/Page cache
Kernel	Security

Fixing page-cache side channels, second attempt

Posted Feb 6, 2019 19:04 UTC (Wed) by donald.buczek (subscriber, #112892) [Link]

Can't a capability be used to grant the right to get the complete view only to specific programs?