Fixing page-cache side channels, second attempt
The mincore() change for 5.0 caused this system call to report only the pages that are mapped into the calling process's address space rather than all pages currently resident in the page cache. That change does indeed take away the ability for an attacker to nondestructively test whether specific pages are present in the cache (using mincore() at least), but it also turned out to break some user-space applications that legitimately needed to know about all of the resident pages. The kernel community is unwilling to accept such regressions unless there is absolutely no other solution, so this change could not remain; it was thus duly reverted for 5.0-rc4.
Regressions are against the community's policy, but so is allowing known security holes to remain open. A replacement for the mincore() change is thus needed; it can probably be found in this patch set posted by Vlastimil Babka at the end of January. It applies a new test to determine whether mincore() will report on the presence of pages in the page cache; in particular, it will only provide that information for memory regions that (1) are anonymous memory, or (2) are backed by a file that the calling process would be allowed to open for write access. In the first case, anonymous mappings should not be shared across security boundaries, so there should be no need to protect information about page-cache residency. For the second case, the ability to write a given file would give an attacker the ability to create all kinds of mischief, of which learning about which pages are cached is relatively minor.
Interestingly, in the cases where mincore() does not return actual page-cache residency information, it reports all pages as being present. This was done out of worries that applications might exist that will make repeated attempts to fault in pages until mincore() confirms that they are present in the cache; reporting a "present" state will prevent such applications from looping forever. But it might also prevent them from bringing in the pages they need, harming performance later. In an attempt to avoid the second problem, Babka has added another patch partially restoring the behavior that was removed from 5.0: if information about page-cache residency for a given region is restricted by the criteria described above, pages will be marked as present only if they are mapped in the calling process's page tables. That will allow a process to observe the effect of explicitly faulting a page in while hiding information about pages that the process has not touched.
It appears that these changes should suffice to close off the use of mincore() to watch the page-cache behavior of other processes without breaking any legitimate use cases. The real world is always capable of providing surprises, though, so these changes will have to be tested for a while before they can be trusted not to break anything. For this reason, they are unlikely to be merged for the 5.0 release. They are likely to be backported to the stable updates, though, if and when they get into the mainline and nobody complains.
In the earlier discussions, though, Dave Chinner pointed out that there are other ways of obtaining the same information. In particular, the preadv2() system call, when used with the RWF_NOWAIT flag, will return immediately (without performing I/O) if the requested data is not in the page cache. It, too, can thus be used to query the presence of pages in the cache without changing that state — just the sort of tool an attacker would like to have. The proposed solution here can also be found in the patch set from Babka; it works by always initiating readahead on the pages read with RWF_NOWAIT. That will bring the queried page(s) into the cache, turning the test into a destructive one. That does not entirely foil the ability to determine whether a given page is in the cache, but it does eliminate the ability to repeatedly query to observe when a target process faults a page into the page cache. That should block most of the attacks of interest.
In theory, this change does not affect the semantics of preadv2() as seen by applications. In practice, it could still prove problematic. The existing preadv2() implementation takes pains to avoid performing I/O or blocking for any reason; the changed version could well block in the process of initiating readahead. It is hard to tell whether that change will create performance problems for specific applications, and it may take a long time before any such problems are actually observed and reported. Nobody has suggested a better solution thus far, though.
Assuming that these patches find their way into the mainline, the known
mechanisms for nondestructively testing the state of the page cache will
have been closed off. It will, of course, remain possible to do
destructive testing by
simply measuring how long it takes to access a given page; if the access
happens quickly, the page is resident. But destructive attacks are much
harder to block; they are also harder to exploit. A much bigger problem is
likely to be nondestructive attacks that have not yet been discovered; like
Spectre, such problems have the potential to haunt us for some time.
Index entries for this article | |
---|---|
Kernel | Memory management/Page cache |
Kernel | Security |
Posted Feb 6, 2019 19:04 UTC (Wed)
by donald.buczek (subscriber, #112892)
[Link]
Fixing page-cache side channels, second attempt