|
|
Log in / Subscribe / Register

Stray-write protection for persistent memory

By Jonathan Corbet
February 3, 2022
Persistent memory has a number of advantages; it is fast, CPU-addressable, available in large quantities and, of course, persistent. But it also, arguably, poses a higher risk of suffering corruption as a result of bugs in the kernel. Protecting against this possibility is the objective of this patch set from Ira Weiny, which makes use of Intel's "protection keys supervisor" (PKS) feature to make it harder for the kernel to inadvertently write to persistent memory.

The stray-write problem

Data stored on rotating or solid-state drives can, normally, only be overwritten with an explicit request issued by the CPU, a complex operation that takes several steps to carry out. Kernel bugs are certainly capable of causing the wrong data to be written to a drive — or the right data to end up in the wrong place — so kernel developers go out of their way to prevent that from happening. On the other hand, the kernel is highly unlikely to accidentally trigger an erroneous write to a storage device when the real objective was, say, to move a process into a new control group or allocate a block of memory. Things can always go wrong, but an error that just happens to go through all of the steps of launching an I/O operation is nearly impossible.

Persistent memory changes that situation. A system can be equipped with terabytes of this memory, all of which may contain important data, and all of which is directly addressable by the CPU. Now all that is required to corrupt that memory is a single write to an incorrect pointer, which is a much more probable sort of bug. The corrupted data could lurk undetected for years before, for example, some poor user discovers that their cryptocurrency wallet has become inaccessible. This kind of corruption seems likely enough that it is worth making an effort to prevent, regardless of how one feels about cryptocurrencies.

An obvious way to protect against this sort of bug would be to set the protections on persistent memory to prevent kernel writes. Those protections would only be changed when the kernel actually needs to write to persistent memory, and only for the duration of the operation. The normal permissions stored in the page tables could accomplish this task, but at a huge performance cost. Changing page-table permissions is expensive; doing so for every persistent-memory write would take away much of the performance gain that justified the use of persistent memory in the first place.

There is another way, though, at least on Intel CPUs. The "memory protection keys" feature allows a four-bit key value to be associated with each page-table entry; there is a per-thread mask that sets the read and write permissions that currently apply to each key. So, for example, that mask might say that all pages marked with key 3 are readable but not writable by the current thread. Changing the mask is not a privileged operation, so protection keys are not a strict security feature, but they can raise the bar for malicious access and prevent accidents. Meanwhile, changing the access mask is a fast operation that can change the accessibility of large number of pages, so it is a far more efficient way to provide this protection than tweaking page permissions.

Protecting persistent memory with PKS

Memory protection keys can be used with any type of memory; one use case, for example, is protecting cryptographic keys in ordinary memory. Linux has supported memory protection keys for user space since the 4.9 release in 2016. But memory protection keys can also be used for the kernel, with a kernel-specific key value associated with each page; memory protection keys for the kernel is also known as PKS. Linux is not currently making use of this capability, though. Thus, the first objective of Weiny's 44-part patch set is to add support for PKS to the kernel, a task that involves a number of little details.

For example, the access-control map for PKS is stored in a model-specific register (MSR), which makes changes relatively fast. That MSR is not automatically saved by the CPU, though, when a thread is preempted, so the kernel has to take care of that detail itself. That involved some detailed changes to the pt_regs structure so that it could hold the extra information without breaking code that might not be expecting it. The new "auxiliary pt_regs" infrastructure can hold the PKS permission mask, along with other information that will surely need to be stored in the future.

There also needs to be a way to allocate the 16 available key values within the kernel. Of those, key 0, which is the default key applied to all pages, must retain the "all access allowed" permissions, so 15 keys remain for other uses. In the current implementation, these keys are allocated statically in the code. That approach will have to change if the kernel runs out of keys, but it is not clear that there will ever be enough users to get to that point. Until that happens, there is little value in adding a more sophisticated key-allocation mechanism, so the static approach prevails. In this patch set, key 1 is reserved for kernel self tests, and key 2 is for stray-write protection.

With that infrastructure in place, it's a relatively straightforward matter to set up persistent-memory pages with the new key, and to set the access permissions to deny writes. That will prevent any stray writes from corrupting the memory, which is good, but will also block legitimate writes, which is not quite as good. So the final step is to change the permissions whenever the kernel needs to write to a persistent-memory page. This is easily enough done in the drivers that deal directly with persistent memory; the appropriate calls can just be inserted before and after the writes are performed. Once again, the change to the permissions mask only applies to the current thread, so no other part of the kernel can write to persistent memory while this operation is underway.

There is a remaining problem, though, in the form of the higher layers of the kernel, which may also need to write to persistent memory. One example is the DAX code within filesystems that allows for the reading and writing of data directly to persistent memory without the need to go through the page cache. Attempts to add the necessary calls around every location that might write to persistent memory seem doomed to failure, so another approach is required.

As it happens, kernel code is already required to make special calls before accessing arbitrary pages of memory. These calls, in the form of kmap() and friends, currently do nothing on most 64-bit systems, but they are needed on 32-bit systems where it is not possible to directly map all of physical memory into the kernel's address space. Weiny's patch set causes the short-term calls (such as kmap_atomic() and kmap_local_page()) to check whether the memory in question is protected by the persistent-memory key. If so, the permissions are changed to allow persistent-memory writes for the duration of the mapping. Long-term mappings (with plain kmap()) of protected persistent memory are not allowed, so the window of opportunity to write to persistent memory is kept small and is limited to the thread that needs it.

This patch set is in its eighth revision and has changed considerably since version 2 was covered here in mid 2020. A number of core developers have had a close look and made comments, resulting in a redesign of much of the functionality. Whether this version will pass muster remains to be seen, but it should be getting closer. The objective seems good; the CPU provides a mechanism that can efficiently detect and prevent accidents, the kernel should make use of it.

Index entries for this article
KernelMemory protection keys
KernelSecurity/Kernel hardening


to post comments

Stray-write protection for persistent memory

Posted Feb 3, 2022 17:04 UTC (Thu) by Bigos (subscriber, #96807) [Link] (4 responses)

Regarding stray writes corrupting regular disk storage, I believe a write to page cache might have similar repercussions. If a page is mapped read-only, it probably is not written back, but if it is read/write the kernel could corrupt it and the next write-back operation will persist the corruption on disk.

It is still probably a bigger problem with persistent memory, though.

Stray-write protection for persistent memory

Posted Feb 3, 2022 17:45 UTC (Thu) by ttuttle (subscriber, #51118) [Link] (2 responses)

Yeah, but "in-flight writes" is a much smaller target than "the entire device".

Stray-write protection for persistent memory

Posted Feb 5, 2022 15:54 UTC (Sat) by willy (subscriber, #9762) [Link] (1 responses)

Not "in flight". A page that gets marked as dirty. Unless the stray write gets overwritten by the dirtying, the stray write will be written back as a consequence of the page being dirtied.

Stray-write protection for persistent memory

Posted Feb 6, 2022 0:53 UTC (Sun) by ttuttle (subscriber, #51118) [Link]

Oh! Right, sorry. Still a good bit less than the whole device, hopefully,

Stray-write protection for persistent memory

Posted Feb 5, 2022 15:55 UTC (Sat) by willy (subscriber, #9762) [Link]

I made this point when I was at Intel. It seems that the other side have managed to introduce the extra complexity that they wanted.

Stray-write protection for persistent memory

Posted Feb 3, 2022 17:33 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

Note that protection keys for userspace and supervisor are different features. Protection keys for userspace debuted on Skylake Xeon Scalable processors, while protection keys for supervisor are new in Sapphire Rapids (which is still unobtainium as far as I know).

Stray-write protection for persistent memory

Posted Feb 3, 2022 18:24 UTC (Thu) by walters (subscriber, #7396) [Link]

Cool stuff. Though it seems like this problem domain is another argument for https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

Tangentially related, out of curiosity what are the userspace programs using pkey today? I skimmed through the debian and github code search but only saw kernel source code hits and things that list system calls (e.g. systemd seccomp allowlist)


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds