memfd_secret() in 5.14
The prototype for memfd_secret() is:
int memfd_secret(unsigned int flags);
The only allowed flag is O_CLOEXEC, which causes the area to be removed when the calling process calls execve(). The secret area will be available to children after a fork, though. The return value from memfd_secret() will be a file descriptor attached to the newly created secret memory area.
At this point, the process can't actually access that memory, which doesn't even have a size yet. A call must be made to ftruncate() to set the size of the area, then mmap() is used to map that "file" into the owning process's address space. The pages allocated to populate that mapping will be removed from the kernel's direct map, and specially marked to prevent them from being mapped back in by mistake. Thereafter, the memory is accessible to that process, but to nobody else, not even the kernel. The memory is thus about as well protected as it can get, which is good, but there are some consequences as well. Pointers to the secret-memory region cannot be used in system calls, for example; this memory is also inaccessible to DMA operations.
The first public posting of this work happened in October 2019. A number of changes have been made over the nearly two years that this system call has been under development — beyond the shift to a separate system call (in July 2020), which was done because this functionality was deemed to have little in common with ordinary memfds. For example, the ability to reserve a dedicated range of memory for memfd_secret() was added in the version-2 posting later in July 2020, only to be removed in version 5 in September.
Early versions of the patch set included a flag to require the removal of the pages in the secret area from the kernel's direct map. Doing so has the advantage of making the memory inaccessible to the kernel (and to anybody who might be able to compromise the kernel), but there were fears that breaking up the 1GB pages used for the direct mapping would degrade performance. Those fears have subsided over time, though; performance with 2MB pages is not much different than with 1GB pages. So that option disappeared in version 4 in August 2020, and direct-map removal became the rule.
Version 8, posted in November, added a couple of changes. Rather than allocating arbitrary kernel memory, CMA was used as the memory pool for memfd_secret(). Another change, which created a bit of controversy over the life of the patch, disables hibernation when a secret memory area is active. The purpose is to prevent the secret data from being written to persistent storage, but some users may become disgruntled if they find that they can no longer hibernate their systems. That notwithstanding, this behavior was part of the version that went into 5.14.
Since the beginning, memfd_secret() has supported a flag requesting uncached mappings — memory that bypasses the memory caches — for the secret area. This feature makes the area even more secure by eliminating copies in the caches (which could be exfiltrated via Spectre vulnerabilities), but setting memory to be uncached will drastically reduce performance. The caches exist for a reason, after all. Andy Lutomirski repeatedly opposed making uncached mappings available, objecting to the performance cost and more; Rapoport finally agreed to remove it. Version 11 removed that feature, leading to the current state where there are no flags that are specific to this system call.
In version 17 (February 2021), memfd_secret() was disabled by default and a command-line option (secretmem_enable=) was added to enable it at boot time. This decision was made out of fears that system performance could be degraded by breaking up the direct map and locking secret-area memory in RAM, so the feature is unavailable unless the system administrator turns it on. This version also ended the use of CMA for memory allocations.
And that leads to what is essentially the current state of
memfd_secret(). It took 23 versions to get there, which
seems like a lot, but that is often the nature of memory-management
changes. Once 5.14 comes out, we'll see how big the user community is for
this feature, and what changes will be deemed necessary. For now, though,
this work would appear to have finally reached a successful conclusion.
For the curious, this draft
man page has a few more details.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Address-space isolation |
| Kernel | Releases/5.14 |
| Kernel | System calls/memfd_secret() |
