Patching until the COWs come home (part 1)
Copy on write is a standard mechanism for sharing a single instance of an object between processes in a situation where each process has the illusion of an independent, private copy of that object. Examples include memory pages shared between processes or data extents shared between files. To see how COW is used in the memory-management subsystem, consider what happens when a process calls fork(): the pages in that process's private memory areas should no longer be shared between the parent and child. But, instead of creating new copies of those pages for the child process during the fork() call, the kernel will simply map the parent's pages in the child's page tables. Importantly, the page-table entries in both parent and child are set as read-only (write-protected).
If either process attempts to write to one of these pages, a page fault will occur, and the kernel's page-fault handler will create a new copy of the page, replacing the page-table entry (PTE) in the faulting process with a PTE that references the new page, but which allows the write to proceed. This action is often referred to as "breaking COW". If the other process then tries to write to that same page, another page fault will occur, as that process's PTE is still marked read-only. But now the page-fault handler will recognize that the page is no longer shared, so the PTE can just be made writable and the process can resume.
The benefits of this scheme are lower memory consumption and a reduction of CPU time spent copying pages during fork() calls. Often the price of copying is never paid for many of the pages because the child might call exit() or exec() before either the parent or the child writes to those pages.
While the COW mechanism looks simple, the devil is in the details, as has been shown already in the past. The recent trouble in this area started in 2020; it resulted in two major changes while attempting to fix a vulnerability — which is actually still not fixed in all scenarios — and resulted in many corner cases, some of which are still not fully ironed out.
The trouble begins
The first public sign of issues with the COW mechanism appeared in the form of commit 17839856fd58 ("gup: document and work around 'COW can break either way' issue") at the end of May 2020. The changelog doesn't fully describe the problem scenario, but what is there is ominous enough:
End result: the get_user_pages() call might result in a page pointer that is no longer associated with the original VM, and is associated with - and controlled by - another VM having taken it over instead.
Any doubts about whether the commit fixed a security vulnerability vanish when one notices the Reported-by tag mentioning Jann Horn; presumably Horn's report went through the appropriate non-public security channels. The practice of making fixes to some vulnerabilities immediately public without explicitly marking them as such is not new, especially in the COW area. Nevertheless, the related Project Zero issue was made public in August, and CVE-2020-29374 was assigned in December; both point to the above-mentioned commit as the fix.
As the Project Zero issue includes proof-of-concept (PoC) code, we can look at the fix with that code in mind and not rely on the incomplete commit log. The most important parts of the PoC are the following:
static void *data; posix_memalign(&data, 0x1000, 0x1000); strcpy(data, "BORING DATA"); if (fork() == 0) { // child int pipe_fds[2]; struct iovec iov = {.iov_base = data, .iov_len = 0x1000 }; char buf[0x1000]; pipe(pipe_fds); vmsplice(pipe_fds[1], &iov, 1, 0); munmap(data, 0x1000); sleep(2); read(pipe_fds[0], buf, 0x1000); printf("read string from child: %s\n", buf); } else { // parent sleep(1); strcpy(data, "THIS IS SECRET"); }
The code starts by allocating an anonymous, private page and writing some data there; it then calls fork(). At that point, the page becomes a COW page — it is write-protected for the parent process by making the corresponding page-table entry read-only, and for the child process an identical PTE is created. Then, while the parent is blocked inside sleep(), the child creates a pipe and passes the page to that pipe with vmsplice(), a system call that is similar to write() but which allows a zero-copy data transfer of the page's contents. In order to achieve that, the kernel takes a reference on the source page (by increasing its reference count) through get_user_page() or one of its variants; the set of these functions is often referred to as "GUP". The child then unmaps the page from its own page tables (but retains the reference in the pipe) and goes to sleep.
The parent wakes up from its sleep and writes new data to the page. The page table entry is write-protected, so the write causes a page fault. The page-fault handler can tell that this is fault on a COW page because the the mapping allows write access while the PTE is write protected. If there were more processes mapping the page then the content would have to be copied (breaking COW), but if there is a single mapping, the page can be just made writeable. The kernel relies on the value returned by page_mapcount() to determine how many mappings exist.
Here is the problem: page_mapcount() at this point in the PoC's execution includes only the parent's mapping, because the child has already called munmap() on that page. This function does not take into account the fact that the child can still access the parent's page through the pipe; it ignores the elevated page reference count. Thus, the kernel allows the parent process to write new data into the page, which is no longer considered to be a COW page. Finally, the child wakes up and reads that new data from the pipe, which might include sensitive information that the parent did not expect the child to see.
Corralling the problem
One might rightfully ask why this potential of leaking data from parent to child can matter in practice, as both processes are normally executing the code from the same binary and the fork() only acts as a branch in the code. So we can assume that, either the binary is trusted and thus the child process is too, or it is not and then we probably should not let the parent access any sensitive data in the first place. And, in the scenario where fork() from a trusted binary is followed by an exec() of a potentially malicious binary, exec() removes all shared pages from the address space of the child process before loading the new binary.
But, as the Project Zero issue mentions, there are environments, such as Android, where each process is forked from a zygote process without a subsequent exec(), for performance reasons. That could lead to a situation that looks a lot like the PoC exploit for this bug.
Moreover, the vmsplice() syscall might just be a symptom of a broader issue, since there are many other callers of the GUP functions in the kernel. So it is a good idea in general not to let a child process hold on to a page shared through the COW mechanism with the parent while letting the parent write new contents to the page.
To prevent exploits of this behavior, commit 17839856fd58 made it impossible to get a reference (even a read-only reference) via GUP to a COW-shared page. All such attempts now result in breaking COW and returning a reference to the new copy instead. Thus, in the PoC code above, calling vmsplice() now causes the child process to replace the shared COW page in the corresponding page table entry with a new page, which is then passed to the pipe. Afterward, the child no longer has any way to access the parent's page and the new contents written there.
The commit notes the possibility of worse performance for some GUP users, especially those that rely on a lockless variant of the interface like get_user_pages_fast(). The changelog continues that finer-grained rules could be added later for situations where it is clear that it is safe to keep sharing the COW page because it can never be overwritten with new, potentially sensitive contents. The system-wide zero-page would be one example of this sort of situation. But otherwise, Linus Torvalds (the author of the change) expected no fundamental issues with this aggressively COW-breaking approach for GUP. Linux 5.8 was duly released with this commit.
And this, one might think, was the end of the problem. But, as was
mentioned at the outset, COW is a complicated and subtle beast. In truth,
the problems were just beginning. The second
half of this article will
delve into how the COW fix led to a stampede of new problems that still
have yet to be completely solved.
Index entries for this article | |
---|---|
Kernel | Memory management/get_user_pages() |
Kernel | Security/Vulnerabilities |
Security | Linux kernel |
GuestArticles | Babka, Vlastimil |
Posted Mar 22, 2021 18:55 UTC (Mon)
by tux3 (subscriber, #101245)
[Link] (1 responses)
Posted Mar 23, 2021 12:37 UTC (Tue)
by gerdesj (subscriber, #5446)
[Link]
Posted Mar 22, 2021 19:22 UTC (Mon)
by nix (subscriber, #2304)
[Link]
Posted Mar 22, 2021 20:52 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (7 responses)
Can we already use io_uring to spawn new processes?
Posted Mar 22, 2021 23:09 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (6 responses)
Incidentally, while researching this comment, I stumbled across this tidbit from clone(2):
> In contrast to the glibc wrapper, the raw clone() system call
That sounds like it would be fun to debug. Now I'm wondering if any developers have decided that they "need to" bypass glibc and pull this sort of chicanery on any of my systems... (I'm an SRE, so if it broke in production, it would be my problem to fix it). "Fortunately," most of the bugs I've seen have tended to be higher-level than this, but it's still a bit frightening that the kernel will just let you do something like that.
* Which is obviously not going to happen given the kernel's fanatical devotion to not breaking userspace, but let's pretend for a moment.
Posted Mar 23, 2021 10:08 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link] (3 responses)
Posted Mar 23, 2021 16:12 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Posted Mar 23, 2021 23:00 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
But I feel what we might actually get is a io_uring-based API that does this using BPF.
Posted Mar 24, 2021 1:45 UTC (Wed)
by foom (subscriber, #14868)
[Link]
Minor problem being that ptrace is really awful. It's too bad the proposals to add a saner handle based version have all died off.
Posted Mar 23, 2021 11:01 UTC (Tue)
by joib (subscriber, #8541)
[Link]
Posted Mar 23, 2021 14:17 UTC (Tue)
by abatters (✭ supporter ✭, #6932)
[Link]
glibc <= 2.23:
Linux glibc >= 2.24:
recent non-Linux glibc
Note that using clone(CLONE_VM | CLONE_VFORK) safely requires blocking all signals, including NPTL-internal signals. But the glibc wrappers don't let you block NPTL-internal signals, making it much more difficult to do outside of glibc. See the glibc implementation for all the gory details.
Posted Mar 22, 2021 23:42 UTC (Mon)
by milesrout (subscriber, #126894)
[Link] (5 responses)
For what it's worth, this all reminds me of the SCM_RIGHTS/io_uring issue (https://lwn.net/Articles/779472/).
Posted Mar 23, 2021 0:20 UTC (Tue)
by vbabka (subscriber, #91706)
[Link] (3 responses)
I'm not sure I understand your point, but that problem is prevented by the commit. When the child wants to take that extra reference (for vmsplice()) it gets a copy instead of the page shared with the parent. Afterwards both the page tables of the child and the reference held by the pipe point to this new copy, and the access to parent's page is lost.
> Just because the page has been munmap'd doesn't mean that the process can't read from it, so why is the page table entry removed?
That's simply the semantic of munmap() - it has to adjust the VMA tree and zap page table entries so that the munmapped range is no longer represented there. Then if the process tries to read/write to an address within the area, it segfaults. We can't leave the page table entry there just because another reference exists. The read from the pipe doesn't go through these page tables.
Posted Mar 23, 2021 3:37 UTC (Tue)
by PengZheng (subscriber, #108006)
[Link] (2 responses)
In retrospect, is this COW breaking fundamentally wrong?
Posted Mar 23, 2021 8:42 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
For example, mmap can be used to allocate memory at a fixed address. It can be difficult to tell whether any given address is suitable (because other pages etc. might be in the way), but if you had just munmapped it, it would be really weird for a subsequent mmap of the same address and size to fail. Userspace might assume that it doesn't need to check the error code for mmap in that case (or it might not have suitable recovery code, and just call abort(3)).
So now your solution needs to accommodate stacking a new page on top of the hidden page, or relocating the hidden page, either of which is nontrivial. That's not even mentioning the fact that you need to teach vmsplice to track per-process references in the same way as pages do, without that page actually existing in the userspace page map. These are all rather difficult problems to solve, and this is a security issue, so solving hard problems is not the ideal form of a fix. Throwing in a simple COW break is a much more straightforward solution (but, as the story alludes, there was presumably some complication which they failed to account for).
Posted Mar 23, 2021 10:28 UTC (Tue)
by excors (subscriber, #95769)
[Link]
That reminds me of an old bug: About a decade ago, Firefox (using code from jemalloc) would try to do a large aligned allocation like "p = mmap(NULL, size*2); munmap(p); p = mmap(round_up(p, alignment), size);" i.e. using the first mmap+munmap to discover a large-enough hole in the address space, then allocating at a correctly-aligned address within that hole. If the second mmap didn't return the address that was requested, there must have been a race condition with another thread that allocated in the same hole, so it would loop around and try again and hope for better luck next time.
That worked okay until it ran on kernels with a security feature that randomised mmap and entirely ignored the address parameter (which is technically okay since it's defined as just a hint, not a requirement), so the code got stuck in an infinite loop.
That was fixed ages ago, but it does seem plausible that some userspace code may still make similarly unwise assumptions.
Posted Mar 23, 2021 16:00 UTC (Tue)
by iabervon (subscriber, #722)
[Link]
I think the semantics that would be clear is that, if a reference can be used to read the contents of the page, or can be converted into a reference that can be used to read the contents of the page, it counts as a map. But then you'd have to identify what needs the addition.
Posted Mar 23, 2021 8:40 UTC (Tue)
by Sesse (subscriber, #53779)
[Link] (3 responses)
Posted Mar 23, 2021 10:06 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link] (2 responses)
* The fio benchmarking tool supports it
* openssl uses it with AF_ALG (which is also a bit on the obscure side), and so does libkcapi (a library for the kernel crypto API)
* VLC is an interesting one, it uses vmsplice to dump the decompressed output of gzip/bzip2/xz into memory (probably because the rest of the program prefers to work with memory-mapped buffers?)
* FUSE also uses it
Also, Samba doesn't use vmsplice but it uses splice to implement the opposite of sendfile (read from a socket into a file).
Posted Mar 23, 2021 21:49 UTC (Tue)
by dgc (subscriber, #6611)
[Link] (1 responses)
FWIW, given the well known documented caveats around combining Direct IO, mmap buffers and fork to avoid data corruption and "undefined results" (see open(2) man page) because of interactions COW, I'm also not at all surprised that there are very similar issues resulting from vmsplice() taking temporary unaccounted references to user mapped pages...
-Dave.
Posted Mar 23, 2021 22:45 UTC (Tue)
by pbonzini (subscriber, #60935)
[Link]
Posted Mar 25, 2021 4:17 UTC (Thu)
by alison (subscriber, #63752)
[Link] (1 responses)
Posted Mar 25, 2021 7:29 UTC (Thu)
by cladisch (✭ supporter ✭, #50193)
[Link]
RCU requires that all software that accesses these data structures uses certain patterns and helper functions. (This is why we have many articles about it.) COW uses the CPU's built-in virtual memory features to make it look to userspace software as if the pages were never shared to begin with.
Patching until the COWs come home (part 1)
I certainly am glad that project zero is there to root out these bugs (pun intended).
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
> accepts NULL as a stack argument (and clone3() likewise allows
> cl_args.stack to be NULL). In this case, the child uses a
> duplicate of the parent's stack. (Copy-on-write semantics ensure
> that the child gets separate copies of stack pages when either
> process modifies the stack.) In this case, for correct
> operation, the CLONE_VM option should not be specified. (If the
> child shares the parent's memory because of the use of the
> CLONE_VM flag, then no copy-on-write duplication occurs and chaos
> is likely to result.)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
posix_spawn() uses vfork() if POSIX_SPAWN_USEVFORK is set or if there is no cleanup expected in the child before it exec(3)s the requested file. However, this implementation of vfork() was the source of a number of bugs.
glibc commit 9ff72da471a509a8c19791efe469f47fa6977410
posix_spawn() switches from vfork() to clone(CLONE_VM | CLONE_VFORK) which uses a separate stack for the child. This fixes a number of vfork()-related bugs ("possible parent clobber due stack spilling"), making it possible to enable by default and ignore POSIX_SPAWN_USEVFORK.
glibc commit ccfb2964726512f6669fea99a43afa714e2e6a80
POSIX_SPAWN_USEVFORK is ignored and regular fork() is always used, due to difficulties getting vfork() to work without the Linux-specific clone() semantics.
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Why not treat vmsplice reference "as references"? Is is possible to make the page inaccessible to the userspace process after munmap while still keep the kernel's reference?
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)
Patching until the COWs come home (part 1)