Patching until the COWs come home (part 1)

March 22, 2021

This article was contributed by Vlastimil Babka

The kernel's memory-management subsystem is built upon many concepts, one of which is called "copy on write", or "COW". The idea behind COW is conceptually simple, but its details are tricky and its past is troublesome. Any change to its implementation can have unexpected consequences and cause subtle breakage for existing workloads. So it is somewhat surprising that last year we saw two major changes the kernel's COW code; less surprising is the fact that, both times, these changes had unexpected consequences and broke things. Some of the resulting problems are still not fixed today, almost ten months after the first change, while the original reason for the changes — a security vulnerability — is also not fully fixed. Read on for a description of COW, the vulnerability, and the initial fix; the concluding article in the series will describe the complications that arose thereafter.

Copy on write is a standard mechanism for sharing a single instance of an object between processes in a situation where each process has the illusion of an independent, private copy of that object. Examples include memory pages shared between processes or data extents shared between files. To see how COW is used in the memory-management subsystem, consider what happens when a process calls fork(): the pages in that process's private memory areas should no longer be shared between the parent and child. But, instead of creating new copies of those pages for the child process during the fork() call, the kernel will simply map the parent's pages in the child's page tables. Importantly, the page-table entries in both parent and child are set as read-only (write-protected).

If either process attempts to write to one of these pages, a page fault will occur, and the kernel's page-fault handler will create a new copy of the page, replacing the page-table entry (PTE) in the faulting process with a PTE that references the new page, but which allows the write to proceed. This action is often referred to as "breaking COW". If the other process then tries to write to that same page, another page fault will occur, as that process's PTE is still marked read-only. But now the page-fault handler will recognize that the page is no longer shared, so the PTE can just be made writable and the process can resume.

The benefits of this scheme are lower memory consumption and a reduction of CPU time spent copying pages during fork() calls. Often the price of copying is never paid for many of the pages because the child might call exit() or exec() before either the parent or the child writes to those pages.

While the COW mechanism looks simple, the devil is in the details, as has been shown already in the past. The recent trouble in this area started in 2020; it resulted in two major changes while attempting to fix a vulnerability — which is actually still not fixed in all scenarios — and resulted in many corner cases, some of which are still not fully ironed out.

The trouble begins

The first public sign of issues with the COW mechanism appeared in the form of commit 17839856fd58 ("gup: document and work around 'COW can break either way' issue") at the end of May 2020. The changelog doesn't fully describe the problem scenario, but what is there is ominous enough:

End result: the get_user_pages() call might result in a page pointer that is no longer associated with the original VM, and is associated with - and controlled by - another VM having taken it over instead.

Any doubts about whether the commit fixed a security vulnerability vanish when one notices the Reported-by tag mentioning Jann Horn; presumably Horn's report went through the appropriate non-public security channels. The practice of making fixes to some vulnerabilities immediately public without explicitly marking them as such is not new, especially in the COW area. Nevertheless, the related Project Zero issue was made public in August, and CVE-2020-29374 was assigned in December; both point to the above-mentioned commit as the fix.

As the Project Zero issue includes proof-of-concept (PoC) code, we can look at the fix with that code in mind and not rely on the incomplete commit log. The most important parts of the PoC are the following:

    static void *data;

    posix_memalign(&data, 0x1000, 0x1000);
    strcpy(data, "BORING DATA");

    if (fork() == 0) {
	// child
	int pipe_fds[2];
	struct iovec iov = {.iov_base = data, .iov_len = 0x1000 };
	char buf[0x1000];

	pipe(pipe_fds);
	vmsplice(pipe_fds[1], &iov, 1, 0);
	munmap(data, 0x1000);

	sleep(2);
	read(pipe_fds[0], buf, 0x1000);
	printf("read string from child: %s\n", buf);
   } else {
	// parent
	sleep(1);
	strcpy(data, "THIS IS SECRET");
   }

The code starts by allocating an anonymous, private page and writing some data there; it then calls fork(). At that point, the page becomes a COW page — it is write-protected for the parent process by making the corresponding page-table entry read-only, and for the child process an identical PTE is created. Then, while the parent is blocked inside sleep(), the child creates a pipe and passes the page to that pipe with vmsplice(), a system call that is similar to write() but which allows a zero-copy data transfer of the page's contents. In order to achieve that, the kernel takes a reference on the source page (by increasing its reference count) through get_user_page() or one of its variants; the set of these functions is often referred to as "GUP". The child then unmaps the page from its own page tables (but retains the reference in the pipe) and goes to sleep.

The parent wakes up from its sleep and writes new data to the page. The page table entry is write-protected, so the write causes a page fault. The page-fault handler can tell that this is fault on a COW page because the the mapping allows write access while the PTE is write protected. If there were more processes mapping the page then the content would have to be copied (breaking COW), but if there is a single mapping, the page can be just made writeable. The kernel relies on the value returned by page_mapcount() to determine how many mappings exist.

Here is the problem: page_mapcount() at this point in the PoC's execution includes only the parent's mapping, because the child has already called munmap() on that page. This function does not take into account the fact that the child can still access the parent's page through the pipe; it ignores the elevated page reference count. Thus, the kernel allows the parent process to write new data into the page, which is no longer considered to be a COW page. Finally, the child wakes up and reads that new data from the pipe, which might include sensitive information that the parent did not expect the child to see.

Corralling the problem

One might rightfully ask why this potential of leaking data from parent to child can matter in practice, as both processes are normally executing the code from the same binary and the fork() only acts as a branch in the code. So we can assume that, either the binary is trusted and thus the child process is too, or it is not and then we probably should not let the parent access any sensitive data in the first place. And, in the scenario where fork() from a trusted binary is followed by an exec() of a potentially malicious binary, exec() removes all shared pages from the address space of the child process before loading the new binary.

But, as the Project Zero issue mentions, there are environments, such as Android, where each process is forked from a zygote process without a subsequent exec(), for performance reasons. That could lead to a situation that looks a lot like the PoC exploit for this bug.

Moreover, the vmsplice() syscall might just be a symptom of a broader issue, since there are many other callers of the GUP functions in the kernel. So it is a good idea in general not to let a child process hold on to a page shared through the COW mechanism with the parent while letting the parent write new contents to the page.

To prevent exploits of this behavior, commit 17839856fd58 made it impossible to get a reference (even a read-only reference) via GUP to a COW-shared page. All such attempts now result in breaking COW and returning a reference to the new copy instead. Thus, in the PoC code above, calling vmsplice() now causes the child process to replace the shared COW page in the corresponding page table entry with a new page, which is then passed to the pipe. Afterward, the child no longer has any way to access the parent's page and the new contents written there.

The commit notes the possibility of worse performance for some GUP users, especially those that rely on a lockless variant of the interface like get_user_pages_fast(). The changelog continues that finer-grained rules could be added later for situations where it is clear that it is safe to keep sharing the COW page because it can never be overwritten with new, potentially sensitive contents. The system-wide zero-page would be one example of this sort of situation. But otherwise, Linus Torvalds (the author of the change) expected no fundamental issues with this aggressively COW-breaking approach for GUP. Linux 5.8 was duly released with this commit.

And this, one might think, was the end of the problem. But, as was mentioned at the outset, COW is a complicated and subtle beast. In truth, the problems were just beginning. The second half of this article will delve into how the COW fix led to a stampede of new problems that still have yet to be completely solved.

Index entries for this article
Kernel	Memory management/get_user_pages()
Kernel	Security/Vulnerabilities
Security	Linux kernel
GuestArticles	Babka, Vlastimil

Patching until the COWs come home (part 1)

Posted Mar 22, 2021 18:55 UTC (Mon) by tux3 (subscriber, #101245) [Link] (1 responses)

What a cliffhanger!
I certainly am glad that project zero is there to root out these bugs (pun intended).

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 12:37 UTC (Tue) by gerdesj (subscriber, #5446) [Link]

"But, as the Project Zero issue mentions, there are environments, such as Android, where each process is forked from a zygote process without a subsequent exec(), for performance reasons. That could lead to a situation that looks a lot like the PoC exploit for this bug."

Patching until the COWs come home (part 1)

Posted Mar 22, 2021 19:22 UTC (Mon) by nix (subscriber, #2304) [Link]

It's definitely a problem because it doesn't match the user's mental model of how fork() is supposed to work. It's clear that either COW must be broken in this case or a mapping must be retained (or the refcounts split into per-mm versions, which seems likely to be far more expensive). The conceptually ideal approach would have everything act just like normal data, i.e. recognise things like vmsplice references *as* references so you don't need to specially break COW early for them -- but this seems likely to be viciously complex and of only minor benefit. Of course, hunting down every single way a reference can be taken by a child and arranging to COW-break on all of them seems likely to be a nightmarish game of whack-a-mole too...

Patching until the COWs come home (part 1)

Posted Mar 22, 2021 20:52 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

Over the years I found that COWs are just bad and it's best to avoid them altogether.

Can we already use io_uring to spawn new processes?

Patching until the COWs come home (part 1)

Posted Mar 22, 2021 23:09 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (6 responses)

If you're going to break API compatibility anyway,* then you might as well just tell people to use vfork() or posix_spawn(), as those are already mature and well-understood interfaces, and the latter is even portable.

Incidentally, while researching this comment, I stumbled across this tidbit from clone(2):

> In contrast to the glibc wrapper, the raw clone() system call
> accepts NULL as a stack argument (and clone3() likewise allows
> cl_args.stack to be NULL). In this case, the child uses a
> duplicate of the parent's stack. (Copy-on-write semantics ensure
> that the child gets separate copies of stack pages when either
> process modifies the stack.) In this case, for correct
> operation, the CLONE_VM option should not be specified. (If the
> child shares the parent's memory because of the use of the
> CLONE_VM flag, then no copy-on-write duplication occurs and chaos
> is likely to result.)

That sounds like it would be fun to debug. Now I'm wondering if any developers have decided that they "need to" bypass glibc and pull this sort of chicanery on any of my systems... (I'm an SRE, so if it broke in production, it would be my problem to fix it). "Fortunately," most of the bugs I've seen have tended to be higher-level than this, but it's still a bit frightening that the kernel will just let you do something like that.

* Which is obviously not going to happen given the kernel's fanatical devotion to not breaking userspace, but let's pretend for a moment.

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 10:08 UTC (Tue) by pbonzini (subscriber, #60935) [Link] (3 responses)

posix_spawn() is not a single system call (technically both fork() and vfork() are wrappers around clone(2), but at least the latter is a single system call).

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 16:12 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

Well sure, but in the (absurd) hypothetical where we're eliminating COW, the kernel would presumably need to grow an interface with capabilities similar to fork+exec, and the POSIX standard name for that interface is posix_spawn.

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 23:00 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

Apparently, people dislike posix_spawn. Perhaps we would grow a full replacement that would allow to create a suspended process, tweak its attributes (using file handle-based API) and then resume it.

But I feel what we might actually get is a io_uring-based API that does this using BPF.

Patching until the COWs come home (part 1)

Posted Mar 24, 2021 1:45 UTC (Wed) by foom (subscriber, #14868) [Link]

We _almost_ have that api already: ptrace.

Minor problem being that ptrace is really awful. It's too bad the proposals to add a saner handle based version have all died off.

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 11:01 UTC (Tue) by joib (subscriber, #8541) [Link]

A couple of years ago we discussed this paper, which argues that fork() is fundamentally the wrong primitive to build OS process management around: https://lwn.net/Articles/785430/

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 14:17 UTC (Tue) by abatters (✭ supporter ✭, #6932) [Link]

I considered using vfork() recently, but ultimately decided against it after encountering too many warnings about it. For example, see the history of glibc using vfork() for posix_spawn():

glibc <= 2.23:
posix_spawn() uses vfork() if POSIX_SPAWN_USEVFORK is set or if there is no cleanup expected in the child before it exec(3)s the requested file. However, this implementation of vfork() was the source of a number of bugs.

Linux glibc >= 2.24:
glibc commit 9ff72da471a509a8c19791efe469f47fa6977410
posix_spawn() switches from vfork() to clone(CLONE_VM | CLONE_VFORK) which uses a separate stack for the child. This fixes a number of vfork()-related bugs ("possible parent clobber due stack spilling"), making it possible to enable by default and ignore POSIX_SPAWN_USEVFORK.

recent non-Linux glibc
glibc commit ccfb2964726512f6669fea99a43afa714e2e6a80
POSIX_SPAWN_USEVFORK is ignored and regular fork() is always used, due to difficulties getting vfork() to work without the Linux-specific clone() semantics.

Note that using clone(CLONE_VM | CLONE_VFORK) safely requires blocking all signals, including NPTL-internal signals. But the glibc wrappers don't let you block NPTL-internal signals, making it much more difficult to do outside of glibc. See the glibc implementation for all the gory details.

Patching until the COWs come home (part 1)

Posted Mar 22, 2021 23:42 UTC (Mon) by milesrout (subscriber, #126894) [Link] (5 responses)

Naïve non-kernel-developer perspective: this seems _intuitively_ like the wrong solution to the problem. The child still has access to the page, so surely it should still be marked as having access to that page as long as it does indeed have a reference to it, rather than pretending it doesn't have access to the page just because it's been munmap'd. Just because the page has been munmap'd doesn't mean that the process can't read from it, so why is the page table entry removed? I assume there's a very good reason why it's done this way, though. Could someone clear up my misunderstanding?

For what it's worth, this all reminds me of the SCM_RIGHTS/io_uring issue (https://lwn.net/Articles/779472/).

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 0:20 UTC (Tue) by vbabka (subscriber, #91706) [Link] (3 responses)

> The child still has access to the page, so surely it should still be marked as having access to that page as long as it does indeed have a reference to it, rather than pretending it doesn't have access to the page just because it's been munmap'd.

I'm not sure I understand your point, but that problem is prevented by the commit. When the child wants to take that extra reference (for vmsplice()) it gets a copy instead of the page shared with the parent. Afterwards both the page tables of the child and the reference held by the pipe point to this new copy, and the access to parent's page is lost.

> Just because the page has been munmap'd doesn't mean that the process can't read from it, so why is the page table entry removed?

That's simply the semantic of munmap() - it has to adjust the VMA tree and zap page table entries so that the munmapped range is no longer represented there. Then if the process tries to read/write to an address within the area, it segfaults. We can't leave the page table entry there just because another reference exists. The read from the pipe doesn't go through these page tables.

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 3:37 UTC (Tue) by PengZheng (subscriber, #108006) [Link] (2 responses)

As a non-kernel developer, I'm curious:

In retrospect, is this COW breaking fundamentally wrong?
Why not treat vmsplice reference "as references"? Is is possible to make the page inaccessible to the userspace process after munmap while still keep the kernel's reference?

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 8:42 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

In principle yes, but then you have other problems.

For example, mmap can be used to allocate memory at a fixed address. It can be difficult to tell whether any given address is suitable (because other pages etc. might be in the way), but if you had just munmapped it, it would be really weird for a subsequent mmap of the same address and size to fail. Userspace might assume that it doesn't need to check the error code for mmap in that case (or it might not have suitable recovery code, and just call abort(3)).

So now your solution needs to accommodate stacking a new page on top of the hidden page, or relocating the hidden page, either of which is nontrivial. That's not even mentioning the fact that you need to teach vmsplice to track per-process references in the same way as pages do, without that page actually existing in the userspace page map. These are all rather difficult problems to solve, and this is a security issue, so solving hard problems is not the ideal form of a fix. Throwing in a simple COW break is a much more straightforward solution (but, as the story alludes, there was presumably some complication which they failed to account for).

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 10:28 UTC (Tue) by excors (subscriber, #95769) [Link]

> if you had just munmapped it, it would be really weird for a subsequent mmap of the same address and size to fail. Userspace might assume that it doesn't need to check the error code for mmap in that case (or it might not have suitable recovery code, and just call abort(3)).

That reminds me of an old bug: About a decade ago, Firefox (using code from jemalloc) would try to do a large aligned allocation like "p = mmap(NULL, size*2); munmap(p); p = mmap(round_up(p, alignment), size);" i.e. using the first mmap+munmap to discover a large-enough hole in the address space, then allocating at a correctly-aligned address within that hole. If the second mmap didn't return the address that was requested, there must have been a race condition with another thread that allocated in the same hole, so it would loop around and try again and hope for better luck next time.

That worked okay until it ran on kernels with a security feature that randomised mmap and entirely ignored the address parameter (which is technically okay since it's defined as just a hint, not a requirement), so the code got stuck in an infinite loop.

That was fixed ages ago, but it does seem plausible that some userspace code may still make similarly unwise assumptions.

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 16:00 UTC (Tue) by iabervon (subscriber, #722) [Link]

The PTE has to go away because that's what munmap() is specified to do. Furthermore, the fact that it's the same process at both ends of the pipe is irrelevant to the issue, I think. It seems to me like the simple solution would be to elevate the page map count while it's in the pipe (but where else might a page be kept that a process other than parent could get it?) or use the refcount to decide if the parent is the exclusive owner (but maybe the parent has extra references?).

I think the semantics that would be clear is that, if a reference can be used to read the contents of the page, or can be converted into a reference that can be used to read the contents of the page, it counts as a map. But then you'd have to identify what needs the addition.

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 8:40 UTC (Tue) by Sesse (subscriber, #53779) [Link] (3 responses)

Do any applications actually use vmsplice() at all? I remember making a prototype for a minimal HTTP server (it only ever existed to serve two static responses), and it helped a few percent, but was never put in production. It seems to have caused a fair bit of security headache over the years…

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 10:06 UTC (Tue) by pbonzini (subscriber, #60935) [Link] (2 responses)

Based on a quick Debian Code Search query:

* The fio benchmarking tool supports it

* openssl uses it with AF_ALG (which is also a bit on the obscure side), and so does libkcapi (a library for the kernel crypto API)

* VLC is an interesting one, it uses vmsplice to dump the decompressed output of gzip/bzip2/xz into memory (probably because the rest of the program prefers to work with memory-mapped buffers?)

* FUSE also uses it

Also, Samba doesn't use vmsplice but it uses splice to implement the opposite of sendfile (read from a socket into a file).

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 21:49 UTC (Tue) by dgc (subscriber, #6611) [Link] (1 responses)

Yup, vmsplice is the classic "solution looking for a problem to solve" interface.

FWIW, given the well known documented caveats around combining Direct IO, mmap buffers and fork to avoid data corruption and "undefined results" (see open(2) man page) because of interactions COW, I'm also not at all surprised that there are very similar issues resulting from vmsplice() taking temporary unaccounted references to user mapped pages...

-Dave.

Patching until the COWs come home (part 1)

Posted Mar 23, 2021 22:45 UTC (Tue) by pbonzini (subscriber, #60935) [Link]

On the other hand vmsplice is just a handy shortcut. AIO or any use of g_u_p() can cause the same issues.

Patching until the COWs come home (part 1)

Posted Mar 25, 2021 4:17 UTC (Thu) by alison (subscriber, #63752) [Link] (1 responses)

Here is a naive question: why do we have both COW and RCU? Is there any reason fork() couldn't use RCU? Is RCU safer? RCU has extensive formal documentation but COW less so.

Patching until the COWs come home (part 1)

Posted Mar 25, 2021 7:29 UTC (Thu) by cladisch (✭ supporter ✭, #50193) [Link]

RCU requires that all software that accesses these data structures uses certain patterns and helper functions. (This is why we have many articles about it.)

COW uses the CPU's built-in virtual memory features to make it look to userspace software as if the pages were never shared to begin with.