LWN: Comments on "Patching until the COWs come home (part 1)" https://lwn.net/Articles/849638/ This is a special feed containing comments posted to the individual LWN article titled "Patching until the COWs come home (part 1)". en-us Thu, 04 Sep 2025 00:41:39 +0000 Thu, 04 Sep 2025 00:41:39 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Patching until the COWs come home (part 1) https://lwn.net/Articles/850442/ https://lwn.net/Articles/850442/ cladisch <p>RCU requires that all software that accesses these data structures uses certain patterns and helper functions. (This is why we have <a href="https://lwn.net/Kernel/Index/#Read-copy-update">many articles</a> about it.)</p> <p>COW uses the CPU's built-in virtual memory features to make it look to userspace software as if the pages were never shared to begin with.</p> Thu, 25 Mar 2021 07:29:38 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850432/ https://lwn.net/Articles/850432/ alison <div class="FormattedComment"> Here is a naive question: why do we have both COW and RCU? Is there any reason fork() couldn&#x27;t use RCU? Is RCU safer? RCU has extensive formal documentation but COW less so.<br> </div> Thu, 25 Mar 2021 04:17:34 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850248/ https://lwn.net/Articles/850248/ foom <div class="FormattedComment"> We _almost_ have that api already: ptrace.<br> <p> <p> Minor problem being that ptrace is really awful. It&#x27;s too bad the proposals to add a saner handle based version have all died off.<br> </div> Wed, 24 Mar 2021 01:45:12 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850223/ https://lwn.net/Articles/850223/ Cyberax <div class="FormattedComment"> Apparently, people dislike posix_spawn. Perhaps we would grow a full replacement that would allow to create a suspended process, tweak its attributes (using file handle-based API) and then resume it.<br> <p> But I feel what we might actually get is a io_uring-based API that does this using BPF. <br> </div> Tue, 23 Mar 2021 23:00:09 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850222/ https://lwn.net/Articles/850222/ pbonzini <div class="FormattedComment"> On the other hand vmsplice is just a handy shortcut. AIO or any use of g_u_p() can cause the same issues.<br> </div> Tue, 23 Mar 2021 22:45:09 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850210/ https://lwn.net/Articles/850210/ dgc <div class="FormattedComment"> Yup, vmsplice is the classic &quot;solution looking for a problem to solve&quot; interface. <br> <p> FWIW, given the well known documented caveats around combining Direct IO, mmap buffers and fork to avoid data corruption and &quot;undefined results&quot; (see open(2) man page) because of interactions COW, I&#x27;m also not at all surprised that there are very similar issues resulting from vmsplice() taking temporary unaccounted references to user mapped pages...<br> <p> -Dave.<br> </div> Tue, 23 Mar 2021 21:49:45 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850191/ https://lwn.net/Articles/850191/ NYKevin <div class="FormattedComment"> Well sure, but in the (absurd) hypothetical where we&#x27;re eliminating COW, the kernel would presumably need to grow an interface with capabilities similar to fork+exec, and the POSIX standard name for that interface is posix_spawn.<br> </div> Tue, 23 Mar 2021 16:12:09 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850189/ https://lwn.net/Articles/850189/ iabervon <div class="FormattedComment"> The PTE has to go away because that&#x27;s what munmap() is specified to do. Furthermore, the fact that it&#x27;s the same process at both ends of the pipe is irrelevant to the issue, I think. It seems to me like the simple solution would be to elevate the page map count while it&#x27;s in the pipe (but where else might a page be kept that a process other than parent could get it?) or use the refcount to decide if the parent is the exclusive owner (but maybe the parent has extra references?).<br> <p> I think the semantics that would be clear is that, if a reference can be used to read the contents of the page, or can be converted into a reference that can be used to read the contents of the page, it counts as a map. But then you&#x27;d have to identify what needs the addition.<br> </div> Tue, 23 Mar 2021 16:00:25 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850170/ https://lwn.net/Articles/850170/ abatters <div class="FormattedComment"> I considered using vfork() recently, but ultimately decided against it after encountering too many warnings about it. For example, see the history of glibc using vfork() for posix_spawn():<br> <p> glibc &lt;= 2.23:<br> posix_spawn() uses vfork() if POSIX_SPAWN_USEVFORK is set or if there is no cleanup expected in the child before it exec(3)s the requested file. However, this implementation of vfork() was the source of a number of bugs.<br> <p> Linux glibc &gt;= 2.24:<br> glibc commit 9ff72da471a509a8c19791efe469f47fa6977410<br> posix_spawn() switches from vfork() to clone(CLONE_VM | CLONE_VFORK) which uses a separate stack for the child. This fixes a number of vfork()-related bugs (&quot;possible parent clobber due stack spilling&quot;), making it possible to enable by default and ignore POSIX_SPAWN_USEVFORK.<br> <p> recent non-Linux glibc<br> glibc commit ccfb2964726512f6669fea99a43afa714e2e6a80<br> POSIX_SPAWN_USEVFORK is ignored and regular fork() is always used, due to difficulties getting vfork() to work without the Linux-specific clone() semantics.<br> <p> Note that using clone(CLONE_VM | CLONE_VFORK) safely requires blocking all signals, including NPTL-internal signals. But the glibc wrappers don&#x27;t let you block NPTL-internal signals, making it much more difficult to do outside of glibc. See the glibc implementation for all the gory details.<br> <p> </div> Tue, 23 Mar 2021 14:17:06 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850163/ https://lwn.net/Articles/850163/ gerdesj <div class="FormattedComment"> &quot;But, as the Project Zero issue mentions, there are environments, such as Android, where each process is forked from a zygote process without a subsequent exec(), for performance reasons. That could lead to a situation that looks a lot like the PoC exploit for this bug.&quot;<br> <p> </div> Tue, 23 Mar 2021 12:37:35 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850158/ https://lwn.net/Articles/850158/ joib <div class="FormattedComment"> A couple of years ago we discussed this paper, which argues that fork() is fundamentally the wrong primitive to build OS process management around: <a href="https://lwn.net/Articles/785430/">https://lwn.net/Articles/785430/</a><br> </div> Tue, 23 Mar 2021 11:01:18 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850154/ https://lwn.net/Articles/850154/ excors <div class="FormattedComment"> <font class="QuotedText">&gt; if you had just munmapped it, it would be really weird for a subsequent mmap of the same address and size to fail. Userspace might assume that it doesn&#x27;t need to check the error code for mmap in that case (or it might not have suitable recovery code, and just call abort(3)).</font><br> <p> That reminds me of an old bug: About a decade ago, Firefox (using code from jemalloc) would try to do a large aligned allocation like &quot;p = mmap(NULL, size*2); munmap(p); p = mmap(round_up(p, alignment), size);&quot; i.e. using the first mmap+munmap to discover a large-enough hole in the address space, then allocating at a correctly-aligned address within that hole. If the second mmap didn&#x27;t return the address that was requested, there must have been a race condition with another thread that allocated in the same hole, so it would loop around and try again and hope for better luck next time.<br> <p> That worked okay until it ran on kernels with a security feature that randomised mmap and entirely ignored the address parameter (which is technically okay since it&#x27;s defined as just a hint, not a requirement), so the code got stuck in an infinite loop.<br> <p> That was fixed ages ago, but it does seem plausible that some userspace code may still make similarly unwise assumptions.<br> </div> Tue, 23 Mar 2021 10:28:40 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850153/ https://lwn.net/Articles/850153/ pbonzini <div class="FormattedComment"> posix_spawn() is not a single system call (technically both fork() and vfork() are wrappers around clone(2), but at least the latter is a single system call).<br> </div> Tue, 23 Mar 2021 10:08:13 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850152/ https://lwn.net/Articles/850152/ pbonzini <div class="FormattedComment"> Based on a quick Debian Code Search query:<br> <p> * The fio benchmarking tool supports it<br> <p> * openssl uses it with AF_ALG (which is also a bit on the obscure side), and so does libkcapi (a library for the kernel crypto API)<br> <p> * VLC is an interesting one, it uses vmsplice to dump the decompressed output of gzip/bzip2/xz into memory (probably because the rest of the program prefers to work with memory-mapped buffers?)<br> <p> * FUSE also uses it<br> <p> Also, Samba doesn&#x27;t use vmsplice but it uses splice to implement the opposite of sendfile (read from a socket into a file).<br> </div> Tue, 23 Mar 2021 10:06:50 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850147/ https://lwn.net/Articles/850147/ NYKevin <div class="FormattedComment"> In principle yes, but then you have other problems.<br> <p> For example, mmap can be used to allocate memory at a fixed address. It can be difficult to tell whether any given address is suitable (because other pages etc. might be in the way), but if you had just munmapped it, it would be really weird for a subsequent mmap of the same address and size to fail. Userspace might assume that it doesn&#x27;t need to check the error code for mmap in that case (or it might not have suitable recovery code, and just call abort(3)).<br> <p> So now your solution needs to accommodate stacking a new page on top of the hidden page, or relocating the hidden page, either of which is nontrivial. That&#x27;s not even mentioning the fact that you need to teach vmsplice to track per-process references in the same way as pages do, without that page actually existing in the userspace page map. These are all rather difficult problems to solve, and this is a security issue, so solving hard problems is not the ideal form of a fix. Throwing in a simple COW break is a much more straightforward solution (but, as the story alludes, there was presumably some complication which they failed to account for).<br> </div> Tue, 23 Mar 2021 08:42:23 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850148/ https://lwn.net/Articles/850148/ Sesse <div class="FormattedComment"> Do any applications actually use vmsplice() at all? I remember making a prototype for a minimal HTTP server (it only ever existed to serve two static responses), and it helped a few percent, but was never put in production. It seems to have caused a fair bit of security headache over the years…<br> </div> Tue, 23 Mar 2021 08:40:04 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850143/ https://lwn.net/Articles/850143/ PengZheng <div class="FormattedComment"> As a non-kernel developer, I&#x27;m curious:<br> <p> In retrospect, is this COW breaking fundamentally wrong?<br> Why not treat vmsplice reference &quot;as references&quot;? Is is possible to make the page inaccessible to the userspace process after munmap while still keep the kernel&#x27;s reference?<br> <p> </div> Tue, 23 Mar 2021 03:37:43 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850135/ https://lwn.net/Articles/850135/ vbabka <div class="FormattedComment"> <font class="QuotedText">&gt; The child still has access to the page, so surely it should still be marked as having access to that page as long as it does indeed have a reference to it, rather than pretending it doesn&#x27;t have access to the page just because it&#x27;s been munmap&#x27;d.</font><br> <p> I&#x27;m not sure I understand your point, but that problem is prevented by the commit. When the child wants to take that extra reference (for vmsplice()) it gets a copy instead of the page shared with the parent. Afterwards both the page tables of the child and the reference held by the pipe point to this new copy, and the access to parent&#x27;s page is lost.<br> <p> <font class="QuotedText">&gt; Just because the page has been munmap&#x27;d doesn&#x27;t mean that the process can&#x27;t read from it, so why is the page table entry removed?</font><br> <p> That&#x27;s simply the semantic of munmap() - it has to adjust the VMA tree and zap page table entries so that the munmapped range is no longer represented there. Then if the process tries to read/write to an address within the area, it segfaults. We can&#x27;t leave the page table entry there just because another reference exists. The read from the pipe doesn&#x27;t go through these page tables.<br> </div> Tue, 23 Mar 2021 00:20:21 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850131/ https://lwn.net/Articles/850131/ milesrout <div class="FormattedComment"> Naïve non-kernel-developer perspective: this seems _intuitively_ like the wrong solution to the problem. The child still has access to the page, so surely it should still be marked as having access to that page as long as it does indeed have a reference to it, rather than pretending it doesn&#x27;t have access to the page just because it&#x27;s been munmap&#x27;d. Just because the page has been munmap&#x27;d doesn&#x27;t mean that the process can&#x27;t read from it, so why is the page table entry removed? I assume there&#x27;s a very good reason why it&#x27;s done this way, though. Could someone clear up my misunderstanding?<br> <p> For what it&#x27;s worth, this all reminds me of the SCM_RIGHTS/io_uring issue (<a href="https://lwn.net/Articles/779472/">https://lwn.net/Articles/779472/</a>). <br> <p> </div> Mon, 22 Mar 2021 23:42:24 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850123/ https://lwn.net/Articles/850123/ NYKevin <div class="FormattedComment"> If you&#x27;re going to break API compatibility anyway,* then you might as well just tell people to use vfork() or posix_spawn(), as those are already mature and well-understood interfaces, and the latter is even portable.<br> <p> Incidentally, while researching this comment, I stumbled across this tidbit from clone(2):<br> <p> <font class="QuotedText">&gt; In contrast to the glibc wrapper, the raw clone() system call</font><br> <font class="QuotedText">&gt; accepts NULL as a stack argument (and clone3() likewise allows</font><br> <font class="QuotedText">&gt; cl_args.stack to be NULL). In this case, the child uses a</font><br> <font class="QuotedText">&gt; duplicate of the parent&#x27;s stack. (Copy-on-write semantics ensure</font><br> <font class="QuotedText">&gt; that the child gets separate copies of stack pages when either</font><br> <font class="QuotedText">&gt; process modifies the stack.) In this case, for correct</font><br> <font class="QuotedText">&gt; operation, the CLONE_VM option should not be specified. (If the</font><br> <font class="QuotedText">&gt; child shares the parent&#x27;s memory because of the use of the</font><br> <font class="QuotedText">&gt; CLONE_VM flag, then no copy-on-write duplication occurs and chaos</font><br> <font class="QuotedText">&gt; is likely to result.)</font><br> <p> That sounds like it would be fun to debug. Now I&#x27;m wondering if any developers have decided that they &quot;need to&quot; bypass glibc and pull this sort of chicanery on any of my systems... (I&#x27;m an SRE, so if it broke in production, it would be my problem to fix it). &quot;Fortunately,&quot; most of the bugs I&#x27;ve seen have tended to be higher-level than this, but it&#x27;s still a bit frightening that the kernel will just let you do something like that.<br> <p> * Which is obviously not going to happen given the kernel&#x27;s fanatical devotion to not breaking userspace, but let&#x27;s pretend for a moment.<br> </div> Mon, 22 Mar 2021 23:09:48 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850113/ https://lwn.net/Articles/850113/ Cyberax <div class="FormattedComment"> Over the years I found that COWs are just bad and it&#x27;s best to avoid them altogether.<br> <p> Can we already use io_uring to spawn new processes?<br> </div> Mon, 22 Mar 2021 20:52:47 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850107/ https://lwn.net/Articles/850107/ nix <div class="FormattedComment"> It&#x27;s definitely a problem because it doesn&#x27;t match the user&#x27;s mental model of how fork() is supposed to work. It&#x27;s clear that either COW must be broken in this case or a mapping must be retained (or the refcounts split into per-mm versions, which seems likely to be far more expensive). The conceptually ideal approach would have everything act just like normal data, i.e. recognise things like vmsplice references *as* references so you don&#x27;t need to specially break COW early for them -- but this seems likely to be viciously complex and of only minor benefit. Of course, hunting down every single way a reference can be taken by a child and arranging to COW-break on all of them seems likely to be a nightmarish game of whack-a-mole too...<br> </div> Mon, 22 Mar 2021 19:22:27 +0000 Patching until the COWs come home (part 1) https://lwn.net/Articles/850097/ https://lwn.net/Articles/850097/ tux3 <div class="FormattedComment"> What a cliffhanger! <br> I certainly am glad that project zero is there to root out these bugs (pun intended).<br> </div> Mon, 22 Mar 2021 18:55:47 +0000