LWN.net Weekly Edition for March 25, 2021
Welcome to the LWN.net Weekly Edition for March 25, 2021
This edition contains the following feature content:
- WireGuard bounces off FreeBSD—for now: FreeBSD 13 was set to release with WireGuard support; then things got messy.
- Clarifying memory management with page folios: a new abstraction to help keep different types of pages straight — and improve performance too.
- Lockless patterns: more read-modify-write operations: the series continues with a deeper look at what can be done with RMW.
- Patching until the COWs come home (part 1): an exploitable bug turns up in the kernel's complex copy-on-write code.
- Extending Python's enums: Python developers explore ways to add membership tests to enum types.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
WireGuard bounces off FreeBSD—for now
The WireGuard VPN tunnel is a fast and easy-to-use solution for those who need or want a secure tunnel for their traffic. The project has been around since 2016, but it has had a somewhat circuitous route into Linux; it was merged for the 5.6 kernel, which was released in March 2020. Getting into Linux required WireGuard developer Jason A. Donenfeld to acquiesce to having WireGuard use some of the existing kernel crypto primitives, rather than merging his Zinc crypto library. Some of the same tensions that were seen in that process seem to be cropping up again in the more recent efforts to add WireGuard support to several BSD kernels.
Complaints
A March 15 posting from Donenfeld on the WireGuard mailing list set off something of a chain reaction; in it he described the current status of WireGuard support in FreeBSD 13, which was in the release candidate stage at that point. A week earlier, FreeBSD developer Kyle Evans had asked on the list about supporting FreeBSD's port of the if_wg implementation of WireGuard from OpenBSD in the user-space wireguard-tools. He noted that FreeBSD 13 was soon to be released. That upcoming deadline caused a flurry of activity that Donenfeld relayed in his message:
In the ensuing discussion I mentioned that we really need to get the actual if_wg kernel implementation up to snuff. We took the conversation to IRC, and agreed that we should work on figuring out what to do before the release date. At the same time, Matt Dunwoodie, who worked on the OpenBSD implementation, also took a look at what had become of that implementation in FreeBSD. Over the next week, the three of us dug in and completely reworked the implementation from top to bottom, each one of us pushing commits and taking passes through the code to ensure correctness.
He went on at some length about the problems that were found in the existing port, which had been done at the behest of Netgate, a network security company that makes the FreeBSD-based pfSense router and firewall distribution. Donenfeld's description of the state of the code in FreeBSD before he, Evans, and Dunwoodie did their week-long rework was rather eye-opening:
It was not pretty. I imagined strange Internet voices jeering, “this is what gives C a bad name!” There were random sleeps added to “fix” race conditions, validation functions that just returned true, catastrophic cryptographic vulnerabilities, whole parts of the protocol unimplemented, kernel panics, security bypasses, overflows, random printf statements deep in crypto code, the most spectacular buffer overflows, and the whole litany of awful things that go wrong when people aren't careful when they write C. Or, more simply, it seems typical of what happens when code ships that wasn't meant to. It was essentially an incomplete half-baked implementation – nothing close to something anybody would want on a production machine.
The lengthy post also laid out some of the philosophy behind the WireGuard
project and Donenfeld's approach to compatible implementations, which is
decidedly hands-on. It is "a radical departure from the traditional model, and one surely to
raise some grumbles amongst graybeards
". He concluded the message by saying
that they had set out hoping to pull something together for the upcoming FreeBSD
release, but had fallen short of that goal. Evans committed
the changes they had worked on, but it "looks
exceedingly likely
" that FreeBSD will "just disable the module from the release, and
revisit for 13.1, rather than merging our fixes a few short days before
the release
". The post led to coverage
in the tech media—and a bunch of other effects.
Response
As might be guessed, Netgate was not particularly happy with what Donenfeld had to say—even less with how he said it—and was also less than pleased about the lack of warning regarding the problems. Netgate director of software engineering Scott Long posted a strongly worded response on the company blog; he also sent an email, apparently privately, to Donenfeld, who publicly replied on the WireGuard mailing list. In the blog post, Long pointed out that the WireGuard work had been put out for public review back in August 2020 and he took exception to some of what was posted:
Unfortunately, the public discussion has also veered into vague claims and slanderous attacks. This is where the lack of transparency, the lack of respect, and the inflation of ego is damaging and unproductive. We had hoped for a better collaboration than this, and it makes me doubt the motives of the attackers. And yes, I make deliberate use of the word “attacker” here, because that’s what this is, an attack on Netgate and on the FreeBSD and pfSense communities. Beware of anyone who says that they have all the answers. I also worry about the integrity of those who make vague statements and blanket, over-the-top accusations.[...] By following the normal, well understood security disclosure process this entire incident could have been handled quickly and efficiently through normal channels. We have yet to see a full description of the problems claimed; their choice to do a complete rewrite obscures the evidence of what they believe they were fixing, and they have yet to submit their work through the normal FreeBSD Phabricator process for review. That said, we do look forward to the bug reports and subsequent evaluation of the code through this review process. Code development is an iterative process, and one that we continue to strive to be better at. In the end, we will all benefit.
For his part, Evans was displeased with his role in the uproar and, on March 16, announced that he would be removing WireGuard from FreeBSD until sometime after the upcoming release. He and others planned to keep working on it moving forward.
The problem is that this work, in particular, is a driver with fairly severe security implications. Review by "three developers working and beating on it" is not the higher bar that we should be holding this to.
He followed that up with a message
stepping down from the wireguard-freebsd project entirely on March 18,
as well as a response
to Donenfeld's original post. He said that he did not "appreciate
how you handled this initial communication
" and felt that some of
the complaints were exaggerated, though he did acknowledge
multiple problems in the existing code. Beyond that:
I've additionally recommended, as a developer and not in any kind of official capacity, that we can't include if_wg in any future version of base. We cannot have our users being put at the risk of this kind of publicity if we can't get security advisories issued in time. Ports is a fine place, where security issues can be addressed expeditiously and more in line with how the WireGuard project chooses to handle them.
For now, development will
continue in the wireguard-freebsd Git
repository, but by the sounds, it will not be returning to the FreeBSD
kernel anytime soon. On March 22, Warner Losh put out a message
from the FreeBSD core team on the freebsd-hackers mailing list about
the "WireGuard controversy
":
Core unconditionally values the work of all contributors, and seeks a culture of cooperation, respect, and collaboration. The public discourse over WireGuard in the past week does not meet these standards and is damaging to our community if not checked. As such, WireGuard development for FreeBSD will now proceed outside of the base system. For those who wish to evaluate, test, or experiment with WireGuard, snapshots will be available via the ports and package systems.
NetBSD
As alluded to in Donenfeld's original post, there have also been efforts to
add WireGuard support in NetBSD, but those have not gone entirely smoothly,
seemingly. An August
2020 thread started by Donenfeld on the tech-kern and tech-net mailing
lists for NetBSD was similarly critical of an existing WireGuard
implementation; he asked that the existing code be reverted in favor of an
evaluation of the proper path forward. He offered to work with Taylor R
Campbell, who had picked
up some code written by Ryota Ozaki back in 2018;
"it seemed to be in pretty good
shape when I reviewed it this year, with a few small issues I saw, so
I dusted it off and merged it
".
In that thread, Campbell and others repeatedly asked Donenfeld to describe the problems he said that he found, but that was not how Donenfeld wanted to proceed:
This isn't about some random bug here or there, or a TODO list of little errors and optimizations, or something that one would normally say, "well, the bones are there, so let's throw this in the tree, and we'll clean it up as a gradual evolutionary thing." Rather, the foundation is incomplete, and its current form does not belong in a tree.[...] The right course of action here is to revert the code, and then after we can evaluate which direction we want to go in -- whether it's doing a full teardown of Ozaki-san's code, or whether it's doing a simple port of the OpenBSD code.
It would seem that Campbell and Donenfeld never actually got to that point;
Donenfeld said that he hoped "to be able to lend a hand similarly to the
NetBSD developers soon
" in the original FreeBSD announcement. Part
of the disconnect here may be the type of review that Donenfeld is looking
for; he talked about what he and Dunwoodie had done over a year's time to get the OpenBSD
implementation into shape ("continual code reviews,
knowledge transfers, and he even showed up at my house at some point
to read code together
"). Matt Macy, who was the author of
the FreeBSD version that was so heavily criticized mentioned a similar
requirement in a FreeBSD
bug report:
Jason had insisted that the code needed to be reviewed by him, but he won't actually spend any time on reviewing the code unless I'm actually sitting there engaging with him directly. The OpenBSD dev appears to have spent many dozens of hours with him. However, I simply have no time or inclination for that. I'm not sure who in the FreeBSD community has the crypto background to review the protocol bits as well as the energy to do so.
By all accounts, WireGuard itself is an excellent VPN solution, but it would seem that unlike the usual approach for a network protocol, where multiple implementations are made independently based on a specification, WireGuard needs to be treated differently. Donenfeld is justifiably proud of his accomplishment, but his requirements for other implementations seem far too rigid—at least for some communities. As we have seen in several different operating system projects (Linux, FreeBSD, NetBSD), Donenfeld often expects that the other, much larger projects conform to his exacting standards and methods. In the end, that attitude may discomfit more than just graybeards.
Clarifying memory management with page folios
Memory management generally works at the level of pages, which typically contain 4,096 bytes but may be larger. The kernel, though, has extended the concept of pages to include compound pages, which are groups of contiguous single pages. That, in turn, has made the definition of what a "page" is a bit fuzzy. Matthew Wilcox has been working since last year on a concept called "page folios" which is meant to bring the picture back into focus; whether the memory-management community will accept it remains unclear, though.At the lowest level, pages are a concept implemented by the hardware; the tracking of memory and whether it is present in RAM or not is done at page granularity. Any given CPU architecture may offer a limited selection of page sizes, but one "base" page size must be chosen, and the most common choice remains 4,096 bytes — the same as it was when the first Linux kernels were released 30 years ago.
The kernel, though, often has reason to work with memory in larger chunks. One example is the management of "huge pages" which, once again, are implemented by the hardware. The x86 architecture, for example, can work with 2MB huge pages, and there are performance advantages to using them where they are applicable. The kernel will also allocate groups of pages in other sizes, though, typically for DMA buffers or other uses where a set of physically contiguous pages is needed. This sort of grouping of pages is known as a "compound page" in the kernel.
Every base page of memory managed by the kernel is represented by a page structure in the system memory map. When a compound page is created out of a set of base pages, the page structure for the first page in the set (the "head page") is specially marked to make its compound nature explicit. The other information in that structure refers to the compound page as a whole. All of the other pages (the "tail pages") are marked as such, with a pointer to the page structure for the associated head page. See this article for details on how compound pages are organized.
This mechanism makes it easy to go from the page structure of a tail page to the head page for the compound page. Many interfaces within the kernel make use of that feature, but it creates a fundamental ambiguity: if a function is passed a pointer to a page structure for a tail page, is it expected to act on that tail page or on the compound page as a whole? Or, as Wilcox put it in the first posting of the folio series in December:
A function which has a struct page argument might be expecting a head or base page and will BUG if given a tail page. It might work with any kind of page and operate on PAGE_SIZE bytes. It might work with any kind of page and operate on page_size() bytes if given a head page but PAGE_SIZE bytes if given a base or tail page. It might operate on page_size() bytes if passed a head or tail page. We have examples of all of these today.
(PAGE_SIZE is the size of a base page, while page_size() returns the full size of a — possibly compound — page.) There does not seem to be an extensive history of bugs resulting from this particular API, but an interface that is this poorly defined seems likely to encourage problems sooner or later.
In an attempt to clarify the situation, Wilcox has come up with the concept of a "page folio", which is really just a page structure that is guaranteed not to be a tail page. Any function accepting a folio will operate on the full compound page (if, indeed, it is a compound page) with no ambiguity. The result is greater clarity in the kernel's memory-management subsystem; as functions are converted to take folios as arguments, it will become clear that they are not meant to operate on tail pages.
When Wilcox first posted this patch series, though, he emphasized a different benefit from the change. Any function that might be passed a tail page, but which must operate on the full compound page containing that tail page, must exchange any pointers to tail-page page structures for pointers to the head page instead. That is typically done with a call to:
struct page *compound_head(struct page *page);
This function is relatively cheap, but it may be called many times over the course of a single operation on a page. That makes the kernel bigger (since it's an inline function) and slows things down. A function that accepts a folio, instead, knows that it is not dealing with a tail page; thus it need not call compound_head(). That saves both time and memory.
The folio type itself is defined as a simple wrapper structure:
struct folio {
struct page page;
};
From there, a new set of infrastructure is built up. For example, get_folio() and put_folio() will manage references to the folio much like get_page() and put_page(), but without the unneeded calls to compound_head(). A whole set of higher-level functions follows from there. Much of the real work, though, will be in converting various kernel subsystems to use the new type; Wilcox didn't sugarcoat the nature of that task:
This is going to be a ton of work, and massively disruptive. It'll touch every filesystem, and a good few device drivers! But I think it's worth it.
By the time the fourth version of this patch set was posted on March 5, the core patches and the conversions (which Wilcox didn't post) added up to about 100 commits, which is a fair amount to review.
Perhaps as a result of the size of the patch series, the previous postings did not elicit that much discussion. In response to the latest one, though, Andrew Morton took a look and was worried by what he saw:
Geeze it's a lot of noise. More things to remember and we'll forever have a mismash of `page' and `folio' and code everywhere converting from one to the other. Ongoing addition of folio accessors/manipulators to overlay the existing page accessors/manipulators, etc.It's unclear to me that it's all really worth it.
Hugh Dickins, too, expressed
a lack of enthusiasm for this work. On the other hand, Kirill
Shutemov and Michal Hocko
both expressed support for it, in concept at least. Dave Chinner said
that "this abstraction is absolutely necessary
" for filesystem
developers, especially if and when the page cache gains the ability to
manage compound pages of multiple sizes.
So, in other words, there is currently no consensus among the core developers regarding whether this work improves the kernel or not. That may change over time as more people look at it and its advantages (or the lack thereof) become more clear. But change tends to happen slowly in the memory-management subsystem in general, even when the patch set in question is not so large and messy. One should also bear in mind that there is an inevitable discussion on naming to be had; it is already clear that "folio" is not popular, though alternatives are currently thin on the ground. One conclusion is thus clear: the kernel may well get folios or something like them, but it seems unlikely to happen soon.
Lockless patterns: more read-modify-write operations
Last week's installment in this series on lockless patterns took a first look at the compare-and-swap (CAS) operation. CAS is a powerful tool that can be used to implement a number of lockless primitives. The next step is to look at other atomic read-modify-write operations that can be implemented on top of compare-and-swap.CAS-based primitives usually operate on int values. The Linux kernel uses atomic_t, a struct type that wraps int so that loads and stores are marked explicitly. For example, it is not possible to write x++; if x is an atomic_t. Instead one must write atomic_inc(&x);. All operations on atomic_t start with "atomic_".
Linux has roughly 30 read-modify-write operations on atomic_t, of which compare-and-swap, in the form of atomic_cmpxchg(), is the real workhorse since it can be used to implement all of the others. For example, here is an "atomic increment":
/* atomic_read() is like READ_ONCE(), but for Linux's atomic_t. */
old = atomic_read(&x);
do {
expected = old;
old = atomic_cmpxchg(&x, expected, expected + 1);
} while (old != expected);
Which, as you can see, is similar to the "add to list" operation that was presented earlier in this series. As a bonus, at completion, old contains the value that was incremented, making the above sequence equivalent to the Linux macro atomic_fetch_inc().
Instruction sets vary wildly in the number of read-modify-write instructions that they offer. Some only include compare-and-swap; x86 has many more, but only three of them (CMPXCHG, XCHG, and the "exchange and add" instruction XADD) return the old value of the memory location. And, even with the most comprehensive instruction set, some cases are bound to occur that the processor cannot cover. That's when compare-and-swap comes in handy. For example, we could define a read-modify-write implementation for the "maximum of two values" operator:
/* Store max(x, new) into x. */
old = atomic_read(&x);
do {
expected = old;
if (old > new)
break;
old = atomic_cmpxchg(&x, expected, new);
} while (old != expected);
or one that increments a value only if it is non-zero:
old = atomic_read(&x);
do {
expected = old;
if (old == 0)
break;
old = atomic_cmpxchg(&x, expected, expected + 1);
} while (old != expected);
Sequences similar to the latter are useful to implement lock-free fast paths. This lockless programming pattern lets the programmer use locks only in rare cases, thus avoiding contention most of the time. A typical example is reference counting.
Lock-free reference counting
Consider a typical, reference-counted data structure that we'll call struct gadget. Each gadget has a parent, and keeps a reference to its parent in order to keep the parent itself alive. This is the simplest possible implementation of get_gadget() and put_gadget(), the functions that respectively increment and decrement the gadget's reference count:
void get_gadget(struct gadget *g) {
mutex_lock(&gadgets_lock);
g->refcnt++;
mutex_unlock(&gadgets_lock);
}
void put_gadget(struct gadget *g) {
mutex_lock(&gadgets_lock);
if (g->refcnt-- == 1) {
mutex_unlock(&gadgets_lock);
put_gadget(g->parent);
kfree(g);
return;
}
mutex_unlock(&gadgets_lock);
}
However, this would be unnecessarily inefficient. When obtaining an additional reference to a gadget g, the thread that calls get_gadget() must already have a reference to g. Furthermore, that reference cannot go away until get_gadget() returns; therefore, any concurrent call to put_gadget() would never go down the kfree() branch. We can do the following instead and get rid of the lock:
void get_gadget(struct gadget *g) {
/*
* Unlike atomic_fetch_dec(), this increment is atomic but has
* no acquire or release semantics. This is true of all Linux
* atomic operations that do not return a value.
*/
atomic_inc(&g->refcnt);
}
void put_gadget(struct gadget *g) {
/*
* Like atomic_cmpxchg(), this has both acquire and release semantics.
*/
if (atomic_fetch_dec(&g->refcnt) > 1)
return;
put_gadget(g->parent);
kfree(g);
}
Linux provides kref_get() and kref_put() as a simple abstraction for this idiom. There are two things worth noting in the above code:
- put_gadget() is not using atomic_fetch_dec_release(); decrementing the reference count is done with both acquire and release semantics. Just like in the lock-free list example, this weaves the "happens before" relation through all the calls to put_gadget(), so that when reaching for other gadgets via g->parent, it is possible to use regular loads.
- The only synchronization point is in put_gadget(); get_gadget() does need the increment to be atomic, but it does not need to have acquire semantics. This is because the calling thread must already have a reference to g. Acquire and release semantics had to be involved whenever the thread initially got that reference, for example at thread creation; after that point, however, the thread proceeds independently until it needs to give back the reference and potentially free the gadget.
Now, imagine that each gadget also stored a list of sub-gadgets. Just before destroying itself, the gadget could remove itself from its parent's list of sub-gadgets:
void put_gadget(struct gadget *g) {
if (atomic_fetch_dec(&g->refcnt) > 1)
return;
mutex_lock(&g->parent->gadgets_lock);
list_del(&g->node);
mutex_unlock(&g->parent->gadgets_lock);
put_gadget(g->parent);
kfree(g);
}
However, as is often the case, the code above only tells half of the story. The gadget has a reference to its parent; if the parent held a reference to each of its children as well, this would create a reference-count cycle and leak memory. Therefore, the list must contain pointers to the children without having a reference to them; sometimes you'll hear that the list has weak references to the children. Threads are free to visit the list and operate on the gadgets therein, but they cannot call get_gadget() because they do not already have a reference. That means that threads cannot prevent child gadgets from going away at a surprising time.
The simplest solution is to operate on weak references only within the protection of the lock; the gadgets in the list will not disappear until put_gadget() can take the lock and remove the gadget from the list. If this is too restrictive, however, we can instead refine the rules for calling get_gadget(). The following would be a workable alternative:
- As before, a thread that already has a strong reference can obtain an extra reference with get_gadget().
- A thread can upgrade a weak reference to strong by calling get_gadget() while holding the lock that protects the list.
The implementation of get_gadget() is still the same one-liner, but put_gadget() must be more careful: it must take the lock before decrementing the reference count from one to zero, and not release the lock until it has deleted itself from the list. This way, visitors to the list will never find weak references to gadgets whose reference count is zero. If the reference count is greater than one, however, put_gadget() can proceed locklessly. This is how it could be implemented using compare-and-swap:
void put_gadget(struct gadget *g) {
for (;;) {
int old = atomic_read(&g->refcnt);
if (old > 1) {
if (atomic_cmpxchg(&g->refcnt, old, old - 1) == old)
return;
} else {
/* old was 1, fence off accesses to weak references! */
mutex_lock(&g->parent->gadgets_lock);
if (atomic_cmpxchg(&g->refcnt, 1, 0) == 1)
break;
/*
* Somebody snuck in and upgraded a weak reference before the
* mutex_lock(). Try again.
*/
mutex_unlock(&g->parent->gadgets_lock);
}
}
list_del(&g->node);
mutex_unlock(&g->parent->gadgets_lock);
put_gadget(g->parent);
kfree(g);
}
Read the code again and again until you can convince yourself that every call to put_gadget() will result in exactly one successful compare-and-swap operation.
This is a common pattern in Linux, and the functions kref_put_mutex() and kref_put_lock() help developers implement it. As with llist.h, you are strongly encouraged not to roll your own and to use the library functions instead. The alert reader will have noticed that there is no kref_put_rwsem() function, which might be handy if a struct rw_semaphore protects the list of sub-gadgets. It may be a good exercise to sit down and try to write that function.
Sometimes there is no lock that could protect both the access and the destruction. This happens if the weak reference is held by a separate subsystem, for example by a debugfs inode's i_private field. In this case, the upgrade operation must be allowed to fail if the reference count is zero. refcount_inc_not_zero() and kref_get_unless_zero() can be helpful in this scenario, and you can see the former at work in kvm_debugfs_open():
/*
* The debugfs files still hold a reference to the kvm struct at the
* time kvm_destroy_vm is called. The files are removed, and the
* reference disappears, before kvm_destroy_vm frees the kvm struct.
*
* To avoid a race between the opening and the removal of the debugfs
* files, return -ENOENT if kvm_destroy_vm is in progress.
*/
if (!refcount_inc_not_zero(&stat_data->kvm->users_count))
return -ENOENT;
Compared to the "increment if not zero" implementation at the top of the article, refcount_inc_not_zero() is more complicated in order to implement overflow protection—further reinforcing the importance of using higher-level primitives when available.
Conclusions
A few loose ends and simplifications that were made throughout the series will be covered in the next article, but this mostly concludes our introduction to lockless programming patterns. Outside of complex parts of Linux, such as the scheduler or read-copy-update (RCU), these synchronization primitives and patterns should cover almost all of the lockless code that you will encounter. My goal with these articles was to help you to understand the basic ideas and how the high-level APIs wrap those ideas, so that you can apply them even in slightly different cases. I hope they will be useful as both learning material and a reference.
[The author would like to thank Jon Corbet, Laszlo Ersek, and Stefan Hajnoczi for help proofreading the drafts of these articles.]
Patching until the COWs come home (part 1)
The kernel's memory-management subsystem is built upon many concepts, one of which is called "copy on write", or "COW". The idea behind COW is conceptually simple, but its details are tricky and its past is troublesome. Any change to its implementation can have unexpected consequences and cause subtle breakage for existing workloads. So it is somewhat surprising that last year we saw two major changes the kernel's COW code; less surprising is the fact that, both times, these changes had unexpected consequences and broke things. Some of the resulting problems are still not fixed today, almost ten months after the first change, while the original reason for the changes — a security vulnerability — is also not fully fixed. Read on for a description of COW, the vulnerability, and the initial fix; the concluding article in the series will describe the complications that arose thereafter.Copy on write is a standard mechanism for sharing a single instance of an object between processes in a situation where each process has the illusion of an independent, private copy of that object. Examples include memory pages shared between processes or data extents shared between files. To see how COW is used in the memory-management subsystem, consider what happens when a process calls fork(): the pages in that process's private memory areas should no longer be shared between the parent and child. But, instead of creating new copies of those pages for the child process during the fork() call, the kernel will simply map the parent's pages in the child's page tables. Importantly, the page-table entries in both parent and child are set as read-only (write-protected).
If either process attempts to write to one of these pages, a page fault will occur, and the kernel's page-fault handler will create a new copy of the page, replacing the page-table entry (PTE) in the faulting process with a PTE that references the new page, but which allows the write to proceed. This action is often referred to as "breaking COW". If the other process then tries to write to that same page, another page fault will occur, as that process's PTE is still marked read-only. But now the page-fault handler will recognize that the page is no longer shared, so the PTE can just be made writable and the process can resume.
The benefits of this scheme are lower memory consumption and a reduction of CPU time spent copying pages during fork() calls. Often the price of copying is never paid for many of the pages because the child might call exit() or exec() before either the parent or the child writes to those pages.
While the COW mechanism looks simple, the devil is in the details, as has been shown already in the past. The recent trouble in this area started in 2020; it resulted in two major changes while attempting to fix a vulnerability — which is actually still not fixed in all scenarios — and resulted in many corner cases, some of which are still not fully ironed out.
The trouble begins
The first public sign of issues with the COW mechanism appeared in the form of commit 17839856fd58 ("gup: document and work around 'COW can break either way' issue") at the end of May 2020. The changelog doesn't fully describe the problem scenario, but what is there is ominous enough:
End result: the get_user_pages() call might result in a page pointer that is no longer associated with the original VM, and is associated with - and controlled by - another VM having taken it over instead.
Any doubts about whether the commit fixed a security vulnerability vanish when one notices the Reported-by tag mentioning Jann Horn; presumably Horn's report went through the appropriate non-public security channels. The practice of making fixes to some vulnerabilities immediately public without explicitly marking them as such is not new, especially in the COW area. Nevertheless, the related Project Zero issue was made public in August, and CVE-2020-29374 was assigned in December; both point to the above-mentioned commit as the fix.
As the Project Zero issue includes proof-of-concept (PoC) code, we can look at the fix with that code in mind and not rely on the incomplete commit log. The most important parts of the PoC are the following:
static void *data;
posix_memalign(&data, 0x1000, 0x1000);
strcpy(data, "BORING DATA");
if (fork() == 0) {
// child
int pipe_fds[2];
struct iovec iov = {.iov_base = data, .iov_len = 0x1000 };
char buf[0x1000];
pipe(pipe_fds);
vmsplice(pipe_fds[1], &iov, 1, 0);
munmap(data, 0x1000);
sleep(2);
read(pipe_fds[0], buf, 0x1000);
printf("read string from child: %s\n", buf);
} else {
// parent
sleep(1);
strcpy(data, "THIS IS SECRET");
}
The code starts by allocating an anonymous, private page and writing some data there; it then calls fork(). At that point, the page becomes a COW page — it is write-protected for the parent process by making the corresponding page-table entry read-only, and for the child process an identical PTE is created. Then, while the parent is blocked inside sleep(), the child creates a pipe and passes the page to that pipe with vmsplice(), a system call that is similar to write() but which allows a zero-copy data transfer of the page's contents. In order to achieve that, the kernel takes a reference on the source page (by increasing its reference count) through get_user_page() or one of its variants; the set of these functions is often referred to as "GUP". The child then unmaps the page from its own page tables (but retains the reference in the pipe) and goes to sleep.
The parent wakes up from its sleep and writes new data to the page. The page table entry is write-protected, so the write causes a page fault. The page-fault handler can tell that this is fault on a COW page because the the mapping allows write access while the PTE is write protected. If there were more processes mapping the page then the content would have to be copied (breaking COW), but if there is a single mapping, the page can be just made writeable. The kernel relies on the value returned by page_mapcount() to determine how many mappings exist.
Here is the problem: page_mapcount() at this point in the PoC's execution includes only the parent's mapping, because the child has already called munmap() on that page. This function does not take into account the fact that the child can still access the parent's page through the pipe; it ignores the elevated page reference count. Thus, the kernel allows the parent process to write new data into the page, which is no longer considered to be a COW page. Finally, the child wakes up and reads that new data from the pipe, which might include sensitive information that the parent did not expect the child to see.
Corralling the problem
One might rightfully ask why this potential of leaking data from parent to child can matter in practice, as both processes are normally executing the code from the same binary and the fork() only acts as a branch in the code. So we can assume that, either the binary is trusted and thus the child process is too, or it is not and then we probably should not let the parent access any sensitive data in the first place. And, in the scenario where fork() from a trusted binary is followed by an exec() of a potentially malicious binary, exec() removes all shared pages from the address space of the child process before loading the new binary.
But, as the Project Zero issue mentions, there are environments, such as Android, where each process is forked from a zygote process without a subsequent exec(), for performance reasons. That could lead to a situation that looks a lot like the PoC exploit for this bug.
Moreover, the vmsplice() syscall might just be a symptom of a broader issue, since there are many other callers of the GUP functions in the kernel. So it is a good idea in general not to let a child process hold on to a page shared through the COW mechanism with the parent while letting the parent write new contents to the page.
To prevent exploits of this behavior, commit 17839856fd58 made it impossible to get a reference (even a read-only reference) via GUP to a COW-shared page. All such attempts now result in breaking COW and returning a reference to the new copy instead. Thus, in the PoC code above, calling vmsplice() now causes the child process to replace the shared COW page in the corresponding page table entry with a new page, which is then passed to the pipe. Afterward, the child no longer has any way to access the parent's page and the new contents written there.
The commit notes the possibility of worse performance for some GUP users, especially those that rely on a lockless variant of the interface like get_user_pages_fast(). The changelog continues that finer-grained rules could be added later for situations where it is clear that it is safe to keep sharing the COW page because it can never be overwritten with new, potentially sensitive contents. The system-wide zero-page would be one example of this sort of situation. But otherwise, Linus Torvalds (the author of the change) expected no fundamental issues with this aggressively COW-breaking approach for GUP. Linux 5.8 was duly released with this commit.
And this, one might think, was the end of the problem. But, as was mentioned at the outset, COW is a complicated and subtle beast. In truth, the problems were just beginning. The second half of this article will delve into how the COW fix led to a stampede of new problems that still have yet to be completely solved.
Extending Python's enums
Enumerated types or "enums" are a feature of many languages, including Python; enums provide a convenient way to collect up a bunch of related symbols that (typically) evaluate to integer values. The canonical example would seem to be for colors, at least for demonstration purposes, but there are others, especially for handling "magic" constants from source likes POSIX or the host operating system. A recent thread on the python-ideas mailing list discusses different ways to add a new feature to enums—seven years after they were added to the standard library as part of Python 3.4.
Background
We covered some of the discussion around
PEP 435
("Adding an Enum type to the Python standard library
") back in
2013 when it was approved. For those who are unfamiliar with the enum feature,
a mini-introduction may be in order; following the documentation for the
enum module, the examples will use upper case for the names, though it
is not required (and the PEP does not).
>>> from enum import Enum
>>> class Color(Enum):
... RED = 1
... GREEN = 2
... BLUE = 3
...
>>> Color.GREEN
<Color.GREEN: 2>
>>> Color.GREEN == 2
False
>>> Color.GREEN.value == 2
True
>>> Color(3)
<Color.BLUE: 3>
>>> Color.RED != Color.BLUE
True
>>> print(Color['RED'])
Color.RED
Here we define Color as an Enum with three name-value
pairs as "members" of the enum that correspond to colors with consecutive
integer values. There is no
requirement that the integers be consecutive or, even, integers. As the
PEP puts it:
In the vast majority of use-cases, one doesn't care what the actual value of an enumeration is. But if the value is important, enumerations can have arbitrary values.
Beyond that, enums can be iterated over in the order the members were defined and enum members are hashable, so they can be used as keys in dictionaries. While each member name must be unique, values do not have to be, so aliases are allowed:
>>> class Color(Enum):
... RED = 1
... GREEN = 2
... BLUE = 3
... ROJO = 1
... VERDE = 2
... AZUL = 3
...
>>> Color.AZUL
<Color.BLUE: 3>
>>> Color.VERDE is Color.GREEN
True
>>> for i in Color:
... print(i)
...
Color.RED
Color.GREEN
Color.BLUE
>>> Color.__members__
mappingproxy({'RED': <Color.RED: 1>, 'GREEN': <Color.GREEN: 2>, 'BLUE': <Color.BLUE: 3>,
'ROJO': <Color.RED: 1>, 'VERDE': <Color.GREEN: 2>, 'AZUL': <Color.BLUE: 3>})
As can be seen, the __members__ attribute stores all of the
members defined, though there are only three members created in the
example. The aliases can be used, but the first member with a given value
is the "true" member that gets created.
Testing for an enum member
Ethan Furman, who is one of the authors of the PEP and maintainers of the
enum module, posted
a question about a common Stack Overflow query on March 12: is there a
way, other than using try and except, to determine
whether a given value is a member of an enum? There was a way to do so in
an earlier version of Python, he said, but that was fixed as a bug along
the way. He suggested adding a get() to Enum with a
default to be
returned if the value is not found,
similar to dict.get(),
adding a recipe to the documentation for enum, or to "do nothing
".
Since it was a python-ideas thread, "do nothing" was probably not going to win out, though Guido van Rossum acknowledged it as a possibility. He also said that get() did not make sense and had an alternative to suggest:
But the way to convert a raw value to an enum value is Color(1), not Color[1], so Color.get(1) seems inconsistent.Maybe you can just change the constructor so you can spell this as Color(1, default=None) (and then check whether that's None)?
Furman agreed that get() was not the right interface and liked the idea of changing the constructor, though he wanted a slightly different approach:
I think I like your constructor change idea, with a small twist:
Color(value=<sentinel>, name=<sentinel>, default=<sentinal>)
This would make it possible to search for an enum by value or by name, and
also specify a default return value (raising an exception if the default is
not set and a member cannot be found).
In that message, Furman also answered Van Rossum's question about the earlier removal of a way that worked to determine if a value was a member of an enum:
if 1 in Color:
...
The __contains__()
special method for Enum had formerly worked for values, but that was removed because
of the ambiguity over whether it should work for values or names. Van
Rossum pointed
out that since the sets of names and values do not overlap,
both could be supported with in (which uses
__contains__()). He also seemed reasonably happy with Furman's
elaboration of the constructor idea, though he suggested that getattr()
might make a better fit for name lookup (with an optional default) to
eliminate the need for the default= keyword argument.
But Van Rossum also
wondered about a corner
case: "I'm not sure what Color(value=1, name='RED') would do --
insist that both value and name match? Would that have a use case?
"
Furman said
that he did not really see a good use case, but that requiring both to
match seemed like the right approach. He also
noted
that in some strange cases, names and values could overlap, "although
having the name of one member match the value of a different member seems
odd
". He seemed inclined to ignore that possibility and concluded:
Everything considered, I think I like allowing `__contains__` to verify both names and values, adding `default=<sentinel>` to the constructor for the value-based "gimme an Enum or None" case, and recommending `getattr` for the name-based "gimme an Enum or None" case.
With an explicit "+1
" from
Van Rossum, it seemed like this relatively minor detail for the
enum module had been resolved. But perhaps both Furman
and Van Rossum got caught up in the discussion and did not fully consider
the nature of a class constructor returning None (or whatever other
random object was given as the default), rather than an object of the type
being constructed. That did
not strike Matt Wozniski as particularly Pythonic:
I find the idea of having the constructor potentially return something other than an instance of the class to be very... off-putting. Maybe it's the best option, but my first impression of it isn't favorable, and I can't think of any similar case that exists in the stdlib today off the top of my head. It seems like we should be able to do better.
He suggested adding from_value() and from_name() class
methods, each of which would take an optional default value as a better
approach. Van Rossum agreed
that the constructor idea was not the right one ("and as static
typing proponent I should have thought of that
"), but he thought
there might not really be a need for an arbitrary default, so changing
__contains__() might be best. "The ergonomics of that seem
better for the dominant use case ('is
this a valid value for that enum?').
"
There was some discussion of whether aliases (i.e. multiple names with the same value) would cause a problem for "by value" lookups, but Furman made it clear that there actually is no ambiguity. Values are associated with the first name that uses them, which is why "Color.AZUL" displays as "Color.BLUE" in the example above.
Furman has not circled back with a final decision on his plans, but it seems clear that the constructor-based idea is a non-starter even though it seemed like the right way to go from early on. Overall, the short thread gave a nice little peek into the kinds of thinking that go on in language and library design. Sometimes, the obvious solution is really not quite right, even when it seems to tick all of the right boxes. And, once again, this kind of discussion shows the value of FOSS communities openly working to arrive at a better solution; while it is not perfect by any means, the results generally speak for themselves.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: .0 IP addresses; Rust hits linux-next; Firefox 87; GNOME 40; 2021 Free Software Awards; Quotes; ...
- Announcements: Newsletters; conferences; security updates; kernel patches; ...
